# Droping Duplicates :

We have a DataFrame named customers that consists of details like customer_id, name, and email. The goal is to remove duplicate rows based on the email column and only keep the first occurrence of any duplicated email.

-  `DataFrames` :
 
|   customer_id | name     | email             |
|--------------:|:---------|:------------------|
|             1 | Ella     | emily@example.com |
|             2 | David    | michael@example.com|
|             3 | Zachary  | sarah@example.com |
|             4 | Alice    | john@example.com  |
|             5 | Finn     | john@example.com  |
|             6 | Violet   | alice@example.com |

-  `Key Concepts `:
-  1.  `drop_duplicates() ` Function : The drop_duplicates function is a method of the DataFrame object in the pandas library. Its purpose is to drop duplicate rows, and you can specify the criteria based on which the rows are considered duplicates.

- 2. `drop_duplicates Function Argument Definition `: - subset: This is the column label or sequence of labels to consider for identifying duplicate rows. If not provided, it considers all columns in the DataFrame.
-  `Attributes ` :
  -  `keep ` : This argument determines which duplicate row to retain.
       - 'first': (default) Drop duplicates except for the first occurrence.
       - 'last': Drop duplicates except for the last occurrence.
       - False: Drop all duplicates.

  -  `inplace ` :
      -  If set to True, the changes are made directly to the object without returning a new object.
      -  If set to False (default), a new object with duplicates dropped will be returned.

-  `Intuition ` :
Let’s go step by step through the provided solution:
1.  `Importing pandas `:

 ```python
import pandas as pd
 ```
This imports the pandas library and gives it an alias pd. pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and data manipulation library built on top of the Python programming language.

2. Dropping duplicate rows based on email:

 ```python
customers.drop_duplicates(subset='email', keep='first', inplace=True)
 ```
This line uses the `drop_duplicates` method on the customers DataFrame. 
- ` subset='email'`: This means that we are considering duplicates based on the email column only.
- ` keep='first'`: This indicates that we want to keep the first occurrence of any duplicated email and drop the subsequent occurrences.
- `inplace=True` : This means the changes will be made directly to the passed DataFrame (customers) without returning a new one.

3. Printing the modified DataFrame:

```python
 print(customers)
 ```
Finally, we Print the modified customers DataFrame with the duplicate rows based on email removed.
 you can clean up the data in your customers DataFrame and ensure that each customer's email is unique, helping maintain data integrity. If two customers have the same email address, only the first one encountered will be kept in the resulting DataFrame.

In [1]:
import pandas as pd
# Importing pandas library using 'Import' Keyord

In [2]:
# Define the customer data as a list of dictionaries
customer_data = [
    {"customer_id": 1, "name": "Ella", "email": "emily@example.com"},
    {"customer_id": 2, "name": "David", "email": "michael@example.com"},
    {"customer_id": 3, "name": "Zachary", "email": "sarah@example.com"},
    {"customer_id": 4, "name": "Alice", "email": "john@example.com"},
    {"customer_id": 5, "name": "Finn", "email": "john@example.com"},
    {"customer_id": 6, "name": "Violet", "email": "alice@example.com"}
]

In [3]:
# Convert the list of dictionaries to a pandas DataFrame
customer_df = pd.DataFrame(customer_data) # passing Nested List as Parameter 

In [4]:
# Display the DataFrame
customer_df

Unnamed: 0,customer_id,name,email
0,1,Ella,emily@example.com
1,2,David,michael@example.com
2,3,Zachary,sarah@example.com
3,4,Alice,john@example.com
4,5,Finn,john@example.com
5,6,Violet,alice@example.com


In [7]:
# Droping The Duplicates using drop_duplicates method
customer_df.drop_duplicates( subset='email', keep='first', inplace=True)

In [8]:
# Printing DataFrames After Droping Duplicates
customer_df

Unnamed: 0,customer_id,name,email
0,1,Ella,emily@example.com
1,2,David,michael@example.com
2,3,Zachary,sarah@example.com
3,4,Alice,john@example.com
5,6,Violet,alice@example.com


## This How you drop duplicates using a particular column

Here in our Dataframe john@example.com was inserted twice,since there with two customer with same Email which cause which certainly confusing.
so we cleaned our data using drop_duplicates method.
### Input :
|   customer_id | name     | email             |
|--------------:|:---------|:------------------|
|             1 | Ella     | emily@example.com |
|             2 | David    | michael@example.com|
|             3 | Zachary  | sarah@example.com |
|             4 | Alice    | john@example.com  |
|             5 | Finn     | john@example.com  |
|             6 | Violet   | alice@example.com |
### Output 
|   customer_id | name     | email             |
|--------------:|:---------|:------------------|
|             1 | Ella     | emily@example.com |
|             2 | David    | michael@example.com|
|             3 | Zachary  | sarah@example.com |
|             4 | Alice    | john@example.com  |
|             6 | Violet   | alice@example.com |
### Removed Column :

|   customer_id | name     | email             |
|--------------:|:---------|:------------------|
|             5 | Finn     | john@example.com  |:m