Real-world data is messy. Pandas gives us powerful tools to clean and transform
data before analysis. Now let us the following processes to clean the data.


**Data Loading**

In [3]:
import pandas as pd
df =  pd.read_csv("Data Cleaning Sample.csv")

In [4]:
df

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,,28.0,Delhi,F,eve@domain.com,
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,,Delhi,M,charlie@example,20-07-2021


**1. Handling Missing Values**

**i. Check for Missing Data**

In [3]:
df.isnull()  # Return True for null values.

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,False,False,False,False,False,False
1,False,True,False,False,False,False
2,False,False,False,False,False,False
3,False,True,False,False,False,False
4,False,False,False,False,False,False
5,True,False,False,False,False,True
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,True,False,False,False,False


In [5]:
df.isnull().sum()           #Count Missing Value Per column

Name         1
Age          3
City         0
Gender       0
Email        0
Join Date    1
dtype: int64

**ii. Drop Missing Data**

In [7]:
df.dropna()      # Drop Rows with missing Value

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021


In [9]:
df.dropna(axis=1)     # Drop Column with missing Values

Unnamed: 0,City,Gender,Email
0,New York,F,alice@example.com
1,Delhi,M,charlie@example
2,Los Angeles,M,bob@example.com
3,Delhi,M,charlie@example
4,Mumbai,M,david@example.com
5,Delhi,F,eve@domain.com
6,New York,F,alice@example.com
7,New York,F,alice@example.com
8,Delhi,M,charlie@example


**iii. Filling Missing Dataa**

In [11]:
df.fillna(0)      # Replace null value with 0

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,0.0,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,0.0,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,0,28.0,Delhi,F,eve@domain.com,0
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,0.0,Delhi,M,charlie@example,20-07-2021


In [14]:
df['Age'].fillna(df['Age'].mean())    # Fill the missing value with an average of the available values

0    25.000000
1    25.833333
2    30.000000
3    25.833333
4    22.000000
5    28.000000
6    25.000000
7    25.000000
8    25.833333
Name: Age, dtype: float64

In [16]:
df.ffill()   # Fill the missing value according to previous value (Top-Down Approach)

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,25.0,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,30.0,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,David,28.0,Delhi,F,eve@domain.com,12-11-2019
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,25.0,Delhi,M,charlie@example,20-07-2021


In [18]:
df.bfill()   # Fill the missing value according to previous value (Bottom-Up Approach)

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,30.0,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,22.0,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,Alice,28.0,Delhi,F,eve@domain.com,01-05-2021
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,,Delhi,M,charlie@example,20-07-2021


**2. Detecting and Removing Duplicates**

In [20]:
df.duplicated()     # Returns true when the value gets repeat after the 1st occurrence.

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7     True
8     True
dtype: bool

In [22]:
df.drop_duplicates()     # Drops all the duplicate value rows.

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,,28.0,Delhi,F,eve@domain.com,


In [23]:
df.duplicated(subset=['Name', 'Age'])   # Detects Duplicate based on Specific Columns.

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7     True
8     True
dtype: bool

**3. String Operations**- Works like vectorized strings methods and returns a Panda Series.


In [75]:
df["Name"].str.lower()   # Converts all the name in lower case characters

0      alice
1    charlie
2        bob
3    charlie
4      david
5        NaN
6      alice
7      alice
8    charlie
Name: Name, dtype: object

In [76]:
df["City"].str.contains("delhi", case=False)   # Checks if 'Delhi' is in the 'City' column, case-insensitive

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8     True
Name: City, dtype: bool

In [77]:
df["Email"].str.split("@")   # Outputs a pandas Series where each element is a list 

0    [alice, example.com]
1      [charlie, example]
2      [bob, example.com]
3      [charlie, example]
4    [david, example.com]
5       [eve, domain.com]
6    [alice, example.com]
7    [alice, example.com]
8      [charlie, example]
Name: Email, dtype: object

**4. Type Conversion with with .astype()**  - Convert Column Data type

In [5]:
df2 = df.dropna().copy()

In [6]:
df2["Age"] = df2["Age"].astype(int)
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25,New York,F,alice@example.com,01-05-2021
2,Bob,30,Los Angeles,M,bob@example.com,15-06-2020
4,David,22,Mumbai,M,david@example.com,12-11-2019
6,Alice,25,New York,F,alice@example.com,01-05-2021
7,Alice,25,New York,F,alice@example.com,01-05-2021


In [81]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 7
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       5 non-null      object
 1   Age        5 non-null      int64 
 2   City       5 non-null      object
 3   Gender     5 non-null      object
 4   Email      5 non-null      object
 5   Join Date  5 non-null      object
dtypes: int64(1), object(5)
memory usage: 280.0+ bytes


**5. Applying Functions.**

**i. .apply()** - Apply any Functions to rows or columns.

In [101]:
df2["Age Group"] = df2["Age"].apply(lambda x : "Adult" if x >= 25 else "Minor")
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date,Age Group
0,Alice,25,New York,F,alice@example.com,01-05-2021,Adult
2,Bob,30,Los Angeles,M,bob@example.com,15-06-2020,Adult
4,David,22,Mumbai,M,david@example.com,12-11-2019,Minor
6,Alice,25,New York,F,alice@example.com,01-05-2021,Adult
7,Alice,25,New York,F,alice@example.com,01-05-2021,Adult


**ii. map()** : Element wise mapping for series.

In [103]:
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date,Age Group
0,Alice,25,New York,F,alice@example.com,01-05-2021,Adult
2,Bob,30,Los Angeles,M,bob@example.com,15-06-2020,Adult
4,David,22,Mumbai,M,david@example.com,12-11-2019,Minor
6,Alice,25,New York,F,alice@example.com,01-05-2021,Adult
7,Alice,25,New York,F,alice@example.com,01-05-2021,Adult


In [105]:
df2["Gender"].apply(lambda x: repr(x))

0    'F'
2    'M'
4    'M'
6    'F'
7    'F'
Name: Gender, dtype: object

In [108]:
print(df2.columns)

Index(['Name', 'Age', 'City', 'Gender', 'Email', 'Join Date', 'Age Group'], dtype='object')


In [110]:
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date,Age Group
0,Alice,25,New York,F,alice@example.com,01-05-2021,Adult
2,Bob,30,Los Angeles,M,bob@example.com,15-06-2020,Adult
4,David,22,Mumbai,M,david@example.com,12-11-2019,Minor
6,Alice,25,New York,F,alice@example.com,01-05-2021,Adult
7,Alice,25,New York,F,alice@example.com,01-05-2021,Adult


In [112]:
gender_mapp = {"M" : "Male", "F" : "Female"}
df2['Gender'] = df2['Gender'].map(gender_mapp)
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date,Age Group
0,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult
2,Bob,30,Los Angeles,Male,bob@example.com,15-06-2020,Adult
4,David,22,Mumbai,Male,david@example.com,12-11-2019,Minor
6,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult
7,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult


**iii. .replace()** - replace specific Values

In [9]:
df2['City'] = df2['City'].replace({"Mumbai" : "New Mumbai", "New York" : "New York City"})
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25,New York City,F,alice@example.com,01-05-2021
2,Bob,30,Los Angeles,M,bob@example.com,15-06-2020
4,David,22,New Mumbai,M,david@example.com,12-11-2019
6,Alice,25,New York City,F,alice@example.com,01-05-2021
7,Alice,25,New York City,F,alice@example.com,01-05-2021
