#Data Cleaning and Preparation using Python
Understand how to handle missing values by detecting, filling, and dropping them.
Learn techniques for data transformation such as converting data types, normalizing, and standardizing.
Master the skills to identify and remove duplicates in datasets.
Gain insights into handling outliers through detection and removal/capping methods.

In [1]:
import pandas as pd


In [15]:
##4.1 Handling Missing Values
#Detecting Missing Values
#isnull() and notnull() methods help identify missing values in data.
# Sample DataFrame with missing values
data = {
    'Name': ['Alice', 'Daniel', 'Charlie', None, 'Evelyn'],
    'Age': [25, None, 30, 35, None],
    'Gender': ['F', 'M', None, 'M', 'F']
}
data



{'Name': ['Alice', 'Daniel', 'Charlie', None, 'Evelyn'],
 'Age': [25, None, 30, 35, None],
 'Gender': ['F', 'M', None, 'M', 'F']}

In [17]:
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Gender
0,Alice,25.0,F
1,Daniel,,M
2,Charlie,30.0,
3,,35.0,M
4,Evelyn,,F


In [35]:
# Detect missing values
print("Missing Values:")
df.isnull()


Missing Values:


Unnamed: 0,Name,Age,Gender
0,False,False,False
1,False,True,False
2,False,False,True
3,True,False,False
4,False,True,False


In [31]:
#To identify rows with any missing values:

null_values = df[df.isnull().any(axis=1)]
print("Rows with missing values:")
df

Rows with missing values:


Unnamed: 0,Name,Age,Gender
0,Alice,25.0,F
1,Daniel,,M
2,Charlie,30.0,
3,,35.0,M
4,Evelyn,,F


In [37]:
# Get count of missing values per column
print("Total missing values per column:")
print(df.isnull().sum())

Total missing values per column:
Name      1
Age       2
Gender    1
dtype: int64


In [43]:
#Filling Missing Values (Imputation)
#fillna() method fills missing values with a specified constant or computed value (e.g., mean, median).

# Fill missing values with a specified constant (string)
print("Fill missing values with 'Unknown':")
filled_df_string = df.fillna(value={'Name': 'Unknown', 'Gender': 'Unknown'})
filled_df_string

Fill missing values with 'Unknown':


Unnamed: 0,Name,Age,Gender
0,Alice,25.0,F
1,Daniel,,M
2,Charlie,30.0,Unknown
3,Unknown,35.0,M
4,Evelyn,,F


In [45]:
# Fill missing values with mean (numeric)
print("Fill missing values with mean:")
filled_df_mean = df.fillna(value={'Age': df['Age'].mean()})
filled_df_mean

Fill missing values with mean:


Unnamed: 0,Name,Age,Gender
0,Alice,25.0,F
1,Daniel,30.0,M
2,Charlie,30.0,
3,,35.0,M
4,Evelyn,30.0,F


In [47]:
# Fill missing values with mode (categorical)
print("Fill missing values with mode:")
mode_gender = df['Gender'].mode()[0]
filled_df_mode = df.fillna(value={'Gender': mode_gender})
filled_df_mode

Fill missing values with mode:


Unnamed: 0,Name,Age,Gender
0,Alice,25.0,F
1,Daniel,,M
2,Charlie,30.0,F
3,,35.0,M
4,Evelyn,,F


In [49]:
#Dropping Missing Values
#dropna() method removes rows or columns with missing values.
# Drop rows with any missing values
cleaned_df = df.dropna()
print("DataFrame after dropping rows with missing values:")
cleaned_df

DataFrame after dropping rows with missing values:


Unnamed: 0,Name,Age,Gender
0,Alice,25.0,F


In [53]:
#4.3 Removing Duplicates
#Identifying Duplicate Rows
#Use duplicated() method to identify duplicate rows based on specified columns.
# Example DataFrame with duplicates
df = pd.DataFrame({"name": ["Alice", "Bob", "Alice"], "age": [25, 30, 25]})
df

Unnamed: 0,name,age
0,Alice,25
1,Bob,30
2,Alice,25


In [55]:
# Identify duplicate rows
print("Duplicate rows:")
print(df[df.duplicated()])

Duplicate rows:
    name  age
2  Alice   25


In [57]:
#Removing Duplicate Rows
#Use drop_duplicates() method to remove duplicate rows from the dataset.
# Remove duplicate rows
df = df.drop_duplicates()
print("DataFrame after removing duplicates:")
df

DataFrame after removing duplicates:


Unnamed: 0,name,age
0,Alice,25
1,Bob,30


In [15]:
import pandas as pd

In [17]:

# Example DataFrame
data = {'age': ['25', '30', '35']}
df = pd.DataFrame(data)

# Convert 'age' column to float
df["age"] = df["age"].astype(float)

print("Data types after conversion:")
# Print the data types of each column
print(df.dtypes)

Data types after conversion:
age    float64
dtype: object


Normalizing and Standardizing Data
 Standardization scales data to have a mean of 0 and variance of 1 using techniques like z-score normalization.
Normalization Example:

In [19]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['age']] = scaler.fit_transform(df[['age']])
print("Normalized data:")
df


Normalized data:


Unnamed: 0,age
0,0.0
1,0.5
2,1.0


In [21]:
from sklearn.preprocessing import StandardScaler


In [23]:

#Standardization Example:
scaler = StandardScaler()
df[['age']] = scaler.fit_transform(df[['age']])
print("Standardized data:")
print(df)

Standardized data:
        age
0 -1.224745
1  0.000000
2  1.224745


4.4 Handling Outliers
Detecting Outliers
Use statistical methods such as quartiles and interquartile range (IQR) to detect outliers.
Example:

In [27]:

# Create a DataFrame with a column 'age' containing some values
df = pd.DataFrame({"age": [25, 30, 100]})
df


Unnamed: 0,age
0,25
1,30
2,100


In [45]:
# Calculate quartiles
q1 = df["age"].quantile(0.25)
q3 = df["age"].quantile(0.75)
# Calculate IQR
iqr = q3 - q1

# Calculate bounds for outliers
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
# Filter DataFrame to include only values within bounds
df_no_outliers = df[(df["age"] >= lower_bound) & (df["age"] <= upper_bound)]
print("DataFrame after removing outliers:")
df_no_outliers


DataFrame after removing outliers:


Unnamed: 0,age
0,25
1,30
2,100


In [47]:
#Handling Outliers: Removal or Capping
#Exclusion: Remove outliers from the dataset if they are determined to be errors or not relevant.
#Capping/Flooring: Replace extreme outliers with the next highest/lowest values within a reasonable range.
#Example: Capping Outliers
# Cap outliers
df['age'] = df['age'].clip(lower=lower_bound, upper=upper_bound)
print("DataFrame after capping outliers:")
df

DataFrame after capping outliers:


Unnamed: 0,age
0,25
1,30
2,100
