# üßπ Data Cleaning & Transformation


## üì¶ 1. Setup and Data Generation

We‚Äôll create a **synthetic dataset** that mimics messy, real-world data ‚Äî including missing values, duplicates, inconsistent data types, typos, and outliers.

In [1]:
import pandas as pd
import numpy as np
import random

np.random.seed(42)

n = 60  

# Simulate some messy data
data = {
    'Customer_ID': np.random.randint(1000, 1100, n),
    'Name': [random.choice(['Alice', 'Bob', 'Catherine', 'David', 'Evelyn', 'Frank', 'Grace', 'Hannah']) for _ in range(n)],
    'Gender': [random.choice(['M', 'F', 'Male', 'Female', 'male', 'female', np.nan]) for _ in range(n)],
    'Age': [random.choice([20, 25, 30, 35, 40, 45, 50, np.nan, 'thirty']) for _ in range(n)],
    'Join_Date': [random.choice(['2021-01-05', '2020-06-10', '2019/12/15', '15-07-2022', np.nan]) for _ in range(n)],
    'City': [random.choice(['Nairobi', 'Mombasa', 'kisumu', 'NAIROBI', 'Eldoret', np.nan]) for _ in range(n)],
    'Income': [random.choice([35000, 50000, 70000, 90000, np.nan, 200000, 1000000]) for _ in range(n)],
    'Satisfaction_2022': np.random.randint(50, 100, n),
    'Satisfaction_2023': np.random.randint(40, 100, n),
    'Preferred_Channel': [random.choice(['Online', 'In-store', 'Both', np.nan]) for _ in range(n)]
}

# Add duplicates deliberately
df = pd.DataFrame(data)
df = pd.concat([df, df.iloc[:5]], ignore_index=True)

df.head(10)

Unnamed: 0,Customer_ID,Name,Gender,Age,Join_Date,City,Income,Satisfaction_2022,Satisfaction_2023,Preferred_Channel
0,1051,Grace,Female,30,2019/12/15,Eldoret,90000.0,77,98,In-store
1,1092,David,M,35,2019/12/15,Mombasa,1000000.0,96,44,Both
2,1014,Catherine,Male,30,2020-06-10,kisumu,200000.0,56,81,Both
3,1071,Catherine,,45,2021-01-05,,70000.0,93,78,Both
4,1060,Frank,M,35,,Nairobi,,57,97,In-store
5,1020,Frank,Female,,2020-06-10,,90000.0,96,80,Online
6,1082,Frank,Male,35,2020-06-10,NAIROBI,90000.0,84,67,Online
7,1086,Evelyn,,45,,Mombasa,35000.0,63,46,
8,1074,Frank,M,thirty,15-07-2022,Eldoret,90000.0,66,48,Both
9,1074,Grace,Male,40,2020-06-10,Mombasa,35000.0,85,47,


## üîç 2. Initial Data Profiling

Before cleaning, let‚Äôs inspect the data structure, types, and common problems.

In [2]:
df.info()
print('\nMissing values per column:\n', df.isna().sum())
print('\nUnique values per column:\n', df.nunique())
print('\nSummary statistics:')
display(df.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65 entries, 0 to 64
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Customer_ID        65 non-null     int32  
 1   Name               65 non-null     object 
 2   Gender             55 non-null     object 
 3   Age                59 non-null     object 
 4   Join_Date          46 non-null     object 
 5   City               56 non-null     object 
 6   Income             55 non-null     float64
 7   Satisfaction_2022  65 non-null     int32  
 8   Satisfaction_2023  65 non-null     int32  
 9   Preferred_Channel  52 non-null     object 
dtypes: float64(1), int32(3), object(6)
memory usage: 4.4+ KB

Missing values per column:
 Customer_ID           0
Name                  0
Gender               10
Age                   6
Join_Date            19
City                  9
Income               10
Satisfaction_2022     0
Satisfaction_2023     0
Preferred_Chan

Unnamed: 0,Customer_ID,Name,Gender,Age,Join_Date,City,Income,Satisfaction_2022,Satisfaction_2023,Preferred_Channel
count,65.0,65,55,59,46,56,55.0,65.0,65.0,52
unique,,8,6,8,4,5,,,,3
top,,Frank,Female,thirty,2021-01-05,kisumu,,,,Both
freq,,14,12,14,17,17,,,,25
mean,1049.892308,,,,,,266727.272727,74.138462,69.892308,
std,29.418384,,,,,,372360.113159,15.007162,16.815286,
min,1001.0,,,,,,35000.0,50.0,40.0,
25%,1021.0,,,,,,70000.0,60.0,54.0,
50%,1054.0,,,,,,90000.0,74.0,73.0,
75%,1074.0,,,,,,200000.0,89.0,81.0,


# üßΩ 3. Data Cleaning Stage

### 3.1 Handling Missing Values (NaN)
Different columns need different strategies based on meaning and type.

In [3]:
# 1Ô∏è‚É£ Drop rows where essential fields are missing
df_dropped = df.dropna(subset=['Customer_ID', 'Name'])

# 2Ô∏è‚É£ Impute numeric fields with median
df['Income'] = df['Income'].fillna(df['Income'].median())

# 3Ô∏è‚É£ Fill categorical NaNs with mode
df['Preferred_Channel'] = df['Preferred_Channel'].fillna(df['Preferred_Channel'].mode()[0])

# 4Ô∏è‚É£ Conditional imputation ‚Äî fill missing Gender based on Name
df.loc[(df['Name'].isin(['Alice','Catherine','Evelyn','Grace','Hannah'])) & (df['Gender'].isna()), 'Gender'] = 'F'
df.loc[(df['Name'].isin(['Bob','David','Frank'])) & (df['Gender'].isna()), 'Gender'] = 'M'

# 5Ô∏è‚É£ Fill missing numeric sequences via interpolation
df['Satisfaction_2022'] = df['Satisfaction_2022'].interpolate()

df.head(10)

Unnamed: 0,Customer_ID,Name,Gender,Age,Join_Date,City,Income,Satisfaction_2022,Satisfaction_2023,Preferred_Channel
0,1051,Grace,Female,30,2019/12/15,Eldoret,90000.0,77,98,In-store
1,1092,David,M,35,2019/12/15,Mombasa,1000000.0,96,44,Both
2,1014,Catherine,Male,30,2020-06-10,kisumu,200000.0,56,81,Both
3,1071,Catherine,F,45,2021-01-05,,70000.0,93,78,Both
4,1060,Frank,M,35,,Nairobi,90000.0,57,97,In-store
5,1020,Frank,Female,,2020-06-10,,90000.0,96,80,Online
6,1082,Frank,Male,35,2020-06-10,NAIROBI,90000.0,84,67,Online
7,1086,Evelyn,F,45,,Mombasa,35000.0,63,46,Both
8,1074,Frank,M,thirty,15-07-2022,Eldoret,90000.0,66,48,Both
9,1074,Grace,Male,40,2020-06-10,Mombasa,35000.0,85,47,Both


### 3.2 Removing Duplicates

In [4]:
print(f"Duplicates before: {df.duplicated().sum()}")
df = df.drop_duplicates()
print(f"Duplicates after: {df.duplicated().sum()}")

Duplicates before: 5
Duplicates after: 0


### 3.3 Fixing Data Types

In [5]:
# Convert Age to numeric
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

# Convert Join_Date to datetime with flexible formats
df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce', dayfirst=True)

df.dtypes

  df['Join_Date'] = pd.to_datetime(df['Join_Date'], errors='coerce', dayfirst=True)


Customer_ID                   int32
Name                         object
Gender                       object
Age                         float64
Join_Date            datetime64[ns]
City                         object
Income                      float64
Satisfaction_2022             int32
Satisfaction_2023             int32
Preferred_Channel            object
dtype: object

### 3.4 String Normalization and Categorical Cleanup

In [6]:
# Strip spaces, normalize case
df['City'] = df['City'].str.strip().str.title()

# Normalize Gender labels
df['Gender'] = df['Gender'].replace({'male':'M','female':'F','Male':'M','Female':'F'})

# Ensure categorical columns have consistent categories
df['Preferred_Channel'] = df['Preferred_Channel'].replace({'both':'Both','online':'Online','in-store':'In-store'})

df['Gender'].value_counts(), df['City'].unique()

(Gender
 F    33
 M    27
 Name: count, dtype: int64,
 array(['Eldoret', 'Mombasa', 'Kisumu', nan, 'Nairobi'], dtype=object))

### 3.5 Handling Outliers (IQR & Capping)

In [7]:
Q1, Q3 = df['Income'].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR

df['Income_Capped'] = np.where(df['Income'] > upper, upper, np.where(df['Income'] < lower, lower, df['Income']))

df[['Income', 'Income_Capped']].describe()

Unnamed: 0,Income,Income_Capped
count,60.0,60.0
mean,235333.333333,84000.0
std,346828.080631,26438.287593
min,35000.0,40000.0
25%,70000.0,70000.0
50%,90000.0,90000.0
75%,90000.0,90000.0
max,1000000.0,120000.0


In [8]:
df[['Income', 'Income_Capped']]

Unnamed: 0,Income,Income_Capped
0,90000.0,90000.0
1,1000000.0,120000.0
2,200000.0,120000.0
3,70000.0,70000.0
4,90000.0,90000.0
5,90000.0,90000.0
6,90000.0,90000.0
7,35000.0,40000.0
8,90000.0,90000.0
9,35000.0,40000.0


### 3.6 Creating New Features

In [None]:
# Create a new derived feature
df['Years_Since_Join'] = 2025 - df['Join_Date'].dt.year
df['Income_per_Age'] = df['Income_Capped'] / df['Age']

df[['Age','Join_Date','Years_Since_Join','Income_per_Age']].head()

# üîÅ 4. Data Transformation Stage

### 4.1 Column Renaming and Reordering

In [9]:
df = df.rename(columns={'Income_Capped':'Annual_Income','Preferred_Channel':'Channel'})
cols = ['Customer_ID','Name','Gender','Age','City','Annual_Income','Satisfaction_2022','Satisfaction_2023','Channel','Join_Date']
df = df[cols + [c for c in df.columns if c not in cols]]
df.head()

Unnamed: 0,Customer_ID,Name,Gender,Age,City,Annual_Income,Satisfaction_2022,Satisfaction_2023,Channel,Join_Date,Income
0,1051,Grace,F,30.0,Eldoret,90000.0,77,98,In-store,2019-12-15,90000.0
1,1092,David,M,35.0,Mombasa,120000.0,96,44,Both,2019-12-15,1000000.0
2,1014,Catherine,M,30.0,Kisumu,120000.0,56,81,Both,NaT,200000.0
3,1071,Catherine,F,45.0,,70000.0,93,78,Both,NaT,70000.0
4,1060,Frank,M,35.0,Nairobi,90000.0,57,97,In-store,NaT,90000.0


### 4.2 Sorting and Reindexing

In [10]:
df = df.sort_values(by=['City','Annual_Income'], ascending=[True, False]).reset_index(drop=True)
df.head()

Unnamed: 0,Customer_ID,Name,Gender,Age,City,Annual_Income,Satisfaction_2022,Satisfaction_2023,Channel,Join_Date,Income
0,1072,David,M,35.0,Eldoret,120000.0,58,91,In-store,NaT,200000.0
1,1051,Grace,F,30.0,Eldoret,90000.0,77,98,In-store,2019-12-15,90000.0
2,1074,Frank,M,,Eldoret,90000.0,66,48,Both,NaT,90000.0
3,1001,Alice,F,40.0,Eldoret,90000.0,53,63,In-store,NaT,90000.0
4,1090,David,M,,Eldoret,90000.0,72,40,Both,NaT,90000.0


### 4.3 Reshaping: Wide to Long (Melt)

In [11]:
df_long = pd.melt(df, 
    id_vars=['Customer_ID','Name','Gender','City'],
    value_vars=['Satisfaction_2022','Satisfaction_2023'],
    var_name='Year', value_name='Satisfaction_Score'
)
df_long.head(10)

Unnamed: 0,Customer_ID,Name,Gender,City,Year,Satisfaction_Score
0,1072,David,M,Eldoret,Satisfaction_2022,58
1,1051,Grace,F,Eldoret,Satisfaction_2022,77
2,1074,Frank,M,Eldoret,Satisfaction_2022,66
3,1001,Alice,F,Eldoret,Satisfaction_2022,53
4,1090,David,M,Eldoret,Satisfaction_2022,72
5,1058,Bob,F,Eldoret,Satisfaction_2022,89
6,1013,David,M,Eldoret,Satisfaction_2022,60
7,1089,Grace,M,Eldoret,Satisfaction_2022,57
8,1041,Bob,M,Eldoret,Satisfaction_2022,70
9,1014,Catherine,M,Kisumu,Satisfaction_2022,56


### 4.4 Reshaping: Long to Wide (Pivot)

In [12]:
df_long['Year'] = df_long['Year'].str.extract('(\\d{4})')
df_pivot = df_long.pivot_table(index=['Customer_ID','Name'], columns='Year', values='Satisfaction_Score').reset_index()
df_pivot.head()

Year,Customer_ID,Name,2022,2023
0,1001,Alice,53.0,63.0
1,1001,Frank,93.0,79.0
2,1001,Grace,84.0,68.0
3,1002,Alice,51.0,87.0
4,1002,Frank,94.0,90.0


### 4.5 Aggregation and Grouping

In [13]:
city_summary = df.groupby('City')[['Annual_Income','Satisfaction_2023']].agg(['mean','median','count']).reset_index()
city_summary.head()

Unnamed: 0_level_0,City,Annual_Income,Annual_Income,Annual_Income,Satisfaction_2023,Satisfaction_2023,Satisfaction_2023
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,count,mean,median,count
0,Eldoret,84444.444444,90000.0,9,64.555556,63.0,9
1,Kisumu,90625.0,90000.0,16,73.375,73.0,16
2,Mombasa,78000.0,80000.0,10,59.6,52.5,10
3,Nairobi,80000.0,90000.0,17,72.117647,75.0,17


### 4.6 Merging and Concatenation

In [None]:
# Split and merge example
df_a = df.iloc[:30]
df_b = df.iloc[30:]

merged_df = pd.concat([df_a, df_b])
merged_df.shape

# ‚úÖ 5. Final Clean Data Overview

In [14]:
print('Final shape:', df.shape)
print('Missing values per column:')
print(df.isna().sum())
df.head(10)

Final shape: (60, 11)
Missing values per column:
Customer_ID           0
Name                  0
Gender                0
Age                  20
City                  8
Annual_Income         0
Satisfaction_2022     0
Satisfaction_2023     0
Channel               0
Join_Date            50
Income                0
dtype: int64


Unnamed: 0,Customer_ID,Name,Gender,Age,City,Annual_Income,Satisfaction_2022,Satisfaction_2023,Channel,Join_Date,Income
0,1072,David,M,35.0,Eldoret,120000.0,58,91,In-store,NaT,200000.0
1,1051,Grace,F,30.0,Eldoret,90000.0,77,98,In-store,2019-12-15,90000.0
2,1074,Frank,M,,Eldoret,90000.0,66,48,Both,NaT,90000.0
3,1001,Alice,F,40.0,Eldoret,90000.0,53,63,In-store,NaT,90000.0
4,1090,David,M,,Eldoret,90000.0,72,40,Both,NaT,90000.0
5,1058,Bob,F,,Eldoret,90000.0,89,44,In-store,2019-12-15,90000.0
6,1013,David,M,40.0,Eldoret,70000.0,60,78,In-store,NaT,70000.0
7,1089,Grace,M,35.0,Eldoret,70000.0,57,54,Both,NaT,70000.0
8,1041,Bob,M,35.0,Eldoret,50000.0,70,65,Both,NaT,50000.0
9,1014,Catherine,M,30.0,Kisumu,120000.0,56,81,Both,NaT,200000.0


---
## üéØ Key Takeaways
- Always **profile your data** before cleaning.
- Choose imputation strategy **based on meaning**, not just statistics.
- Normalize and validate text/categorical data.
- Use **reshaping** (melt/pivot) to prepare for visualization or modeling.
- Create **derived features** for deeper insights.
- Keep transformations **documented and reproducible**.