# Data Cleaning:

---

Data cleaning was executed to resolve data quality issues through deduplication, type correction, missing value handling, categorical normalization, and outlier treatment. The result is a standardized, analysis-ready dataset suitable for feature engineering, analytics, and modeling workflows.

### Objective:
Apply corrections identified during quality assessment to ensure a reliable analytical dataset.

---

### Tasks:

- Remove duplicate rows
- Handle missing values
- Drop only when justified
- Impute using mean/median/forward fill when appropriate
- Standardize field naming convention
- Convert to snake_case
- Normalize categorical values
- Example: unify case and spelling:
    - “fb”, “FB”, “Facebook” → Facebook
    - Address outliers
- Remove, cap, or transform based on statistical justification
- Save cleaned dataset to /data/processed/marketing_campaigns_cleaned.csv



### Outcome:

A consistent, analysis-ready dataset to support feature engineering, EDA, and modeling

---

## Ingest Data:

### Import Libraries:

In [151]:
#Libraries
    
import pandas as pd
import os

### Ingest Data as DF:

In [152]:
#LOAD DATA
#LOAD 1ST CSV - 2024 FILE

# df1 = pd.read_csv("../data/raw/marketing_campaign_2024.csv")
# df2 = pd.read_csv("../data/raw/marketing_campaign_2025.csv")

df = pd.read_csv("../data/processed/marketing_campaign_all_clean.csv")

### Brief Overview:

In [153]:
print("Combined dataset shape: ", df.shape)
print(df.info())
print(df.describe())
df.head(10)

Combined dataset shape:  (1000, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   campaign_id       1000 non-null   object 
 1   campaign_name     1000 non-null   object 
 2   start_date        1000 non-null   object 
 3   end_date          1000 non-null   object 
 4   channel           1000 non-null   object 
 5   region            1000 non-null   object 
 6   impressions       1000 non-null   int64  
 7   clicks            1000 non-null   int64  
 8   conversions       1000 non-null   int64  
 9   spend_usd         1000 non-null   float64
 10  revenue_usd       1000 non-null   float64
 11  target_audience   1000 non-null   object 
 12  product_category  1000 non-null   object 
 13  device            1000 non-null   object 
 14  year              1000 non-null   int64  
 15  dataset_year      1000 non-null   int64  
dtypes: floa

Unnamed: 0,campaign_id,campaign_name,start_date,end_date,channel,region,impressions,clicks,conversions,spend_usd,revenue_usd,target_audience,product_category,device,year,dataset_year
0,2024_0001,Campaign_2024_0001,2024-05-16,2024-08-16,Search,South America,28252,5609,65466,39193.43,79017.74,Youth,Electronics,Desktop,2024,2024
1,2024_0002,Campaign_2024_0002,2024-04-06,2024-10-13,Search,Asia,89608,83584,26865,17291.53,49868.54,Adults,Home,Mobile,2024,2024
2,2024_0003,Campaign_2024_0003,2024-05-08,2024-11-27,Social,Europe,37853,62661,43662,6729.63,63021.28,Seniors,Electronics,Desktop,2024,2024
3,2024_0004,Campaign_2024_0004,2024-01-28,2024-08-03,Display,Africa,10577,41421,75023,15077.58,133106.71,Seniors,Clothing,Desktop,2024,2024
4,2024_0005,Campaign_2024_0005,2024-02-06,2024-08-23,Social,Asia,84039,56010,11283,16877.69,144736.99,Adults,Home,Mobile,2024,2024
5,2024_0006,Campaign_2024_0006,2024-02-11,2024-12-19,Search,Europe,46728,87560,69277,6698.46,105523.82,Adults,Travel,Mobile,2024,2024
6,2024_0007,Campaign_2024_0007,2024-05-07,2024-09-03,Search,Asia,24269,4485,39619,8269.18,113563.46,Seniors,Home,Mobile,2024,2024
7,2024_0008,Campaign_2024_0008,2024-03-01,2024-09-13,Search,South America,45499,20267,8609,4110.73,59552.89,Seniors,Home,Desktop,2024,2024
8,2024_0009,Campaign_2024_0009,2024-03-15,2024-07-31,Print,Asia,14398,79811,62095,14214.98,74892.62,Adults,Services,Mobile,2024,2024
9,2024_0010,Campaign_2024_0010,2024-03-07,2024-10-15,Social,Africa,18278,57806,55147,17466.57,38287.83,Youth,Home,Mobile,2024,2024


## THE COLUMNS:

### Standardize Column Names:

In [154]:
# df1.columns = df1.columns.str.lower().str.strip().str.replace(' ', '_')
# df2.columns = df2.columns.str.lower().str.strip().str.replace(' ', '_')

# # for df in [df1, df2]:
# for df in [df]:
df.columns = (
        df.columns.str.lower()
        .str.strip()
        .str.replace(" ", "_")
    )



In [155]:
df.columns

Index(['campaign_id', 'campaign_name', 'start_date', 'end_date', 'channel',
       'region', 'impressions', 'clicks', 'conversions', 'spend_usd',
       'revenue_usd', 'target_audience', 'product_category', 'device', 'year',
       'dataset_year'],
      dtype='object')

### Remove duplicates:

In [156]:
#Remove Duplicates

# for df in [df1, df2]:
# for df in [df]:
df.drop_duplicates(subset=["campaign_id"], inplace=True)


### Unique Values:

In [157]:
numbers = df.select_dtypes(include="number")
numbers.head()

Unnamed: 0,impressions,clicks,conversions,spend_usd,revenue_usd,year,dataset_year
0,28252,5609,65466,39193.43,79017.74,2024,2024
1,89608,83584,26865,17291.53,49868.54,2024,2024
2,37853,62661,43662,6729.63,63021.28,2024,2024
3,10577,41421,75023,15077.58,133106.71,2024,2024
4,84039,56010,11283,16877.69,144736.99,2024,2024


In [158]:
numbers = df.select_dtypes(exclude="number")
numbers.head()

Unnamed: 0,campaign_id,campaign_name,start_date,end_date,channel,region,target_audience,product_category,device
0,2024_0001,Campaign_2024_0001,2024-05-16,2024-08-16,Search,South America,Youth,Electronics,Desktop
1,2024_0002,Campaign_2024_0002,2024-04-06,2024-10-13,Search,Asia,Adults,Home,Mobile
2,2024_0003,Campaign_2024_0003,2024-05-08,2024-11-27,Social,Europe,Seniors,Electronics,Desktop
3,2024_0004,Campaign_2024_0004,2024-01-28,2024-08-03,Display,Africa,Seniors,Clothing,Desktop
4,2024_0005,Campaign_2024_0005,2024-02-06,2024-08-23,Social,Asia,Adults,Home,Mobile


In [159]:
# Check for unique_values in columns for duplicates and posible wrong spelling 

unique_values1 = df["channel"].unique()
unique_values2 = df["region"].unique()
unique_values3 = df["target_audience"].unique()
unique_values4 = df["product_category"].unique()
unique_values5 = df["device"].unique()

print("Unique values:",       unique_values1)
print("Unique values:",       unique_values2)
print("Unique values:",      unique_values3)
print("Unique values:",       unique_values4)
print("Unique values:",       unique_values5)

Unique values: ['Search' 'Social' 'Display' 'Print' 'Email']
Unique values: ['South America' 'Asia' 'Europe' 'Africa' 'North America']
Unique values: ['Youth' 'Adults' 'Seniors']
Unique values: ['Electronics' 'Home' 'Clothing' 'Travel' 'Services']
Unique values: ['Desktop' 'Mobile' 'Tablet']


## Fix data types:

### Convert Data types - Dates, Float & Numeric:

In [160]:
# update on date time data types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   campaign_id       1000 non-null   object 
 1   campaign_name     1000 non-null   object 
 2   start_date        1000 non-null   object 
 3   end_date          1000 non-null   object 
 4   channel           1000 non-null   object 
 5   region            1000 non-null   object 
 6   impressions       1000 non-null   int64  
 7   clicks            1000 non-null   int64  
 8   conversions       1000 non-null   int64  
 9   spend_usd         1000 non-null   float64
 10  revenue_usd       1000 non-null   float64
 11  target_audience   1000 non-null   object 
 12  product_category  1000 non-null   object 
 13  device            1000 non-null   object 
 14  year              1000 non-null   int64  
 15  dataset_year      1000 non-null   int64  
dtypes: float64(2), int64(5), object(9)
memory u

In [161]:
# for df in [df]:

df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
df['end_date'] = pd.to_datetime(df['end_date'], errors='coerce')
df['spend_usd'] = pd.to_numeric(df['spend_usd'], errors='coerce')
df['revenue_usd'] = pd.to_numeric(df['revenue_usd'], errors='coerce')
    # df['dataset_year'] = pd.to_datetime(df['dataset_year'], errors='coerce')


In [162]:
# Observe changes in df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   campaign_id       1000 non-null   object        
 1   campaign_name     1000 non-null   object        
 2   start_date        1000 non-null   datetime64[ns]
 3   end_date          1000 non-null   datetime64[ns]
 4   channel           1000 non-null   object        
 5   region            1000 non-null   object        
 6   impressions       1000 non-null   int64         
 7   clicks            1000 non-null   int64         
 8   conversions       1000 non-null   int64         
 9   spend_usd         1000 non-null   float64       
 10  revenue_usd       1000 non-null   float64       
 11  target_audience   1000 non-null   object        
 12  product_category  1000 non-null   object        
 13  device            1000 non-null   object        
 14  year              1000 no

## MIssing Values

### Impute missing metrics (replace NA/blank with 0):

In [164]:
#Missing Values

# # for df in [df1, df2]:
# for df in [df]:
df.fillna({
        'impressions': 0,
        'clicks': 0,
        'conversions': 0,
        'spend_usd': 0,
        'revenue_usd': 0
    }, inplace=True)


In [165]:
# FIND NA VALUES

df.isna().sum()

campaign_id         0
campaign_name       0
start_date          0
end_date            0
channel             0
region              0
impressions         0
clicks              0
conversions         0
spend_usd           0
revenue_usd         0
target_audience     0
product_category    0
device              0
year                0
dataset_year        0
dtype: int64

In [166]:
# FIND NULL VALUES

df.isnull().sum()

campaign_id         0
campaign_name       0
start_date          0
end_date            0
channel             0
region              0
impressions         0
clicks              0
conversions         0
spend_usd           0
revenue_usd         0
target_audience     0
product_category    0
device              0
year                0
dataset_year        0
dtype: int64

### Overview of Changes made:

In [169]:
#The result

print("Combined dataset shape: ", df.shape)
print(df.info())
print(df.describe())
df.head(10)

Combined dataset shape:  (1000, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   campaign_id       1000 non-null   object        
 1   campaign_name     1000 non-null   object        
 2   start_date        1000 non-null   datetime64[ns]
 3   end_date          1000 non-null   datetime64[ns]
 4   channel           1000 non-null   object        
 5   region            1000 non-null   object        
 6   impressions       1000 non-null   int64         
 7   clicks            1000 non-null   int64         
 8   conversions       1000 non-null   int64         
 9   spend_usd         1000 non-null   float64       
 10  revenue_usd       1000 non-null   float64       
 11  target_audience   1000 non-null   object        
 12  product_category  1000 non-null   object        
 13  device            1000 non-null   object   

Unnamed: 0,campaign_id,campaign_name,start_date,end_date,channel,region,impressions,clicks,conversions,spend_usd,revenue_usd,target_audience,product_category,device,year,dataset_year
0,2024_0001,Campaign_2024_0001,2024-05-16,2024-08-16,Search,South America,28252,5609,65466,39193.43,79017.74,Youth,Electronics,Desktop,2024,2024
1,2024_0002,Campaign_2024_0002,2024-04-06,2024-10-13,Search,Asia,89608,83584,26865,17291.53,49868.54,Adults,Home,Mobile,2024,2024
2,2024_0003,Campaign_2024_0003,2024-05-08,2024-11-27,Social,Europe,37853,62661,43662,6729.63,63021.28,Seniors,Electronics,Desktop,2024,2024
3,2024_0004,Campaign_2024_0004,2024-01-28,2024-08-03,Display,Africa,10577,41421,75023,15077.58,133106.71,Seniors,Clothing,Desktop,2024,2024
4,2024_0005,Campaign_2024_0005,2024-02-06,2024-08-23,Social,Asia,84039,56010,11283,16877.69,144736.99,Adults,Home,Mobile,2024,2024
5,2024_0006,Campaign_2024_0006,2024-02-11,2024-12-19,Search,Europe,46728,87560,69277,6698.46,105523.82,Adults,Travel,Mobile,2024,2024
6,2024_0007,Campaign_2024_0007,2024-05-07,2024-09-03,Search,Asia,24269,4485,39619,8269.18,113563.46,Seniors,Home,Mobile,2024,2024
7,2024_0008,Campaign_2024_0008,2024-03-01,2024-09-13,Search,South America,45499,20267,8609,4110.73,59552.89,Seniors,Home,Desktop,2024,2024
8,2024_0009,Campaign_2024_0009,2024-03-15,2024-07-31,Print,Asia,14398,79811,62095,14214.98,74892.62,Adults,Services,Mobile,2024,2024
9,2024_0010,Campaign_2024_0010,2024-03-07,2024-10-15,Social,Africa,18278,57806,55147,17466.57,38287.83,Youth,Home,Mobile,2024,2024


## Keep Changes - Save files as CSV to data/procesed folder

In [167]:
#Save 3 files:
#1st - marketing_campaign_2024_clean.cs
#2nd - marketing_campaign_2025_clean.csv
#3rd - marketing_campaign_all_clean.csv
#The library os is truely a useful tool :)


os.makedirs("../data/processed", exist_ok=True)

# df1.to_csv("../data/processed/marketing_campaign_2024_clean.csv", index=False)
# df2.to_csv("../data/processed/marketing_campaign_2025_clean.csv", index=False)
df.to_csv("../data/processed/marketing_campaign_all_clean.csv", index=False)
