# Sales Conversion Optimization - Data Exploration & Preprocessing

**Problem Statement**: Optimize sales conversions of anonymous organization's through soial media advertisement campaigns.

#### Here's the data description
- The dataset contains details about the campaign/ad and the person to whom the ad was shown with total money spend by that person & total product purchased after seeing the ad.
| Feature             | Description                                                   | Data Type      |
|---------------------|---------------------------------------------------------------|----------------|
| ad_id               | Unique ID for each ad.                                        | Numerical      |
| xyz_campaign_id     | ID associated with each ad campaign of XYZ company.           | Categorical    |
| fb_campaign_id      | ID for how Facebook tracks each campaign.                      | Categorical    |
| age                 | Age of the person to whom the ad is shown.                     | Numerical      |
| gender              | Gender of the person to whom the ad is shown.                  | Categorical    |
| interest            | Code specifying the category of the person’s interests.       | Categorical    |
| Impressions         | Number of times the ad was shown.                             | Numerical      |
| Clicks              | Number of clicks on the ad.                                   | Numerical      |
| Spent               | Amount paid by company XYZ to Facebook for the ad.            | Numerical      |
| Total conversion    | Total number of inquiries about the product after seeing the ad. | Numerical   |
| Approved conversion | Total number of product purchases after seeing the ad.        | Numerical      |


## Importing the necessary libraries

In [2]:
import pandas as pd
import numpy as np

### Reading the dataset

In [3]:
df = pd.read_csv('/home/diwas/Documents/DevStuff/Sales-Conversion-Optimization-Project/data/raw/KAG_conversion_data.csv')

In [7]:
df.head()

Unnamed: 0,ad_id,xyz_campaign_id,fb_campaign_id,age,gender,interest,Impressions,Clicks,Spent,Total_Conversion,Approved_Conversion
0,708746,916,103916,30-34,M,15,7350,1,1.43,2,1
1,708749,916,103917,30-34,M,16,17861,2,1.82,2,0
2,708771,916,103920,30-34,M,20,693,0,0.0,1,0
3,708815,916,103928,30-34,M,28,4259,1,1.25,1,0
4,708818,916,103928,30-34,M,28,4133,1,1.29,1,1


## Data Exploration

In [11]:
# Getting basic information about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ad_id                1143 non-null   int64  
 1   xyz_campaign_id      1143 non-null   int64  
 2   fb_campaign_id       1143 non-null   int64  
 3   age                  1143 non-null   object 
 4   gender               1143 non-null   object 
 5   interest             1143 non-null   int64  
 6   Impressions          1143 non-null   int64  
 7   Clicks               1143 non-null   int64  
 8   Spent                1143 non-null   float64
 9   Total_Conversion     1143 non-null   int64  
 10  Approved_Conversion  1143 non-null   int64  
dtypes: float64(1), int64(8), object(2)
memory usage: 98.4+ KB


In [19]:
# Checking for missing values
null_values = df.isnull().sum()
null_values[null_values > 0]

Series([], dtype: int64)

In [14]:
# Checking unique values in gender column
df['gender'].unique()

array(['M', 'F'], dtype=object)

In [18]:
# Checking unique values in age column
df['age'].unique()

array(['30-34', '35-39', '40-44', '45-49'], dtype=object)

**Insights**
- Are are no missing values in every column.
- Age in age column are divided into bins.

## Data Cleaning and Preprocessing

In [22]:
# Checking for duplicated records
df.duplicated().sum()

0

### Handle Outliers