# Project Title: Ad Campaign Analysis and Conversion Prediction

## Project Description:
In this data analysis project, we had worked with an ad campaign dataset to perform data preprocessing, exploratory data analysis (EDA), and built a machine learning model for predicting ad campaign conversions. The dataset contains information about various features related to the campaign, including demographic attributes, income, gender, area, and predicted conversion probabilities.

In [3]:
import piplite
await piplite.install(['ipycytoscape', 'pandas', 'networkx'])
await piplite.install(['numpy', 'seaborn', 'matplotlib', 'scikit-learn'])

In [4]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics
import seaborn as sns


## Data Loading and Preprocessing:

### Initial Data Inspection: 
This step involves loading the dataset using Pandas and checking its structure and contents using df.head() and df.info().

In [5]:
df = pd.read_csv('ad_campaign_data.csv')
df.head()

Unnamed: 0,religion,politics,college_educated,parents,homeowner,gender,age,income,area,true_conversion,predicted_conversion,predicted_probability
0,Unknown,Unknown,1,1,1,Unknown,55-64,Unknown,Unknown,0,0,0.001351
1,Other,Unknown,1,1,1,Unknown,55-64,Unknown,Urban,0,0,0.002238
2,Unknown,Unknown,1,1,1,F,55-64,Unknown,Unknown,0,0,0.002704
3,Unknown,Unknown,1,1,1,F,55-64,Unknown,Unknown,0,0,0.001967
4,Unknown,Unknown,1,1,1,F,55-64,Unknown,Urban,0,0,0.001681


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1443140 entries, 0 to 1443139
Data columns (total 12 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   religion               1443140 non-null  object 
 1   politics               1443140 non-null  object 
 2   college_educated       1443140 non-null  int64  
 3   parents                1443140 non-null  int64  
 4   homeowner              1443140 non-null  int64  
 5   gender                 1443140 non-null  object 
 6   age                    1443140 non-null  object 
 7   income                 1443140 non-null  object 
 8   area                   1443140 non-null  object 
 9   true_conversion        1443140 non-null  int64  
 10  predicted_conversion   1443140 non-null  int64  
 11  predicted_probability  1443140 non-null  float64
dtypes: float64(1), int64(5), object(6)
memory usage: 99.1+ MB


### Handling Missing Values: 
You've addressed missing values by replacing "Unknown" values with None and using forward and backward filling for categorical attributes. This ensures the data is ready for analysis and modeling.

In [7]:
df.replace("Unknown", None, inplace = True)
df

Unnamed: 0,religion,politics,college_educated,parents,homeowner,gender,age,income,area,true_conversion,predicted_conversion,predicted_probability
0,,,1,1,1,,55-64,,,0,0,0.001351
1,Other,,1,1,1,,55-64,,Urban,0,0,0.002238
2,,,1,1,1,F,55-64,,,0,0,0.002704
3,,,1,1,1,F,55-64,,,0,0,0.001967
4,,,1,1,1,F,55-64,,Urban,0,0,0.001681
...,...,...,...,...,...,...,...,...,...,...,...,...
1443135,Other,,1,1,1,F,25-34,,,0,0,0.002318
1443136,,,1,1,0,F,55-64,,,0,0,0.001420
1443137,,,1,1,1,,55-64,,,0,0,0.002879
1443138,,,1,1,1,F,55-64,,,0,0,0.001905


In [8]:
df['income'].value_counts(dropna=False)

None     1375624
<100K      40957
>100K      26559
Name: income, dtype: int64

### Age Calculation and Filling: 
A function get_average_age is applied to calculate the average age from age ranges. This function ensures that age values are numerically represented. Missing age values are filled with the mean age.

In [9]:

def get_average_age(age_range):
    if age_range is not None:
        age_values = age_range.split('-')
        return (int(age_values[0]) + int(age_values[1])) / 2

df['age'] = df['age'].apply(get_average_age)
df

Unnamed: 0,religion,politics,college_educated,parents,homeowner,gender,age,income,area,true_conversion,predicted_conversion,predicted_probability
0,,,1,1,1,,59.5,,,0,0,0.001351
1,Other,,1,1,1,,59.5,,Urban,0,0,0.002238
2,,,1,1,1,F,59.5,,,0,0,0.002704
3,,,1,1,1,F,59.5,,,0,0,0.001967
4,,,1,1,1,F,59.5,,Urban,0,0,0.001681
...,...,...,...,...,...,...,...,...,...,...,...,...
1443135,Other,,1,1,1,F,29.5,,,0,0,0.002318
1443136,,,1,1,0,F,59.5,,,0,0,0.001420
1443137,,,1,1,1,,59.5,,,0,0,0.002879
1443138,,,1,1,1,F,59.5,,,0,0,0.001905


In [10]:
df['age'].fillna(df['age'].mean(), inplace=True)
df['age'].value_counts(dropna=False)
df

Unnamed: 0,religion,politics,college_educated,parents,homeowner,gender,age,income,area,true_conversion,predicted_conversion,predicted_probability
0,,,1,1,1,,59.5,,,0,0,0.001351
1,Other,,1,1,1,,59.5,,Urban,0,0,0.002238
2,,,1,1,1,F,59.5,,,0,0,0.002704
3,,,1,1,1,F,59.5,,,0,0,0.001967
4,,,1,1,1,F,59.5,,Urban,0,0,0.001681
...,...,...,...,...,...,...,...,...,...,...,...,...
1443135,Other,,1,1,1,F,29.5,,,0,0,0.002318
1443136,,,1,1,0,F,59.5,,,0,0,0.001420
1443137,,,1,1,1,,59.5,,,0,0,0.002879
1443138,,,1,1,1,F,59.5,,,0,0,0.001905


### Gender Prediction Model:

#### Logistic Regression for Gender Prediction: 
A logistic regression model is built to predict gender based on features like age, homeowner status, parent status, and college education.


In [11]:
known_gender_data = df[df['gender'].notnull()]
unknown_gender_data = df[df['gender'].isnull()]
unknown_gender_data

Unnamed: 0,religion,politics,college_educated,parents,homeowner,gender,age,income,area,true_conversion,predicted_conversion,predicted_probability
0,,,1,1,1,,59.500000,,,0,0,0.001351
1,Other,,1,1,1,,59.500000,,Urban,0,0,0.002238
12,Christianity,,1,1,1,,59.500000,,,0,0,0.031544
13,,,1,1,1,,59.500000,,,0,0,0.000582
17,,,1,1,1,,59.500000,,,0,0,0.001950
...,...,...,...,...,...,...,...,...,...,...,...,...
1443130,Other,,1,1,1,,59.500000,,Rural,0,0,0.005252
1443131,,,1,1,1,,59.500000,,,0,0,0.001031
1443133,Christianity,,0,1,1,,59.500000,,,0,0,0.002815
1443134,Other,,1,1,1,,58.730928,,,0,0,0.002564


In [12]:
selected_features = ['age', 'homeowner', 'parents', 'college_educated']

#### Data Splitting and Training: 
The dataset is split into training and testing sets using train_test_split. The logistic regression model is trained using the training data.


In [13]:
X_train, X_test, y_train, y_test = train_test_split(known_gender_data[selected_features], known_gender_data['gender'], test_size=0.2, random_state=42)

In [14]:
model = LogisticRegression()
model.fit(X_train, y_train)

#### Model Evaluation: 
The trained model's accuracy is evaluated on the testing data using the accuracy_score metric. This provides an indication of how well the model performs in predicting gender.


In [15]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.6548691924052849

#### Predicting Missing Gender Values: 
The trained model is used to predict missing gender values in the dataset, enhancing the completeness of the data.

In [16]:
unknown_df = df[df['gender'].isnull()]
predictions = model.predict(unknown_df[selected_features])
predictions

array(['F', 'F', 'F', ..., 'F', 'F', 'F'], dtype=object)

In [None]:
# replace missing gender values with predicted values 
gender = pd.DataFrame(predictions, columns = ['Gender'])

count = 0 
for i in range(len(df)):
    if df.loc[i, 'gender'] == None:
        df.loc[i, 'gender'] = gender.loc[count, 'Gender']
        count+=1
    

In [None]:
df.head()

In [None]:
df['gender'].value_counts(dropna = False)

### Income & Area Category Correction: 
Inconsistent income and Area category values are corrected using replace and fillna methods to ensure consistency and accuracy.

In [None]:
df['income'].value_counts(dropna = False)

In [None]:
df['income'].fillna("<100K" , inplace = True)

In [None]:
df.head()

In [None]:
df['income'].replace(to_replace="<100k",
           value="<100K", inplace = True)

In [None]:
df['income'].value_counts(dropna = False)

In [None]:
df

In [None]:
df['area'].value_counts(dropna = False)

In [None]:
df['area'].fillna(method = 'bfill', inplace = True)

In [None]:
df['area'].value_counts(dropna = False)

In [None]:
df['area'].fillna(method = 'ffill', inplace = True)

In [None]:
df['area'].value_counts(dropna = False)

In [None]:
df

In [None]:
df.dtypes

## Exploratory Data Analysis (EDA):
#### Visual Insights: 
Utilizing Seaborn and Matplotlib, various visualization techniques are employed to uncover insights. This includes count plots, pie charts, histograms, box plots, and scatter plots.
#### Data Distribution and Relationships: 
The visualizations provide a visual understanding of the distribution of features, relationships between variables, and potential patterns within the data.
#### Evaluating Gender Balance: 
Gender distribution is analyzed using pie charts and count plots, allowing for insights into the representation of gender in the dataset.
#### Age Distribution and Income Variation: 
Histograms and scatter plots showcase the age distribution and income trends, helping to identify potential correlations between age and income.

In [None]:
# Bar plot for categorical variables
sns.countplot(x='college_educated', data=df)
plt.show()


In [None]:
# Pie chart for 'gender'
df['gender'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.axis('equal')
plt.show()

In [None]:
# Histogram for 'age'
sns.histplot(df['age'], kde=True)
plt.show()

In [None]:
# Box plot for 'income'
sns.boxplot(x='income', y='age', data=df)
plt.show()

In [None]:
# Scatter plot for 'age' vs. 'income'
sns.scatterplot(x='age', y='income', data=df, hue='true_conversion')
plt.show()

In [None]:
# Bar plot for binary categorical variables
sns.countplot(x='true_conversion', data=df)
plt.show()

In [None]:
# Box plot for 'age' across 'college_educated'
sns.boxplot(x='college_educated', y='age', data=df)
plt.show()

In [None]:
# Pair plot for numeric variables
sns.pairplot(df[['age', 'income', 'predicted_probability']])
plt.show()

In [None]:
# Bar plot for True Conversion vs. Predicted Conversion
sns.countplot(x='true_conversion', hue='predicted_conversion', data=df)
plt.show()

In [None]:
# Scatter plot for True Conversion vs. Predicted Probability
sns.scatterplot(x='true_conversion', y='predicted_probability', data=df)
plt.show()

# Conversion Prediction Model:

### Categorical Feature Encoding: 
One-hot encoding is applied to categorical features using Pandas' get_dummies function to transform them into a suitable format for modeling.

In [None]:
df_new = pd.get_dummies(data=df, columns=['gender', 'income', 'area'])
df_new

#### Data Preparation: 
The dataset is preprocessed to prepare it for conversion prediction. This includes binarization of the target variable and splitting data into features (X) and the target (y).


In [None]:
df_new.drop(columns=['religion', 'politics'], axis=1, inplace = True)
df_new = df_new.astype(int)
selected_features = df_new.drop(columns=['predicted_probability', 'predicted_conversion', 'true_conversion'], axis=1)
y = df_new['true_conversion']
df_new

In [None]:
y = df_new['true_conversion'].astype(bool)
print(selected_features.dtypes)  # Check data types of selected_features
print(y.dtypes)  # Check data type of y
print(df.head())  # Print the first few rows of df to verify its contents

### Logistic Regression for Conversion Prediction:
A logistic regression model is employed to predict true conversion based on various features.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(selected_features, y, test_size=0.2, random_state=42)

In [None]:
lg_model = LogisticRegression(solver='liblinear', max_iter=4000)
lg_model.fit(X_train, y_train)

#### Performance Metrics: 
The model's performance is evaluated using metrics like accuracy, mean absolute error (MAE), mean squared error (MSE), and R-squared. These metrics quantify how well the model predicts conversions.


In [None]:
y_pred = lg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

In [None]:
print(type(y_test))
print(type(y_pred))
print(y_test.shape)
print(y_pred.shape)


In [None]:
y_test = y_test.astype(float)
y_pred = y_pred.astype(float)


In [None]:
assert y_test.shape == y_pred.shape, "Shapes of y_test and y_pred do not match."


#### Evaluation and Comparison: 
Performance metrics are used to assess the model's accuracy and how well it generalizes to unseen data.

In [None]:
lg_MAE = metrics.mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", lg_MAE)

In [None]:
lg_MSE = metrics.mean_squared_error(y_test, y_pred)
lg_R2 = metrics.r2_score(y_test, y_pred)
print("Mean Squared Error:", lg_MSE)
print("R2 Score:", lg_R2)