# Conversion Prediction: EDA
-------
The **Data Science Weekly Newsletter** wants to figure out a way to increase the conversion rate of visitors. In particular, they want to accurately predict whether a visitor to their website will end up subscribing to the newsletter.
 
>In this first phase of the project,we will try to:  

>> explore the data and   
perform a preprocessing if needed. 

> Please go to the folder **Viz**, to see the different visualizations.
------

### Table of Contents

* [1. Load Data](#section1)
* [2. EDA](#section2)
    * [2.1. Explore Dataset](#section21)
    * [2.2. Unique values](#section21)
    * [2.3. Missing values](#section22)
    * [2.4. Duplicates](#section23)
    * [2.5. Sampling](#section24)
    * [2.6. Univariate Analysis](#section25)
    * [2.7. Bivariate Analysis](#section26)
    * [2.8. Correlation](#section27)
* [3. Key Insights](#section3)

 #### Import useful modules ⬇️⬇️ and Global params

In [1]:
# Importing useful libraries
# generic libs
import os
import pandas as pd
from numpy import round

#plotting libs
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "iframe" 

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Predefined Functions
from modules import MyFunctions as MyFunct

# Global parameters 
train_filepath = 'data/conversion_data_train.csv'
pre_train_filepath = 'data/pre_conversion_data_train.csv'

if not os.path.exists("Viz"):
    os.mkdir("Viz")
    
viz_path = "Viz/"

sample_size = 10_000
seed = 42

 # Load Data

In [2]:
print("Loading dataset...")
dataset = pd.read_csv(train_filepath)
print("...Done.")
print()

Loading dataset...
...Done.



# EDA

## Explore Dataset

In [3]:
MyFunct.explore(dataset)

Shape : (284580, 6)

data types : 
country                object
age                     int64
new_user                int64
source                 object
total_pages_visited     int64
converted               int64
dtype: object

Display of dataset: 


Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,China,22,1,Direct,2,0
1,UK,21,1,Ads,3,0
2,Germany,20,0,Seo,14,1
3,US,23,1,Seo,3,0
4,US,28,1,Direct,3,0



Basics statistics: 


Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
count,284580,284580.0,284580.0,284580,284580.0,284580.0
unique,4,,,3,,
top,US,,,Seo,,
freq,160124,,,139477,,
mean,,30.564203,0.685452,,4.873252,0.032258
std,,8.266789,0.464336,,3.341995,0.176685
min,,17.0,0.0,,1.0,0.0
25%,,24.0,0.0,,2.0,0.0
50%,,30.0,1.0,,4.0,0.0
75%,,36.0,1.0,,7.0,0.0



Distinct values: 


country                 4
age                    60
new_user                2
source                  3
total_pages_visited    29
converted               2
dtype: int64

## Unique Values

In [4]:
Cols = ['country', 'new_user', 'source', 'converted']
MyFunct.unique_count(dataset, Cols)

unique values of country:


US         160124
China       69122
UK          43641
Germany     11693
Name: country, dtype: int64

unique values of new_user:


1    195066
0     89514
Name: new_user, dtype: int64

unique values of source:


Seo       139477
Ads        80027
Direct     65076
Name: source, dtype: int64

unique values of converted:


0    275400
1      9180
Name: converted, dtype: int64

## Missing Values

In [5]:
print("Missing values: ")
MyFunct.missing(dataset)

Missing values: 
there is no missing values in this dataset


## Duplicates 

In [6]:
print("Duplicates: ")
MyFunct.duplicates_count(dataset)

Duplicates: 


Unnamed: 0,country,age,new_user,source,total_pages_visited,converted,records
0,China,17,0,Ads,1,0,15
1,China,17,0,Ads,2,0,19
2,China,17,0,Ads,3,0,18
3,China,17,0,Ads,4,0,12
4,China,17,0,Ads,5,0,18
...,...,...,...,...,...,...,...
15806,US,70,1,Ads,9,0,1
15807,US,72,1,Direct,4,0,1
15808,US,73,1,Seo,5,0,1
15809,US,77,0,Direct,4,0,1


## Sampling

> As the dataset is relatively large and to make analysis smoother, we need to take smaller and more manageable sample.

In [7]:
sample = dataset.sample(n = sample_size, random_state = seed)

## Univariate Analysis

In [None]:
title = 'Distribution of the different quantitative variables'
fig = make_subplots(rows=2, cols=2)

fig.add_trace(MyFunct.my_box_plotter(sample['age']), row=1, col=1)
fig.add_trace(MyFunct.my_box_plotter(sample['total_pages_visited']), row=1, col=2)

fig.add_trace(MyFunct.my_hist_plotter(sample['age'], 10), row=2, col=1)
fig.add_trace(MyFunct.my_hist_plotter(sample['total_pages_visited'], 5), row=2, col=2)

# Update xaxis properties
fig.update_xaxes(title_text="age", row=2, col=1)
fig.update_xaxes(title_text="total pages visited", row=2, col=2)

# Update yaxis properties
fig.update_yaxes(title_text="Count", row=2, col=1)

fig.update_layout(
    title= title, title_x = 0.5,
    showlegend=False
)

fig.to_image(format="png", engine="kaleido")
if os.path.exists(viz_path+title+".png"):
    os.remove(viz_path+title+".png")
    
fig.write_image(viz_path+title+".png")

fig.show()

🗒 **Notes**:  
> **age**: 50% of users are aged between 24 and 36 years. Hence it is important to target this category.  

> **total_pages_visited**: 50% of users visit between 2 and 7 pages. The total visited pages is too low as there are nearly 30 pages, the newsletter should investigate this fact and try to improve the users experience to encourage them to visit more pages.

> **Outliers**: there are outliers in both quantitative variables : age and total_pages_visited but they are not erronous values.

In [None]:
title = 'Distribution of the different qualitative variables'

countries = sample.groupby(['country']).size()
source = sample.groupby(['source']).size()
user = sample.groupby(['new_user']).size()
converted = sample.groupby(['converted']).size()

fig = make_subplots(rows=2, cols=2)

fig.add_trace(MyFunct.my_bar_plotter(countries.index, countries.values, {'text': round((countries.values / 10000)*100,2)}), row=1, col=1)
fig.add_trace(MyFunct.my_bar_plotter(source.index, source.values, {'text': round((source.values / 10000)*100,2)}), row=1, col=2)

fig.add_trace(MyFunct.my_bar_plotter(user.index, user.values,{'text': round((user.values / 10000)*100,2)} ), row=2, col=1)
fig.add_trace(MyFunct.my_bar_plotter(converted.index, converted.values, {'text': round((converted.values / 10000)*100,2)}), row=2, col=2)

# Update xaxis properties
fig.update_xaxes(title_text="Country", row=1, col=1)
fig.update_xaxes(title_text="source", row=1, col=2)
fig.update_xaxes(title_text="new user", tickvals = [0, 1], ticktext = ['No', 'Yes'], row=2, col=1)
fig.update_xaxes(title_text="converted",tickvals = [0, 1], ticktext = ['No', 'Yes'], row=2, col=2)

# Update yaxis properties
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_yaxes(title_text="Count", row=2, col=1)

fig.update_layout(
    title= title, title_x = 0.5,
    showlegend=False
)

fig.to_image(format="png", engine="kaleido")
if os.path.exists(viz_path+title+".png"):
    os.remove(viz_path+title+".png")
    
fig.write_image(viz_path+title+".png")

fig.show()

🗒 **Notes**:   

> **country**: almost 56% of the newsletter users come from US and 24% come from China, while the newsletter doesn't seem to be popular in the UK or Germany. Region based marketing might be helpful to increase the conversion rate.

> **source**: almost half of the newsletter users are directed by search engine's results. 28% of the users are coming after seeing some Advertisements while 24% are coming directly. Hence, optimise the newsletter pages visibility in search results may garner attention and attract prospective and existing users alike.

> **new_user**: few old users versus new ones. Hence, the newsletter should focus more on improving the users experience on the website in order to incite new users to return.  

>  **converted**: the conversion rate is too low (3%). This means that the dataset is highly **imbalanced**

## Bivariate Analysis

In [None]:
title = 'Convertion rate per country'

countries = sample.groupby(['country','converted']).size().reset_index(name='count')

fig = px.histogram(countries, x="country", y="count", color="converted", barmode='relative', barnorm='percent', text_auto=True)
fig.update_traces(texttemplate='%{value:.2f}%')
fig.update_yaxes(title = 'Percent')
fig.update_layout(title = title,title_x = 0.5, legend_title="Converted")

fig.to_image(format="png", engine="kaleido")
if os.path.exists(viz_path+title+".png"):
    os.remove(viz_path+title+".png")
    
fig.write_image(viz_path+title+".png")
fig.show()

🗒 **Notes**:  

> The conversion rate in the **US** is almost equal to those in the **UK** or in **Germany**. However, the conversion rate in **China** is near 0%. This last fact should be further investigated. 

> Finally, as most of the newsletter users come from **US**, the newsletter team must be careful to make sure that they guard their position in this country from competition. 

In [None]:
title = 'Convertion rate per user'

users = sample.groupby(['new_user','converted']).size().reset_index(name='count').astype({'new_user': str})

fig = px.histogram(users, x="new_user", y="count", color="converted", barmode='relative', barnorm='percent', text_auto=True)

fig.update_traces(texttemplate='%{value:.2f}%')
fig.update_xaxes(tickvals = [0, 1], ticktext = ['No', 'Yes'])
fig.update_yaxes(title = 'Percent')
fig.update_layout(title = title,title_x = 0.5, legend_title="Converted")
fig.update_coloraxes(showscale=False)

fig.to_image(format="png", engine="kaleido")
if os.path.exists(viz_path+title+".png"):
    os.remove(viz_path+title+".png")
    
fig.write_image(viz_path+title+".png")
fig.show()

🗒 **Notes**:  

> An old user is more likely to be converted than a new one. Again, this is an appeal to improve the user experience to incite users to return and then to be converted.

In [None]:
title = 'Convertion rate per source'

sources = sample.groupby(['source','converted']).size().reset_index(name='count')

fig = px.histogram(sources, x="source", y="count", color="converted", barmode='relative', barnorm='percent', text_auto=True)

fig.update_traces(texttemplate='%{value:.2f}%')
fig.update_yaxes(title = 'Percent')
fig.update_layout(title = title,title_x = 0.5, legend_title="Converted")

fig.to_image(format="png", engine="kaleido")
if os.path.exists(viz_path+title+".png"):
    os.remove(viz_path+title+".png")
    
fig.write_image(viz_path+title+".png")
fig.show()

🗒 **Notes**:  
> conversion rate doesn't seem to be dependent of users' source as all rates are almost equal. Hence, source seems to be a non-significant predictor.

***********************************************

🗒 **Notes**:   

> Before analysing the conversion rate wrt the different quantitative variables, **bining** data brought by those continuous variables into multiple buckets is necessary.    
       
> As we don't have any rule to define the bins within the variables age and total_pages_visited, we will use the pandas function **qcut** that defines equal sized bins using percentiles based on the distribution of the data in order to make sure the distribution of data in the bins is equal.

In [None]:
title = 'Convertion rate per age category'

sample_age=sample.copy(deep=True)
sample_age['age']= pd.qcut(sample_age['age'],4, labels=['17-24', '25-30', '31-36', '37-69'])
age = sample_age.groupby(['age','converted']).size().reset_index(name='count')

fig = px.histogram(age, x="age", y="count", color="converted", barmode='relative', barnorm='percent', text_auto=True)

fig.update_traces(texttemplate='%{value:.2f}%')
fig.update_yaxes(title = 'Percent')
fig.update_layout(title = title,title_x = 0.5, legend_title="Converted")

fig.to_image(format="png", engine="kaleido")
if os.path.exists(viz_path+title+".png"):
    os.remove(viz_path+title+".png")
    
fig.write_image(viz_path+title+".png")
fig.show()

🗒 **Notes**:   

> It is clear that as the user is older as the conversion rate is lower which implicates a negative correlation between age and converted. Hence, the newsletter team should focus on young usuers

In [None]:
title = 'Convertion rate per total visited pages'

sample_pages =sample.copy(deep=True)
sample_pages['total_pages_visited']= pd.qcut(sample_pages['total_pages_visited'],4, labels=['1-2', '3-4', '5-7', '8-27'])
pages = sample_pages.groupby(['total_pages_visited','converted']).size().reset_index(name='count')
pages = pages.astype({'total_pages_visited':'str'})

fig = px.histogram(pages, x="total_pages_visited", y="count", color="converted", barmode='relative', barnorm='percent', text_auto=True)

fig.update_traces(texttemplate='%{value:.2f}%')
fig.update_yaxes(title = 'Percent')
fig.update_layout(title = title,title_x = 0.5, legend_title="Converted")

fig.to_image(format="png", engine="kaleido")
if os.path.exists(viz_path+title+".png"):
    os.remove(viz_path+title+".png")
    
fig.write_image(viz_path+title+".png")
fig.show()

🗒 **Notes**:     

> The conversion rate gets progressively higher whith the number of visited pages. This suggests that if the newsletter is made appealing for users to stay and browse more pages, users may end up converted. This can be confirmed using experimental tests.

## Correlation

🗒 **Notes**:  

> To make a prediction tool that is based on multiple predictors, it is important to check if there are some highly correlated predictors as in such a case, the prediction model and the interpretation will be affected.

In [None]:
sample.corr()

🗒 **Notes**:     
> There is no important correlation between predictors

> we can notice some correlation between the total visited pages and the conversion of a user, we should check if this correlation is higher in the original dataset

In [None]:
title = 'Correlation degrees between different variables'
MyFunct.my_heatmap(dataset, title)

🗒 **Notes**:  

> Same relationship between the total visited pages and the conversion of a user is noticied in the original dataset, hence, the total visited pages may be a good predictor in the conversion prediction.

# Key Insights

> 1) The newsletter team should focus more on the design and the content of the website pages in order to make the pages more appealing and user friendly as there is a strong correlation between the total visited pages and the conversion rate.

> 2) As the old users have a higher conversion rate than new ones, it is important to give special care to new visitors in order to encourage them to return and then to be converted by offering special deals.