### Quels sont les objectifs
Les data scientists √† l'origine de la newsletter souhaitent mieux comprendre le comportement des utilisateurs visitant leur site web. Ils aimeraient savoir s'il est possible de construire un mod√®le pr√©dictif de l'abonnement √† la newsletter, √† partir de quelques informations seulement. Ils envisagent ensuite d'analyser les param√®tres de ce mod√®le afin de mettre en √©vidence les caract√©ristiques pertinentes pour expliquer ce comportement et, √©ventuellement, identifier un nouveau levier d'action pour am√©liorer le taux de conversion de la newsletter.

### The project can be cut into four steps :

- **Part 1 :** make an **EDA** and the **preprocessings** and train a baseline model with the file data_train.csv
- **Part 2 :** improve your model's f1-score on your test set (you can try feature engineering, feature selection, regularization, non-linear models, hyperparameter optimization by grid search, etc...)
- **Part 3 :** Once you're satisfied with your model's score, you can use it to make some predictions with the file data_test.csv. You will have to dump the predictions into a .csv file that will be sent to Kaggle (actually, to your teacher/TA ü§ì). You can make as many submissions as you want, feel free to try different models !
- **Part 4 :** Take some time to analyze your best model's parameters. Are there any lever for action that would help to improve the newsletter's conversion rate ? What recommendations would you make to the team ?

#### Import all library

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# setting Jedha color palette as default
pio.templates["jedha"] = go.layout.Template(
    layout_colorway=["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]
)
pio.templates.default = "jedha"
pio.renderers.default = "svg" # to be replaced by "iframe" if working on JULIE
from IPython.display import display

AttributeError: `np.unicode_` was removed in the NumPy 2.0 release. Use `np.str_` instead.

#### 1- EDA

In [3]:
data = pd.read_csv('./conversion_rate/conversion_data_train.csv')
print('Set with labels (our train+test) :', data.shape)

Set with labels (our train+test) : (284580, 6)


In [4]:
data.head()

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,China,22,1,Direct,2,0
1,UK,21,1,Ads,3,0
2,Germany,20,0,Seo,14,1
3,US,23,1,Seo,3,0
4,US,28,1,Direct,3,0


In [7]:
##### Analysons les donn√©es
# 1. Voir les premi√®res lignes
print("=== Premi√®res lignes ===")
print(data.head())

# 2. Voir les derni√®res lignes (pour v√©rifier la coh√©rence)
print("\n=== Derni√®res lignes ===")
print(data.tail())

# 3. Conna√Ætre la taille du dataset
print(f"\n=== Dimensions ===")
print(f"Lignes : {data.shape[0]}")
print(f"Colonnes : {data.shape[1]}")

# 4. Lister toutes les colonnes
print("\n=== Liste des colonnes ===")
print(data.columns.tolist())

# 5. Obtenir les infos g√©n√©rales
print("\n=== Informations g√©n√©rales ===")
print(data.info())

# Nombre de lignes et colonnes : Est-ce que c'est beaucoup ? Peu ?
# Noms des colonnes : Sont-ils clairs ? Compr√©hensibles ?
# Types de donn√©es : int64, float64, object, datetime ?
# Valeurs manquantes : Y a-t-il des "Non-Null" diff√©rents du nombre total de lignes ?

=== Premi√®res lignes ===
   country  age  new_user  source  total_pages_visited  converted
0    China   22         1  Direct                    2          0
1       UK   21         1     Ads                    3          0
2  Germany   20         0     Seo                   14          1
3       US   23         1     Seo                    3          0
4       US   28         1  Direct                    3          0

=== Derni√®res lignes ===
       country  age  new_user  source  total_pages_visited  converted
284575      US   36         1     Ads                    1          0
284576      US   31         1     Seo                    2          0
284577      US   41         1     Seo                    5          0
284578      US   31         1  Direct                    4          0
284579      US   26         0     Ads                    3          0

=== Dimensions ===
Lignes : 284580
Colonnes : 6

=== Liste des colonnes ===
['country', 'age', 'new_user', 'source', 'total_pages_

#### Analysons les types de variables
- Variable categorielle : Nominal: Country, Source
- Variable numerique: discr√®te: Age, new_user, total_pages_visited, converted

In [9]:
# Si necessaires mais pas forcement
# √âTAPE 2.1 : S√©parer les variables par type

# Variables num√©riques
variables_numeriques = data.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Variables cat√©gorielles (texte/objet)
variables_categorielles = data.select_dtypes(include=['object']).columns.tolist()

# Variables datetime (si pr√©sentes)
variables_datetime = data.select_dtypes(include=['datetime64']).columns.tolist()

print("=== VARIABLES NUM√âRIQUES ===")
print(variables_numeriques)

print("\n=== VARIABLES CAT√âGORIELLES ===")
print(variables_categorielles)

print("\n=== VARIABLES DATETIME ===")
print(variables_datetime)

=== VARIABLES NUM√âRIQUES ===
['age', 'new_user', 'total_pages_visited', 'converted']

=== VARIABLES CAT√âGORIELLES ===
['country', 'source']

=== VARIABLES DATETIME ===
[]


See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  variables_categorielles = data.select_dtypes(include=['object']).columns.tolist()


#### Analysons les valeurs manquantes

In [12]:
# J'ai cr√©√© un dataframe, puis filtrer par des valeurs manquantes positives
missing_df = pd.DataFrame({
    'Colonne': data.columns,
    'Valeurs_manquantes': data.isnull().sum(),
    'Pourcentage': (data.isnull().sum() / len(data)) * 100
})

missing_df = missing_df[missing_df['Valeurs_manquantes'] > 0].sort_values(
    'Pourcentage', ascending=False
)

print(missing_df)

Empty DataFrame
Columns: [Colonne, Valeurs_manquantes, Pourcentage]
Index: []


In [13]:
data.describe(include='all')

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
count,284580,284580.0,284580.0,284580,284580.0,284580.0
unique,4,,,3,,
top,US,,,Seo,,
freq,160124,,,139477,,
mean,,30.564203,0.685452,,4.873252,0.032258
std,,8.266789,0.464336,,3.341995,0.176685
min,,17.0,0.0,,1.0,0.0
25%,,24.0,0.0,,2.0,0.0
50%,,30.0,1.0,,4.0,0.0
75%,,36.0,1.0,,7.0,0.0


In [None]:
### Mbola tohizana

In [14]:
features_list = ['total_pages_visited']
numeric_indices = [0]
categorical_indices = []
target_variable = 'converted'


In [15]:
X = data.loc[:, features_list]
Y = data.loc[:, target_variable]

print('Explanatory variables : ', X.columns)
print()


Explanatory variables :  Index(['total_pages_visited'], dtype='str')



In [16]:
# Divide dataset Train set & Test set
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=0)
print("...Done.")
print()


Dividing into train and test sets...
...Done.



In [17]:
# Put here all the preprocessings
print("Encoding categorical features and standardizing numerical features...")

featureencoder = StandardScaler()
X_train = featureencoder.fit_transform(X_train)
print("...Done")
print(X_train[0:5,:])


Encoding categorical features and standardizing numerical features...
...Done
[[-0.26070136]
 [ 0.93728655]
 [-0.85969532]
 [-0.56019834]
 [-0.26070136]]


In [18]:
# Train model
print("Train model...")
classifier = LogisticRegression() #
classifier.fit(X_train, Y_train)
print("...Done.")


Train model...
...Done.


In [19]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = classifier.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()


Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]



In [20]:
# Use X_test, and the same preprocessings as in training pipeline,
# but call "transform()" instead of "fit_transform" methods (see example below)

print("Encoding categorical features and standardizing numerical features...")

X_test = featureencoder.transform(X_test)
print("...Done")
print(X_test[0:5,:])


Encoding categorical features and standardizing numerical features...
...Done
[[ 0.63778957]
 [ 0.03879562]
 [-0.26070136]
 [-0.26070136]
 [ 0.63778957]]


In [21]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = classifier.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()


Predictions on test set...
...Done.
[0 0 0 ... 0 0 0]



### Performance assessment

In [22]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))


f1-score on train set :  0.6938517686692869
f1-score on test set :  0.7060240963855422


In [23]:
# You can also check more performance metrics to better understand what your model is doing
print("Confusion matrix on train set : ")
print(confusion_matrix(Y_train, Y_train_pred))
print()
print("Confusion matrix on test set : ")
print(confusion_matrix(Y_test, Y_test_pred))
print()


Confusion matrix on train set : 
[[246817   1082]
 [  3280   4943]]

Confusion matrix on test set : 
[[27384   117]
 [  371   586]]

