***Cleaning of data***

In [45]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import sklearn 
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

In [6]:
df = pd.read_csv(fr"../synthetic_ecommerce_data.csv")

In [7]:
df.head()

Unnamed: 0,customer_id,gender,age,country,pages_visited,session_time,purchased,product_clicked,product_category,price
0,71974402-3ecd-4cb6-90cb-84805ac45d6a,Male,19,Sudan,9,8.48,0,seven,Electronics,0.0
1,7f656f70-7a89-4a4e-9bae-9d1d403f70a7,Male,55,Niue,14,2.41,0,between,Clothing,0.0
2,0bb2052d-7d53-4bbb-adc5-0ceb202e3de0,Male,50,Papua New Guinea,20,2.26,0,anything,Beauty,0.0
3,c636d181-1763-4cbc-8efc-afe08c923639,Female,32,Ukraine,15,18.29,0,blue,Clothing,0.0
4,e8dbc1eb-d0ac-4c39-b032-7f1a149e13d6,Female,39,Belarus,9,5.93,1,eat,Electronics,50.91


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       2000 non-null   object 
 1   gender            2000 non-null   object 
 2   age               2000 non-null   int64  
 3   country           2000 non-null   object 
 4   pages_visited     2000 non-null   int64  
 5   session_time      2000 non-null   float64
 6   purchased         2000 non-null   int64  
 7   product_clicked   2000 non-null   object 
 8   product_category  2000 non-null   object 
 9   price             2000 non-null   float64
dtypes: float64(2), int64(3), object(5)
memory usage: 156.4+ KB


In [9]:
df

Unnamed: 0,customer_id,gender,age,country,pages_visited,session_time,purchased,product_clicked,product_category,price
0,71974402-3ecd-4cb6-90cb-84805ac45d6a,Male,19,Sudan,9,8.48,0,seven,Electronics,0.00
1,7f656f70-7a89-4a4e-9bae-9d1d403f70a7,Male,55,Niue,14,2.41,0,between,Clothing,0.00
2,0bb2052d-7d53-4bbb-adc5-0ceb202e3de0,Male,50,Papua New Guinea,20,2.26,0,anything,Beauty,0.00
3,c636d181-1763-4cbc-8efc-afe08c923639,Female,32,Ukraine,15,18.29,0,blue,Clothing,0.00
4,e8dbc1eb-d0ac-4c39-b032-7f1a149e13d6,Female,39,Belarus,9,5.93,1,eat,Electronics,50.91
...,...,...,...,...,...,...,...,...,...,...
1995,b884d741-1f87-4619-9367-8b841131023c,Female,38,Croatia,17,14.95,1,ball,Beauty,109.09
1996,be6b0907-f556-49ed-9356-614c00a617eb,Male,38,Equatorial Guinea,6,6.97,1,act,Clothing,426.92
1997,9e5f1d24-576c-4afa-ae08-3b28e368430a,Male,38,Japan,17,29.46,1,yeah,Books,118.22
1998,afad0595-9124-4095-8905-dc1e163c5b55,Female,48,Namibia,10,12.07,1,prove,Books,364.72


In [10]:
df.drop(columns = 'customer_id',inplace = True)

In [11]:
df

Unnamed: 0,gender,age,country,pages_visited,session_time,purchased,product_clicked,product_category,price
0,Male,19,Sudan,9,8.48,0,seven,Electronics,0.00
1,Male,55,Niue,14,2.41,0,between,Clothing,0.00
2,Male,50,Papua New Guinea,20,2.26,0,anything,Beauty,0.00
3,Female,32,Ukraine,15,18.29,0,blue,Clothing,0.00
4,Female,39,Belarus,9,5.93,1,eat,Electronics,50.91
...,...,...,...,...,...,...,...,...,...
1995,Female,38,Croatia,17,14.95,1,ball,Beauty,109.09
1996,Male,38,Equatorial Guinea,6,6.97,1,act,Clothing,426.92
1997,Male,38,Japan,17,29.46,1,yeah,Books,118.22
1998,Female,48,Namibia,10,12.07,1,prove,Books,364.72


In [12]:
df['country'].value_counts()

country
Congo                        19
Wallis and Futuna            16
Malawi                       16
Fiji                         15
Austria                      15
                             ..
Netherlands Antilles          3
Cook Islands                  3
Bouvet Island (Bouvetoya)     2
Estonia                       2
Luxembourg                    2
Name: count, Length: 243, dtype: int64

In [13]:
df.drop(columns= 'product_clicked', inplace = True)

In [14]:
df

Unnamed: 0,gender,age,country,pages_visited,session_time,purchased,product_category,price
0,Male,19,Sudan,9,8.48,0,Electronics,0.00
1,Male,55,Niue,14,2.41,0,Clothing,0.00
2,Male,50,Papua New Guinea,20,2.26,0,Beauty,0.00
3,Female,32,Ukraine,15,18.29,0,Clothing,0.00
4,Female,39,Belarus,9,5.93,1,Electronics,50.91
...,...,...,...,...,...,...,...,...
1995,Female,38,Croatia,17,14.95,1,Beauty,109.09
1996,Male,38,Equatorial Guinea,6,6.97,1,Clothing,426.92
1997,Male,38,Japan,17,29.46,1,Books,118.22
1998,Female,48,Namibia,10,12.07,1,Books,364.72


In [15]:
len(df['country'].unique())   # Number of unique countries 

243

This means that we would need to convert these words (countries) into numeric data using encoders

In [16]:
df

Unnamed: 0,gender,age,country,pages_visited,session_time,purchased,product_category,price
0,Male,19,Sudan,9,8.48,0,Electronics,0.00
1,Male,55,Niue,14,2.41,0,Clothing,0.00
2,Male,50,Papua New Guinea,20,2.26,0,Beauty,0.00
3,Female,32,Ukraine,15,18.29,0,Clothing,0.00
4,Female,39,Belarus,9,5.93,1,Electronics,50.91
...,...,...,...,...,...,...,...,...
1995,Female,38,Croatia,17,14.95,1,Beauty,109.09
1996,Male,38,Equatorial Guinea,6,6.97,1,Clothing,426.92
1997,Male,38,Japan,17,29.46,1,Books,118.22
1998,Female,48,Namibia,10,12.07,1,Books,364.72


In [18]:
categorical_features = ['gender', 'country', 'product_category']
numeric_features = ['age', 'pages_visited', 'session_time', 'price']
target = 'purchased'

In [23]:
df_cleaned=df

In [24]:
X = df_cleaned.drop(columns=[target])
y = df_cleaned[target]

In [25]:
df_cleaned

Unnamed: 0,gender,age,country,pages_visited,session_time,purchased,product_category,price
0,Male,19,Sudan,9,8.48,0,Electronics,0.00
1,Male,55,Niue,14,2.41,0,Clothing,0.00
2,Male,50,Papua New Guinea,20,2.26,0,Beauty,0.00
3,Female,32,Ukraine,15,18.29,0,Clothing,0.00
4,Female,39,Belarus,9,5.93,1,Electronics,50.91
...,...,...,...,...,...,...,...,...
1995,Female,38,Croatia,17,14.95,1,Beauty,109.09
1996,Male,38,Equatorial Guinea,6,6.97,1,Clothing,426.92
1997,Male,38,Japan,17,29.46,1,Books,118.22
1998,Female,48,Namibia,10,12.07,1,Books,364.72


In [56]:
category_encoded = pd.DataFrame({
    'product_category': df_cleaned['product_category'],
    'country': df_cleaned['country'],
    'gender': df_cleaned['gender']
})

In [57]:
category_encoded

Unnamed: 0,product_category,country,gender
0,Electronics,Sudan,Male
1,Clothing,Niue,Male
2,Beauty,Papua New Guinea,Male
3,Clothing,Ukraine,Female
4,Electronics,Belarus,Female
...,...,...,...
1995,Beauty,Croatia,Female
1996,Clothing,Equatorial Guinea,Male
1997,Books,Japan,Male
1998,Books,Namibia,Female


In [58]:
label_encoder = LabelEncoder()
X['product_category'] = label_encoder.fit_transform(category_encoded['product_category'])
X['country'] = label_encoder.fit_transform(category_encoded['country'])
X['gender'] = label_encoder.fit_transform(category_encoded['gender'])

In [59]:
X

Unnamed: 0,gender,age,country,pages_visited,session_time,price,category_encoded,product_category
0,1,19,205,9,8.48,0.00,3,3
1,1,55,158,14,2.41,0.00,2,2
2,1,50,168,20,2.26,0.00,0,0
3,0,32,227,15,18.29,0.00,2,2
4,0,39,19,9,5.93,50.91,3,3
...,...,...,...,...,...,...,...,...
1995,0,38,53,17,14.95,109.09,0,0
1996,1,38,64,6,6.97,426.92,2,2
1997,1,38,108,17,29.46,118.22,1,1
1998,0,48,148,10,12.07,364.72,1,1


In [60]:
X.drop(columns = "category_encoded", inplace = True)

In [61]:
X

Unnamed: 0,gender,age,country,pages_visited,session_time,price,product_category
0,1,19,205,9,8.48,0.00,3
1,1,55,158,14,2.41,0.00,2
2,1,50,168,20,2.26,0.00,0
3,0,32,227,15,18.29,0.00,2
4,0,39,19,9,5.93,50.91,3
...,...,...,...,...,...,...,...
1995,0,38,53,17,14.95,109.09,0
1996,1,38,64,6,6.97,426.92,2
1997,1,38,108,17,29.46,118.22,1
1998,0,48,148,10,12.07,364.72,1


Data above is now numerical.

In [62]:
X_correlation = X.corr()

In [63]:
X_correlation

Unnamed: 0,gender,age,country,pages_visited,session_time,price,product_category
gender,1.0,-0.030027,0.020337,0.007093,-0.009346,0.006303,-0.00776
age,-0.030027,1.0,-0.017014,-0.011346,0.013134,0.003076,0.032044
country,0.020337,-0.017014,1.0,0.026038,0.031694,-0.024097,0.047757
pages_visited,0.007093,-0.011346,0.026038,1.0,0.013015,-0.054328,0.000665
session_time,-0.009346,0.013134,0.031694,0.013015,1.0,-0.005377,0.03548
price,0.006303,0.003076,-0.024097,-0.054328,-0.005377,1.0,0.006863
product_category,-0.00776,0.032044,0.047757,0.000665,0.03548,0.006863,1.0


In [67]:
X['purchased'] = y

In [68]:
X

Unnamed: 0,gender,age,country,pages_visited,session_time,price,product_category,purchased
0,1,19,205,9,8.48,0.00,3,0
1,1,55,158,14,2.41,0.00,2,0
2,1,50,168,20,2.26,0.00,0,0
3,0,32,227,15,18.29,0.00,2,0
4,0,39,19,9,5.93,50.91,3,1
...,...,...,...,...,...,...,...,...
1995,0,38,53,17,14.95,109.09,0,1
1996,1,38,64,6,6.97,426.92,2,1
1997,1,38,108,17,29.46,118.22,1,1
1998,0,48,148,10,12.07,364.72,1,1


In [69]:
y_correlate = X.corr()['purchased'].sort_values(ascending=False)

In [70]:
y_correlate

purchased           1.000000
price               0.775887
product_category    0.016431
age                -0.000430
gender             -0.012442
country            -0.024449
session_time       -0.030770
pages_visited      -0.044213
Name: purchased, dtype: float64

Price has a **strong positive** correlation (0.775887) with purchased:
Inference: More expensive items are more likely to be purchased.
Users may be engaging with high-value products, or promotions might target them.

Product Category has an extremely **weak positive** correlation:
Could suggest some categories are slightly more purchased than others, but it’s marginal.

*****MODEL IMPLEMENTATION BELOW*****

In [71]:
X

Unnamed: 0,gender,age,country,pages_visited,session_time,price,product_category,purchased
0,1,19,205,9,8.48,0.00,3,0
1,1,55,158,14,2.41,0.00,2,0
2,1,50,168,20,2.26,0.00,0,0
3,0,32,227,15,18.29,0.00,2,0
4,0,39,19,9,5.93,50.91,3,1
...,...,...,...,...,...,...,...,...
1995,0,38,53,17,14.95,109.09,0,1
1996,1,38,64,6,6.97,426.92,2,1
1997,1,38,108,17,29.46,118.22,1,1
1998,0,48,148,10,12.07,364.72,1,1


In [79]:
y_updated = X['purchased']

In [82]:
X.drop(columns='purchased',inplace=True)

In [83]:
X

Unnamed: 0,gender,age,country,pages_visited,session_time,price,product_category
0,1,19,205,9,8.48,0.00,3
1,1,55,158,14,2.41,0.00,2
2,1,50,168,20,2.26,0.00,0
3,0,32,227,15,18.29,0.00,2
4,0,39,19,9,5.93,50.91,3
...,...,...,...,...,...,...,...
1995,0,38,53,17,14.95,109.09,0
1996,1,38,64,6,6.97,426.92,2
1997,1,38,108,17,29.46,118.22,1
1998,0,48,148,10,12.07,364.72,1


In [84]:
y = y_updated

In [85]:
y

0       0
1       0
2       0
3       0
4       1
       ..
1995    1
1996    1
1997    1
1998    1
1999    1
Name: purchased, Length: 2000, dtype: int64

In [86]:
# Train and test 
X_train = X[:1500]
y_train = y[:1500]


In [88]:
X_test = X[1500:]
y_test = y[1500:]

In [90]:
X_test.shape

(500, 7)

In [91]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [92]:
model = RandomForestClassifier(random_state=42)


In [93]:
model.fit(X_train, y_train)


In [99]:
y_pred = model.predict(X_test)

In [100]:
accuracy_score(y_test, y_pred)

1.0

In [101]:
confusion_matrix(y_test, y_pred)

array([[256,   0],
       [  0, 244]])

In [103]:
report = classification_report(y_test, y_pred)

In [104]:
type(report)

str

In [105]:
report

'              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00       256\n           1       1.00      1.00      1.00       244\n\n    accuracy                           1.00       500\n   macro avg       1.00      1.00      1.00       500\nweighted avg       1.00      1.00      1.00       500\n'

In [106]:
report_dict = classification_report(y_test, y_pred, output_dict=True)

In [107]:
report_dict

{'0': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 256.0},
 '1': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 244.0},
 'accuracy': 1.0,
 'macro avg': {'precision': 1.0,
  'recall': 1.0,
  'f1-score': 1.0,
  'support': 500.0},
 'weighted avg': {'precision': 1.0,
  'recall': 1.0,
  'f1-score': 1.0,
  'support': 500.0}}

In [109]:
report_df = pd.DataFrame(report_dict).transpose()
report_df.style

Unnamed: 0,precision,recall,f1-score,support
0,1.0,1.0,1.0,256.0
1,1.0,1.0,1.0,244.0
accuracy,1.0,1.0,1.0,1.0
macro avg,1.0,1.0,1.0,500.0
weighted avg,1.0,1.0,1.0,500.0


The *classification model* achieved perfect performance metrics, including precision, recall, f1-score, and accuracy — all equal to 1.0 across both classes. This strongly suggests that the model was able to classify all instances in the test set without any errors.

However, given that a synthetic dataset was used, this level of performance may not generalize well to real-world data. The synthetic nature of the dataset likely results in highly separable feature patterns that are easy for the model to learn and classify. As a result, the model may be overfitting, capturing the exact patterns of the training data instead of learning more generalizable rules.

*****Model Validation*****

In [110]:
from sklearn.model_selection import cross_val_score

In [113]:
cv_scores = cross_val_score(model, X_test, y_test, cv=5, scoring='accuracy')

In [114]:
cv_scores

array([1., 1., 1., 1., 1.])

In [115]:
print("Cross-validation accuracy scores:", cv_scores)
print("Mean CV accuracy:", cv_scores.mean())
print("Standard deviation:", cv_scores.std())

Cross-validation accuracy scores: [1. 1. 1. 1. 1.]
Mean CV accuracy: 1.0
Standard deviation: 0.0


The model demonstrates perfect performance across all cross-validation folds, suggesting either the dataset is highly structured or the model may be overfitting to synthetic patterns. While this indicates strong model capability in capturing the dataset's structure, caution should be taken when generalizing to real-world data, which may be noisier and less predictable.

**Final report and conclusions on the data**

BEST 3 INDICATORS THAT DRIVES THE BUSINESS ARE: PRICE, PRODUCT_CATEGORY AND SESSION TIME 
- Higher or more attractive pricing seems to drive purchases.
- Promote best-selling categories on the homepage or through ads.
- Use personalized recommendations to highlight products from categories a user has previously browsed.
- Streamline the user experience to reduce friction: faster checkout. 
