# Online Store - Purchase Forecasting 

## I. Introduction

In this report, we present the modeling process for predicting custumer behaviour in an online store. The goal of the model is to predict whether a given customer will make a purchase (1) or not (0) in the store. For this purpose, we use two different models: Random Forest and Logistic Regression.


### Objectives

1. Exploratory Data Analisys (EDA) to investigate the data.
2. Data preparation - cleaning and preprocessing.
3. Utilizing two classification algorithms - Logistic Regression and Random Forest.
4. Evaluation and comparison of model performances.
5. Addressing class imbalance issues.
6. Selection of the most suitable model for purchase prediction.




## II. Importing the nessesary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report 
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from sklearn.utils import class_weight
from sklearn.model_selection import cross_val_score
from imblearn.under_sampling import RandomUnderSampler
import seaborn as sns
sns.set()




## III. Exploratory Data Analisys (EDA)

During EDA, the following observations were made:

1. Number of observations: 12 330.
2. Number of features: 29, including the target variable "Revenue".
3. There are no missing values in the dataset.


### Loading the raw data

In [2]:
raw_data = pd.read_csv("C:/Users/aleksandar.dimitrov/Desktop/INFOLITICA/DATA SCIENCE/Projects/Online Visitors/data.csv")
raw_data.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


### Displaying statistical information about the dataset

In [3]:
raw_data.describe(include = "all")

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330,12330.0,12330.0,12330.0,12330.0,12330,12330,12330
unique,,,,,,,,,,,10,,,,,3,2,2
top,,,,,,,,,,,May,,,,,Returning_Visitor,False,False
freq,,,,,,,,,,,3364,,,,,10551,9462,10422
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,,2.124006,2.357097,3.147364,4.069586,,,
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,,0.911325,1.717277,2.401591,4.025169,,,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0,1.0,,,
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,,2.0,2.0,1.0,2.0,,,
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,,2.0,2.0,3.0,2.0,,,
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,,3.0,2.0,4.0,4.0,,,


### Checking for missing values

In [4]:
missing_values = raw_data.isnull().sum()
missing_values

Administrative             0
Administrative_Duration    0
Informational              0
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    0
Revenue                    0
dtype: int64

## IV. Data Preparation

The data preparation process involves the following steps:

1. One-hot encoding for categorical features "VisitorType" and "Month" - this converts categorical data into numerical values, which is necessary for model training. 

2. Label encoding for the binary feature "Weekend" - this converts binary data into numerical labels (0 and 1). 

3. Label encoding for the target variable "Revenue" - this converts binary data into numerical labels (0 and 1).

4. Splititng the data into training and testing sets - this allow us to train the models on the training data and evaluate them on the testing data. 

### Converting the categorical data into the appropriate data type 

#### One_Hot Encoding for "VisitorType" column

In [5]:
# One-hot encoding
one_hot_encoded = pd.get_dummies(raw_data['VisitorType'], prefix='VisitorType')

# Concatenate the one-hot encoded columns 
data_mod = pd.concat([raw_data, one_hot_encoded], axis=1)

# Drop the original 'VisitorType' column 
data_mod.drop('VisitorType', axis=1, inplace=True)


#### One_Hot Encoding for "Month" column

In [6]:
# One-hot encoding
one_hot_encoded = pd.get_dummies(data_mod["Month"], prefix="Month")

# Concatenate the one-hot encoded columns 
data_mod = pd.concat([data_mod, one_hot_encoded], axis=1)

# Drop the original 'VisitorType' column 
data_mod.drop("Month", axis=1, inplace=True)


#### Label_Encoding for "Weekend" column

In [7]:
data_mod["Weekend"] = data_mod["Weekend"].replace({"False": 0, "True": 1}).astype(int)





#### Label_Encoding for target variable  "Revenue"


In [8]:
data_mod["Revenue"] = data_mod["Revenue"].replace({"False": 0, "True": 1}).astype(int)
data_mod

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,Month_Aug,Month_Dec,Month_Feb,Month_Jul,Month_June,Month_Mar,Month_May,Month_Nov,Month_Oct,Month_Sep
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,...,0,0,1,0,0,0,0,0,0,0
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,...,0,0,1,0,0,0,0,0,0,0
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,...,0,0,1,0,0,0,0,0,0,0
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,...,0,0,1,0,0,0,0,0,0,0
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12325,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,...,0,1,0,0,0,0,0,0,0,0
12326,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,...,0,0,0,0,0,0,0,1,0,0
12327,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,...,0,0,0,0,0,0,0,1,0,0
12328,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,...,0,0,0,0,0,0,0,1,0,0


### Spitting the data into training and testing sets

In [9]:
X = data_mod.drop("Revenue", axis=1)
y = data_mod["Revenue"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [10]:
X = data_mod.drop("Revenue", axis = 1)
y = data_mod["Revenue"]

## V. Modelling

1. Model I - Random Forest
2. Model II - Logistic Regression

### Initializing and training the Random Forest model

We start by creating an instance of the Random Forest Classifier called rf_model using the default hyperparameters. 

In [11]:
rf_model = RandomForestClassifier()

In [12]:
rf_model.fit(X_train, y_train)

### Visualizing feature importance

After training the Random Forest model, we evaluate the importance of each feature in predicting the target variable (revenue). Feature importance values are calculated based on how much each feature contributes to the model's accuracy during decision-making.

In [13]:
feature_importance = rf_model.feature_importances_

In [14]:
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importance})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
feature_importance_df

Unnamed: 0,Feature,Importance
8,PageValues,0.362259
5,ProductRelated_Duration,0.088646
7,ExitRates,0.087763
4,ProductRelated,0.069047
1,Administrative_Duration,0.058552
6,BounceRates,0.054519
0,Administrative,0.041597
13,TrafficType,0.032269
12,Region,0.031841
3,Informational_Duration,0.027764


By analyzing feature_importance_df, we can identify which features have the most significant impact on predicting whether a custumer will make a purchase in the online store.

### Evaluating the model using Logistic Regression BEFORE undersampling

We initialize a Logistic Regression model (logreg_model) and train it using the training data (X_train and y_train)

In [15]:
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)
y_pred = logreg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model before undersampling: {:.2f}%".format(accuracy * 100))
print(classification_report(y_test, y_pred))


Accuracy of the model before undersampling: 86.82%
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      2055
           1       0.71      0.35      0.47       411

    accuracy                           0.87      2466
   macro avg       0.80      0.66      0.70      2466
weighted avg       0.85      0.87      0.85      2466



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The logistic regression model is trained on the original training data without any modifications for class imbalance. The accuracy achieved is 86.82%, and the classification report shows that the model has good precision and recall for class "0" (no purchase), but it performs poorly in predicting class "1" (purchase), as indicated by the lower precision, recall, and F1-score.

### Undersampling the negative examples

To address the class imbalance issue, where the number of non-purchase instances outweighs the purchase instances, we use Random Under-Sampling. This method removes some examples from the majority class (non-purchase) to balance the two classes.

In [16]:
rus = RandomUnderSampler(sampling_strategy='auto', random_state=42)
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)


This approach reduces the impact of the majority class during training and ensures more balanced data for model training.


### Evaluating the Logistic Regression model AFTER undersampling

We initialize a new Logistic Regression model (logreg_model) and train it using the resampled training data (X_train_resampled and y_train_resampled).\

In [17]:
logreg_model = LogisticRegression()
logreg_model.fit(X_train_resampled, y_train_resampled)
y_pred = logreg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model after undersampling: {:.2f}%".format(accuracy * 100))
print(classification_report(y_test, y_pred))

Accuracy of the model after undersampling: 87.10%
              precision    recall  f1-score   support

           0       0.95      0.90      0.92      2055
           1       0.59      0.74      0.66       411

    accuracy                           0.87      2466
   macro avg       0.77      0.82      0.79      2466
weighted avg       0.89      0.87      0.88      2466



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


After undersampling the majority class (no purchase), the logistic regression model is trained on the resampled data. The accuracy improves to 87.10%, and the classification report shows an increase in precision, recall, and F1-score for class "1" (purchase). This means the model is now better at predicting the positive class.

### Cross-validation with 5 folds for the Logistic Regression model AFTER undersampling

To check the stability and generalizability of the Logistic Regression model after undersampling, we perform cross-validation with 5 folds.

In [18]:
model = LogisticRegression()
cv_scores = cross_val_score(model, X_train_resampled, y_train_resampled, cv=5, scoring='accuracy')
print("Cross-validation scores: {}".format(cv_scores))
print("Mean accuracy: {:.2f}".format(cv_scores.mean()))
print("Standard deviation: {:.2f}".format(cv_scores.std()))



Cross-validation scores: [0.80634391 0.79298831 0.82804674 0.8163606  0.81103679]
Mean accuracy: 0.81
Standard deviation: 0.01


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

To assess the model's stability and generalizability, cross-validation is performed on the logistic regression model trained with undersampled data. The cross-validation scores are reasonably consistent, with a mean accuracy of 0.81 and a small standard deviation of 0.01. This indicates that the model is likely to perform consistently on unseen data.

## VI. Conclusion


In this report, we presented the modeling process for predicting customer behavior in an online store. We used Random Forest to evaluate feature importance and Logistic Regression for actual prediction. Undersampling the negative examples helped to address the class imbalance issue and improved the accuracy of the model for predicting purchases. Cross-validation strengthened the model's generalizability and made it more robust against overfitting.




