<div style="text-align: center; font-size: 46px; color: blue;">
    <u><b>ONLINE SHOPPERS PURCHASING INTENSION</b></u>
</div>

<div style="text-align: center; font-size: 20px; color: red;">
    <b>SEETHA V</b>
</div>

<div style="text-align: center; font-size: 35px; color: red;">
    <u><b>OVERVIEW OF THE PROJECT</b></u>
</div>

<div style="text-align: left; font-size: 26px; color: PURPLE;">
    <u><b>PROBLEM STATEMENT</b></u>
</div>

<div style="text-align: LEFT; font-size: 20px; color: white;">
    <p>The primary objective of this project is to analyze the Online Shoppers' Intention dataset to identify the factors influencing online shopping behavior and predict the likelihood of a visitor making a purchase, thereby driving revenue. This dataset includes various features that capture user interactions on an e-commerce platform, such as page views, time spent on different page types, and visitor characteristics. By leveraging these insights, the project aims to optimize marketing strategies and enhance the user experience on online shopping platforms. </p>    
</div>

<div style="text-align: left; font-size: 26px; color: PURPLE;">
    <u><b>GOAL</b></u>
</div>

<div style="text-align: LEFT; font-size: 20px; color: white;">
    <p>The objective of this project is to predict the purchase intent of online shoppers based on their browsing behaviors and interactions with the website. Specifically, the goal is to classify whether a visitor will make a purchase (Revenue = TRUE) or not (Revenue = FALSE) by analyzing various features derived from their online activities. </p>    
</div>

<div style="text-align: left; font-size: 26px; color: PURPLE;">
    <u><b>SOURCE</b></u>
</div>

<div style="text-align: LEFT; font-size: 20px; color: white;">
    <p>The dataset can be downloaded from the link : https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset</p>    
</div>

<div style="text-align: left; font-size: 26px; color: PURPLE;">
    <u><b>FEATURES</b></u>
</div>

<div style="text-align: LEFT; font-size: 20px; color: white;">
    <p><b>*  Administrative:</b> Number of administrative pages viewed by the visitor.</p>
    <p><b>*  Administrative_Duration:</b> Total time (in seconds) spent on administrative pages.</p>
    <p><b>*  Informational:</b> Number of informational pages viewed.</p>
    <p><b>*  Informational_Duration:</b> Total time (in seconds) spent on informational pages.</p>
    <p><b>*  ProductRelated:</b> Number of product-related pages viewed during the session.</p>
    <p><b>*  ProductRelated_Duration:</b>Total time (in seconds) spent on product-related pages.</p>
    <p><b>*  BounceRates:</b> The percentage of visitors who leave the site after viewing only one page.</p>
    <p><b>*  ExitRates:</b> The percentage of visitors who exit from a specific page.</p>
    <p><b>*  PageValues:</b> Average value of a page based on conversion rates, indicating its contribution to revenue generation.</p>
    <p><b>*  SpecialDay:</b> A binary indicator that specifies if the visit occurred on a special day (e.g., holiday).</p>
    <p><b>*  Month:</b> The month during which the visit occurred, represented as a categorical variable.</p>
    <p><b>*  OperatingSystems:</b> The type of operating system used by the visitor (e.g., Windows, macOS).</p>
    <p><b>*  Browser:</b> The web browser used by the visitor (e.g., Chrome, Firefox).</p>
    <p><b>*  Region:</b> Geographic region from which the visitor accessed the site.</p>
    <p><b>*  TrafficType:</b> Type of traffic source (e.g., direct, referral).</p>
    <p><b>*  VisitorType:</b> Indicates whether the visitor is a new or returning user.</p>
    <p><b>*  Weekend:</b> A binary indicator that denotes if the visit occurred on a weekend.</p>
    <p><b>*  Revenue:</b> A binary outcome variable indicating whether the visit resulted in revenue generation (1 for yes, 0 for no). </p>    
</div>

<div style="text-align: center; font-size: 35px; color: red;">
    <u><b>IMPORTING LIBRARIES</b></u>
</div>

In [147]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.special import boxcox1p
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import PowerTransformer
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")

<div style="text-align: center; font-size: 36px; color: red;">
    <u><b>LOADING & PREPROCESSING</b></u>
</div>

<div style="text-align: left; font-size: 24px; color: violet;">
    <b>1. LOAD THE DATA</b>
</div>

In [150]:
data = pd.read_csv("online_shoppers_intention.csv")
data.head(10)

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False
5,0,0.0,0,0.0,19,154.216667,0.015789,0.024561,0.0,0.0,Feb,2,2,1,3,Returning_Visitor,False,False
6,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.4,Feb,2,4,3,3,Returning_Visitor,False,False
7,1,0.0,0,0.0,0,0.0,0.2,0.2,0.0,0.0,Feb,1,2,1,5,Returning_Visitor,True,False
8,0,0.0,0,0.0,2,37.0,0.0,0.1,0.0,0.8,Feb,2,2,2,3,Returning_Visitor,False,False
9,0,0.0,0,0.0,3,738.0,0.0,0.022222,0.0,0.4,Feb,2,4,1,2,Returning_Visitor,False,False


In [151]:
data.shape

(12330, 18)

<div style="text-align: left; font-size: 24px; color: violet;">
    <b>2. DISPLAY FIRST & LAST ROWS</b>
</div>

In [153]:
# DISPLAY FIRST FEW ROWS TO UNDERSTAND THE STRUCTURE OF THE DATA
print(data.head())

   Administrative  Administrative_Duration  Informational  \
0               0                      0.0              0   
1               0                      0.0              0   
2               0                      0.0              0   
3               0                      0.0              0   
4               0                      0.0              0   

   Informational_Duration  ProductRelated  ProductRelated_Duration  \
0                     0.0               1                 0.000000   
1                     0.0               2                64.000000   
2                     0.0               1                 0.000000   
3                     0.0               2                 2.666667   
4                     0.0              10               627.500000   

   BounceRates  ExitRates  PageValues  SpecialDay Month  OperatingSystems  \
0         0.20       0.20         0.0         0.0   Feb                 1   
1         0.00       0.10         0.0         0.0   Feb   

In [154]:
# DISPLAY LAST FEW ROWS TO UNDERSTAND THE STRUCTURE OF THE DATA
print(data.tail())

       Administrative  Administrative_Duration  Informational  \
12325               3                    145.0              0   
12326               0                      0.0              0   
12327               0                      0.0              0   
12328               4                     75.0              0   
12329               0                      0.0              0   

       Informational_Duration  ProductRelated  ProductRelated_Duration  \
12325                     0.0              53              1783.791667   
12326                     0.0               5               465.750000   
12327                     0.0               6               184.250000   
12328                     0.0              15               346.000000   
12329                     0.0               3                21.250000   

       BounceRates  ExitRates  PageValues  SpecialDay Month  OperatingSystems  \
12325     0.007143   0.029031   12.241717         0.0   Dec                 4   
12

<div style="text-align: left; font-size: 24px; color: violet;">
    <b>3. DATATYPE OF EACH COLUMN</b>
</div>

In [156]:
# DISPLAY DATA TYPE OF EACH COLUMN
print("Dataset Info:")
data.info()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficT

<div style="text-align: left; font-size: 24px; color: violet;">
    <b>4. STATISTICAL SUMMARY OF DATA</b>
</div>

In [158]:
# DISPLAY STATISTICAL SUMMARY 
print("Statistical Summary:")
data.describe()

Statistical Summary:


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,2.124006,2.357097,3.147364,4.069586
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,0.911325,1.717277,2.401591,4.025169
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,2.0,2.0,1.0,2.0
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,2.0,2.0,3.0,2.0
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,3.0,2.0,4.0,4.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0,8.0,13.0,9.0,20.0


<div style="text-align: left; font-size: 24px; color: violet;">
    <b>5. DISPLAY ALL COLUMN NAMES</b>
</div>

In [160]:
# DISPLAY PARTICULAR COLUMN
print("Columns of the dataset:")
data.columns

Columns of the dataset:


Index(['Administrative', 'Administrative_Duration', 'Informational',
       'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
       'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month',
       'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType',
       'Weekend', 'Revenue'],
      dtype='object')

<div style="text-align: left; font-size: 24px; color: violet;">
    <b>6. NULL / MISSING VALUES IN EACH COLUMN</b>
</div>

In [162]:
# DISPLAY NULL VALUES IN EACH COLUMN
print("Null values in each column:")
print(data.isnull().sum())

Null values in each column:
Administrative             0
Administrative_Duration    0
Informational              0
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    0
Revenue                    0
dtype: int64


<div style="text-align: left; font-size: 24px; color: violet;">
    <b>7. DUPLICATE VALUES</b>
</div>

In [164]:
# FINDING THE TOTAL NO OF DUPLICATES
print(f"Total number of duplicate values is {data.duplicated().sum()}")

Total number of duplicate values is 125


In [165]:
data.shape

(12330, 18)

In [166]:
# TO REMOVE DUPLICATES
data.drop_duplicates(inplace=True)

In [167]:
print(f"The shape of the dataset after removing the duplicates is {data.shape}")

The shape of the dataset after removing the duplicates is (12205, 18)


<div style="text-align: left; font-size: 24px; color: violet;">
    <b>8. UNIQUE VALUE IN EACH COLUMN AND ITS LENGTH</b>
</div>

In [169]:
for column in data.columns:
    unique_values = data[column].unique()  # Get unique values in the column
    unique_count = len(unique_values)  # Get the count of unique values
    print(f"COLUMN: {column}")
    print(f"UNIQUE VALUES: {unique_values}")
    print(f"COUNT OF UNIQUE VALUES: {unique_count}")
    print("\n")

COLUMN: Administrative
UNIQUE VALUES: [ 0  1  2  4 12  3 10  6  5  9  8 16 13 11  7 18 14 17 19 15 24 22 21 20
 23 27 26]
COUNT OF UNIQUE VALUES: 27


COLUMN: Administrative_Duration
UNIQUE VALUES: [  0.         53.         64.6       ... 167.9107143 305.125
 150.3571429]
COUNT OF UNIQUE VALUES: 3335


COLUMN: Informational
UNIQUE VALUES: [ 0  1  2  4 16  5  3 14  6 12  7  9 10  8 11 24 13]
COUNT OF UNIQUE VALUES: 17


COLUMN: Informational_Duration
UNIQUE VALUES: [  0.   120.    16.   ... 547.75 368.25 211.25]
COUNT OF UNIQUE VALUES: 1258


COLUMN: ProductRelated
UNIQUE VALUES: [  1   2  10  19   0   3  16   7   6  23  13  20   8   5  32   4  45  14
  52   9  46  15  22  11  12  36  42  27  90  18  38  17 128  25  30  21
  51  26  28  31  24  50  96  49  68  98  67  55  35  37  29  34  71  63
  87  40  33  54  64  75  39 111  81  61  47  44  88 149  41  79  66  43
 258  80  62  83 173  48  58  57  56  69  82  59 109 287  53  84  78 137
 113  89  65  60 104 129  77  74  93  76  72 194 

<div style="text-align: left; font-size: 24px; color: violet;">
    <b>9. MAKE COPY OF DATASET</b>
</div>

In [171]:
data_copy = data.copy()
data_copy.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


<div style="text-align: left; font-size: 24px; color: violet;">
    <b>9. EXTRACT NUMERICAL COLUMNS</b>
</div>

In [173]:
num_data = data.select_dtypes(include="number")
num_data

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,1,1,1,1
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,2,2,1,2
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,4,1,9,3
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,3,2,2,4
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,3,3,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12325,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,4,6,1,1
12326,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,3,2,1,8
12327,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,3,2,1,13
12328,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,2,2,3,11


In [174]:
num_data = data.select_dtypes(include='number')
numeric_columns=list(num_data)
numeric_columns

['Administrative',
 'Administrative_Duration',
 'Informational',
 'Informational_Duration',
 'ProductRelated',
 'ProductRelated_Duration',
 'BounceRates',
 'ExitRates',
 'PageValues',
 'SpecialDay',
 'OperatingSystems',
 'Browser',
 'Region',
 'TrafficType']