## Project: Predicting the Online Shoppers Purchasing Intention.

##### Prepared by: Md Rahat Mahmud

#### Part 1: Learning and Splitting the data-set based on online shoppers purchasing intention.

Here we will apply the machine learning tools and libraries such as pandas and matpotlib to learn from the online shoppers purchasing intention stated in the dataset.


###### Loading necessary libraries:

In [3]:
import sys
print("Python version: {}".format(sys.version))
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import matplotlib
print("matplotlib version: {}".format(matplotlib.__version__))
import numpy as np
print("NumPy version: {}".format(np.__version__))
import scipy as sp
print("SciPy version: {}".format(sp.__version__))
import IPython
print("IPython version: {}".format(IPython.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
import mglearn
import os


Python version: 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
pandas version: 0.25.1
matplotlib version: 3.1.1
NumPy version: 1.16.5
SciPy version: 1.3.1
IPython version: 7.8.0
scikit-learn version: 0.21.3




###### Loading the data-set and printing the attributes and the first 5 rows.

In [5]:
intn_dataset = pd.read_csv(r"http://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv")
intn_dataset.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


Before the start of the analysis with a catagorical data atributes, it is very important to check whether it has only two catagories or more. Here, we see that the revenue column has only two catagories of data. That means it is ready to be used for the analysis after the One-Hot-Encoding.

###### The One-Hot-Encoding or one-out-of-N encoding by using dummy variables:

In [6]:
print(intn_dataset.Revenue.value_counts())

False    10422
True      1908
Name: Revenue, dtype: int64


In [7]:
print("Keys of intn_dataset: \n{}".format(intn_dataset.keys()))

Keys of intn_dataset: 
Index(['Administrative', 'Administrative_Duration', 'Informational',
       'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
       'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month',
       'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType',
       'Weekend', 'Revenue'],
      dtype='object')


Here among the attributes, "Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories.

The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session.

The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction.

The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.

The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

##### Here we apply the get_dummies function that automatically transforms all columns that have object type (like strings) or are categorical

In [8]:
print("Original features:\n", list(intn_dataset.columns), "\n")
data_dummies = pd.get_dummies(intn_dataset)
print("Features after get_dummies:\n", list(data_dummies.columns))

Original features:
 ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Revenue'] 

Features after get_dummies:
 ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'Weekend', 'Revenue', 'Month_Aug', 'Month_Dec', 'Month_Feb', 'Month_Jul', 'Month_June', 'Month_Mar', 'Month_May', 'Month_Nov', 'Month_Oct', 'Month_Sep', 'VisitorType_New_Visitor', 'VisitorType_Other', 'VisitorType_Returning_Visitor']


###### Here we see the catagorical features were expanded into one new feature for each possible value.

In [9]:
data_dummies.head()


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,Month_Jul,Month_June,Month_Mar,Month_May,Month_Nov,Month_Oct,Month_Sep,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1


###### Now we will extract all the columns form the intn_dataset except the target; "Revenue" attribute data.

In [10]:
features = data_dummies.loc[:, 'Administrative':'Weekend']
# Extracting NumPy Arrays
X = features.values
y = data_dummies['Revenue'].values
print("X.shape: {} y.shape: {}".format(X.shape, y.shape))

X.shape: (12330, 15) y.shape: (12330,)


In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [12]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42)
mlp.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp.score(X_test, y_test)))
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Accuracy on training set: 0.89
Accuracy on test set: 0.87


###### Now the redefined dataset is ready to be applied at the scikit-learn to analyze with. But we will try to imrove it by using other models.

## LogisticRegression Method to find the test set accuracy:

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(logreg.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(logreg.score(X_test, y_test)))



Accuracy on training set: 0.89
Accuracy on test set: 0.87


## KNN Method to find the test set accuracy:

In [13]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(knn.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(knn.score(X_test, y_test)))

Accuracy on training set: 0.91
Accuracy on test set: 0.85


We see that the test set acuuracy is about o.97. That means the prediction we get from this model will be accurate for the 85% of the intn_data set. Under mathmetical assumptions we can say that the model will be correct 85% of the time for the online shoppers intensions.

In [15]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(tree.score(X_test, y_test)))

Accuracy on training set: 0.90
Accuracy on test set: 0.88


In [16]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42)
mlp.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp.score(X_test, y_test)))

Accuracy on training set: 0.89
Accuracy on test set: 0.87


The accuracy of the MLP is quite good. But it is not as good as the otehr models. Neural networks expect all input features to vary in a similar way, and ideally to have a mean of 0, and a variance of 1. So to fulfill these requirements, we must reschedule the data.

##### Using StandardScaler to standardize the data to improve accuracy:

In [17]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
mlp = MLPClassifier(random_state=0)
mlp.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
mlp.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))

Accuracy on training set: 0.908
Accuracy on test set: 0.886




## Applying the SVM:

In [18]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test, y_test)))



Accuracy on training set: 0.997
Accuracy on test set: 0.830


Previously, accuracy of training and test sets were 0.997 and 0.830 respectively.But here we see that scaling the data made a huge difference. To increase the accuracy even more, we can try increasing eiter C or Gamma to fit a more complex model:

In [19]:
svc = SVC(C=1000)
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test, y_test)))



Accuracy on training set: 1.000
Accuracy on test set: 0.828


Now we see the trainiang and test set accuracy has been increased to 1.00 and 0.828 respectively.

### Opinion:

After analyzing the models, we have seen that for the KNN model, the test set accuracy is 0.85. For the logistic Regression model, the test set accuracy is 0.87. For the Decision Tree model, the test set accuracy is 0.88. For the NNET model, the test set accuracy is 0.87, and for the SVM model, the test set accuracy is 0.83. So, we see that the NNET model is the best model for the dataset as it offers the most accurate test data set.

                                            __THE END__