## TIMELINE OF TASK FOR THE PROJECT

* Function to read the data
* Data Cleaning
* Exploratory Data Analysis
* Data Segmentation and Model Trianing
* Compare the performance of Different models ( KNN, Random Forest, XgBoost)
* Evaluation of the Models (Specificity, Sensitivity)
* Conclusion /Insights

### Dirtying the dataset

* convert some one int column to objects
* choose a binary column and make null 5% of the column (targeting only column with the mode value of the column) -> to fix we will fill those null spaces with the mode of the column.
* Choose a column having true or false values and replace some of the False values with Fals
* 

## Introduction

In this project, we will work with a dataset containing browsing information about internet users — how many pages they’ve visited, whether they’re shopping on a weekend, what web 

browser they’re using, etc. The dataset obtained from the shopping website contains about 12,000 users sessions and each session is labelled with 18 attributes (columns).  

The dataset for this project was provided by (Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018))[https://link.springer.com/article/10.1007/s00521-018-3523-0]

## Information about each column

| **Column**       |   **Description**|
|  :----------      |    :----------    |
| Administrative     |    Number of Administrative web pages visited by the user|
|Administrative_Duration    |    The total amount of time spent on administrative pages|
| Informational     |    Number of Informational web pages visited by the user|
|Informational_Duration    |    The total amount of time spent on informational web pages|
| ProductRelated     |    Number of ProductRelated web pages visited by the user|
|ProductRelated_Duration    |    The total amount of time spent on ProductRelated web pages|
|BounceRates            |    measure information from Google Analytics about the page the user visited.|
|ExitRates           |    measure information from Google Analytics about the page the user visited|
|PageValues      |    measure information from Google Analytics about the page the user visited|
|SpecialDay    | This is a value that measures how close the date of the user’s session is to a special day (like Valentine’s Day or Mother’s Day) |
|Month  | s an abbreviation of the month the user visited.|
|OperatingSystems | An integer encoding information about the user's operating system |
|Browswer | An integer encoding information about the user's Browser |
|Region | An integer representing information about the user's location|
|TrafficType | An integer encoding browing information about the user |
|VisitorType | A column that will take on the value Returning_Visitor for returning visitors and some other string value for non-returning visitors.|
|Weekend | A boolean that states whether it is weekend or not |
|Revenue | Indicates whether the user ultimately made a purchase or not: TRUE if they did, FALSE if they didn’t.|


## Goal

The goal of this project include: 

*   Cleaning the dataset
*   Analyse the user's information and identify the relationship that exist between the different variable through exploratory data analysis.
*   Design a Classifier to predict whether any given user plans to make a purchase or not.


## Importing all Relevant Libraries

In [3]:
# Import Libraries
# Matrix, dataframe, and plotting libraries import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Model Packages
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.preprocessing import MinMaxScaler

## Loading Data

In [4]:
data = pd.read_csv('shoppingdata.csv')
data.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,,Fals
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,,Fals
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,,Fals
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,,Fals
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,Fals


## Data Exploration and Cleaning

Let's explore the dataset to determing the columns that require cleaning and to identify the relationship that exists between different columns

In [5]:
data.describe(include = 'all')

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330,12330.0,12330.0,12330.0,12330.0,12330,11830,12330
unique,,,,,,,,,,,10,,,,,3,2,3
top,,,,,,,,,,,May,,,,,Returning_Visitor,False,False
freq,,,,,,,,,,,3364,,,,,10551,8962,9076
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,,2.124006,2.357097,3.147364,4.069586,,,
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,,0.911325,1.717277,2.401591,4.025169,,,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0,1.0,,,
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,,2.0,2.0,1.0,2.0,,,
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,,2.0,2.0,3.0,2.0,,,
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,,3.0,2.0,4.0,4.0,,,


#### Observations

*   We observe that the weekend column has only a total count of 11830 entries unlike other columns that have a count of 12330 entries representing the total number of rows within the database. We will take a closer look to understand why this is the case.
*   The Revenue column is supposed to have only two distinct values, True or False, representing whether a visitor made a purchase or not. However, the dataset description indicates the presence of three unique variables.
*   All the other columns seem to have their intended behavior. 

Le't understand the different datatypes for each column.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

All the numeric columns have data type int64 or float64. Hence, no further processing need to be performed on them. More than 90% of the total columns are numeric.

According to the dataset's decription, the datatype of both the weekend and Revenue column should be bool. However, they appear to be of type Object.

Next, let's have a look at all the odd columns.

In [7]:
data['Revenue'].value_counts()

False    9076
True     1908
Fals     1346
Name: Revenue, dtype: int64

The revenue column contains the string "Fals". Knowing that the Revenue column can only contain values between True and False, we can assume that the string "Fals" occured as a result of an error while recording the data.

Let's replace all fields with value Fals with the boolean False and convert the entire Revenue row back to bool.

In [8]:
data['Revenue'].replace({'Fals' :  'False'}, inplace = True)
data['Revenue'].replace({'False' :  False}, inplace = True)
data['Revenue'].replace({'True' :  True}, inplace = True)
data['Revenue'] = data['Revenue'].astype(bool)

data['Revenue'].value_counts()

False    10422
True      1908
Name: Revenue, dtype: int64

In [9]:
data['Weekend'].isnull().sum()

500

500 entries in the Weekend table have no data. This represents less than 4% of the total dataset.
 
We will replace this null values with the mode of the column.

In [10]:
mode = data['Weekend'].mode().values[0]
data['Weekend'].fillna(mode, inplace=True)
data['Weekend'].isnull().sum()

0

In [11]:
data.columns.size

18