# Online Shoppers Purchasing Intention Clustering

Isaiah Jenkins

## Import the required libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## 1. About the data

1. a. Description

Throughout this analysis we will explore online shoppers purchasing habits and online activity to achieve shopper segmentation. Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping.

1. b. Attributes dictionary, 17 features, 10 numerical, 8 categorical

Administrative, Administrative Duration, Informational, Informational Duration, Product Related, Product Related Duration 
- The number of different types of pages visited by the visitor in that session and total time spent in each of these page categories
  
Bounce Rates, Exit Rates, Page Value
- Metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. The value of "Exit Rate" feature for a specific web page is calculated as for all page views to the page, the percentage that were the last in the session. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction.

Special Day
- Indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date.

Month, Operating Systems, Browser, Traffic Type
- General information

Visitor Type 
- returning or new visitor category

Weekend
- Boolean value indicating whether the date of the visit is weekend
  
Revenue
- Whether shopper generated revenue for company

## 2. Objectives

Throughout this analysis we will explore and build different clustering models to segment between shoppers that did and did not buy products. A possible problems that could occur may include effective clustering because of fewer samples ending with shopping versus no shopping because 84.5% were negative class samples that did not end with shopping. However, with sophisticated clustering models we should be able to obtain proper segmentation.

## 3. Data Exploration, Cleaning and Feature Engineering

The Plan:

1. Explore the dataset's structure, including the number of rows, features, data types and any missing values.
2. Perform basic statistical analysis such as calculating descriptive statistics (mean, median) for numerical features and value counts for categorical features.
3. Create data visualizations such as box plots to visually inspect the distributions and relationships between variables to identify potential issues like outliers.
4. Feature engineer relevant columns for models to effectively segment between shoppers that shopped and did not shop.

### Load in data

In [2]:
shoppers = pd.read_csv('data/online_shoppers_intention.csv')

In [3]:
shoppers.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


In [4]:
shoppers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

## 4. Cluster Models

### Summary of Models

## 5. Insights and Key Findings

## 6. Next Steps