# <center>Adding Value & Finding Insights From Online Shopper Data</center>

<h1><center>Project Report</center></h1>

<center>Dylan Meyer, Peter Groglio, Archana Agarwal</center>

[Associated Presentation](https://drive.google.com/file/d/1ebdXVQaO21qbGAKmwSZiW0VQ7YOOsV--/view?usp=sharing) 

# Business Understanding:
A dataset has been gathered for the purpose of predicting user intent to purchase, consisting of session level data as well as the aggregation of webpage statistics for each session. 

In order to be of the best use to our clients we have decided to leverage this data in order to provide the marketing team with targeted assessments both for individual clients, and user segmentation for groups of clients. This will allow for servicing existing customers in a comprehensive manner as well as prospecting for future valuable clients. 


## Business Objectives:
- Establish the ability to determine if users will have a propensity to purchase our product 
- Segment user groups based on their web history on our site for more targeted client interactions
- Explore further insights around user actions and experience


## Data Mining Goals:
- Train a model or multiple models to reasonably predict whether or not a customer will purchase anything from our site.
- Engineer new features from the existing dataset which will have an impact on the model's predictive power
- Recommend additional information to track for more accurate modeling or make our client segmentation more useful
- Explore a variety of models and their efficacy in accuracy, predictive power, and consistency


# Data Understanding:

## Data Assumptions:
All web page stats are treated as the average value of all webpages seen in the user’s session
## Data Dictionary:
A [Data Dictionary](https://docs.google.com/spreadsheets/d/1cvWTEruAo16xvobsGKet90kCKNbkX8sRvzReJT0bCbA/edit?usp=sharing) was compiled to express our collective understanding of each field present in the dataset

## Descriptive Statistics:
- The initial data is composed of 12,330 rows, and 18 columns
- Most of the fields seem to be quite imbalanced. This will mean that binning will be required prior to predictions. 
    - Similarly, the target variable is also heavily imbalanced. We will plan to use SMOTE to fix this imbalance
- Initial analysis has not found any obvious null values. 
    - Imputation may not be necessary. 
- Average time spent per page:
    - admin:  37.94881948546389
    - info:  69.39555127426121
    - product:  37.75003240438953
- Overlap between page types:
    - 12,292 w/ any product related (99.69%)
    - 2,631 w/ any informational (21.34%)
    - 6,562 w/ any admin (53.22%)
    - 2,168 w/ Informational and Admin (17.58%) - shows a lot of overlap considering info was 21% originally - users potentially cluster here?
    - 6,537 w/ Admin + Product (53.02%) - shows almost all admins also look at product info
    - 2,623 w/ Informational and Product (21.27%) - similarly almost all informational have also looked at products
    - 2,167 sessions with all 3 types (17.58%)
    - This shows that almost all sessions used product, even those which focused on info/admin


## Data Exploration:
- 720 rows have sessions with no time allocated towards any webpage type, accounting for ~6% of data
    - Of these, 99% do not have a purchase. A variable “no_time” will be created to account for this.
- A combination of bounce_rate and exit_rate seems to create a distinct separation in the target variable’s classes
- <img src="bounce_exit_plot.png" alt="Drawing" style="width: 350px;"/>
- “Bounce_exit” will be created as a new field which is a linear combination consisting of Bounce_rate + (3 * exit_rate)
- The data is missing for both January and April. This implies that the data may either be missing or incomplete.


# Data Preparation:
## Data Set & Select Data:
### Separate Categorical and Continuous Values:
- Continuous
    - Administrative
    - Administrative_Duration
    - Informational
    - Informational_Duration
    - ProductRelated
    - ProductRelated_Duration
    - BounceRates
    - ExitRates
    - PageValues
    - SpecialDay
- Categorical
    - Month
    - OperatingSystems
    - Browser
    - Region
    - TrafficType
    - VisitorType
    - Weekend
    - Revenue


## Clean, Construct, Integrate, Format Data:
## Pipeline Options:
- [__Pipeline A__](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/pete/Pipeline_A.ipynb)
    - Normalization
        - Try various transformation methods including the Yeo Johnson method for normalizing. 
        - The Yeo Johnson is an alternative to the Box-Cox power transform technique
        - It estimates the optimal value of lambda
        - The Yeo Johnson is similar to the Box-Cox, however it allows for the transformation of non-positive data.
    - ZScore
        - Find the Z Score values for the columns
        - Z Score is the values relationship to the mean
        - Outlier detections w/ std dev.
        - Find outliers by detecting values that are outside of 3 standard deviations from the mean
        - Write code to push them back into the data set to standardize them
- [__Pipeline B__](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/archana/data_prep-Final.ipynb)
    - IQR outlier detection + handling  
        - Outliers are data points that are far from other data points. Histograms were used to check the outliers in the dataset.
        - Higher and lower range of Interquartile range was defined to move the outliers within 1.5* Interquartile range
    - Data Scaling
        - Min-max scaling  -  In this approach, the data is scaled to a fixed range - usually 0 to 1.
        - This approach will end up with smaller standard deviations, which can suppress the effect of outliers.
    - Data Normalization
        - Applied different methods to normalize the data and power transformer method Yeo-Johnson.
        - It helps in making the data more Gaussian-like and takes into account both positive and negative values which Box-Cox method fails to apply. 
- [__Pipeline C__](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/dylan/pipeline_c.ipynb)
    - Outlier detection + handling with IQR
        - Leverage IQR to find outliers and move them to edge of +/- 1.5 * IQR range
    - Binning of Categorical Variables
        - 5-finger-rule for binning of values
    - KMeans clustering for each continuous variable
        - Distortion plots show less than 5 clusters is ideal for all continuous value binning after IQR standardization
        - Each field now associated with a single cluster
        
## Merge Pipeline Data:
- Merge data from all pipelines, add a suffix of ‘pipeline_x’ for each
- Now that all fields are available, the best combination of each can be used in modelling


## Correlation Analysis:
- Original Data standalone
    - <img src="original_corr.png" alt="Drawing" style="width: 400px;"/>
- There are a large amount of correlated fields. This will need to be taken into account prior to modelling


Pipelines A and B show similar correlation. Pipeline C is all categorical and cannot be viewed in the same manner. 
- Pipeline A
    - <img src="pipeline_a_corr.png" alt="Drawing" style="width: 400px;"/>
- Pipeline B
    - <img src="pipeline_b_corr.png" alt="Drawing" style="width:400px;"/>

 
Each type of correlation analysis will help determine the fields with high correlation both from within pipeline and across them. This can be taken into account for feature selection purposes.


# Modelling - Predict Revenue:
## Feature Selection:
- [RFE](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/shared_code/rfe_corr.py) 
    - Recursive Feature Elimination
        - Technique that assigns weights to features
        - Select the features by recursively selecting smaller and smaller sets of data
        - Can be optimized for each model intended for use
- Removal of correlated features
    - Correlation analysis displayed that some means of removing correlated features is required
- Combining Feature Selection Techniques
    - RFE leveraged for specific model types
    - Features ordered through RFE per model
    - Remove features correlated to those seen as having more predictive power
    - Also remove features from alternative pipelines if they are derived from same initial data


## Select Modelling Techniques:

In order to select the most accurate models, each member of the team tested out the 6 models and we cross validated the results.

- Naive Bayes
- Logistic Regression
- Decision Tree
- Random Forest
- SVM
- XGBoost


## Assess Initial Models:
- AUC Evaluation - 
    - ROC-AUC curve helps in measuring the performance of the model. 
    - ROC is a probability curve and AUC represents degree or measure of separability. 
    - It tells how much a model is capable of distinguishing between classes. 
    - Higher the AUC, better the model is at predicting if the customer would make a purchase or not.
- Classification Report - 
    - Classification report helps in determining whether the predicted value matched the actual value and displays the precision, recall, F1, and support scores for the model.

On the basis of these metrics, the most promising models were XGBoost, Random Forest, Support Vector Machine

[Full Results](https://docs.google.com/spreadsheets/d/1_fL1TIb08jYMj-BAQjTE8nMf5x_NnAfwD7jBWfmELww/edit?usp=sharing)

<img src="initial_results.png" alt="Drawing" style="width: 400px;"/>


## Optimize Hyperparameters:
- Grid Search
    - Exhaustive search of model hyperparameters to derive best set
    - Final Models
        - [Random Forest](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/archana/Data%20prep/Model%20tuning.ipynb)
            - Criterion - [Gini or Entropy] - The function to measure the quality of a split.
            - N_estimators - [75,150,200,300, 450,500] - Number of trees used in the forest
            - max_depth = [3,4,5,8,10] - The maximum depth of the tree.
            - min_samples_split = [2,5,10] - The minimum number of samples required to split an internal node.
            - min_samples_leaf = [1,2,4] - The minimum number of samples required to split to be at a leaf node. 
        - [SVM (Support Vector Machine)](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/pete/SVM_Hyperparam_Tune.ipynb)
            - C - [1.0,2.0,3.0,4.0,5.0,6.0] - Regularization parameter
            - Decision_Function_Shape - [OVO or OVR] Return One vs Rest (ovr) or One vs One (ovo)
            - Kernel -  [RBF or Linear] - Kernel type used in the algorithm
        - [XGBoost](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/dylan/optimize_xgb.ipynb)
            - Max_depth - Maximum depth which the trees may travel
            - N_estimators - Number of trees in the overall forest
            - Learning Rate - stepwise shrinkage of feature weights passed to boosting to prevent overfitting
            - Gamma - Minimum loss required to further split a leaf node
            - Subsample - Ratio of training data to use in tree growth
            - Colsample_bytree - Ratio of features to be used in each tree


## Ensemble Model Creation:
### Combine the optimized stand-alone models for increased performance
- [Voting Classifier](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/dylan/ensemble.ipynb)
    - Each model in the classifier yields a “vote” for classifying input data
    - Contribution of each model can be altered to give more weight to preferable models
    - 60/20/20 split favoring most weight on the XGBoost as it performed the best following individual optimization
- Feature Sets
    - Same features used for each model within the ensemble
    - All features used for initial ensemble, subset of features leveraging correlation analysis used in subsequent iterations


## Assess Optimal Models:
Report on results of grid search
- AUC Evaluation
- Classification Report
- [Compare results of each model and ensemble](https://docs.google.com/spreadsheets/d/1_fL1TIb08jYMj-BAQjTE8nMf5x_NnAfwD7jBWfmELww/edit?usp=sharing)

<img src="optimized_res.png" alt="Drawing" style="width: 400px;"/>


## Feature Importance:

__SVM Results:__

<img src="svm_feat_imp.png" alt="Drawing" style="width: 400px;"/>

You can see below that the feature of Page Values is important as it shows the higher the page value, the better chance of making a purchase

<img src="svm_special_day.png" alt="Drawing" style="width: 400px;"/>

The closer you are to a special day, the more likely you are to make a sale. Since you can see before the model some data in between the range, after the model, more no revenues go away from 0 (the time farther away from a special day) meaning the further away from a special day you go, the less likely you are to purchase


__Random Forest Results:__

<img src="rf_product.png" alt="Drawing" style="width: 400px;"/>

The results from the model show that most of the revenue is above the mean line. The plot shows that as the Product related duration increases there is a higher possibility of purchase. 

<img src="rf_bounce.png" alt="Drawing" style="width: 400px;"/>

The model captures the trend correctly showing that as the bounce rates increases, there is a low probability of a purchase and as the bounce rates decreases , there is a high probability of a purchase.


## XGBoost Model Explainability - [SHAP](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/dylan/shap.ipynb)
- Shap Values allow for easier explainability in black-box models
- Color-coded for feature values
- Ability to see impact on model predictions

<img src="shap_overall.png" alt="Drawing" style="width: 400px;"/>

Shap displays feature values based on color coding where red refers to higher and blue lower values respectively). Items to the left of the vertical line in the center correspond to a lower probability of purchasing, while those on the right show values leading to a higher liklihood of purchasing. Additionally, features are ordered from top to bottom in order of overall importance to final predictions.

This is best seen in an example:

- PageValues_b
    - Red values on the right of the vertical line implies that high page values lead to a higher liklihood of revenue
    - Blue values on the left of the vertical line show that web sessions with lower page values are less likely to provide revenue
    
- bounce_exit_iqr_standardized_cluster_c_2
    - Red values on left-hand side of vertical line implies higher values for the combination of bounce and exit rate lead to a lower probability of client purchases.
    - Blue on the right-hand side of the vertical line shows a lower bounce/exit rate leads to a higher liklihood of client purchases
    - Both of these previous notions makes intuitive sense as a low bounce/exit rate implies a positive experiance, while a higher rate most likely means a more negative experiance

## One-off Explanability

Shap provides the ability to see the impact of each feature per prediction. This is particularly useful in building confidence in the model, as well as addressing business concerns where a single example acts as a stand-in for an overall explanation


The following is an example web session history which led to a very low predicted probability of purchasing. It can be seen that each individual feature contributed to the overall low prediction.

<img src="shap_neg.png" alt="Drawing" style="width: 400px;"/>

Similarly there are some cases in which each component of the web session contributes positively towards a high probability of revenue. It can be seen that Page Values play a particularly valuable role in the following case based on the "elbow-like" movement towards a very high probability prediction. 

<img src="shap_pos.png" alt="Drawing" style="width: 400px;"/>



## Value of Model:
- Assign purchasing probability to existing clients
    - Have a better sense for how likely clients are to purchase from us
- Prospecting tool for potential new clients
    - Assess whether newly incoming clients are likely to purchase given their initial web sessions
- Set correct expectations for Sales Representatives
    - Know which individuals are easier and harder to sell to
    - Can strategically target only those who are already mroe likely to spend
    - Alternatively, only focus on clients who have less chance to purchase as others are already more likely and do not need special attention
- Our model produces reliable results when predicting whether a client will buy products
- Model performance can improve with additional data
    - As time progresses the model will naturally become more accurate as data is funneled in
- Additional data which would be useful
    - Cost of items purchased by clients
    - Number of distinct purchases
    - Time spent as a client
    - Product level information
- Predictions at an individual level which are highly valuable across the company



# [High-Level Client Segmentation](https://github.com/fairfield-university-ba545/project2-archana-s-team/blob/master/dylan/user_segmentation.ipynb)
- Previous model focused on individual level predictions
- Predictions are often more actionable when given at scale or higher level
- User clustering can provide more actionable insight in certain cases

## Select Modelling Techniques:
- K Means clustering

## Assess Initial Models:
- Distortion Plots
    - Measure for finding optimal number of clusters, displaying the point of diminishing returns where adding clusters does not add value
    - <img src="distortion_chart.png" alt="Drawing" style="width: 400px;"/>
    - It appears that any more than 10 clusters does not add much additional value
    - In normal business operations, fewer clusters is better for clarity and more actionable decision-making. Therefore, 5 clusters will be leveraged

- Cluster Visualization
    - PCA leveraged to reduce dimensionality of input data to qualitatively assess clusters
    - <img src="pca_clusters.png" alt="Drawing" style="width: 400px;"/>
    - Clusters appear relatively distinct. Addition of more clusters does not lead to great improvements
    - Silhouette Score of 0.187 showS moderate performance


## Cluster Use:
- View how different fields are represented across clusters
    - Revenue
        - <img src="sales_clusters.png" alt="Drawing" style="width: 400px;"/>
        - Revenue is concentrated in cluster 5
        - Other clusters have historically contributed very little in overall revenue
    - Bounce/Exit Rates
        - <img src="bounce_exit_clusters.png" alt="Drawing" style="width: 400px;"/>
        - Know that a combination of bounce and exit rate is highly correlated with whether or not revenue will occur. Clustering shows stark differences across these rates
        - Once again, Cluster 5 is seen as the lowest overall bounce/exit rate which makes intuitive sense given that this same cluster was most sucessful in delivering revenue


# Evaluation:
## Value of our Predictions:
- User Segmentation
    - Segmentation provides business partners with an easier way of approximating different client segments
    - Marketing and targeted ads
        - Alter strategies based on the segments particular users fall in
    - AB testing site changes to try to increase revenue with lowly engaged users
        - Create several different versions of the site for each cluster
        - Alternatively test 2 versions of the site
            - No changes for Cluster 5 which already provides revenue
            - Dramatically alter the site for other clusters as they are less likely to produce revenue already

# Overall Value Added:
- Identify key features in driving potential customers to make a purchase revenue
- Suggestions/Recommendations: 
    - A prospecting tool where marketing emails those more likely to purchase based on data
    - A sales team time bank touch plan strategy using segmentation
        - 70% of their time in a year on high value
        - 25% of their time in a year on low value
        - 5% of their time in a year on unknown value
- Next Steps:
    - Additional data which would be useful
        - January and April do not have data. Would be useful to have these to test seasonality more thoroughly 
            - Similarly, additional years of data would be very useful for looking at year over year trends
    - More granular data into Product pages
    - Shipping costs for each product which would affect the purchase
    - Purchase behavior after sale is made (if there is a return)
    - Other problems we can solve for the client
        - Recommendation system for different product pages per person