# Machine Team 4 (Michael DiSanto, Dawn Massey & Brian Nicholls)
### BA545: Data Mining - Competition #2 (Online Shoppers' Purchasing Intentions)
#### Data Audit Report - Spring 2020


<img src="https://i.ytimg.com/vi/CRKn-9gVNBw/maxresdefault.jpg" width=60%/>

Note: This work was completed using the CRISP-DM Framework shown above; accordingly, it will serve as an organizing framework for this report.

#### **Table of Contents:**

0. [Part 0: Preparing for Analysis](#part0)
1. [Part I: Business Issue Understanding](#part1)
2. [Part II: Data Understanding & Exploratory Data Analysis (EDA)](#part2)
3. [Part III: Data Preparation](#part3)

#### **Note: Parts V and onward are for future work**
4. [Part IV: Data Analysis/Modeling](#part4)
5. [Part V: Validation](#part5)
6. [Part VI: Presentation/Visualization](#part6)
7. [Part VII: Sources](#part7)



# Part 0: Preparing for Analysis  <a name="part0"></a>
#### Import the necesary packages for reading, analyzing, tidying, medeling, & evaluating the data

In [None]:
# TO USE FOR ENTIRE TEAM
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import statsmodels.api as sm
from scipy import stats
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
# TO USE FOR DWM ONLY
import pandas as pd
import numpy as np
#from pandas_profiling import ProfileReport
import statsmodels.api as sm
from scipy import stats
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
#import plotly.express as px

In [None]:
# Processing the data

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler 
# scaler = StandardScaler().fit(X_train) >>> standardized_X = scaler.transform(X_train) >>> standardized_X_test = scaler.transform(X_test
from sklearn.preprocessing import Normalizer
# scaler = Normalizer().fit(X_train) >>> normalized_X = scaler.transform(X_train) >>> normalized_X_test = scaler.transform(X_test)
from sklearn.preprocessing import Binarizer 
# binarizer = Binarizer(threshold=0.0).fit(X) >>> binary_X = binarizer.transform(X)

# Encoding Categorical Features
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# enc = LabelEncoder()
# y = enc.fit_transform(y)
from sklearn.impute import (SimpleImputer, KNNImputer, MissingIndicator)
from sklearn.experimental import enable_iterative_imputer
# imp = Imputer(missing_values=0, strategy='mean', axis=0) >>> imp.fit_transform(X_train)
from sklearn.preprocessing import PolynomialFeatures 
# poly = PolynomialFeatures(5) >>> poly.fit_transform(X)

from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)


# Various Models
from sklearn.cluster import KMeans
# k_means = KMeans(n_clusters=3, random_state=0

from sklearn.decomposition import PCA
# pca = PCA(n_components=0.95)

from sklearn.linear_model import LogisticRegression
# logreg = LogisticRegression()
from sklearn.linear_model import RidgeCV
# rrm = RidgeCV(alphas=(0.01, 0.1, 1.0, 10.0), normalize=True)

from sklearn.naive_bayes import GaussianNB
# gnb = GaussianNB()
from sklearn.svm import SVC 
# svc = SVC(kernel='linear')
from sklearn.linear_model import LinearRegression
# lr = LinearRegression(normalize=True)
from sklearn import neighbors
# knn = neighbors.KNeighborsClassifier(n_neighbors=5)

## Fit the model
# # Supervised learning
# lr.fit(X, y)
# knn.fit(X_train, y_train)
# svc.fit(X_train, y_train)   

# #Unsupervised Learning 
# k_means.fit(X_train) 
# pca_model = pca.fit_transform(X_train)

## Predict Y
# Supervised Estimators
# y_pred = svc.predict(np.random.random((2,5))) 
# y_pred = lr.predict(X_test)
# y_pred = knn.predict_proba(X_test)   
# Unsupervised Estimators 
# y_pred = k_means.predict(X_test)


# Packages to evaluate Model Performance (Classification)
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report 
# print(classification_report(y_test_log, y_pred_log))

# Packages to evaluate Model Performance (Linear)
from sklearn.metrics import mean_absolute_error 
# y_true = [3, -0.5, 2] >>> mean_absolute_error(y_true, y_pred)
from sklearn.metrics import mean_squared_error
# mean_squared_error(y_test, y_pred)
from sklearn.metrics import r2_score 
# r2_score(y_true, y_pred)

#from sklearn.cross_validation import cross_val_score 
# print(cross_val_score(knn, X_train, y_train, cv=4)) >>> print(cross_val_score(lr, X, y, cv=2)


from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss 

In [None]:
# Pull in our original data
df = pd.read_csv('online_shoppers_intention.csv', na_values=r'-')

# Part I: Business Issue Understanding  <a name="part1"></a>

## A. Research Question:
Overall, this project's research question is: *What drives potential customers to make purchases?*

## B. Scope of Work:
This project is a classification project in which the members of Machine Team 4 (Michael DiSanto, Dawn Massey and Brian Nicholls) will use the data feature, Revenue, as the target feature when predicing whether a consumer made a purchase and, thus, is part of Class 1 (i.e., if Revenue > 0) or, instead, whether the consumer did not make a purchase and, thus, is part of Class 0 (i.e., if Revenue <= 0). 

Using the 10 numerical (continous) and 8 categorical features in the given dataset, members of the team we will utilize advanced and novel methods in preparing the data to design and implement a model for the client that will predict whether a site visitors will make a purchase. The model will be evaluated on the basis of its prediction accuracy and its predictive power. 

Deliverable due dates are as follows:
*     Data Audit Report due Tuesday, March 31, 2020
*     Initial Data Model due Tuesday, April 14, 2020
*     Final Presentation and Report due Tuesday, April 28, 2020

## C. Business Understanding:
Online shopping is an important revenue source for many retail businesses, such as our client. According to Sakar et al. (2019), desipte increases in e-commerce traffic in the recent past, "conversion" (of browsers to purchasers) has not increased proportionately. Indeed, the dataset includes 12,330 "sessions," of which only 1908 (15.5%) resulted in conversion (Sakar et al. 2019, 6895). Thus, it is very important for retail companies, such as our client, to better understand - in real time - the cues that drive conversion. Complicating the process is that unlike "brick and mortar" stores where shoppers can interact with salespeople who, in turn, can help to facilitate (or at least understand) customer conversion on the basis of their interaction, online retail businesses must *infer* customer behavior from other cues. But, what are those cues for our client? As part of determining whether a "browser" will become a "purchaser," our client also might like to know about the cues suggesting the opposite behavior - i.e., abandoning the site or the shopping cart. Additionally, the client might like to know why customers purchasing competitors' products fail to visit their website. On the basis of our models, our client is interested in knowing what we might suggest doing (in real time) to increase conversion/reduce abandonment. Further our client might also like to know if we believe there are other factors that have not been captured in the dataset that might be helpful in better predicting conversion/abandonment in the future as well as figuring out how to attract to the client's websitem consumers who make purchases on competitors' websites.

Past research has focused on: 
clickstream data
session information
session length in terms of the number of Web pages visited in a session
session duration in seconds
average time per page in seconds
traffic type (representing the page that referred the user to a particular (bookstore) site
three binary variables representing a set of key perations related to the commercial intent
a set of product categories viewed during the session 
sequential data/most frequently followed navigation paths
(see Sakar et al. 2019, p. 6894, ff.)


### _Reference:_
Sakar, C., S. Polat, M. Katircioglu and Y. Kastro. 2019. Real-time predicgion of online shoppers' purchasing intention using multilayer perceptron and LSTM recurrent neural networks. *Neural Computing and Applications 31:* 6893-6908.


# Part II: Data Understanding & Exploratory Data Analysis  <a name="part2"></a>

## A. Data Understanding
Data Understanding includes providing an overview of the dataset, conducting exploratory data analysis, verifying data quality, and deciding how to address data quality issues.

#### _1. Overview of Dataset_
The dataset that has been gathered for purposes of this analysis contains 18 variables: Revenue, which is the Target Variable (where Revenue = TRUE if the customer visiting the website made a purchase - i.e., Class 1; and Revenue = FALSE if the customer visiting the website did not make a purchase - i.e., Class 0); and 17 predictor variables, including 10 continuous features and 7 categorical features, each of which are listed below and then delineated within our Data Dictionary.


##### a. Continuous Features:
*     Administrative: Number of pages visited by the visitor about account management  
*     Administrative Duration: The total amount of time (in seconds) the visitor spent on account management-related pages
*     Informational: Number of pages visited by the visitor about Web site and its communciation and address information 
*     Informational Duration: The total amount of time (in seconds) the visitor spent on informational pages
*     Product Related: Number of pages visited by the visitor about product-related pages  
*     Product Related Duration: The total amount of time (in seconds) the visitor spent on product-related pages  
*     Bounce Rate: Average bounce rate value of the pages visited by the visitor
*          (Note: a "bounce" occurs when a visitor enters the site from a particular page and then leaves the site (bounces) 
*          without any further activity.)
*     Exit Rate: Average exit rate value of the pages visited by the visitor
*     Page Value: Average page value of the pages visited by the visitor
*     Special Day: Closeness of the visitor's visit to the site to a special day (e.g., Mother's Day, Valentine's Day)


##### b. Categorical Features:
*     OperatingSystems: Operating system of the visitor (8 possible operating systems)
*     Browser: Browser of the visitor (13 possible browsers)
*     Region: Geographic region from which the sesion has been started by the visitor (9 possible regions)
*     TrafficType: Traffic source by which the visitor has arrived at the Web site - e.g., banner, SMS, direct (20 possible types)
*     VisitorType: Visitor type as "New Visitor," "Returning Visitor," and "Other" (3 possible types)
*     Weekend: Boolean value indicating whether the date of the visit is a weekend (2 possible values)
*     Month: Month value for visit date (12 possible months)


##### c. Data Dictionary:

<table class="tg">
<tbody>
</tr>
<tr>
<td class="tg-7btt" style="text-align: center;" colspan="4"><strong>Data Dictionary</strong></td>
</tr>
<tr>
<tr>
<th class="tg-0pky">Variable</th>
<th class="tg-0pky">Variable Name</th>
<th class="tg-0pky">Variable Definition</th>
<th class="tg-fymr">Data Type</th>
</tr>
<tr>
<td class="tg-7btt" style="text-align: center;" colspan="4"><strong>Web Page Analytics &ndash; Numerical</strong></td>
</tr>
<tr>
<td class="tg-0pky"><strong>Home Page</strong></td>
<td class="tg-fymr">Administrative</td>
<td class="tg-fymr">Number of pages visited by the visitor about account management.</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Home Page Duration</strong></td>
<td class="tg-fymr">Administrative_Duration</td>
<td class="tg-fymr">The total amount of time (in seconds) the visitor spent on account management-related pages.</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Inforation Page</strong></td>
<td class="tg-fymr">Informational</td>
<td class="tg-fymr">Number of pages visited by the visitor about Web site and its communciation and address information.</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Informational Duration</strong></td>
<td class="tg-fymr">Informational Duration</td>
<td class="tg-fymr">The total amount of time (in seconds) the visitor spent on informational pages.</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Product Page</strong></td>
<td class="tg-fymr">ProductRelated</td>
<td class="tg-fymr">Number of pages visited by the visitor about product-related pages.</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Product Related Duration</strong></td>
<td class="tg-fymr">ProductRelated_Duration</td>
<td class="tg-fymr">The total amount of time (in seconds) the visitor spent on product-related pages</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Bounce Rate</strong></td>
<td class="tg-fymr">BounceRates</td>
<td class="tg-fymr">The percentage of single page visits (or web sessions). It is the percentage of visits in which a person leaves your website from the landing page without browsing any further.</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Exit Rate</strong></td>
<td class="tg-fymr">ExitRates</td>
<td class="tg-fymr">For all pageviews to the page, Exit Rate is the percentage that were the last in the session</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Page Value</strong></td>
<td class="tg-fymr">PageValues</td>
<td class="tg-fymr">the average value for a page that a user visited before landing on the goal page or completing an Ecommerce transaction (or both). This value is intended to give you an idea of which page in your site contributed more to your site's revenue.</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Special Day</strong></td>
<td class="tg-fymr">SpecialDay</td>
<td class="tg-fymr">the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction.</td>
<td class="tg-0pky">Continuous/Float</td>
</tr>
<tr>
<td class="tg-7btt" style="text-align: center;" colspan="4"><strong>Web Page Analytics &ndash;Categorical</strong></td>
</tr>
<tr>
<td class="tg-0pky"><strong>Month</strong></td>
<td class="tg-fymr">Month</td>
<td class="tg-fymr">Month in which the visit took place</td>
<td class="tg-0pky">Categorical/Int</td>
</tr>
<tr>
<td class="tg-0pky"><strong>OperatingSystems</strong></td>
<td class="tg-fymr">OperatingSystems</td>
<td class="tg-fymr">Operating system of the computer in which the user used while viewing the site</td>
<td class="tg-0pky">Categorical/Int</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Browser</strong></td>
<td class="tg-fymr">Browser</td>
<td class="tg-fymr">Browser in which the user used to view the site</td>
<td class="tg-fymr">Categorical/Int</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Region</strong></td>
<td class="tg-fymr">Region</td>
<td class="tg-fymr">Region wher ethe user is located</td>
<td class="tg-fymr">Categorical/Int</td>
</tr>
<tr>
<td class="tg-0pky"><strong>TrafficType</strong></td>
<td class="tg-fymr">TrafficType</td>
<td class="tg-fymr">Traffic source by which the visitor has arrived at the Web site - e.g., banner, SMS, direct (20 possible types)</td>
<td class="tg-fymr">Categorical/Int</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Visitor Type</strong></td>
<td class="tg-fymr">VisitorType</td>
<td class="tg-fymr">Is this a returing visitor or a new visitor</td>
<td class="tg-fymr">Binary/Boolean</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Weekend</strong></td>
<td class="tg-fymr">Weekend</td>
<td class="tg-fymr">Did the visit happen on the weekend?</td>
<td class="tg-fymr">Binary/Boolean</td>
</tr>
<tr>
<td class="tg-0pky"><strong>Revenue</strong></td>
<td class="tg-fymr">Revenue</td>
<td class="tg-fymr">Did the visit result in Revenue?</td>
<td class="tg-fymr">Binary/Boolean</td>
</tr>
<tr>



### B. Exploratory Data Analysis (EDA) &  Data Quality Verification (DQV)


#### _1. Overview of Findings from EDA & DQV (per the below):_
*     There are 12,330 observations with one target value and 17 features.
*     There are no missing values; however, we did note the following:
    ** There are no observations for January and April - which suggests the dataset does not include a full year's-worth of information, which may limit our ability to assess monthly trends/differences.
    ** A few (85) observations were coded as "other" - meaning they were neither new nor returning customers; since "new" and "returning" customers are mutually exclusive, the observations coded as "other" appear to be erroneous.
    ** 85% of the data come from Browser 1 (20%) or 2 (65%); hence, the data are not balanced with regard to browser
    ** Approximately 90% of the data comes from days other than "Special" days
    ** 
*     Bounce Rate and Exit Rate are highly correlated at 0.91; however, they also are highly correlated with the target variable (at -0.15 for Bounce Rate and at -0.25 for Exit Rate); hence, we are reluctant to remove either from our analysis. Rather, we will consider engineering a new feature that combines Bounce Rate with Exit Rate (e.g., via a linear combination of an average or weighted average of the features).
*     Administrative Page and Exit Rates are also highly correlated at -0.43; however, they, too, are highly correlated with the target variable (at 0.62 for Adminstative Page and at -0.25 for Exit Rate); hence, we are reluctant to remove either from our analysis. Rather, we will consider engineering a new feature that combines Administative Page with Exit Rate (e.g., via a division of one feature by the other).
*     Our data is imbalanced toward Revenue = False (i.e, Class 0, no purchases).



#### _2. Descriptive Statistics:_

In [None]:
sample =df.sample(30)
sample

### Observations
----------------------------------------------------------------------
##### Initial import seems to accurate and complete in comparison to the data dictionary

In [None]:
# More info on the dataframe
df.info()

### Observations
----------------------------------------------------------------------
##### There are no null columns on import and 12330 rows and 18 columns
##### Month, VisitorType, Weekend, and Revenue are non-numberic attributes that may need adjustment later in the analysis.

In [None]:
# Inital description of the data
df.describe()

### Observations
----------------------------------------------------------------------

##### BounceRate, ExitRate, & SpecialDay are on a 0-1 scale, while the others nummerical attributies are on a differnt scale.
##### SpecialDay, OperatingSystem, Browser, Region, TrafficType are all categorical attributes which could be futher analyzed using encoding.

#### _3. Correlation Analysis:_

In [None]:
# Correlation Heatmap for the dataframe
spearman =df.corr(method ='spearman')
plt.figure(figsize=(25,10))
sns.heatmap(spearman, annot=True)

In [None]:
numerical_list=['Administrative','Administrative_Duration','Informational','Informational_Duration','ProductRelated','ProductRelated_Duration','BounceRates','ExitRates','PageValues','SpecialDay','OperatingSystems','Browser','Region','TrafficType']
sns.pairplot(df[numerical_list],corner=True)

### Observations
----------------------------------------------------------------------

##### We observed the following high correlations:
    - The duration attributes may need to be assesed for elimination in the final model as each is very highly correlated with its corresponding non-duration attribute.
    
    - Administrative & ProductRelated have a correlation of 0.46; this merits further investigation and possible feature engineering.
    - Administrative & ProductRelated_Duration have a correlation of 0.42; this merits further investigation and possible feature engineering.
    - Administrative & ExitRates have a correlation of -0.43; this merits further investigation and possible feature engineering.
    - Administrative & PageValues have a correlation of 0.33; this merits further investigation and possible feature engineering.

 
    - Administrative_Duration & ExitRates have a correlation of -0.44; this merits further investigation and possible feature engineering.
    - Administrative_Duration & ProductRelated have a correlation of 0.43; this merits further investigation and possible feature engineering.
    - Administrative_Duration & ProductRelated_Duration have a correlation of 0.41; this merits further investigation and possible feature engineering. 

    - ProductRelated & ExitRates have a correlation of -0.52; this merits further investigation and possible feature engineering.
 
    - ProductRelated_Duration & ExitRates have a correlation of -0.48; this merits further investigation and possible feature engineering.
    
    - BounceRate & ExitRates have a correlation of 0.6; this merits further investigation and possible feature engineering.

##### After initial exploration we decided to compare features that are highly correlated with the target (Revenue)      
    - PageValues & Revenue have a correlation of 0.63; As Revenue is the target we would expect PageValues to be a useful attribute in our model.

In [None]:
#Code for Images using Ploylt

# import plotly.express as px
# import os
# from IPython.display import Image

# # Admin vs PageValues Scatter plot
# fig6 = px.scatter(df, x="ProductRelated", y="Administrative",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_6 = fig6.to_image(format="png", width=1200, height=400, scale=1)
# fig6.write_image("images/fig6.png")
# #Image(img_bytes_6)

<img src="images/fig6.png"/>

Admin vs ProductRelated seems to not be useful features for feature engineering as the trends do not differ when segmented by Revenue True or False

In [None]:
# #Code for Images using Ploylt

# import plotly.express as px
# import os
# from IPython.display import Image

# # Admin vs PageValues Scatter plot
# fig7 = px.scatter(df, x="ProductRelated_Duration", y="Administrative",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_7 = fig7.to_image(format="png", width=1200, height=400, scale=1)
# fig7.write_image("images/fig7.png")
# #Image(img_bytes_7)

<img src="images/fig7.png"/>

Admin vs ProductRelated_Duration seems to not be useful features for feature engineering as the trends do not differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # Admin vs BounceRates Scatter plot
# fig5 = px.scatter(df, x="ExitRates", y="Administrative",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_5 = fig5.to_image(format="png", width=1200, height=400, scale=1)
# fig5.write_image("images/fig5.png")
# # Image(img_bytes_5)

<img src="images/fig5.png"/>

Admin vs ExitRates doesn not seem to be useful features for feature engineering as the trends do not differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # Admin vs BounceRates Scatter plot
# fig8 = px.scatter(df, x="PageValues", y="Administrative",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_8 = fig8.to_image(format="png", width=1200, height=400, scale=1)
# fig8.write_image("images/fig8.png")
# # Image(img_bytes_8)

<img src="images/fig8.png"/>

Admin vs Pages values seems to be useful features for feature engineering as the trends differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # Admin vs BounceRates Scatter plot
# fig9 = px.scatter(df, x="ExitRates", y="Administrative_Duration",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_9 = fig9.to_image(format="png", width=1200, height=400, scale=1)
# fig9.write_image("images/fig9.png")
# # Image(img_bytes_9)

<img src="images/fig9.png"/>

Administrative_Duration vs ExitRates seems to not be useful features for feature engineering as the trends do not differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # Admin vs BounceRates Scatter plot
# fig10 = px.scatter(df, x="ProductRelated", y="Administrative_Duration",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_10 = fig10.to_image(format="png", width=1200, height=400, scale=1)
# fig10.write_image("images/fig10.png")
# # Image(img_bytes_10)

<img src="images/fig10.png"/>

Administrative_Duration vs ProductRelated seems to not be useful features for feature engineering as the trends do not differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # Admin vs BounceRates Scatter plot
# fig11 = px.scatter(df, x="ProductRelated_Duration", y="Administrative_Duration",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_11 = fig11.to_image(format="png", width=1200, height=400, scale=1)
# fig11.write_image("images/fig11.png")
# # Image(img_bytes_11)

<img src="images/fig11.png"/>

Administrative_Duration vs ProductRelated_Duration seems to not be useful features for feature engineering as the trends do not differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # Admin vs BounceRates Scatter plot
# fig13 = px.scatter(df, x="ExitRates", y="ProductRelated",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_13 = fig13.to_image(format="png", width=1200, height=400, scale=1)
# fig13.write_image("images/fig13.png")
# # Image(img_bytes_13)

<img src="images/fig13.png"/>

ExitRates vs ProductRelated seems to be useful features for feature engineering as the trends differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # Admin vs BounceRates Scatter plot
# fig12 = px.scatter(df, x="ExitRates", y="ProductRelated_Duration",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_12 = fig12.to_image(format="png", width=1200, height=400, scale=1)
# fig12.write_image("images/fig12.png")
# # Image(img_bytes_12)

<img src="images/fig12.png"/>

ExitRates vs ProductRelated_Duration seems to be useful features for feature engineering as the trends differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # ExitRates vs BounceRates Scatter plot
# fig4 = px.scatter(df, x="BounceRates", y="ExitRates",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_4 = fig4.to_image(format="png", width=1200, height=400, scale=1)
# fig4.write_image("images/fig4.png")
# Image(img_bytes_4)


<img src="images/fig4.png"/>

BounceRates vs ExitRates are highly correlated including when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # BounceRates vs PageValues Scatter plot
# fig2 = px.scatter(df, x="PageValues", y="BounceRates",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_2 = fig2.to_image(format="png", width=1200, height=400, scale=1)
# fig2.write_image("images/fig2.png")
# Image(img_bytes_2)


<img src="images/fig2.png"/>

BounceRates vs PagesValues seems to be slightly useful features for feature engineering as the trends differ when segmented by Revenue True or False

In [None]:
# # Code for Images using Ploylt

# # ExitRates vs PageValues Scatter plot
# fig3 = px.scatter(df, x="PageValues", y="ExitRates",facet_col="Revenue", color="Region", trendline="ols",render_mode = 'webgl' )
# img_bytes_3 = fig3.to_image(format="png", width=1200, height=400, scale=1)
# fig3.write_image("images/fig3.png")
# Image(img_bytes_3)

<img src="images/fig3.png"/>

ExitRates vs PagesValues seems to be slightly useful features for feature engineering as the trends differ when segmented by Revenue True or False

#### _4. Prelimiary EDA Visualizations:_

In [None]:
profile = ProfileReport(df)
profile.to_file(output_file="Customer_Intentions_Profile.html")
profile

### Observations for Preliminary EDA/Visualizations:
----------------------------------------------------------------------

##### **Warnings – highlights:**
    -High correlation between Exit Rates and BounceRates, which we noted in the correlation map.
    -Dataset has 125 duplicate rows, which we decided to allow given that there are no unique identifiers that we could use to verify whether the apparently duplicate entries (~1% of the data) were bogus or legitimate.

##### **Variables – highlights:**

    - Administrative is a count between 0 and 27; it is right skewed (with ~66% of the dataset in 0, 1, 2)
    - Administrative_duration captures time spent; it also is right skewed with almost 50% of the data being zero (which makes sense because almost 50% of the data in Administrative is zero)

    - Informational is a count between 0 and 24; it is right skewed (with ~90% of the dataset in 0, 1, 2)
    - Informational_duration captures time spent; it also is right skewed with over 80% of the data being zero (which makes sense because over 78% of the data in Informational is zero)

    - ProductRelated is a count between 0 and 705; it is right skewed (however; only ~12% of the dataset is in 0, 1, 2)
    - ProductRelated_duration captures time spent; it also is right skewed with almost 6% of the data being zero (which makes sense because only 12% of the data in ProductRelated is zero)

    - BounceRate captures the percentage of visits in which a visitor exits the landing page without browsing any further. It is right skewed, with about 45% of the data being a value of 0. Given that those who “bounce” will certainly not buy, this attribute may well be an important variable in our model. (Note: ExitRate, which is highly correlated with BounceRate, also is likely to have a similar skewness, distribution and importance in predicting online shoppers’ purchasing behavior.)

    - PageValues are dollar amounts – more or less – amounting to sales amounts, divided by page views. The variable is right skewed with approximately 78% of PageValues being zero; this makes sense because about 45% of customers “bounce” immediately, buying nothing, leaving another ~30% to browse without completing a purchase.

    - SpecialDay is a binary variable with 0 for not near a holiday/special day and 1 for near a holiday/special day. The variable is right skewed with approximately 90% of the data being zero (i.e., transaction not occurring near a holiday/special day). This information suggests to us that we are dealing with a unique retail environment (i.e., most retailers experience increased activity at/during holiday times).

    - Month – is the month of the year in which the transaction occurred. We first noted that the dataset is devoid of transactions in January and April. Thus, the dataset does not appear to contain a full year of information, which could impair our ability to complete the analysis in light of potential seasonality. The most popular months for online browsing/shopping are: May (27.3%), November (24.3%), March (15.5%) and December 14.0%). Low months include: June-October, perhaps because           folks are not browsing/shopping online during the warmer months.

    - OperatingSystems – is a categorical variable and most of the data (~95%) is in one of three operating systems (2, 1, 3). 

    - Browser – is a categorical variable and most the data (~91%) come from three browsers (2, 1, 4).

    - Region – is a categorical variable for the region from which the visitor came. The top four account for ~77% of the data (i.e., regions 1, 3, 4, 2).

    - TrafficType – is a categorical variable to indicate how the visitor arrived at the website. The top three account for approximately 67% of the referrals (i.e., types 2, 1, 3).

    - VisitorType – is a categorical variable. Most visitors (~86%) are return visitors. A few visitors have been classified as “Other”; however, they should not be so classified as the categories of “Returning_visitor” and “New_visitor” should capture all visitors (i.e., a visitor is either one or the other).

    - Weekend – is a categorical variable to capture whether the visitor is visiting the site on a weekend. Approximately 77% of the visits took place during the week, which makes sense in light of the proportion of weekdays in a week (i.e., 5/7 = 71.4%).

    - Revenue – is the target variable. It is a categorical variable. It is imbalanced – as approximately 85% of the visits resulted in “no sale” (i.e., only ~15% of the visits resulted in sales). As such, we will need to balance the data later.


In [None]:
# Add Box Plots to further Describe the Data
# For Administrative
df_p=df.iloc[:,0]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers above because the values are neither excessisvely extreme nor are they outside a reasonable range.

In [None]:
# For Administrative_Duration
df_p=df.iloc[:,1]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers because there is an extrordinary amount of them.

In [None]:
# For Informational
df_p=df.iloc[:,2]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers because the values are neither excessisvely extreme nor are they outside a reasonable range.

In [None]:
# For Informational_Duration
df_p=df.iloc[:,3]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers because there is an extrordinary amount of them.

In [None]:
# For ProductRelated
df_p=df.iloc[:,4]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers because there is an extrordinary amount of them.

In [None]:
# For ProductRelated_Duration
df_p=df.iloc[:,5]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers because there is an extrordinary amount of them.

In [None]:
# For BounceRates
df_p=df.iloc[:,6]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers because there is an extrordinary amount of them and the outlier values are not outside the range of 0-1.

In [None]:
# For ExitRates
df_p=df.iloc[:,7]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers because there is an extrordinary amount of them and the outlier values are not outside the range of 0-1.

In [None]:
# For PageValues
df_p=df.iloc[:,8]
df_p.plot.box()

In [None]:
# The group decided not to further consider the outliers because there is an extrordinary amount of them
# and even the most extreme value of $350+ is reasonable considering the definition of the attribute.

### C. Data Quality Improvment Strategies
#### _1. Overview:_
Given the noted missing/erroneous values we have formulated the following pipeline for understanding and preparing our data.

We will complete the following pipeline step in the data understanding phase:

        1. Imputation 
           - We will replace VisitorType "Other" with the mode "Returning_Visitor"

We will complete the following pipeline steps in the data preparation phase:
       
       1. Feature Engineering
           - Created new variables 
               - 'Admin_per_Exit', 'Admin_per_Bounce', 'Bounce_Exit_Rate_Avg', 'Bounce_Exit_Rate_WeightedAvg', 'Bounce_per_Exit_Rate','Total_Duration','Total_Duration_Avg','Admin_Duration_percent_TotalDuration',\
                  Info_Duration_percent_TotalDuration','Product_Duration_percent_TotalDuration','TotalDuration_per_PageValues', 'Admin_per_PageValues', 'AdminDuration_per_PageValues', 'Informational_per_PageValues',\
                  'Info_Duration_per_PageValues','ProductRelated_per_PageValues', 'Product_Duration_per_PageValues', 'Exit_per_PageValues', 'Bounce_per_PageValues'
          
          - Binned categorical variables to reduce the number of categories to five or fewer (Operating, Browser, Region and TrafficType,VisitorType,Month,Weekend) 
               - Binned month acorrding to Holiday month
               - Binned month acorrding to special day frequency
               - Binned month acorrding to target frequency

       2. Outlier Detection
           - IQR Outlier Detection: We use the IQR to address outliers in all calcualted variables
           
       3. Normalization
          - utilized quantile_transform method
          - utilized PowerTransformer as a secondary method
       
       4. Standardization
           a. Min-Max Scaler
           b. Z-Score Standardization
              - We initially omit this step due to successfully using the Min-Max Scaler method
           c. Standard Deviation Outlier
              - We initially omit this step due to addressing outliers via IQR

In [None]:
# Pipeline - Initial Imputation of Categorical Features:

## Replace the VisitorType 'Other' with the variable's mode, namely: 'Returning_Visitor'
df['VisitorType'] = df['VisitorType'].replace('Other','Returning_Visitor')
df.groupby('VisitorType').count()

In [None]:
# Initial Imputing Continuous Features

impute1_list =['Administrative','Administrative_Duration','Informational','Informational_Duration','ProductRelated','ProductRelated_Duration','BounceRates','ExitRates','PageValues']

# Impute Zeros before doing the log
for column in impute1_list:
    df[column] = df[column] + 1

display(df.sample(20))

# df['PageValues'] = df['PageValues'] + 1

df['PageValues_Log'] = np.log(df['PageValues'])
df['PageValues_Log10'] = np.log10(df['PageValues'])

df[['PageValues_Log','PageValues_Log10']]

In [None]:
# Check to see if there are any inf values or N/As
display(df[['PageValues_Log','PageValues_Log10']].describe())
display(df[['PageValues_Log','PageValues_Log10']].isna().sum())

# Part III: Data Preparation  <a name="part3"></a>

### A. Overview
Data Preparation includes preprocessing steps for selecting data (e.g., including feature engineering/binning) and cleaning data (e.g., recoding for any "new" features created; normalizing; handling outliers; dealing with skewness; standardizing; reviewing correlations to identify highly related/correlated features that s/b avoided in the analysis).

## 1. Selecting Data
    a. Feature Engineering for Continuous Features

In [None]:
# Relationship of rates or duration / page value, compare to y
# Weighted Avg Bounce & Exit rate
# Pipeline - Feature Engineering:
# Created 5 new variables (Admin_per_Exit; Bounce_Exit_Rate_Avg; Bounce_per_Exit_Rate; Total_Duration; Total_Duration_Avg)


#Create 'Admin_per_Exit' to enable us to retain two highly correlated variables (i.e., 'Administrative' and 'ExitRates') since both are highly correlated with the target
df['Admin_per_Exit'] = df['Administrative'] / df['ExitRates']
df['Admin_per_Bounce'] = df['Administrative'] / df['BounceRates']

#Create 'Bounce_Exit_Rate_Avg' to enable us to retain two highly correlated variables (i.e., 'BounceRates' and 'ExitRates') since both are highly correlated with the target
df['Bounce_Exit_Rate_Avg'] = (df['BounceRates'] + df['ExitRates'])/2
df['Bounce_Exit_Rate_WeightedAvg'] = ((df['BounceRates']*.6) + (df['ExitRates']*.4))

#Create 'Bounce_per_Exit_Rate' to enable us to retain two highly correlated variables (i.e., 'BounceRates' and 'ExitRates') since both are highly correlated with the target
df['Bounce_per_Exit_Rate'] = df['BounceRates'] / df['ExitRates']


#Create 'Total_Duration' and 'Total_Duration_Avg' to enable us to assess total and average duration, respectively.
df['Total_Duration'] = df['Administrative_Duration'] + df['Informational_Duration'] + df['ProductRelated_Duration']
df['Total_Duration_Avg'] = (df['Total_Duration'])/3
df['Admin_Duration_percent_TotalDuration'] = df['Administrative_Duration'] / df['Total_Duration']
df['Info_Duration_percent_TotalDuration'] = df['Informational_Duration'] / df['Total_Duration']
df['Product_Duration_percent_TotalDuration'] = df['ProductRelated_Duration'] / df['Total_Duration']
df['TotalDuration_per_PageValues'] = df['Total_Duration'] / df['PageValues']


#Create 'Admin_per_Exit' to enable us to retain two highly correlated variables (i.e., 'Administrative' and 'ExitRates') since both are highly correlated with the target
df['Admin_per_PageValues'] = df['Administrative'] / df['PageValues']
df['AdminDuration_per_PageValues'] = df['Administrative_Duration'] / df['PageValues']

#Create 'Admin_per_Exit' to enable us to retain two highly correlated variables (i.e., 'Administrative' and 'ExitRates') since both are highly correlated with the target
df['Informational_per_PageValues'] = df['Informational'] / df['PageValues']
df['Info_Duration_per_PageValues'] = df['Informational_Duration'] / df['PageValues']


#Create 'Admin_per_Exit' to enable us to retain two highly correlated variables (i.e., 'Administrative' and 'ExitRates') since both are highly correlated with the target
df['ProductRelated_per_PageValues'] = df['ProductRelated'] / df['PageValues']
df['Product_Duration_per_PageValues'] = df['ProductRelated_Duration'] / df['PageValues']


#Create 'Admin_per_Exit' to enable us to retain two highly correlated variables (i.e., 'Administrative' and 'ExitRates') since both are highly correlated with the target
df['Exit_per_PageValues'] = df['ExitRates'] / df['PageValues']


#Create 'Admin_per_Exit' to enable us to retain two highly correlated variables (i.e., 'Administrative' and 'ExitRates') since both are highly correlated with the target
df['Bounce_per_PageValues'] = df['BounceRates'] / df['PageValues']


calcualted_cols = ['Admin_per_Exit', 'Admin_per_Bounce', 'Bounce_Exit_Rate_Avg', 'Bounce_Exit_Rate_WeightedAvg', 'Bounce_per_Exit_Rate', 'Total_Duration','Total_Duration_Avg','Admin_Duration_percent_TotalDuration',\
           'Info_Duration_percent_TotalDuration','Product_Duration_percent_TotalDuration','TotalDuration_per_PageValues', 'Admin_per_PageValues', 'AdminDuration_per_PageValues', 'Informational_per_PageValues',\
           'Info_Duration_per_PageValues','ProductRelated_per_PageValues', 'Product_Duration_per_PageValues', 'Exit_per_PageValues', 'Bounce_per_PageValues','Revenue']

display(df[calcualted_cols].sample(20))


# display(df[['Administrative', 'ExitRates', 'Admin_per_Exit', 'BounceRates', 'Bounce_Exit_Rate_Avg', 'Bounce_per_Exit_Rate',\
#            'Administrative_Duration','Informational_Duration', 'ProductRelated_Duration', 'Total_Duration', 'Total_Duration_Avg']].sample(20))

In [None]:
# check newly calcualted featues for missing values or NaN values.
# We also analyzing the statistics for the calcualted features

df[calcualted_cols].info()
print('---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------')
display(df[calcualted_cols].isna().mean().round(4) * 100)
print('---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------')
display(df[calcualted_cols].describe())
print('---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------')
# Correlation Heatmap for the dataframe
spearman_calculated =df.corr(method ='spearman')
plt.figure(figsize=(35,15))
sns.heatmap(spearman_calculated, annot=True)

In [None]:
# Code below reatined in the event that it is needed in the future

# # Drop calcualted columns with more than 20% missing values 
# df.drop(['Admin_per_Bounce','Admin_per_PageValuesLog','AdminDuration_per_PageValuesLog','Informational_per_PageValuesLog','Info_Duration_per_PageValuesLog','Bounce_per_PageValuesLog' ],axis=1, inplace = True)
# # check to see if the correct column droped
# display(df.isna().mean().round(4) * 100)

    b. Data Imputation (NO LONGER NEEDED)

In [None]:
# Code below reatined in the event that it is needed in the future


# # Mean imputation

# ## Impute mean values for mising values in Admin_per_Exit
# df['Admin_per_Exit'].fillna(df['Admin_per_Exit'].mean(), inplace=True)


# ## Impute mean values for missing values in Bounce_per_Exit_Rate
# df['Bounce_per_Exit_Rate'].fillna(df['Bounce_per_Exit_Rate'].mean(), inplace=True)

# ## Check the df
# df.isnull().sum()


    c. Binning for Categorical Features

In [None]:
# Viewing if there is a trend between months and Special day
print(df.groupby('Month')['SpecialDay'].sum())

# The observation is that Special days only occur in Feb and May so we will bin based on Feb, May, and Other

In [None]:
# Pipeline - Binning Categorical Features:

##Reduce categories for Operating Systems to the top 3 plus "other"
### Operating Systems – is a categorical variable and most of the data (~95%) is in one of three operating systems (2, 1, 3). 
def binning_operating_systems(B):
    if (B <= 3):
        return(B)
    else:
        return(4) # creating a category of 4 for all Operating Systems > 3

df['OperatingSystems_Bin']=df['OperatingSystems'].apply(binning_operating_systems)   # Creating a new column in the df


      
##Reduce categories for Browser to the top 3 plus "other"
### Browser – is a categorical variable and most the data (~91%) come from three browsers (2, 1, 4).
def binning_browser(B):
    if (B == 3) or (B > 4): 
        return(3) 
    else:
        return(B) 

df['Browser_Bin']=df['Browser'].apply(binning_browser)   # Creating a new column in the df
      
      
##Reduce categories for Region to the top 4 plus "other"
### Region – is a categorical variable for region from which the visitor came. The top four account for ~77% of the data (i.e., region 1, 3, 4, 2).
def binning_region(B):
    if (B <= 4):
        return(B)
    else:
        return(5) # creating a category of 5 for all Regions > 4

df['Region_Bin']=df['Region'].apply(binning_region)   # Creating a new column in the df

      
      
##Reduce categories for TrafficType to the top 3 plus "other"
### TrafficType – is a categorical variable to indicate how visitor arrived at website. The top three account for approximately 67% of the referrals (i.e., types 2, 1, 3).
def binning_traffic_type(B):
    if (B <= 3):
        return(B)
    else:
        return(4) # creating a category of 4 for all Traffic Types > 3

df['TrafficType_Bin']=df['TrafficType'].apply(binning_traffic_type)   # Creating a new column in the df


##Create holiday/non-holiday bin for Feb/May = Holiday; others = Non-holiday
### Months – is a boolean variable to the month of the internet visits. 
def holiday_bin_func(month) :
    if month == 'May':
        return int(1)
    elif month == 'Feb':
        return int(1)
    else:
        return int(0)
    
df['Holiday_Bin'] = df['Month'].apply(holiday_bin_func)

##Reduce months to the top 4 in which there are transactions and "other"
def month_bin_func(month) :
    if month == 'May':
        return int(5)
    elif month == 'Nov':
        return int(11)
    elif month == 'Mar':
        return int(3)
    elif month == 'Dec':
        return int(12)
    else:
        return int(0)
    
df['Month_Bin'] = df['Month'].apply(month_bin_func)


##Encode month names to numerical representations

def month_func(month) :
    if month == 'Jan':
        return int(1)
    elif month == 'Feb':
        return int(2)
    elif month == 'Mar':
        return int(3)
    elif month == 'Apr':
        return int(4)
    elif month == 'May':
        return int(5)
    elif month == 'June':
        return int(6)
    elif month == 'Jul':
        return int(7)
    elif month == 'Aug':
        return int(8)
    elif month == 'Sep':
        return int(9)
    elif month == 'Oct':
        return int(10)
    elif month == 'Nov':
        return int(11)
    elif month == 'Dec':
        return int(12)

df['Month'] = df['Month'].apply(month_func)

# validate that each bin function worked as intended
display(df[['OperatingSystems', 'OperatingSystems_Bin', 'TrafficType', 'TrafficType_Bin', 'Browser', 'Browser_Bin', 'Region', 'Region_Bin', 'Month', 'Month_Bin','SpecialDay','Holiday_Bin']].sample(20))


In [None]:
# chech the data types of the newly created features
df[['OperatingSystems', 'OperatingSystems_Bin', 'TrafficType', 'TrafficType_Bin', 'Browser', 'Browser_Bin', 'Region', 'Region_Bin', 'Month', 'Month_Bin','SpecialDay','Holiday_Bin','Admin_per_Exit', 'Admin_per_Bounce',\
    'Bounce_Exit_Rate_Avg', 'Bounce_Exit_Rate_WeightedAvg', 'Bounce_per_Exit_Rate', 'Total_Duration','Total_Duration_Avg','Admin_Duration_percent_TotalDuration',\
           'Info_Duration_percent_TotalDuration','Product_Duration_percent_TotalDuration','TotalDuration_per_PageValues', 'Admin_per_PageValues', 'AdminDuration_per_PageValues', 'Informational_per_PageValues',\
           'Info_Duration_per_PageValues','ProductRelated_per_PageValues', 'Product_Duration_per_PageValues', 'Exit_per_PageValues', 'Bounce_per_PageValues']].info()

## 2. Data Preparation
    a. Outlier Handling

In [None]:
# Utilize IQR method to address outliers

def replace_columns_outliers_iqr(df, column_list): 
    for my_col in column_list:
        Q1 = df[my_col].quantile(0.25)
        Q3 = df[my_col].quantile(0.75)
        IQR = Q3 - Q1

        u_bound_q3 = (Q3 + 1.5 * IQR)
        l_bound_q1 = (Q1 - 1.5 * IQR)

        df[my_col][df[my_col] > u_bound_q3] = u_bound_q3
        df[my_col][df[my_col] < l_bound_q1] = l_bound_q1

In [None]:
# Address Outliers for all calculated variables using the IQR method

calcualted_cols2 = calcualted_cols

In [None]:
del calcualted_cols2[-1]

In [None]:
calcualted_cols2

In [None]:
replace_columns_outliers_iqr(df=df, column_list=calcualted_cols2) 
df[calcualted_cols2].describe()

In [None]:
# checking to see if the created attributes have null values
df.isnull().sum()

    b. Split the dataset

In [None]:
# Subsetting the data to be used for modeling
display(df.dtypes)
df_list =  df.columns
df_list

In [None]:
df_list = df_list.drop(['Revenue'])

In [None]:
# Encode Revenue before splitting the data to allow for modeling

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
df['Revenue'] = enc.fit_transform(df['Revenue'])

In [None]:
# Splitting the data in to X and y
X,y = df.loc[:,df_list],df.loc[:,'Revenue']

    c. Normalization

In [None]:
X_col_list= X.columns.tolist()
X_col_list

In [None]:
# Subset X between categorical and continuous features

X_continuous = ['Administrative','Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', \
                'PageValues', 'PageValues_Log', 'PageValues_Log10','Admin_per_Exit', 'Admin_per_Bounce', 'Bounce_Exit_Rate_Avg', 'Bounce_Exit_Rate_WeightedAvg', 'Bounce_per_Exit_Rate', 'Total_Duration', \
                'Total_Duration_Avg', 'Admin_Duration_percent_TotalDuration', 'Info_Duration_percent_TotalDuration', 'Product_Duration_percent_TotalDuration', 'TotalDuration_per_PageValues', \
                'Admin_per_PageValues', 'AdminDuration_per_PageValues', 'Informational_per_PageValues', 'Info_Duration_per_PageValues', 'ProductRelated_per_PageValues', 'Product_Duration_per_PageValues', \
                'Exit_per_PageValues', 'Bounce_per_PageValues', 'OperatingSystems_Bin', 'Browser_Bin', 'Region_Bin', 'TrafficType_Bin', 'Holiday_Bin', 'Month_Bin']

X_categorical =['SpecialDay','Month','OperatingSystems','Browser','Region','TrafficType','VisitorType','Weekend', 'OperatingSystems_Bin', 'Browser_Bin','Region_Bin','TrafficType_Bin','Holiday_Bin','Month_Bin']

X_continuous_df = X.loc[:,X_continuous]
X_categorical_df = X.loc[:,X_categorical]

In [None]:
# Initially displaying the skewness of all attributes
skew_df = pd.DataFrame(X_continuous_df.skew())

#filter skew attributes by absolute values of 0.5
skew_over = skew_df[(skew_df > 0.5).any(axis=1)]
skew_under = skew_df[(skew_df < -0.5).any(axis=1)]
display(skew_over.index)
display(skew_under.index)
total_skew_df = pd.concat([skew_over, skew_under])

skew_cols = total_skew_df.index.tolist()

In [None]:
skew_cols

In [None]:
# creating the list of cols to adjust for skewness

for i in skew_cols:
    X[i+'_skew'] = X[i]
    
    
cols_to_skew = X.iloc[:,-27:].columns


In [None]:
cols_to_skew

In [None]:
# Normalize using quantile_transform for columns that have skewness

from sklearn.preprocessing import quantile_transform
transformed_qt = quantile_transform(X[cols_to_skew],random_state=0,copy=True)
transformed_qt_df = pd.DataFrame(transformed_qt,columns = cols_to_skew)
X[cols_to_skew] = transformed_qt_df[cols_to_skew]


In [None]:
display(X[cols_to_skew].skew())

In [None]:
Still_skew_df = pd.DataFrame(X[cols_to_skew].skew())

#filter skew attributes by absolute values of 0.5
still_skew_over = Still_skew_df[(Still_skew_df > 0.5).any(axis=1)]
still_skew_under = Still_skew_df[(Still_skew_df < -0.5).any(axis=1)]
display(still_skew_over.index)
display(still_skew_under.index)

col_still_skew_df = pd.concat([still_skew_over, still_skew_under])

cols_to_skew_2 = col_still_skew_df.index.tolist()


In [None]:
cols_to_skew_2

In [None]:
# creating a list of columns that need to be transformed due to skewness

# cols_to_skew_2 = ['Informational_skew','Informational_Duration_skew', 'PageValues_skew', 'PageValues_Log_skew', 'PageValues_Log10_skew', 'Holiday_Bin_skew']

In [None]:
# Normalize using PowerTransformer for remaining columns that continue to have skewness
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')
transform = pt.fit(X[cols_to_skew_2])
transformed = pt.transform((X[cols_to_skew_2]))
transformed_df = pd.DataFrame(transformed,columns = cols_to_skew_2)
X[cols_to_skew_2] = transformed_df[cols_to_skew_2]


In [None]:
# check for remaining skewness
display(X[cols_to_skew_2].skew())

    - After two methods for correcting for skewness we are left with 8 attributes whose distributions remain skewed.
    - We have decided to proceed without further adjustment to these attributes

In [None]:
# visually chack for the columns that need to be rescaled based on a max higher than 1
display(X.max())

In [None]:
# review data types to ensure all data is processed for modeling
X.info()

In [None]:
# Encode the categorical variables that remain bool or object for modeling

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
X['VisitorType'] = enc.fit_transform(X['VisitorType'])
X['Weekend'] = enc.fit_transform(X['Weekend'])
#X['Revenue'] = enc.fit_transform(X['Revenue'])


In [None]:
X[['VisitorType','Weekend']].info()

    d. Rescaling the data

In [None]:
col_list = X.columns

In [None]:
# process the attributes that have a range outside of zero to one (0 - 1)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scale = scaler.fit(X)
scaled = scaler.transform(X)
scaled_df = pd.DataFrame(scaled,columns = col_list)
X_scaled = scaled_df
# X[scale_cols] = scaled_df[scale_cols]

In [None]:
# validate that the scaler worked as intended
X_scaled.describe()

    f. Naive Model/Baseline Model
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y: </b> 71% accuracy - Original Base at time of Data Audit Report
* <b> Y: </b> 87% accuracy with an AUC of 89 at time of Inital Data Models

In [None]:
# resplit based on additional data prep completed post initial split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,test_size=0.3,random_state=500) 

In [None]:
#Create a Gaussian Classifier
gnb1 = GaussianNB()

#Train the model using the training sets
gnb1.fit(X_train, y_train)


#Predict the response for test dataset
y_pred = gnb1.predict(X_test)


In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


In [None]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

display(cnf_matrix)


In [None]:
#AUC for y Base model
y_pred_proba = gnb1.predict_proba(X_test)[::,1]
fpr_base, tpr_base, _ = metrics.roc_curve(y_test, y_pred_proba)
auc_base = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr_base,tpr_base,label="auc="+str(auc_base))
plt.legend(loc=4)
plt.show()
print("")

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix Y1', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')


In [None]:
print(classification_report(y_test, y_pred))

    g. Reviewing Correlations to Identify Highly Related/Correlated Features to Avoid in Analysis:


In [None]:
#define new dataframe, df_Prepped, which contains all the features, as adjusted per the Data Preparation above, along with the two target variables from the initial dataframe

df_Prepped = pd.concat([X,y],axis=1)

#check new dataframe, df_Prepped                
df_Prepped.head()





In [None]:
#Correlation Heatmap for the dataframe
spearman =df_Prepped.corr(method ='spearman')
plt.figure(figsize=(50,25))
sns.heatmap(spearman, annot=True)

In [None]:
corr_df = pd.DataFrame(X.corrwith(df_Prepped['Revenue']))

#filter skew attributes by absolute values of 0.5
corr_over = corr_df[(corr_df > 0.09).any(axis=1)]
corr_under = corr_df[(corr_df < -0.09).any(axis=1)]
display(corr_over.index)
display(corr_under.index)

corr_with_df = pd.concat([corr_over, corr_under])

model_cols = corr_with_df.index.tolist()
model_cols

In [None]:
possible_Features_df = X[model_cols]

In [None]:
# determined possible features list by removing corresponding, duplicative features that were not adjusted for skewness
 
possible_Features_list =['Month',
 'Administrative_skew',
 'Administrative_Duration_skew',
 'Informational_skew',
 'Informational_Duration_skew',
 'ProductRelated_skew',
 'ProductRelated_Duration_skew',
 'PageValues_skew',
 'PageValues_Log_skew',
 'PageValues_Log10_skew',
 'Admin_per_Exit_skew',
 'Admin_per_Bounce_skew',
 'Total_Duration_skew',
 'Total_Duration_Avg_skew',
 'Bounce_per_Exit_Rate_skew',
 'VisitorType',
 'Exit_per_PageValues',
 'Bounce_per_PageValues',
 'BounceRates_skew',
 'ExitRates_skew',
 'Bounce_Exit_Rate_Avg_skew',
 'Bounce_Exit_Rate_WeightedAvg_skew',
 'Info_Duration_percent_TotalDuration_skew',
 'TotalDuration_per_PageValues_skew',
 'Admin_per_PageValues_skew',
 'AdminDuration_per_PageValues_skew',
 'ProductRelated_per_PageValues_skew',
 'Product_Duration_per_PageValues_skew','Revenue']    


In [None]:
#Correlation Heatmap for the dataframe
possible_Corr_df = df_Prepped.loc[:,possible_Features_list]
possible_Corr_df


In [None]:
spearman_possible =possible_Corr_df.corr(method ='spearman')
plt.figure(figsize=(30,15))
sns.heatmap(spearman_possible, annot=True)

In [None]:
# Based on the above:
# - Need to choose only one of the variables for which there is a corresponding duration variable - namely:
#    ~Administrative_skew or Administrative_Duration_skew (correlation 0.94)
#    ~Informational_skew or Informational_Duration_skew (correlation 0.95)
#    ~ProductRelated_skew or Product_RelatedDuration_skew (correlation 0.88)
#    **Initial decision: utilize Administrative_skew and ProductRelated_Duration_skew as they are more highly correlated with the target variable;
#      utilize Informational_skew because Informational_Duration_skew is incorporated into the Total_Duration variables (discussed below)

# - Need to choose only one of the variables from pairs that capture related information - namely:
#    ~BounceRates_skew or ExitRates_skew (correlation 0.60)
#    ~Bounce_Exit_Rate_Avg_skew or Bounce_per_Exit_Rate_skew (correlation 0.62)
#    ~Total_Duration_skew or Total_Duration_Avg_Skew (correlation 1.00)
#    **Initial decision: For the first two, utilize the ones more strongly correlated with the corresponding Y values - namely: ExitRates_skew & Bounce_Exit_Rate_Avg_skew;
#      In the case of the third variable pair, we opt to use the value for Total_Duration_skew (rather than the averaged value)



In [None]:
# Filtered out intercorrelated featues for feature importance
feature_importance_list = ['Month', 'Administrative_skew','ProductRelated_Duration_skew', 'Informational_skew','PageValues_skew','Admin_per_Exit_skew','Admin_per_Bounce_skew',\
                           'Bounce_Exit_Rate_WeightedAvg_skew', 'VisitorType','Exit_per_PageValues','Bounce_per_PageValues','Admin_per_PageValues_skew','TotalDuration_per_PageValues_skew',\
                           'ProductRelated_per_PageValues_skew']

    h. Split the data into training and test

In [None]:
# Split the df_Models dataset
X_Models,y_Models = possible_Corr_df.loc[:,feature_importance_list],possible_Corr_df.iloc[:,-1]

# Set the training at 30% (as above for baseline) given the modest size of the dataset (~12,000 observations)
X_Models_train, X_Models_test, y_Models_train, y_Models_test = train_test_split(X_Models, y_Models, test_size=0.3,random_state=500) 

#print out the first five rows of the training data
display(X_Models_train.head(),y_Models_train.head())

In [None]:
# Test the GN model without feature importance

In [None]:
# import the metrics class
from sklearn import metrics

# import other required modules for confusion matrices
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')

###1. Step 1: Specify the Model

In [None]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

###2. Steps 2-4: Generate Test Data, Build the Models & Assess the Models for y2

In [None]:
#Train the model using the training sets - for y2 (Sale)
gnb.fit(X_Models_train, y_Models_train)

#Predict the response for test dataset for y2
y_NB_pred = gnb.predict(X_Models_test)

# Model Accuracy, how often is the classifier correct?
# Accuracy for y2
print("y Accuracy:",metrics.accuracy_score(y_Models_test, y_NB_pred))
print("")

#Can use classification report to assess model adequacy, too
print(metrics.classification_report(y_Models_test, y_NB_pred, labels=class_names))

#AUC for y
y_NB_pred_proba = gnb.predict_proba(X_Models_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_Models_test,  y_NB_pred_proba)
auc = metrics.roc_auc_score(y_Models_test, y_NB_pred_proba)
plt.plot(fpr,tpr,label="auc="+str(auc))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix = metrics.confusion_matrix(y_Models_test, y_NB_pred)
cnf_matrix
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')


In [None]:
#Conclusion - With accuracy of 81% for each y1 and y2, the Naive Bayes model is superior to the base model which had an accuracy of 71% for each target.
#However, there is still room for improvement.

# Part IV: Data Analysis/Modeling <a name="part4"></a>

##### add resampled data into models **
##### can also delete all of the y2 models

## A. Naive Bayes Model (NB)
>   <b> Assumption: </b> Model features all are independent; we have included all features we believe independent per the immediately preceding correlation analysis
<br><b> Calculate Accuracy: </b> How many times are you right?
<br><b> Answer: </b> 
* <b> Y1 </b>: 81.21% accuracy (AUC = 0.8741)

###0. Step 0: Import Needed Packages

In [None]:
# import the metrics class
from sklearn import metrics

# import other required modules for confusion matrices
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')

###1. Step 1: Specify the Model

In [None]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

###2. Steps 2-4: Generate Test Data, Build the Models & Assess the Models for y2

In [None]:
#Train the model using the training sets - for y2 (Sale)
gnb.fit(X_Models_train, y_Models_train)

#Predict the response for test dataset for y2
y_NB_pred = gnb.predict(X_Models_test)

# Model Accuracy, how often is the classifier correct?
# Accuracy for y2
print("y Accuracy:",metrics.accuracy_score(y_Models_test, y_NB_pred))
print("")

#Can use classification report to assess model adequacy, too
print(metrics.classification_report(y_Models_test, y_NB_pred, labels=class_names))

#AUC for y
y_NB_pred_proba = gnb.predict_proba(X_Models_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_Models_test,  y_NB_pred_proba)
auc = metrics.roc_auc_score(y_Models_test, y_NB_pred_proba)
plt.plot(fpr,tpr,label="auc="+str(auc))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix = metrics.confusion_matrix(y_Models_test, y_NB_pred)
cnf_matrix
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')


In [None]:
#Conclusion - With accuracy of 81% for each y1 and y2, the Naive Bayes model is superior to the base model which had an accuracy of 71% for each target.
#However, there is still room for improvement.

In [None]:
# MD's Modeling Work

## B. Decision Tree Model (DT)
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y1 </b>: 85.56% accuracy
* <b> Y2: </b> 85.45% accuracy

#### Specify the Model

In [None]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [None]:
# Just to take look at the models df to see if everything is correct
#df_Models.head()

#### Build the Model

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_Models_train,y_Models_train)

#Predict the response for test dataset
y_DT_pred = clf.predict(X_Models_test)

In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_Models_test, y_DT_pred))

#### Assess the Model

In [None]:
# Evaluating the Classification Report
print(metrics.classification_report(y_Models_test, y_DT_pred))

In [None]:
# Evaluating the Confusion Matrix
print(metrics.confusion_matrix(y_Models_test, y_DT_pred))

In [None]:
#AUC for y
y_DT_pred_proba = clf.predict_proba(X_Models_test)[::,1]
fpr_DT, tpr_DT, _ = metrics.roc_curve(y_Models_test,  y_DT_pred_proba)
auc_DT = metrics.roc_auc_score(y_Models_test, y_DT_pred_proba)
plt.plot(fpr_DT,tpr_DT,label="auc="+str(auc_DT))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix_DT = metrics.confusion_matrix(y_Models_test, y_DT_pred)
cnf_matrix_DT
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_DT), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix DT', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

85.45% is much better than the baseline model of 70%

#### *** Attempted Model Optimization - Not 100% if I optimized these correctly

##### y2 (Sale)

In [None]:
# Trying to optimize Decision Tree Model by adding in the parameters "entropy" (information gain selection measure) and "max depth=3"
# # Did this in order to reduce to complexity of the Decision Tree, in hopes that it will yield better results
# Create Decision Tree classifer object
clf2 = DecisionTreeClassifier(criterion = "entropy", max_depth=3)

# Train Decision Tree Classifer
clf2 = clf2.fit(X_Models_train,y_Models_train)

#Predict the response for test dataset
y_DT_pred2 = clf2.predict(X_Models_test)

In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_Models_test, y_DT_pred2))

In [None]:
# Evaluating the Classification Report
print(metrics.classification_report(y_Models_test, y_DT_pred2))

In [None]:
# Evaluating the Confusion Matrix
print(metrics.confusion_matrix(y_Models_test, y_DT_pred2))

#AUC for y
y_DT_pred_proba2 = clf2.predict_proba(X_Models_test)[::,1]
fpr_DT2, tpr_DT2, _ = metrics.roc_curve(y_Models_test,  y_DT_pred_proba2)
auc_DT2 = metrics.roc_auc_score(y_Models_test, y_DT_pred_proba2)
plt.plot(fpr_DT2,tpr_DT2,label="auc="+str(auc_DT2))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix_DT2 = metrics.confusion_matrix(y_Models_test, y_DT_pred2)
cnf_matrix_DT2
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_DT2), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix DT 2', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

##### Not sure if the optimized model is overfitting, but 89.53% accuracy is better than the base Decision Tree Models

## C. Random Forest Model (RF)
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y1 </b>: 90% accuracy
* <b> Y2: </b> 90% accuracy (not sure if Y2 result should be the same as Y1 but could make sense)

#### Specify the Model

In [None]:
# Building a Classifier
#Import scikit-learn dataset library
from sklearn import datasets

#### Build Model for y2 (Sale)

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
rfc = RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_Models_test)
rfc.fit(X_Models_train,y_Models_train)

y_RF_pred = rfc.predict(X_Models_test)

In [None]:
# Evaluation of the Classification Report
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_Models_test, y_RF_pred))

In [None]:
# Evaluation of the Confusion Matrix
print(confusion_matrix(y_Models_test, y_RF_pred))

#AUC for y
y_RF_pred_proba = rfc.predict_proba(X_Models_test)[::,1]
fpr_RF, tpr_RF, _ = metrics.roc_curve(y_Models_test,  y_RF_pred_proba)
auc_RF = metrics.roc_auc_score(y_Models_test, y_RF_pred_proba)
plt.plot(fpr_RF,tpr_RF,label="auc="+str(auc_RF))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix_RF = metrics.confusion_matrix(y_Models_test, y_RF_pred)
cnf_matrix_RF
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_RF), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix RF', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

##### 90% accuracy for both Random Forest Models, so far better than the Decision Tree Models

#### More to do... Can do Feature Importance using scikit_learn - have the code for it but couldn't get it to run ***

In [None]:
# # Attempt on the feature selection using scikit-learn
# # import the package
# from sklearn.ensemble import RandomForestClassifier

# #Create a Gaussian Classifier
# clf=RandomForestClassifier(n_estimators=100)

# #Train the model using the training sets y_pred=clf.predict(X_test)
# clf.fit(X_Models_train,y_Models_train)

In [None]:
# import pandas as pd
# feature_imp = pd.Series(clf.feature_importances_,index=y_Models_test).sort_values(ascending=False)
# feature_imp

## D. Support Vector Machines (SVM) Classification Model (SVC)
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y1 </b>: 70.78% accuracy
* <b> Y2: </b> 70.78% accuracy (not sure if Y2 result should be the same as Y1 but could make sense)

In [None]:
###1. Step 1: Specify the Model
from sklearn import svm
svc = svm.SVC(kernel='rbf',cache_size=7000,gamma= 'auto', C=5, probability =True, degree = 5) # gamma= 0.001 , kernel='poly', 'rbf',‘linear’

In [None]:
###2. Step 2: Generate Test Data
# X_train_svc, X_test_svc, y1_train_svc, y1_test_svc, y2_train_svc, y2_test_svc = train_test_split(X, y1,y2, test_size=0.3,random_state=1000) 
# X_Models_train, X_Models_test, y1_Models_train, y1_Models_test, y2_Models_train, y2_Models_test
#Standard Scale the data to allow for better SVC model performance
# scaler = StandardScaler().fit(X_Models_train) 
# standardized_X_svc = scaler.transform(X_Models_train) 
# standardized_X_test_svc = scaler.transform(X_Models_test)

In [None]:
###3. Step 3: Build the Model
#Train the model using the training sets
svc.fit(X_Models_train, y_Models_train)

#Predict the response for test dataset
y_pred_svc = svc.predict(X_Models_test)

In [None]:
###4. Step 4: Assess the Model
print("Accuracy_svc:",metrics.accuracy_score(y_Models_test, y_pred_svc))
cnf_matrix_svc = metrics.confusion_matrix(y_Models_test, y_pred_svc)

print(cnf_matrix_svc)


#AUC for y1
y_SVM_pred_proba = svc.predict_proba(X_Models_test)[::,1]
fpr_SVM, tpr_SVM, _ = metrics.roc_curve(y_Models_test,  y_SVM_pred_proba)
auc_SVM = metrics.roc_auc_score(y_Models_test, y_SVM_pred_proba)
plt.plot(fpr_SVM,tpr_SVM,label="auc="+str(auc_SVM))
plt.legend(loc=4)
plt.show()
print("")

class_names_svc=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks_svc = np.arange(len(class_names_svc))
plt.xticks(tick_marks_svc, class_names_svc)
plt.yticks(tick_marks_svc, class_names_svc)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_svc), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix svc', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

print(classification_report(y_Models_test, y_pred_svc))
print("Accuracy_svc:",metrics.accuracy_score(y_Models_test, y_pred_svc))

## E. XGBoost Model (XGB)
>   <b> Assumption: </b> XXXAll features are usefull for Y1 & Y2!XXX
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> XXXThis can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.XXX
<br><b> Answer: </b> 
* <b> Y2 </b>: 90.19% accuracy
* <b> Y2: </b> XXX94.97 RMSEXXX - not sure this is right!!!

###0. Step 0: Import Needed Packages

In [None]:
#Import Needed Packages
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

#Do preliminary work


###1. Step 1: Specify the Model

In [None]:
#Instantiate an XGBoost Classifer Model - for y1 (No_Sale)
XGB_class = xgb.XGBClassifier(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100)


###2. Steps 2-4: Generate Test Data, Build the Models & Assess the Models for y2 (Sale)

In [None]:
#Put Data into structure for XGBoost- for y2 (Sale) 
data_dmatrix = xgb.DMatrix(data=X_Models,label=y_Models)

#Train the model using the training sets for y1
XGB_class.fit(X_Models_train, y_Models_train)

#Predict the response for test dataset for y1
y_XGB_pred = XGB_class.predict(X_Models_test)

#Calculate RMSE for y2
rmse_XGB = np.sqrt(mean_squared_error(y_Models_test, y_XGB_pred))
print("XGBoost's RMSE for y2 is: %f" % (rmse_XGB))

#Create error ratio to evaluate results for y2
target_range_XGB = y_Models.max() - y_Models.min()
print("XGB target range is: %f" % (target_range_XGB))
error_ratio_XGB = rmse_XGB/target_range_XGB
print("XGBoost's Error Ratio for y2 is: %f" % (error_ratio_XGB))

In [None]:
#ISSUE WITH THE ABOVE - HOW COME THE TARGET RANGE IS 1? DOES IT HAVE TO DO WITH 0/1 STATUS OF TARGET VARIABLE???

In [None]:
# Model Accuracy, how often is the classifier correct?
# Accuracy for y2
print("y Accuracy:",metrics.accuracy_score(y_Models_test, y_XGB_pred))
print("")


#Can use classification report to assess model adequacy, too
print(metrics.classification_report(y_Models_test, y_XGB_pred, labels=class_names))

#AUC for y1
y_XGB_pred_proba = XGB_class.predict_proba(X_Models_test)[::,1]
fpr_XGB, tpr_XGB, _ = metrics.roc_curve(y_Models_test,  y_XGB_pred_proba)
auc_XGB = metrics.roc_auc_score(y_Models_test, y_XGB_pred_proba)
plt.plot(fpr_XGB,tpr_XGB,label="auc="+str(auc_XGB))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix_XGB = metrics.confusion_matrix(y_Models_test, y_XGB_pred)
cnf_matrix_XGB
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_XGB), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix XGB', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

# More to go...including: k-fold cross-validation, visualization for feature importance & hyper-parameter tuning to improve model

## F. Neural Network Model (NN)
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y1 </b>: 90.26% accuracy
* <b> Y2: </b> -% accuracy (not sure if Y2 result should be the same as Y1 but could make sense)

In [None]:
# X_Models_train, X_Models_test, y1_Models_train, y1_Models_test, y2_Models_train, y2_Models_test


###1. Step 1: Specify the Model
# Import the model
from sklearn.neural_network import MLPClassifier

# Initializing the multilayer perceptron
# mlp = MLPClassifier(hidden_layer_sizes = (3,1),solver='sgd',learning_rate_init= 0.01, max_iter=50)
mlp= MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto', beta_1=0.9, 
beta_2=0.999, early_stopping=False, epsilon=1e-08,       
hidden_layer_sizes=(7,3), learning_rate='adaptive',      
learning_rate_init=0.01, max_iter=10000, momentum=0.9,       
nesterovs_momentum=True, power_t=0.5, random_state=1000,       
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,       
verbose=False, warm_start=True)

In [None]:
###2. Step 2: Generate Test Data
# Train the model
mlp.fit(X_Models_train, y_Models_train)

In [None]:
###3. Step 3: Build the Model
y_pred_nn = mlp.predict(X_Models_test)


In [None]:
###4. Step 4: Assess the Model
# Score takes a feature matrix X_test and the expected target values y_test. 
# Predictions for X_test are compared with y_test

print("MLP score is",mlp.score(X_Models_test,y_Models_test))

# Accuracy for y NN
#print("y Accuracy NN:",metrics.accuracy_score(y_Models_test, y_XGB_pred))
#print("")

###4. Step 4: Assess the Model
print("Accuracy_nn:",metrics.accuracy_score(y_Models_test, y_pred_nn))
cnf_matrix_nn = metrics.confusion_matrix(y_Models_test, y_pred_nn)

print(cnf_matrix_nn)

# Plot AOC
y_pred_proba_nn = mlp.predict_proba(X_Models_test)[::,1]
fpr_nn, tpr_nn, _ = metrics.roc_curve(y_Models_test,  y_pred_proba_nn)
auc_nn = metrics.roc_auc_score(y_Models_test, y_pred_proba_nn)
plt.plot(fpr_nn,tpr_nn,label="data 1, auc="+str(auc_nn))
plt.legend(loc=4)
plt.show()


class_names_nn=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks_nn = np.arange(len(class_names_nn))
plt.xticks(tick_marks_nn, class_names_nn)
plt.yticks(tick_marks_nn, class_names_nn)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_nn), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix NN', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

print(classification_report(y_Models_test, y_pred_nn))
print("Accuracy_svc:",metrics.accuracy_score(y_Models_test, y_pred_nn))

#Plot AOC
# y_pred_proba_nn = mlp.predict_proba(X_Models_test)[::,1]
# fpr_nn, tpr_nn, _ = metrics.roc_curve(y_Models_test,  y_pred_proba_nn)
# auc_nn = metrics.roc_auc_score(y_Models_test, y_pred_proba_nn)
# plt.plot(fpr_nn,tpr_nn,label="data 1, auc="+str(auc_nn))
# plt.legend(loc=4)
# plt.show()

## Logistic Regression Model

>   <b> Assumption: </b> All features are usefull for Y
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y </b>: 90.3% accuracy - about the same as our other best scores

In [None]:
#Build the model
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_Models_train,y_Models_train)

#
y_LR_pred=logreg.predict(X_Models_test)

In [None]:
# import the metrics class for the Confusion Matrix
from sklearn import metrics
cnf_matrix_LogR = metrics.confusion_matrix(y_Models_test, y_LR_pred)
cnf_matrix_LogR

In [None]:
# Visualizing the Confusion Matrix

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_LogR), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix Log Reg', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
# ROC Curve
y_LR_pred_proba = logreg.predict_proba(X_Models_test)[::,1]
fpr_LR, tpr_LR, _ = metrics.roc_curve(y_Models_test,  y_LR_pred_proba)
auc_LR = metrics.roc_auc_score(y_Models_test, y_LR_pred_proba)
plt.plot(fpr_LR,tpr_LR,label="data 1, auc="+str(auc_LR))
plt.legend(loc=4)
plt.show()

In [None]:
print("Accuracy:",metrics.accuracy_score(y_Models_test, y_LR_pred))
print("Precision:",metrics.precision_score(y_Models_test, y_LR_pred))
print("Recall:",metrics.recall_score(y_Models_test, y_LR_pred))

In [None]:
#90.3% accuracy 

## K-Means Model

In [None]:
#important packages

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
plt.style.use('ggplot')
%matplotlib inline

In [None]:
X_kmeans = np.array(X_Models_train)
y_kmeans = np.array(y_Models_train)

In [None]:
# Build the model

# load the model
kmeans = KMeans(n_clusters=2, max_iter=600, algorithm = 'auto') # 2 clusters, sale or no sale
kmeans.fit(X_kmeans)

In [None]:
# Predictions
correct = 0
for i in range(len(X_kmeans)):
    predict_me = np.array(X_kmeans[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y_kmeans[i]:
        correct += 1

print(correct/len(X_kmeans))

# Resampling & Feature Importance: Next Steps 

## Run the models with reshaped the data due to inbalance in 'Y'

In [None]:
# Over sample the data using SMOTE

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE # doctest: +NORMALIZE_WHITESPACE
# X, y = make_classification(n_classes=2, class_sep=2,
#     weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
#     n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=2019)
print('Original dataset shape %s' % Counter(y_Models_train))
#Original dataset shape Counter({1: 900, 0: 100})
sm = SMOTE(random_state=42)
over_sampl_X_Models_train, over_sampled_y_Models_train = sm.fit_resample(X_Models_train, y_Models_train) #DWM Note: Are we SURE on "Models" in last X_train, y2_train - per SMOTE doc from Tao
print('Resampled dataset shape %s' % Counter(over_sampled_y_Models_train))
#Resampled dataset shape Counter({0: 900, 1: 900})


In [None]:
#Print first 5 rows
display(over_sampled_X_Models_train.sample(5))

In [None]:
# Under sample the data using Near Miss
# https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.NearMiss.html

# X, y = make_classification(n_classes=2, class_sep=2,
#     weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
#     n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=2019)
print('Original dataset shape %s' % Counter(y_Models_train))
#Original dataset shape Counter({1: 900, 0: 100})
nm = NearMiss(sampling_strategy='all')
under_sampled_X_Models_train, under_sampled_y_Models_train = nm.fit_resample(X_Models_train, y_Models_train) #DWM Note: Are we SURE on "Models" in last X_train, y2_train - per SMOTE doc from Tao
print('Under sampled dataset shape %s' % Counter(under_sampled_y_Models_train))
#Resampled dataset shape Counter({0: 900, 1: 900})

In [None]:
#Print first 5 rows
display(under_sampled_X_Models_train.sample(5))

In [None]:
# # Attempt on the feature selection using scikit-learn
# # import the package
# from sklearn.ensemble import RandomForestClassifier

# #Create a Gaussian Classifier
# clf=RandomForestClassifier(n_estimators=100)

# #Train the model using the training sets y_pred=clf.predict(X_test)
# clf.fit(X_Models_train,y_Models_train)


# https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV

In [None]:
# import pandas as pd
# feature_imp = pd.Series(clf.feature_importances_,index=y_Models_test).sort_values(ascending=False)
# feature_imp

    d. One Hot Encoding & Label Encoding

In [None]:
# # Pipeline - Encoding:
# oh_enc = OneHotEncoder(sparse=False) # initializing One-Hot Encoder Function

# ## One-Hot for VisitorType to create new columns for 'Returning_Visitor' and 'New_Visitor'
# encoder_visitortype = X[['VisitorType']].values
# visitortype_encoded = encoder_visitortype.reshape(len(encoder_visitortype), 1)
# visitortype_onehot_encoded = oh_enc.fit_transform(visitortype_encoded)
# visitortype_onehot_df = pd.DataFrame(visitortype_onehot_encoded, columns = ["Returning_Visitor", "New_Visitor"])
# visitortype_onehot_df.head()

# ## Creating list for newly-created columns for VisitorType
# visitor_list = visitortype_onehot_df.columns


# ## One-Hot for Weekend to create new columns for 'Is_Weekend' and 'Not_Weekend'
# encoder_weekend = df[['Weekend']].values
# weekend_encoded = encoder_weekend.reshape(len(encoder_weekend), 1)
# weekend_onehot_encoded = oh_enc.fit_transform(weekend_encoded)
# weekend_onehot_df = pd.DataFrame(weekend_onehot_encoded, columns = ["Not_Weekend", "Is_Weekend"])
# weekend_onehot_df.head()

# ## Creating list for newly-created columns for Weekend
# weekend_list = weekend_onehot_df.columns


# ## Combine Holiday Seasons Months
# ## One-Hot for Month to create new columns for "Month_May", "Month_Nov","Month_Mar", "Month_Dec", "Month_Other"
# encoder_month = df[['Month_Bin']].values
# month_onehot_encoded = oh_enc.fit_transform(encoder_month)
# month_onehot_df = pd.DataFrame(month_onehot_encoded,  columns = ["Month_May", "Month_Nov","Month_Mar", "Month_Dec", "Month_Other"])
# month_onehot_df.head()

# ## Creating list for newly-created columns for Month
# month_list = month_onehot_df.columns
# df [month_list] = month_onehot_df.loc[:,month_list]
# df['VisitorType'] = df['VisitorType'].replace('Other','Returning_Visitor')

# # ## We can delete the revenue encoding:
# # ## One-Hot Encoding for Y to separate into two new columns for 'Sale' and 'No_Sale'
# # encoder_revenue = df[['Revenue']].values
# # revenue_encoded = encoder_revenue.reshape(len(encoder_revenue), 1)
# # revenue_onehot_encoded = oh_enc.fit_transform(revenue_encoded)
# # revenue_onehot_df = pd.DataFrame(revenue_onehot_encoded, columns = ["No_Sale", "Sale"])
# # revenue_onehot_df.head(30)

# # ## Creating list for newly-created columns for Y
# # rev_list = revenue_onehot_df.columns

# ## Code to add newly created columns to the df
# X[month_list] = month_onehot_df.loc[:,month_list]
# X[visitor_list] = visitortype_onehot_df.loc[:,visitor_list]
# X[weekend_list] = weekend_onehot_df.loc[:,weekend_list]


## A. Naive Bayes Model (NB) Over Under Sampled
>   <b> Assumption: </b> Model features all are independent; we have included all features we believe independent per the immediately preceding correlation analysis
<br><b> Calculate Accuracy: </b> How many times are you right?
<br><b> Answer: </b> 
* <b> Y1 </b>: 81.21% accuracy (AUC = 0.8741)

###0. Step 0: Import Needed Packages

In [None]:
# import the metrics class
from sklearn import metrics

# import other required modules for confusion matrices
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')

###1. Step 1: Specify the Model

In [None]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnbo = GaussianNB()
gnbu = GaussianNB()

###2. Steps 2-4: Generate Test Data, Build the Models & Assess the Models for y2

In [None]:
#Train the model using the training sets - for y2 (Sale)
gnbo.fit(over_sampl_X_Models_train, over_sampled_y_Models_train)
gnbu.fit(under_sampled_X_Models_train, under_sampled_y_Models_train)


#Predict the response for test dataset for y2
y_NB_pred_over_s = gnbo.predict(X_Models_test)
y_NB_pred_under_s = gnbu.predict(X_Models_test)

# Model Accuracy, how often is the classifier correct?
# Accuracy for y2
print("y Accuracy:",metrics.accuracy_score(y_Models_test, y_NB_pred_over_s))
print("")
print("y Accuracy:",metrics.accuracy_score(y_Models_test, y_NB_pred_under_s))
print("")

#Can use classification report to assess model adequacy, too
print(metrics.classification_report(y_Models_test, y_NB_pred_over_s, labels=class_names))
print(metrics.classification_report(y_Models_test, y_NB_pred_under_s, labels=class_names))

#AUC for y over sampled
y_NB_pred_proba = gnbo.predict_proba(X_Models_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_Models_test,  y_NB_pred_proba)
auc = metrics.roc_auc_score(y_Models_test, y_NB_pred_proba)
plt.plot(fpr,tpr,label="auc="+str(auc))
plt.legend(loc=4)
plt.show()
print("")

#AUC for y under sampled
y_NB_pred_proba = gnbu.predict_proba(X_Models_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_Models_test,  y_NB_pred_proba)
auc = metrics.roc_auc_score(y_Models_test, y_NB_pred_proba)
plt.plot(fpr,tpr,label="auc="+str(auc))
plt.legend(loc=4)
plt.show()
print("")


#Print Confusion Matrix
cnf_matrix_over = metrics.confusion_matrix(y_Models_test, y_NB_pred_over_s)
cnf_matrix_over
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_over), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix Over Sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

#Print Confusion Matrix
cnf_matrix_under = metrics.confusion_matrix(y_Models_test, y_NB_pred_under_s)
cnf_matrix_under
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_under), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix under sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')



In [None]:
#Conclusion - With accuracy of 81% for each y1 and y2, the Naive Bayes model is superior to the base model which had an accuracy of 71% for each target.
#However, there is still room for improvement.

In [None]:
# MD's Modeling Work

## B. Decision Tree Model (DT) Over Under Sampled
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y1 </b>: 85.56% accuracy
* <b> Y2: </b> 85.45% accuracy

#### Specify the Model

In [None]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [None]:
# Just to take look at the models df to see if everything is correct
df_Models.head()

#### Build the Model

In [None]:
# Create Decision Tree classifer object
clfo = DecisionTreeClassifier(criterion = "entropy", max_depth=3)
clfu = DecisionTreeClassifier(criterion = "entropy", max_depth=3)

# Train Decision Tree Classifer
clfo = clfo.fit(over_sampl_X_Models_train, over_sampled_y_Models_train)
clfu = clfu.fit(under_sampled_X_Models_train, under_sampled_y_Models_train)


#Predict the response for test dataset
y_DT_pred_over = clfo.predict(X_Models_test)
y_DT_pred_under = clfu.predict(X_Models_test) 


In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_Models_test, y_DT_pred_over))
print("Accuracy:",metrics.accuracy_score(y_Models_test, y_DT_pred_under))

#### Assess the Model

In [None]:
# Evaluating the Classification Report
print(metrics.classification_report(y_Models_test, y_DT_pred_over))
print(metrics.classification_report(y_Models_test, y_DT_pred_under))

In [None]:
# Evaluating the Confusion Matrix
print(metrics.confusion_matrix(y_Models_test, y_DT_pred_over))
print(metrics.confusion_matrix(y_Models_test, y_DT_pred_under))

In [None]:
#AUC for y
y_DT_pred_proba_over = clfo.predict_proba(X_Models_test)[::,1]
fpr_DT_over, tpr_DT_over, _ = metrics.roc_curve(y_Models_test,  y_DT_pred_proba_over)
auc_DT_over = metrics.roc_auc_score(y_Models_test, y_DT_pred_proba_over)
plt.plot(fpr_DT_over,tpr_DT_over,label="auc="+str(auc_DT_over))
plt.legend(loc=4)
plt.show()
print("")

#AUC for y
y_DT_pred_proba_under = clfu.predict_proba(X_Models_test)[::,1]
fpr_DT_under, tpr_DT_under, _ = metrics.roc_curve(y_Models_test,  y_DT_pred_proba_under)
auc_DT_under = metrics.roc_auc_score(y_Models_test, y_DT_pred_proba_under)
plt.plot(fpr_DT_under,tpr_DT_under,label="auc="+str(auc_DT_under))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix_DT_over = metrics.confusion_matrix(y_Models_test, y_DT_pred_over)
cnf_matrix_DT_over
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_DT_over), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix DT Over', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

#Print Confusion Matrix
cnf_matrix_DT_under = metrics.confusion_matrix(y_Models_test, y_DT_pred_under)
cnf_matrix_DT_under
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_DT_under), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix DT Under', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

85.45% is much better than the baseline model of 70%

## C. Random Forest Model (RF) Over Under Sampled
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y1 </b>: 90% accuracy
* <b> Y2: </b> 90% accuracy (not sure if Y2 result should be the same as Y1 but could make sense)

#### Specify the Model

In [None]:
# Building a Classifier
#Import scikit-learn dataset library
from sklearn import datasets

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
rfco = RandomForestClassifier(n_estimators=100)
rfcu = RandomForestClassifier(n_estimators=100)


#Train the model using the training sets y_pred=clf.predict(X_Models_test)
rfco.fit(over_sampl_X_Models_train, over_sampled_y_Models_train)
rfcu.fit(under_sampled_X_Models_train, under_sampled_y_Models_train)

y_RF_over_pred = rfco.predict(X_Models_test)
y_RF_under_pred = rfcu.predict(X_Models_test)


In [None]:
# Evaluation of the Classification Report
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_Models_test, y_RF_over_pred))
print(classification_report(y_Models_test, y_RF_under_pred))

In [None]:
# Evaluation of the Confusion Matrix
print(confusion_matrix(y_Models_test, y_RF_over_pred))
print(confusion_matrix(y_Models_test, y_RF_under_pred))


#AUC for y
y_RF_over_pred_proba = rfco.predict_proba(X_Models_test)[::,1]
fpr_RF_over, tpr_RF_over, _ = metrics.roc_curve(y_Models_test,  y_RF_over_pred_proba)
auc_RF_over = metrics.roc_auc_score(y_Models_test, y_RF_over_pred_proba)
plt.plot(fpr_RF_over,tpr_RF_over,label="auc="+str(auc_RF_over))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix_RF_over = metrics.confusion_matrix(y_Models_test, y_RF_over_pred) 
cnf_matrix_RF_over
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_RF_over), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix RF Over sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
# Evaluation of the Confusion Matrix
print(confusion_matrix(y_Models_test, y_RF_under_pred))

#AUC for y
y_RF_under_pred_proba = rfcu.predict_proba(X_Models_test)[::,1]
fpr_RF_under, tpr_RF_under, _ = metrics.roc_curve(y_Models_test,  y_RF_under_pred_proba)
auc_RF_under = metrics.roc_auc_score(y_Models_test, y_RF_under_pred_proba)
plt.plot(fpr_RF_under, tpr_RF_under,label="auc="+str(auc_RF_under))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix_RF_under = metrics.confusion_matrix(y_Models_test, y_RF_under_pred)
cnf_matrix_RF_under
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_RF_under), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix RF under sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

##### 90% accuracy for both Random Forest Models, so far better than the Decision Tree Models

#### More to do... Can do Feature Importance using scikit_learn - have the code for it but couldn't get it to run ***

In [None]:
# # Attempt on the feature selection using scikit-learn
# # import the package
# from sklearn.ensemble import RandomForestClassifier

# #Create a Gaussian Classifier
# clf=RandomForestClassifier(n_estimators=100)

# #Train the model using the training sets y_pred=clf.predict(X_test)
# clf.fit(X_Models_train,y_Models_train)

In [None]:
# import pandas as pd
# feature_imp = pd.Series(clf.feature_importances_,index=y_Models_test).sort_values(ascending=False)
# feature_imp

## D. Support Vector Machines (SVM) Classification Model (SVC) Over Under Sampled
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y1 </b>: 70.78% accuracy
* <b> Y2: </b> 70.78% accuracy (not sure if Y2 result should be the same as Y1 but could make sense)

In [None]:
###1. Step 1: Specify the Model
from sklearn import svm
svco = svm.SVC(kernel='rbf',cache_size=7000,gamma= 'auto', C=5, probability =True, degree = 5) # gamma= 0.001 , kernel='poly', 'rbf',‘linear’
svcu = svm.SVC(kernel='rbf',cache_size=7000,gamma= 'auto', C=5, probability =True, degree = 5) # gamma= 0.001 , kernel='poly', 'rbf',‘linear’

In [None]:
###2. Step 2: Generate Test Data
# X_train_svc, X_test_svc, y1_train_svc, y1_test_svc, y2_train_svc, y2_test_svc = train_test_split(X, y1,y2, test_size=0.3,random_state=1000) 
# X_Models_train, X_Models_test, y1_Models_train, y1_Models_test, y2_Models_train, y2_Models_test
#Standard Scale the data to allow for better SVC model performance
# scaler = StandardScaler().fit(X_Models_train) 
# standardized_X_svc = scaler.transform(X_Models_train) 
# standardized_X_test_svc = scaler.transform(X_Models_test)

In [None]:
###3. Step 3: Build the Model
#Train the model using the training sets
svco.fit(over_sampl_X_Models_train, over_sampled_y_Models_train)
svcu.fit(under_sampled_X_Models_train, under_sampled_y_Models_train)

#Predict the response for test dataset
y_pred_svc_over = svco.predict(X_Models_test)
y_pred_svc_under = svcu.predict(X_Models_test)


In [None]:
###4. Step 4: Assess the Model
print("Accuracy_svc_over:",metrics.accuracy_score(y_Models_test, y_pred_svc_over))
cnf_matrix_svc_over = metrics.confusion_matrix(y_Models_test, y_pred_svc_over)
print("Accuracy_svc_under:",metrics.accuracy_score(y_Models_test, y_pred_svc_under))
cnf_matrix_svc_under = metrics.confusion_matrix(y_Models_test, y_pred_svc_under)



print(cnf_matrix_svc_over)
print(cnf_matrix_svc_under)



#AUC for y1
y_SVM_over_pred_proba = svco.predict_proba(X_Models_test)[::,1]
fpr_SVM_over, tpr_SVM_over, _ = metrics.roc_curve(y_Models_test,  y_SVM_over_pred_proba)
auc_SVM_over = metrics.roc_auc_score(y_Models_test, y_SVM_over_pred_proba)
plt.plot(fpr_SVM_over, tpr_SVM_over,label="auc="+str(auc_SVM_over))
plt.legend(loc=4)
plt.show()
print("")

#AUC for y1
y_SVM_under_pred_proba = svcu.predict_proba(X_Models_test)[::,1]
fpr_SVM_under,tpr_SVM_under, _ = metrics.roc_curve(y_Models_test,  y_SVM_under_pred_proba)
auc_SVM_under = metrics.roc_auc_score(y_Models_test, y_SVM_under_pred_proba)
plt.plot(fpr_SVM_under,tpr_SVM_under,label="auc="+str(auc_SVM_under))
plt.legend(loc=4)
plt.show()
print("")



class_names_svc=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks_svc = np.arange(len(class_names_svc))
plt.xticks(tick_marks_svc, class_names_svc)
plt.yticks(tick_marks_svc, class_names_svc)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_svc_over), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix svc over sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

class_names_svc=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks_svc = np.arange(len(class_names_svc))
plt.xticks(tick_marks_svc, class_names_svc)
plt.yticks(tick_marks_svc, class_names_svc)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_svc_under), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix svc under sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')



print(classification_report(y_Models_test, y_pred_svc))
print("Accuracy_svc_over:",metrics.accuracy_score(y_Models_test, y_pred_svc_over))

print(classification_report(y_Models_test, y_pred_svc))
print("Accuracy_svc_under:",metrics.accuracy_score(y_Models_test, y_pred_svc_under))

## E. XGBoost Model (XGB) Over Under Sampled
>   <b> Assumption: </b> XXXAll features are usefull for Y1 & Y2!XXX
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> XXXThis can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.XXX
<br><b> Answer: </b> 
* <b> Y2 </b>: 90.19% accuracy
* <b> Y2: </b> XXX94.97 RMSEXXX - not sure this is right!!!

###0. Step 0: Import Needed Packages

In [None]:
#Import Needed Packages
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

#Do preliminary work


###1. Step 1: Specify the Model

In [None]:
#Instantiate an XGBoost Classifer Model - for y1 (No_Sale)
XGB_class_o = xgb.XGBClassifier(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100)
XGB_class_u = xgb.XGBClassifier(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100)

###2. Steps 2-4: Generate Test Data, Build the Models & Assess the Models for y2 (Sale)

In [None]:
#Put Data into structure for XGBoost- for y2 (Sale) 

#Train the model using the training sets for y1
XGB_class_o.fit(over_sampl_X_Models_train, over_sampled_y_Models_train)
XGB_class_u.fit(under_sampled_X_Models_train, under_sampled_y_Models_train)

#Predict the response for test dataset for y1
y_XGB_pred_over = XGB_class_o.predict(X_Models_test)
y_XGB_pred_under = XGB_class_u.predict(X_Models_test)

#Calculate RMSE for y2
rmse_XGB_over = np.sqrt(mean_squared_error(y_Models_test, y_XGB_pred_over))
print("XGBoost's RMSE for y2 is: %f" % (rmse_XGB_over))
rmse_XGB_under = np.sqrt(mean_squared_error(y_Models_test, y_XGB_pred_under))
print("XGBoost's RMSE for y2 is: %f" % (rmse_XGB_under))




#Create error ratio to evaluate results for y2
target_range_XGB = y_Models.max() - y_Models.min() # why Y_models
print("XGB target range is: %f" % (target_range_XGB))
error_ratio_XGB = rmse_XGB/target_range_XGB
print("XGBoost's Error Ratio for y2 is: %f" % (error_ratio_XGB))

target_range_XGB = y_Models.max() - y_Models.min()
print("XGB target range is: %f" % (target_range_XGB))
error_ratio_XGB = rmse_XGB/target_range_XGB
print("XGBoost's Error Ratio for y2 is: %f" % (error_ratio_XGB))

gnbo.fit(over_sampl_X_Models_train, over_sampled_y_Models_train)
gnbu.fit(under_sampled_X_Models_train, under_sampled_y_Models_train)

In [None]:
#ISSUE WITH THE ABOVE - HOW COME THE TARGET RANGE IS 1? DOES IT HAVE TO DO WITH 0/1 STATUS OF TARGET VARIABLE???

In [None]:
# Model Accuracy, how often is the classifier correct?
# Accuracy for y2
print("y Accuracy Over:",metrics.accuracy_score(y_Models_test, y_XGB_pred_over))
print("")
print("y Accuracy Under:",metrics.accuracy_score(y_Models_test, y_XGB_pred_under))
print("")

#Can use classification report to assess model adequacy, too
print(metrics.classification_report(y_Models_test, y_XGB_pred_over, labels=class_names))
print(metrics.classification_report(y_Models_test, y_XGB_pred_under, labels=class_names))

#AUC for y over sampled
y_XGB_over_pred_proba = XGB_class_o.predict_proba(X_Models_test)[::,1]
fpr_XGB_over, tpr_XGB_over, _ = metrics.roc_curve(y_Models_test,  y_XGB_over_pred_proba)
auc_XGB_over = metrics.roc_auc_score(y_Models_test, y_XGB_over_pred_proba)
plt.plot(fpr_XGB_over,tpr_XGB_over,label="auc="+str(auc_XGB_over))
plt.legend(loc=4)
plt.show()
print("")


#AUC for y under sampled
y_XGB_under_pred_proba = XGB_class_u.predict_proba(X_Models_test)[::,1]
fpr_XGB_under, tpr_XGB_under, _ = metrics.roc_curve(y_Models_test,  y_XGB_under_pred_proba)
auc_XGB_under = metrics.roc_auc_score(y_Models_test, y_XGB_under_pred_proba)
plt.plot(fpr_XGB_under,tpr_XGB_under,label="auc="+str(auc_XGB_under))
plt.legend(loc=4)
plt.show()
print("")

#Print Confusion Matrix
cnf_matrix_XGB_over = metrics.confusion_matrix(y_Models_test, y_XGB_pred_over)
cnf_matrix_XGB_over
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_XGB_over), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix XGB over sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

#Print Confusion Matrix
cnf_matrix_XGB_under = metrics.confusion_matrix(y_Models_test, y_XGB_pred_under)
cnf_matrix_XGB_under
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_XGB_under), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix XGB under sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

# More to go...including: k-fold cross-validation, visualization for feature importance & hyper-parameter tuning to improve model

## F. Neural Network Model (NN) Over Under sampled
>   <b> Assumption: </b> All features are usefull for Y1 & Y2!
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y1 and Y2 - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y1 </b>: 90.26% accuracy
* <b> Y2: </b> -% accuracy (not sure if Y2 result should be the same as Y1 but could make sense)

In [None]:
# X_Models_train, X_Models_test, y1_Models_train, y1_Models_test, y2_Models_train, y2_Models_test


###1. Step 1: Specify the Model
# Import the model
from sklearn.neural_network import MLPClassifier

# Initializing the multilayer perceptron
# mlp = MLPClassifier(hidden_layer_sizes = (3,1),solver='sgd',learning_rate_init= 0.01, max_iter=50)
mlpo= MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto', beta_1=0.9, 
beta_2=0.999, early_stopping=False, epsilon=1e-08,       
hidden_layer_sizes=(7,3), learning_rate='adaptive',      
learning_rate_init=0.01, max_iter=10000, momentum=0.9,       
nesterovs_momentum=True, power_t=0.5, random_state=1000,       
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,       
verbose=False, warm_start=True)

mlpu= MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto', beta_1=0.9, 
beta_2=0.999, early_stopping=False, epsilon=1e-08,       
hidden_layer_sizes=(7,3), learning_rate='adaptive',      
learning_rate_init=0.01, max_iter=10000, momentum=0.9,       
nesterovs_momentum=True, power_t=0.5, random_state=1000,       
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,       
verbose=False, warm_start=True)

In [None]:
###2. Step 2: Generate Test Data
# Train the model
mlpo.fit(over_sampl_X_Models_train, over_sampled_y_Models_train)
mlpu.fit(under_sampled_X_Models_train, under_sampled_y_Models_train)


In [None]:
###3. Step 3: Build the Model
y_pred_nn_over = mlpo.predict(X_Models_test)
y_pred_nn_under = mlpu.predict(X_Models_test)


In [None]:
###4. Step 4: Assess the Model
# Score takes a feature matrix X_test and the expected target values y_test. 
# Predictions for X_test are compared with y_test

print("MLP over score is",mlpo.score(X_Models_test,y_Models_test))
print("MLP under score is",mlpu.score(X_Models_test,y_Models_test))

# Accuracy for y NN
#print("y Accuracy NN:",metrics.accuracy_score(y_Models_test, y_XGB_pred))
#print("")

###4. Step 4: Assess the Model
print("Accuracy_nn_over:",metrics.accuracy_score(y_Models_test, y_pred_nn_over))
cnf_matrix_nn_over = metrics.confusion_matrix(y_Models_test, y_pred_nn_over)

print(cnf_matrix_nn_over)


print("Accuracy_nn_under:",metrics.accuracy_score(y_Models_test, y_pred_nn_under))
cnf_matrix_nn_under = metrics.confusion_matrix(y_Models_test, y_pred_nn_under)

print(cnf_matrix_nn_under)

# Plot AOC over sampled
y_pred_proba_nn_over = mlpo.predict_proba(X_Models_test)[::,1]
fpr_nn_over, tpr_nn_over, _ = metrics.roc_curve(y_Models_test,  y_pred_proba_nn_over)
auc_nn_over = metrics.roc_auc_score(y_Models_test, y_pred_proba_nn_over)
plt.plot(fpr_nn_over, tpr_nn_over,label="data 1, auc="+str(auc_nn_over))
plt.legend(loc=4)
plt.show()


# Plot AOC under sampled
y_pred_proba_nn_under = mlpu.predict_proba(X_Models_test)[::,1]
fpr_nn_under, tpr_nn_under, _ = metrics.roc_curve(y_Models_test,  y_pred_proba_nn_under)
auc_nn_under = metrics.roc_auc_score(y_Models_test, y_pred_proba_nn_under)
plt.plot(fpr_nn_under, tpr_nn_under,label="data 1, auc="+str(auc_nn_under))
plt.legend(loc=4)
plt.show()

# over sampled
class_names_nn=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks_nn = np.arange(len(class_names_nn))
plt.xticks(tick_marks_nn, class_names_nn)
plt.yticks(tick_marks_nn, class_names_nn)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_nn_over), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix NN over sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

print(classification_report(y_Models_test, y_pred_nn_over))
print("Accuracy_svc_over:",metrics.accuracy_score(y_Models_test, y_pred_nn_over))

# under sampled
class_names_nn=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks_nn = np.arange(len(class_names_nn))
plt.xticks(tick_marks_nn, class_names_nn)
plt.yticks(tick_marks_nn, class_names_nn)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_nn_under), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix NN under sampled', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

print(classification_report(y_Models_test, y_pred_nn_under))
print("Accuracy_svc_under:",metrics.accuracy_score(y_Models_test, y_pred_nn_under))


## Logistic Regression Model Over Under Sampled

>   <b> Assumption: </b> All features are usefull for Y
<br><b> Calculate: </b> How many times are you right?
<br><b> Reason: </b> This can be set as the baseline for our accuracy of Y - the computer model should at least beat this in order for it to be better than guessing.
<br><b> Answer: </b> 
* <b> Y </b>: 90.3% accuracy - about the same as our other best scores

In [None]:
#Build the model
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg_over = LogisticRegression()
logreg_under = LogisticRegression()

# fit the model with data
logreg_over.fit(over_sampl_X_Models_train, over_sampled_y_Models_train)
logreg_under.fit(under_sampled_X_Models_train, under_sampled_y_Models_train)

#
y_LR_pred_over = logreg_over.predict(X_Models_test)
y_LR_pred_under = logreg_under.predict(X_Models_test)


In [None]:
# import the metrics class for the Confusion Matrix
from sklearn import metrics
cnf_matrix_LogR_over = metrics.confusion_matrix(y_Models_test, y_LR_pred_over)
print(cnf_matrix_LogR_over)

cnf_matrix_LogR_under = metrics.confusion_matrix(y_Models_test, y_LR_pred_under)
print(cnf_matrix_LogR_under)

In [None]:
# Visualizing the Confusion Matrix over sampled

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_LogR_over), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix Log Reg over', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')


# Visualizing the Confusion Matrix under sampled

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix_LogR_under), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix Log Reg under', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
# ROC Curve
y_LR_pred_proba_over = logreg_over.predict_proba(X_Models_test)[::,1]
fpr_LR_over, tpr_LR_over, _ = metrics.roc_curve(y_Models_test,  y_LR_pred_proba_over)
auc_LR_over = metrics.roc_auc_score(y_Models_test, y_LR_pred_proba_over)
plt.plot(fpr_LR_over, tpr_LR_over,label="data 1, auc="+str(auc_LR_over))
plt.legend(loc=4)
plt.show()

# ROC Curve
y_LR_pred_proba_under = logreg_under.predict_proba(X_Models_test)[::,1]
fpr_LR_under, tpr_LR_under, _ = metrics.roc_curve(y_Models_test,  y_LR_pred_proba_under)
auc_LR_under = metrics.roc_auc_score(y_Models_test, y_LR_pred_proba_under)
plt.plot(fpr_LR_under, tpr_LR_under,label="data 1, auc="+str(auc_LR_under))
plt.legend(loc=4)
plt.show()

In [None]:
print("Accuracy over:",metrics.accuracy_score(y_Models_test, y_LR_pred_over))
print("Precision over:",metrics.precision_score(y_Models_test, y_LR_pred_over))
print("Recall over:",metrics.recall_score(y_Models_test, y_LR_pred_over))

print("Accuracy under:",metrics.accuracy_score(y_Models_test, y_LR_pred_under))
print("Precision under:",metrics.precision_score(y_Models_test, y_LR_pred_under))
print("Recall under:",metrics.recall_score(y_Models_test, y_LR_pred_under))

In [None]:
#90.3% accuracy 

## K-Means Model Over Under Sampled

In [None]:
#important packages

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
plt.style.use('ggplot')
%matplotlib inline

In [None]:
X_kmeans_over = np.array(over_sampl_X_Models_train)
y_kmeans_over = np.array(over_sampled_y_Models_train)

X_kmeans_under = np.array(under_sampled_X_Models_train)
y_kmeans_under = np.array(under_sampled_y_Models_train)

In [None]:
# Build the model

# load the model
kmeans_over = KMeans(n_clusters=2, max_iter=600, algorithm = 'auto') # 2 clusters, sale or no sale
kmeans_under = KMeans(n_clusters=2, max_iter=600, algorithm = 'auto') # 2 clusters, sale or no sale

kmeans_over.fit(X_kmeans_over)
kmeans_under.fit(X_kmeans_under)


In [None]:
# Predictions
correct_over = 0
for i in range(len(X_kmeans_over)):
    predict_me_over = np.array(X_kmeans_over[i].astype(float))
    predict_me_over = predict_me_over.reshape(-1, len(predict_me_over))
    prediction_over = kmeans_over.predict(predict_me_over)
    if prediction_over[0] == y_kmeans_over[i]:
        correct_over += 1

print(correct/len(X_kmeans_under))

correct_under = 0
for i in range(len(X_kmeans_under)):
    predict_me_under = np.array(X_kmeans_under[i].astype(float))
    predict_me_under = predict_me_under.reshape(-1, len(predict_me_under))
    prediction_under = kmeans_under.predict(predict_me_under)
    if prediction_under[0] == y_kmeans_under[i]:
        correct_under += 1
        
print(correct/len(X_kmeans_under))

# Part V: Validation  <a name="part5"></a>

Stratified Cross Validation

In [3]:
# skf = StratifiedKFold(n_splits=5, random_state=None)
# # X is the feature set and y is the target
# for train_index, test_index in skf.split(X,y): 
#     print("Train:", train_index, "Validation:", val_index) 
#     X_train, X_test = X[train_index], X[val_index] 
#     y_train, y_test = y[train_index], y[val_index]

In [None]:
# # create some synthetic data for illustration
# X_data = np.random.randint(5, size=(9, 2))
# X_data

Regular K-fold CV:

In [None]:
# kf = KFold(n_splits=3, random_state=2020)
# for train_index, test_index in kf.split(X_data):
#       print("Train:")
#       print(X_data[train_index])
#       print("Test:")
#       print(X_data[test_index])
#       print('\n')

Repeated K-fold CV:

In [None]:
# rkf = RepeatedKFold(n_splits=3, n_repeats=5, random_state=2020)
# for train_i.ndex, test_index in rkf.split(X_data):
#       print("Train:")
#       print(X_data[train_index])
#       print("Test:")
#       print(X_data[test_index])
#       print('\n')

Model Selection examples
https://scikit-learn.org/stable/modules/grid_search.html#grid-search

# Part VI: Presentation/Visualization  <a name="part6"></a>

# Part VII: Sources  <a name="part7"></a>
1. https://i.ytimg.com/vi/CRKn-9gVNBw/maxresdefault.jpg
2. https://support.google.com/analytics

# Part VIII: Next Steps (Discuss deployment, Lessons learned, Additional analyses had time permitted)