## Pre-processing and Training Data Development
This is the forth step in the Data Science Method. In this exercise, you will build the data inorder to fit models
**The Data Science Method**  


1.   Problem Identification 

2.   Data Wrangling 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
  * Outliers
 
3.   Exploratory Data Analysis 
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features

4.   **Pre-processing and Training Data Development**
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

%matplotlib inline

### Load clean data from previous step

Also check data type and sync it based on analysis on previous step

<img src = 'Data_desc.png'>

<img src = 'Data_desc_cat.png'>

In [3]:
path = 'C:/Users/sanja/Jupyter Code/Git Hub/Online-Shopper-Intention-Capstone/'
filename = 'data/clean_data2.csv'
df = pd.read_csv(path+filename)
df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,-1.0,0,-1.0,1,-1.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


In [None]:
#DataFrame dtype info
df.info()

In [None]:
# categorical feature
#df['SpecialDay'] = df['SpecialDay'].astype('object')
df['Month'] = df['Month'].astype('object')
df['OperatingSystems'] = df['OperatingSystems'].astype('object')
df['Browser'] = df['Browser'].astype('object')
df['Region'] = df['Region'].astype('object')
df['TrafficType'] = df['TrafficType'].astype('object')

In [None]:
df.info()

## Dealing with Categorical Variables

In [None]:
cat_name = df.select_dtypes(include = ['object']).columns
print(cat_name)

In [None]:
for i in cat_name:
    print(i, df[i].nunique())
    print(df[i].value_counts())

In [None]:
#One hot encoding 'VisitorType'

df_enc = pd.concat([df.drop('VisitorType', axis = 1),pd.get_dummies(df['VisitorType'],prefix = 'VT')],axis = 1)
df_enc.head()

In [None]:
#One hash encoding 'Month'

encoder=ce.HashingEncoder(cols='Month',n_components=3)
df_enc = encoder.fit_transform(df_enc)
df_enc.head()

In [None]:
df_enc.rename(columns={'col_0': 'Month_0','col_1': 'Month_1','col_2': 'Month_2'},inplace = True)

df_enc.columns

In [None]:
#One hot encoding 'OperatingSystems'

df_enc = pd.concat([df_enc.drop('OperatingSystems', axis = 1),pd.get_dummies(df_enc['OperatingSystems'],prefix = 'OS')],axis = 1)
df_enc.head()

In [None]:
#One hot encoding 'Browser'

df_enc = pd.concat([df_enc.drop('Browser', axis = 1),pd.get_dummies(df_enc['Browser'],prefix = 'Bro')],axis = 1)
df_enc.head()

In [None]:
#One hot encoding 'Region'

df_enc = pd.concat([df_enc.drop('Region', axis = 1),pd.get_dummies(df_enc['Region'],prefix = 'Reg')],axis = 1)
df_enc.head()

In [None]:
#One hot encoding 'TrafficType'

df_enc = pd.concat([df_enc.drop('TrafficType', axis = 1),pd.get_dummies(df_enc['TrafficType'],prefix = 'TT')],axis = 1)
df_enc.head()

In [None]:
df_enc.columns

## Dealing with Non - Categorical Variables

In [None]:
df_enc.describe()

In [None]:
non_cat_name = df.select_dtypes(exclude = ['object','bool']).columns
print(non_cat_name)

In [None]:
df_enc[non_cat_name].hist(figsize = (15,10))
plt.show()

## Standardization

In [None]:
scaler = StandardScaler()
df_enc[non_cat_name] = scaler.fit_transform(df_enc[non_cat_name])

In [None]:
df_enc[non_cat_name].hist(figsize = (15,10))
plt.show()

In [None]:
X = df_enc.drop(['Revenue'], axis=1)
y = df_enc.Revenue

y = y.ravel()

# Call the train_test_split() function with the first two parameters set to X_scaled and y 
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=1)