In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

customer= pd.read_pickle('modifiedCus.pkl')
customer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   CustomerID                   5630 non-null   int64  
 1   Churn                        5630 non-null   object 
 2   Tenure                       5366 non-null   float64
 3   PreferredLoginDevice         5630 non-null   object 
 4   CityTier                     5630 non-null   object 
 5   WarehouseToHome              5379 non-null   float64
 6   PreferredPaymentMode         5630 non-null   object 
 7   Gender                       5630 non-null   object 
 8   HourSpendOnApp               5375 non-null   object 
 9   NumberOfDeviceRegistered     5630 non-null   object 
 10  PreferedOrderCat             5630 non-null   object 
 11  SatisfactionScore            5630 non-null   object 
 12  MaritalStatus                5630 non-null   object 
 13  NumberOfAddress   

##### I. Data Cleaning
- Using statistics to define normal data and identify outliers
- Imputing missing values using statistics or a learned Model 

##### II. Feature Selections

the supervised techniques can be further divided into models that automatically select features as part of fitting the model (intrinsic), those that explicitly choose features that result in the best performing model (wrapper) and those that score each input feature and allow a subset to be selected (filter).

##### III. Data Transformation 


#### IV. Dimension Reduction 

In [4]:
customer.select_dtypes(include='object').columns

Index(['Churn', 'PreferredLoginDevice', 'CityTier', 'PreferredPaymentMode',
       'Gender', 'HourSpendOnApp', 'NumberOfDeviceRegistered',
       'PreferedOrderCat', 'SatisfactionScore', 'MaritalStatus', 'Complain'],
      dtype='object')

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import class_weight
#convert some columns back to numerical 
for i in ['CityTier','SatisfactionScore','HourSpendOnApp','NumberOfDeviceRegistered']:
    customer[i]= customer[i].astype('float64')
xVar, yVar= customer.drop(['Churn', 'CustomerID'], axis=1), customer['Churn']

#one hot encoding for nonlinear variables 
onehot_columns = ['PreferredLoginDevice','PreferredPaymentMode','Gender','PreferedOrderCat', 'MaritalStatus', 'Complain']
onehot_df = customer[onehot_columns]
onehot_df = pd.get_dummies(onehot_df, columns = onehot_columns)
score_onehot_drop = customer.drop(onehot_columns, axis = 1)
customer_c = pd.concat([score_onehot_drop, onehot_df], axis = 1)

> Entity Embedding for Categorical Variables

Entity Embeddings perform better than one-hot encodings because they represent categorical variables in a compact and continuous way. Whereas one-hot encodings ignore informative relations between a feature’s values, entity embeddings can map related values closer together in embedding space, revealing the inherent continuity of the data (Guo 2016).



In [None]:
customer.select_dtypes(include='object')

<div class="alert alert-block alert-info">
<b>Decision Tree Algorithm</b> Use blue boxes (alert-info) for tips and notes.</div>

Since Decision Tree Algorithm does not require feature scaling, such as standardization and normalization, while handling missing values and outliers automatically; it's used as a model assess data quality. 

In [None]:

x_train, x_test, y_train, y_test= train_test_split(xVar, yVar, test_size=.3, random_state=61)

In [None]:
classWeights= class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)

In [None]:
dTree_hp= DecisionTreeClassifier(random_state=61, class_weight=dict(enumerate(classWeights)))
param={'max_depth':[3,5,7,10,15],
          'min_samples_leaf':[1, 3,5,10,15,20], 
          'min_samples_split':[2,4,6,8,10,12], 
          'criterion':['gini','entropy']}

GSdt= GridSearchCV(estimator=dTree_hp, param_grid=param, cv=5)
GSdt.fit(x_train, y_train)