## Binary Classification template

* Import Packages
* Data Exploration
    * Summary of the original dataset
    * Create new features
    * Missing values
    * Dataset normalization
    * (Dataset PCA)
    * Visulization 
        * Feature correlation heatmap
        * Certain feature histograms with target variable as hue
        * kde plot
* Modeling
    * Feature selection
        * Univariate Feature selection
        * Decision tree feature importance
        * RFE
    * Train test datasets split (Cross validation)
    * Build model
        * Logistic regression
        * Decission regression
        * Random Forest 
        * Gradient Boosting
        * SVM
    * Model evaluation on both training, test, and validation datasets

### Import packages

In [5]:
# General settings
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Display all plots inline in Jupyter notebook
%matplotlib inline
#set 'png' here when working on notebook
%config InlineBackend.figure_format = 'retina'

In [6]:
# Visualisation packages
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# Configure visualizations
mpl.style.use('ggplot')
#sns.set_style('white')
pylab.rcParams['figure.figsize'] = 8,6

In [7]:
# import pandas and numpy packages
import numpy as np
import pandas as pd
# Avoid truncate the display
#pd.options.display.max_rows =2000
#pd.options.display.max_columns=2000

import cPickle

In [8]:
# Import sklearn
# Modelling Algorithms
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Modelling Helpers
from sklearn.preprocessing import Imputer, Normalizer, scale, StandardScaler, MinMaxScaler
from sklearn.cross_validation import train_test_split, KFold, StratifiedKFold
from sklearn.grid_search import GridSearchCV #for tunning hyper parameter it will use all combination of of given parameters
from sklearn.grid_search import RandomizedSearchCV # same for tunning hyper parameter but will use random combinations of parameters
from sklearn.feature_selection import RFECV,RFE,SelectPercentile,f_classif
import sklearn.metrics as metrics
from sklearn.metrics import confusion_matrix, accuracy_score,recall_score, precision_recall_curve,auc, roc_curve, roc_auc_score, classification_report

### Data Exploration

* Summary of the original dataset
* Create new features
* Handle missing values
* Dataset normalization
* (Dataset PCA)
* Visulization 
    * Feature correlation heatmap
    * Certain feature histograms with target variable as hue
    * kde plot

In [10]:
# Summary of the original dataset
df.describe()
df.shape()
df.summary()

NameError: name 'df' is not defined

In [11]:
# Create new features
# x.apply
# lamda function
# Binary columns using np.where, one-hot encoding, dummies

In [9]:
# Handle missing values
df.fillna(0)

In [None]:
# Dataset normalization

In [None]:
# Dataset PCA

In [None]:
# Visulization
sns.heatmap(df.corr(),xticklabels=df.columns(),yticklabels=df.columns())

### Modeling

* Feature selection
    * Univariate Feature selection
    * Decision tree feature importance
    * RFE
* Train test datasets split (Cross validation)
* Build model
    * Logistic regression
    * Decission regression
    * Random Forest 
    * Gradient Boosting
    * SVM
* Model evaluation on both training, test, and validation datasets

In [12]:
# Feature selection using randomeforestclassifier feature importance

from sklearn.ensemble import RandomForestClassifier
def feature_tree_imp(X,y):
    tree = RandomForestClassifier()
    tree.fit(X, y)
    imp = pd.DataFrame(tree.feature_importances_, columns = [ 'Importance' ], index = X.columns)
    imp = imp.sort_values( ['Importance'], ascending = False)
    return imp

train_X,test_X,train_y,test_y= train_test_split(X, y,test_size=0.3, \
                                                random_state=42,stratify=y)
imp_test = feature_tree_imp(train_X, train_y)
print imp_test

imp_feature_list = list(imp_test.index)

In [None]:
# Use RFE or RFECV to find the best features
# use logistic regression here

clf = LogisticRegression()
rfe = RFE(clf, 8)
rfe = rfe.fit(X,y)
# summarize the selection of the attributes
print('Selected features: %s' % list(X.columns[rfe.support_]))

In [None]:
selected_features = ['','','','']

In [None]:
from sklearn.feature_selection import RFECV
# Create the RFE object and compute a cross-validated score.
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=LogisticRegression(), step=1, cv=10, scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features: %d" % rfecv.n_features_)
print('Selected features: %s' % list(X.columns[rfecv.support_]))

# Plot number of features VS. cross-validation scores
plt.figure(figsize=(10,6))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

In [None]:
# Train test datasets split (Cross validation)
train_X,test_X,train_y,test_y= train_test_split(X[imp_feature_list[0:15]],y,test_size=0.3,\
                                                random_state=42,stratify=y_predict_target)

In [13]:
# Build models
clf= LogisticRegression(class_weight='balanced', random_state = 0)
#clf= LogisticRegression(penalty = 'l1', C = 0.00001,class_weight='balanced', random_state = 0)
#clf= LogisticRegression(penalty = 'l1', C = 0.00001, random_state = 0)
#clf= LogisticRegression(penalty = 'l2', C = 10000, class_weight='balanced',random_state = 0)
# For unbalaced dataset, RF might be not a good option, needs to manually oversample some class
#clf = GaussianNB()
#clf = DecisionTreeClassifier(max_depth=15)
#clf = RandomForestClassifier(random_state=0, n_estimators=50,max_depth=10)
#clf = RandomForestClassifier()

clf.fit(train_X, train_y)

In [None]:
# Model evaluation on both training, test, and validation datasets

# Get prediction
pred = clf.predict(test_X)
predprob = clf.predict_proba(test_X)

# Evaluation
cnf_matrix = confusion_matrix(test_y, pred)
TP = cnf_matrix[1, 1]
TN = cnf_matrix[0, 0]
FP = cnf_matrix[0, 1]
FN = cnf_matrix[1, 0]
accuracy = accuracy_score(test_y, pred)
recall = recall_score(test_y, pred)
specificity = TN / float(TN + FP)
auc = roc_auc_score(test_y, predprob[:, 1])

print "TP:", TP  # no of fraud transaction which are predicted fraud
print "TN:", TN  # no. of normal transaction which are predited normal
print "FP:", FP  # no of normal transaction which are predicted fraud
print "FN:", FN, "\n"  # no of fraud Transaction which are predicted normal
print cnf_matrix, "\n"
print "Original non-redemption distribution:", 1 - len(test_y[test_y == 1]) / (len(test_y) * 1.0)
print "Classification Accuracy:", accuracy
print "Recall:", recall
print "Specificity:", specificity
print "AUC:", auc
print("\n----------Classification Report on dataset-------------")
print(classification_report(test_y, pred))