 ## Modeling Trail2
      1. Split the test set into validation and testing
      2. Train the classification models (Logistic Regression, Decision Trees, Random Forest, SVM and SVM with grid search)
      3. Confusion Matrix and Classification Report
      4. Report the model metric using the max F1 and it is Logistic Regression

**Libraries and Packages needed for modeling**

In [1]:
import pandas as pd
import numpy as np

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
%matplotlib inline

In [4]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics          import confusion_matrix,classification_report

from sklearn.svm              import SVC
from sklearn.grid_search      import GridSearchCV
from sklearn.linear_model     import LogisticRegression
from sklearn.tree             import DecisionTreeClassifier
from sklearn.ensemble         import RandomForestClassifier



**Import the data**

In [5]:
train = pd.read_csv("C:\\AllamMarwa\\3- Side Projects\\Side Project - titanic\\Data\\train_trail_2.csv")
test  = pd.read_csv("C:\\AllamMarwa\\3- Side Projects\\Side Project - titanic\\Data\\test_trail_2.csv")

**1. Splitting the testing set to validation and testing set**

In [6]:
train.drop(['PassengerId'],axis =1, inplace=True)
test.drop(['PassengerId'] ,axis =1, inplace=True)

In [7]:
X_train = train.drop('Survived', axis = 1) 
y_train = train['Survived']

In [8]:
X_Val_tst = test.drop('Survived', axis = 1) 
y_Val_tst = test['Survived']

In [9]:
X_validation, X_test, y_validation, y_test = train_test_split(X_Val_tst, y_Val_tst, test_size=0.5, random_state=101)

# Modeling Phase
**Building Logistic Regression**



In [10]:
LogisticReg_T2 = LogisticRegression()

In [11]:
LogisticReg_T2.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [41]:
LogRegPred_train = LogisticReg_T2.predict(X_train)

In [42]:
print(confusion_matrix(y_train,LogRegPred_train))
print('\n')
print(classification_report(y_train,LogRegPred_train))

[[476  73]
 [105 235]]


             precision    recall  f1-score   support

          0       0.82      0.87      0.84       549
          1       0.76      0.69      0.73       340

avg / total       0.80      0.80      0.80       889



In [43]:
LogRegPred = LogisticReg_T2.predict(X_validation)

In [44]:
print(confusion_matrix(y_validation,LogRegPred))
print('\n')
print(classification_report(y_validation,LogRegPred))

[[138   4]
 [  2  64]]


             precision    recall  f1-score   support

          0       0.99      0.97      0.98       142
          1       0.94      0.97      0.96        66

avg / total       0.97      0.97      0.97       208



**Building Support Vector Machine**

In [14]:
SupVctr_T2 = SVC()

In [15]:
SupVctr_T2.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [16]:
SVMPred_train = SupVctr_T2.predict(X_train)

In [17]:
SVMPred = SupVctr_T2.predict(X_validation)

In [18]:
print(confusion_matrix(y_train,SVMPred_train))
print('\n')
print(classification_report(y_train,SVMPred_train))

[[517  32]
 [ 67 273]]


             precision    recall  f1-score   support

          0       0.89      0.94      0.91       549
          1       0.90      0.80      0.85       340

avg / total       0.89      0.89      0.89       889



In [19]:
print(confusion_matrix(y_validation,SVMPred))
print('\n')
print(classification_report(y_validation,SVMPred))

[[107  35]
 [ 25  41]]


             precision    recall  f1-score   support

          0       0.81      0.75      0.78       142
          1       0.54      0.62      0.58        66

avg / total       0.72      0.71      0.72       208



**Building Support Vector Machine with grid search**

** Gridsearch**

Finding the right parameters (like what C or gamma values to use) is a tricky task! But luckily, we can be a little lazy and just try a bunch of combinations and see what works best! This idea of creating a 'grid' of parameters and just trying out all the possible combinations is called a Gridsearch, this method is common enough that Scikit-learn has this functionality built in with GridSearchCV! The CV stands for cross-validation which is the

GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested. 

In [20]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['linear','rbf']} 

In [21]:
grid = GridSearchCV(SVC(),param_grid, verbose= 2)
grid.fit(X_train,y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits
[CV] C=0.1, gamma=1, kernel=linear ...................................
[CV] .......................... C=0.1, gamma=1, kernel=linear -   0.2s
[CV] C=0.1, gamma=1, kernel=linear ...................................
[CV] .......................... C=0.1, gamma=1, kernel=linear -   0.0s
[CV] C=0.1, gamma=1, kernel=linear ...................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV] .......................... C=0.1, gamma=1, kernel=linear -   1.1s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ............................. C=0.1, gamma=1, kernel=rbf -   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ............................. C=0.1, gamma=1, kernel=rbf -   0.1s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ............................. C=0.1, gamma=1, kernel=rbf -   0.0s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV] ........................ C=0.1, gamma=0.1, kernel=linear -   0.2s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV] ........................ C=0.1, gamma=0.1, kernel=linear -   0.0s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV] ........................ C=0.1, gamma=0.1, kernel=linear -   2.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] .

[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 53.2min finished





GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['linear', 'rbf']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)

In [24]:
grid.best_params_

{'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}

In [25]:
grid_predictions_train = grid.predict(X_train)

In [26]:
print(confusion_matrix(y_train,grid_predictions_train))
print('\n')
print(classification_report(y_train,grid_predictions_train))

[[489  60]
 [ 65 275]]


             precision    recall  f1-score   support

          0       0.88      0.89      0.89       549
          1       0.82      0.81      0.81       340

avg / total       0.86      0.86      0.86       889



In [27]:
grid_predictions = grid.predict(X_validation)

In [28]:
print(confusion_matrix(y_validation,grid_predictions))
print('\n')
print(classification_report(y_validation,grid_predictions))

[[127  15]
 [  3  63]]


             precision    recall  f1-score   support

          0       0.98      0.89      0.93       142
          1       0.81      0.95      0.88        66

avg / total       0.92      0.91      0.92       208



**Buidling Decision Trees**

In [29]:
dtree = DecisionTreeClassifier()

In [30]:
dtree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [31]:
DT_pred_train = dtree.predict(X_train)

In [32]:
print(confusion_matrix(y_train, DT_pred_train))
print('\n')
print(classification_report(y_train, DT_pred_train))

[[547   2]
 [ 14 326]]


             precision    recall  f1-score   support

          0       0.98      1.00      0.99       549
          1       0.99      0.96      0.98       340

avg / total       0.98      0.98      0.98       889



In [33]:
DT_pred = dtree.predict(X_validation)

In [34]:
print(confusion_matrix(y_validation, DT_pred))
print('\n')
print(classification_report(y_validation, DT_pred))

[[118  24]
 [ 16  50]]


             precision    recall  f1-score   support

          0       0.88      0.83      0.86       142
          1       0.68      0.76      0.71        66

avg / total       0.82      0.81      0.81       208



**Building Random Forest**

In [35]:
RanForst = RandomForestClassifier(n_estimators=200)

In [36]:
RanForst.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [37]:
RanFor_Predict_train = RanForst.predict(X_train)

In [38]:
print(confusion_matrix(y_train,RanFor_Predict_train))
print('\n')
print(classification_report(y_train,RanFor_Predict_train))

[[543   6]
 [ 10 330]]


             precision    recall  f1-score   support

          0       0.98      0.99      0.99       549
          1       0.98      0.97      0.98       340

avg / total       0.98      0.98      0.98       889



In [39]:
RanFor_Predict = RanForst.predict(X_validation)

In [40]:
print(confusion_matrix(y_validation,RanFor_Predict))
print('\n')
print(classification_report(y_validation,RanFor_Predict))

[[125  17]
 [ 13  53]]


             precision    recall  f1-score   support

          0       0.91      0.88      0.89       142
          1       0.76      0.80      0.78        66

avg / total       0.86      0.86      0.86       208



**Report the accuracy on testing set using max F1 which was the Logistics Regression**

In [46]:
LogRegPred_test = LogisticReg_T2.predict(X_test)

In [47]:
print(confusion_matrix(y_test,LogRegPred_test))
print('\n')
print(classification_report(y_test,LogRegPred_test))

[[117   6]
 [  6  80]]


             precision    recall  f1-score   support

          0       0.95      0.95      0.95       123
          1       0.93      0.93      0.93        86

avg / total       0.94      0.94      0.94       209

