# 1. Load libraries and dataset

There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a **classification** and **regression** problem. We want to identify relationship between output (Survived or not) with other variables or features. We are performing a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - **Supervise** plus **Classification and Regression**, we can narrow down our choice of models to:
- Logistic Regression
- KNN or k-Nearest Neighbors
- Support Vector Machines
- Naive Bayes classifier
- Decision Tree
- Random Forest
- Perceptron
- Artificial Neural Network
- RVM or Relevance Vector Machine

In [1]:
#load libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import KFold
from sklearn import cross_validation, metrics
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.grid_search import GridSearchCV

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4


#load dataset
train = pd.read_csv(r'C:\Users\LW130003\Documents\GitHub\titanic\train_modified.csv')
test = pd.read_csv(r'C:\Users\LW130003\Documents\GitHub\titanic\test_modified.csv')

train.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 23 columns):
Pclass                      891 non-null int64
Survived                    891 non-null float64
IsAlone                     891 non-null int64
Embarked_C                  891 non-null int64
Embarked_Q                  891 non-null int64
Embarked_S                  891 non-null int64
Sex_female                  891 non-null int64
Sex_male                    891 non-null int64
AgeBand_(-0.08, 16.0]       891 non-null int64
AgeBand_(16.0, 32.0]        891 non-null int64
AgeBand_(32.0, 48.0]        891 non-null int64
AgeBand_(48.0, 64.0]        891 non-null int64
AgeBand_(64.0, 80.0]        891 non-null int64
FareBand_(-0.512, 102.4]    891 non-null int64
FareBand_(102.4, 204.8]     891 non-null int64
FareBand_(204.8, 307.2]     891 non-null int64
FareBand_(307.2, 409.6]     891 non-null int64
FareBand_(409.6, 512.0]     891 non-null int64
Title_Master                891 non-null int64


# 2. Generic function to train and test model

In [2]:
#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome, feature_importance = True):
    #Fit the model:
    model.fit(data[predictors],data[outcome])
    #Make predictions on training set:
    predictions = model.predict(data[predictors])
    #Print accuracy
    accuracy = metrics.accuracy_score(predictions,data[outcome])
    print("Accuracy : %s" % "{0:.3%}".format(accuracy))
    #Perform k-fold cross-validation with 5 folds
    kf = KFold(data.shape[0], n_folds=5)
    error = []
    
    for train, test in kf:
        # Filter training data
        train_predictors = (data[predictors].iloc[train,:])
        # The target we're using to train the algorithm.
        train_target = data[outcome].iloc[train] 
        # Training the algorithm using the predictors and target.
        model.fit(train_predictors, train_target)
        #Record error from each cross-validation run
        error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
    
    print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
    cv = np.mean(error)
    #Fit the model again so that it can be refered outside the function:
    model.fit(data[predictors],data[outcome])

    return accuracy, cv

# 3. Train and Predict

In [3]:
#create a loop
algorithm = pd.DataFrame({
    'Model' : ['Logistic Regression', 'KNN', 'Support Vector Machines', 
               'Naive Bayes classifier', 'Decision Tree', 'Random Forest' ,
               'Perceptron' , 'Stochastic Gradient Descent', 'Linear SVC'
              ],
    'Accuracy' : np.nan,
    'CV Score' : np.nan
},  columns=['Model', 'Accuracy', 'CV Score'])

models = [LogisticRegression(), KNeighborsClassifier(), SVC(), GaussianNB(), 
          DecisionTreeClassifier(), RandomForestClassifier(), Perceptron(), 
          SGDClassifier(), LinearSVC()]
outcome_var = 'Survived'
predictor_var = train.columns.tolist()
predictor_var.remove('Survived')

for i, model in enumerate(models):
    print(model)
    acc, cvs = classification_model(model, train,predictor_var,outcome_var)
    algorithm['Accuracy'].iloc[i] = acc
    algorithm['CV Score'].iloc[i] = cvs
    print('-'*40)
algorithm.sort_values(by='CV Score', ascending=False)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy : 80.920%
Cross-Validation Score : 80.359%
----------------------------------------
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
Accuracy : 83.838%
Cross-Validation Score : 79.911%


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


----------------------------------------
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Accuracy : 80.022%
Cross-Validation Score : 78.226%
----------------------------------------
GaussianNB(priors=None)
Accuracy : 74.860%
Cross-Validation Score : 75.091%
----------------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Accuracy : 84.624%
Cross-Validation Score : 79.910%
----------------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Accuracy : 84.400%
Cross-Validation Score : 78.679%
----------------------------------------
Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=0, warm_start=False)
Accuracy : 80.022%
Cross-Validation Score : 78.010%


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


----------------------------------------
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)
Accuracy : 78.900%
Cross-Validation Score : 65.875%
----------------------------------------
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Accuracy : 81.145%
Cross-Validation Score : 79.015%
----------------------------------------


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,Model,Accuracy,CV Score
0,Logistic Regression,0.809203,0.80359
1,KNN,0.838384,0.799115
4,Decision Tree,0.84624,0.799102
8,Linear SVC,0.811448,0.790145
5,Random Forest,0.843996,0.786787
2,Support Vector Machines,0.800224,0.782261
6,Perceptron,0.800224,0.780095
3,Naive Bayes classifier,0.748597,0.750907
7,Stochastic Gradient Descent,0.789001,0.658747
