# HW-6 (answer key: Titanic-SVM Classification)


[Titanic competition from Kaggle](https://www.kaggle.com/c/titanic).

Instructor: [Pedram Jahangiry](https://github.com/PJalgotrader)

In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
sns.set()  #if you want to use seaborn themes with matplotlib functions

### Data Preprocessing

In [43]:
df = pd.read_csv('titanic_train_clean.csv')
df_test = pd.read_csv('titanic_test_clean.csv')

In [44]:
df.head()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Cabin_known,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,1,0,22.0,1,0,7.25,0,0,1,1,0,1
1,2,1,38.0,1,0,71.2833,1,0,0,0,0,0
2,3,1,26.0,0,0,7.925,0,0,1,0,0,1
3,4,1,35.0,1,0,53.1,1,0,0,0,0,1
4,5,0,35.0,0,0,8.05,0,0,1,1,0,1


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 12 columns):
PassengerId    889 non-null int64
Survived       889 non-null int64
Age            889 non-null float64
SibSp          889 non-null int64
Parch          889 non-null int64
Fare           889 non-null float64
Cabin_known    889 non-null int64
Pclass_2       889 non-null int64
Pclass_3       889 non-null int64
Sex_male       889 non-null int64
Embarked_Q     889 non-null int64
Embarked_S     889 non-null int64
dtypes: float64(2), int64(10)
memory usage: 83.4 KB


# Question 1: SVM (Classification)

1- Use SVC function from Sklearn package (use the default properties) and report the estimated accuracy of your model. Note that because the original target variables in the test set are unobservable, you need to estimate this accuracy_test by applying cross validation to the train set (try K=5 and K=10). 

2- Use Grid search to find the optimal C and gamma and then rerun the model with the optimized parameters. 
Depending on the speed of your computer, this step may take a while. If your memory is 4 or 8GB, try to limit the set of your parameters to something like this: 

param_grid = {'C': [0.1,10], 'gamma': [1,0.01]} 

3- report the accuracy_CV for the optimized model. (k=5)


In [46]:
# Defining the target and feature space for both train and test set. Note that the target for the test set is unknown.

X_train= df.drop(['PassengerId','Survived'], axis=1)
y_train= df['Survived']

X_test= df_test.drop('PassengerId',axis=1)

In [47]:
X_train.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Cabin_known,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,22.0,1,0,7.25,0,0,1,1,0,1
1,38.0,1,0,71.2833,1,0,0,0,0,0
2,26.0,0,0,7.925,0,0,1,0,0,1
3,35.0,1,0,53.1,1,0,0,0,0,1
4,35.0,0,0,8.05,0,0,1,1,0,1


In [48]:
y_train.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

## Scaling the features: 

The following step is very important for **Classification** in general. Because any variable on the larger scale, has a larger effect on the distance between observations. 

In general we need to rescale our variables. If we don't rescale the salary in this example, then the model may always return T or F. We have two options now:

1. Rescale the entire data set using StandardScaler
2. Rescale the individual features.

For SVM it is highly recommended that we use the first method and this is what I will do after splitting the data into train and test. 

In [49]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test) 

In [50]:
X_train_sc[0:2,:]

array([[-0.53167023,  0.43135024, -0.47432585, -0.50023975, -0.5422472 ,
        -0.51087465,  0.90032807,  0.73534203, -0.30794088,  0.61679395],
       [ 0.68023223,  0.43135024, -0.47432585,  0.78894661,  1.84417735,
        -0.51087465, -1.11070624, -1.35991138, -0.30794088, -1.62128697]])

In [51]:
X_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,889.0,29.019314,13.209814,0.42,22.0,26.0,36.5,80.0
SibSp,889.0,0.524184,1.103705,0.0,0.0,0.0,1.0,8.0
Parch,889.0,0.382452,0.806761,0.0,0.0,0.0,0.0,6.0
Fare,889.0,32.096681,49.697504,0.0,7.8958,14.4542,31.0,512.3292
Cabin_known,889.0,0.227222,0.419273,0.0,0.0,0.0,0.0,1.0
Pclass_2,889.0,0.206974,0.405365,0.0,0.0,0.0,0.0,1.0
Pclass_3,889.0,0.552306,0.497536,0.0,0.0,1.0,1.0,1.0
Sex_male,889.0,0.649044,0.477538,0.0,0.0,1.0,1.0,1.0
Embarked_Q,889.0,0.086614,0.281427,0.0,0.0,0.0,0.0,1.0
Embarked_S,889.0,0.724409,0.447063,0.0,0.0,1.0,1.0,1.0


## Train the model

In [52]:
from sklearn.svm import SVC

In [53]:
SVM_classification = SVC(gamma='auto')
SVM_classification.fit(X_train_sc, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [54]:
y_hat = SVM_classification.predict(X_test_sc)

## Evaluation

We cannot evaluate the model directly because the test sample is unsupervised. We don't have actual y in the test set. However, we can use the Cross validation with the train set to estimate the accuracy of the test set. 

In [55]:
# defining our own confusion matrix function
from sklearn.metrics import confusion_matrix
def my_confusion_matrix(y, y_hat):
    cm = confusion_matrix(y, y_hat)
    TN, FP, FN, TP = cm[0,0], cm[0,1], cm[1,0], cm[1,1]
    accuracy = round((TP+TN) / (TP+ FP+ FN+ TN) ,2)
    precision = round( TP / (TP+FP),2)
    recall = round( TP / (TP+FN),2)
    cm_labled = pd.DataFrame(cm, index=['Actual : 0 ','Actual : 1'], columns=['Predict : 0','Predict :1 '])
    print('\n')
    print('Accuracy = {}'.format(accuracy))
    print('Precision = {}'.format(precision))
    print('Recall = {}'.format(recall))
    print("-----------------------------------------")
    return cm_labled
 

We can report the confusion matrix for the training set! However, this is not the question of interest. 

In [56]:
my_confusion_matrix(y_train,SVM_classification.predict(X_train_sc) )



Accuracy = 0.84
Precision = 0.89
Recall = 0.67
-----------------------------------------


Unnamed: 0,Predict : 0,Predict :1
Actual : 0,520,29
Actual : 1,111,229


###  Cross validation

Now let's try to get an estimate for the accuracy of our model in the test set by applying cross validation technique to the training set.

In [57]:
from sklearn.model_selection import cross_val_score

In [58]:
Accuracy_CV=[]
K = (5,10)
for i in K:
    accuracy = cross_val_score(estimator = SVM_classification, X = X_train_sc, y = y_train, cv = i , scoring="accuracy" )
    Accuracy_CV.append(round(accuracy.mean(),3))

Accuracy_CV_df = pd.DataFrame(K, columns=['K'])
Accuracy_CV_df['Accuracy_CV']= Accuracy_CV

Accuracy_CV_df

Unnamed: 0,K,Accuracy_CV
0,5,0.826
1,10,0.83


So the estimated version of **accuracy_test** of the model is **0.826** using **5-fold** and **0.83** using **10-fold** cross validation. Note that these numbers are smaller than the **accuracy_train** which is **0.84**. 

### What are the parameters in general?

* **C** represents the budget for your misclassification on the training data. A small c value gives you low bias but high variance. Low bias because you restrict the misclassification a lot. 

**IMPORTANT NOTE: in scikit learn, the interpretation of C is completely reversed compared to ISLR!!** C stands for **Cost**. A large C in sklearn set up means that you are penalizing the errors more restricly so the margin will be narrower ie overfitting (small bias, big variance)
https://scikit-learn.org/stable/modules/svm.html


* **gamma** is the free prameter of the radial basis function (rbf). Intuitively, the gamma parameter defines **how far** the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. 

https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html


# Gridsearch

Finding the right parameters (like what C or gamma values to use) is a tricky task! But luckily, we can be a little lazy and just try a bunch of combinations and see what works best! This idea of creating a 'grid' of parameters and just trying out all the possible combinations is called a Gridsearch, this method is common enough that Scikit-learn has this functionality built in with GridSearchCV! The CV stands for cross-validation.

GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested. 


One of the great things about GridSearchCV is that it is a meta-estimator. It takes an estimator like SVC, and creates a new estimator, that behaves exactly the same - in this case, like a classifier. You should add refit=True and choose verbose to whatever number you want, higher the number, the more verbose (verbose just means the text output describing the process).

In [69]:
param_grid = {'C': [0.1,1, 10,100], 'gamma': [1,0.1,0.01], 'kernel': ['rbf','linear']} 

In [70]:
from sklearn.model_selection import GridSearchCV

In [71]:
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=0, cv=10)

In [72]:
# This will take a long time to be computed!! Be patient! 
grid.fit(X_train_sc,y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01],
                         'kernel': ['rbf', 'linear']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

You can inspect the best parameters found by GridSearchCV in the best_params_ attribute, and the best estimator in the best\_estimator_ attribute:

In [74]:
grid.best_params_

{'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}

In [75]:
grid.best_estimator_

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

# Reporting the accuracy_CV for the optimized model. (k=10)

In [76]:
grid_predictions = grid.predict(X_test_sc)

In [77]:
accuracy = cross_val_score(estimator = SVC(C=100, gamma=0.01, kernel='rbf'), X = X_train_sc, y = y_train, cv = 10 , scoring="accuracy" )
Accuracy_CV = (round(accuracy.mean(),3))

Accuracy_CV

0.831

### Question 2: Kaggle csv submission

In [78]:
predictions = pd.DataFrame(grid_predictions, columns=["Survived"])
predictions.head()

Unnamed: 0,Survived
0,0
1,0
2,0
3,0
4,0


In [79]:
Kaggle_submission= pd.concat([df_test['PassengerId'],predictions], axis=1)

In [80]:
Kaggle_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0


In [82]:
Kaggle_submission.to_csv("Titanic_SVM.csv", index=False)

### My score in Kaggle competition:

* Logistic regression: Score 0.77033 , rank 10,410
* KNN Classification : Score 0.77511 , rank 9,767 
* SVM Classification:  Score **0.78947** , rank **3,463**

So far, the SVM with C=100 and gamma=0.01 outbid the logistic regression and KNN models. 