### By: Anika Achary

In this project, I will be building an SVM model and applying GridSearchCV to determine the best parameters for the kernel and C for a dataset on passengers of the Titanic. I will then take the best parameters from the search and build another SVC model and then compute the confusion matrix, classification report, and accuracy.

I will also be using GridSearchCV to find the best number of trees for a Random Forest, and then build a Random Forest Classifier to then compute the confusion matrix, classification report, and accuracy.

The purpose of the model is to determine the survival of passengers based on the Pclass, Gender, Age, SibSp, Parch, Fare, and Embarked columns (since they are quantitative data).

In [6]:
import pandas as pd

#### Loading the Dataset

In [8]:
df = pd.read_csv('Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
df.shape

(891, 12)

The columns PassengerId, Name, Ticket, and Cabin are not useful for the model prediction, so we will drop them.

In [11]:
df = df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])
df.head()

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [12]:
df.isnull().sum()

Survived      0
Pclass        0
Gender        0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

We have missing values that have to be dealt with in the Age, Cabin, and Embarked before we can proceed with SVM. I did some digging on the sklearn guide, and I found that an Imputer takes missing values and replaces them with median values or the most frequent valuesof the data. I thought this would be a good way to deal with the missing values so I used this strategy. 

In [14]:
from sklearn.impute import SimpleImputer
imputer_age = SimpleImputer(strategy='median')
imputer_embarked = SimpleImputer(strategy='most_frequent')

df['Age'] = imputer_age.fit_transform(df[['Age']])
df['Embarked'] = imputer_embarked.fit_transform(df[['Age']])
#df.isnull().sum()

Next we have to change categorical variables into numerical variables. To do this I will use LabelEncoding. 

In [16]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])
df['Embarked'] = label_encoder.fit_transform(df['Embarked'])
df.head()

Unnamed: 0,Survived,Pclass,Gender,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,28
1,1,1,0,38.0,1,0,71.2833,51
2,1,3,0,26.0,0,0,7.925,34
3,1,1,0,35.0,1,0,53.1,47
4,0,3,1,35.0,0,0,8.05,47


In [17]:
x = df.drop(columns=['Survived'])
y = df['Survived']

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state=42)

#### Applying GridSearchCV

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

parameters = {
    'C': [1, 5, 10],
    'kernel': ['linear', 'rbf', 'poly']
}

svc = SVC()
grid_search_SVC = GridSearchCV(svc, param_grid = parameters, cv=3, scoring='accuracy', n_jobs=-1)

final_grid_search = grid_search_SVC.fit(x_train, y_train)

best_params_svc = grid_search_SVC.best_params_
best_svc_model = grid_search_SVC.best_estimator_

print(best_params_svc)
print(best_svc_model)

Based on the GridSearchCV results, we need to build another SVC model with a linear kernel and C = 1.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC
from sklearn import metrics

svc = SVC(kernel = 'linear', C=1)
svc.fit(x_train, y_train) # fitting the data to the model

svc.fit(x_train, y_train) # fitting the data to the model
y_pred_svc = svc.predict(x_test)

print(confusion_matrix(y_test,y_pred_svc))
print(classification_report(y_test,y_pred_svc))

print("Accuracy:", metrics.accuracy_score(y_test, y_pred_svc))

#### Using GridSearchCV to find the best number of trees for Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

param_grid_rf = {
    'n_estimators': [50, 100, 200, 300, 400, 500]
}

rf = RandomForestClassifier(random_state=42)

grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)

grid_search_rf.fit(x_train, y_train)

best_params_rf = grid_search_rf.best_params_
best_rf_model = grid_search_rf.best_estimator_

print(best_params_rf)
print(best_rf_model)

y_pred_rf = best_rf_model.predict(x_test)

confusion_matrix_rf = confusion_matrix(y_test, y_pred_rf)
class_report_rf = classification_report(y_test, y_pred_rf)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print(confusion_matrix_rf)
print(class_report_rf)
print("Accuracy:", accuracy_rf)

### The Difference between GridSearchCV and RandomSearchCV

RandomizedSearchCV and GridSearchCV are both methods used for hyperparameter tuning.

GridSearchCV performs a search over a specified parameter grid and it tries every possible combination of the hyperparameters defined. The model is trained using cross-validation for each combination. This method is likely to find the best combination, however it can be very slow if the parameter grid is large. 

RandomizedSearchCV randomly samples from a specified parameter space for a fixed number of iterations. It doesn't try every possible combination like GridSearchCV, but rahter it selects a random subset of combinations. Like GridSearchCv, it also uses cross-validation for each combination it picks. It is much faster than GridSearchCV, especially if the parameter grid is large. However, one important thing to note is that it might miss the best combination since it does not evaluate all combinations. 