# Titanic Data
### Hranush Sahradyan
### 09.12.2021

# Overview

The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

<br></br>
More on https://www.kaggle.com/c/titanic/data?select=test.csv

In [93]:
import pandas as pd
import numpy as np

In [94]:
train_df=pd.read_csv('train.csv')

In [95]:
train_df[10:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them

In [96]:
train_df=train_df[['Survived','Pclass','Sex','Age','SibSp','Parch','Fare']]

In [97]:
test_df=pd.read_csv('test.csv')
test_df=test_df[['Pclass','Sex','Age','SibSp','Parch','Fare']]
test_df['Sex']=[0 if x == 'female' else 1 for x in test_df['Sex']] 

In [98]:
train_df['Sex']=[0 if x == 'female' else 1 for x in train_df['Sex']] 


In [99]:
train_df.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,0.47799,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,0.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,1.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,1.0,38.0,1.0,0.0,31.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292


In [100]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int64  
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int64(5)
memory usage: 48.9 KB


In [101]:
train_df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [102]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  418 non-null    int64  
 1   Sex     418 non-null    int64  
 2   Age     332 non-null    float64
 3   SibSp   418 non-null    int64  
 4   Parch   418 non-null    int64  
 5   Fare    417 non-null    float64
dtypes: float64(2), int64(4)
memory usage: 19.7 KB


In [103]:
xTrain=train_df[['Pclass','Sex','Age','SibSp','Parch','Fare']]
yTrain=train_df['Survived']

In [104]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3,weights='distance')
imputer.fit(xTrain)
xTrain=imputer.transform(xTrain)
xTrain=pd.DataFrame(xTrain)
xTrain.columns=['Pclass','Sex','Age','SibSp','Parch','Fare']
xTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  891 non-null    float64
 1   Sex     891 non-null    float64
 2   Age     891 non-null    float64
 3   SibSp   891 non-null    float64
 4   Parch   891 non-null    float64
 5   Fare    891 non-null    float64
dtypes: float64(6)
memory usage: 41.9 KB


In [105]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3,weights='distance')
imputer.fit(xTrain)
test_df=imputer.transform(test_df)
test_df=pd.DataFrame(test_df)
test_df.columns=['Pclass','Sex','Age','SibSp','Parch','Fare']
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  418 non-null    float64
 1   Sex     418 non-null    float64
 2   Age     418 non-null    float64
 3   SibSp   418 non-null    float64
 4   Parch   418 non-null    float64
 5   Fare    418 non-null    float64
dtypes: float64(6)
memory usage: 19.7 KB


## Finding best model with GridSearch

In [23]:
from sklearn.model_selection import GridSearchCV

## SVM

In [24]:
from sklearn.svm import SVC
svc=SVC()
parameters = {'kernel':('poly','rbf'), 'degree':[3,5],'C':[10,100,1000],'class_weight':('balanced',None)}

In [25]:
clf1=GridSearchCV(estimator=svc,param_grid=parameters)

In [26]:
clf1.fit(xTrain,yTrain)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [10, 100, 1000],
                         'class_weight': ('balanced', None), 'degree': [3, 5],
                         'kernel': ('poly', 'rbf')})

In [27]:
clf1.best_score_

0.7968551879982424

In [28]:
clf1.best_params_

{'C': 1000, 'class_weight': None, 'degree': 3, 'kernel': 'rbf'}

### Gaussian Process

In [113]:
from sklearn.gaussian_process import GaussianProcessClassifier

In [114]:
gpc=GaussianProcessClassifier()
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import Matern
from sklearn.gaussian_process.kernels import WhiteKernel
kernel1=1.0 * RBF(1.0)
kernel2=1.0 * Matern(length_scale=1.0, nu=1.5)
kernel3=WhiteKernel(noise_level=0.5)
parameters={'kernel':[kernel1,kernel2,kernel3]}

In [115]:
clf2=GridSearchCV(estimator=gpc,param_grid=parameters)

In [116]:
clf2.fit(xTrain,yTrain)

ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


GridSearchCV(estimator=GaussianProcessClassifier(),
             param_grid={'kernel': [1**2 * RBF(length_scale=1),
                                    1**2 * Matern(length_scale=1, nu=1.5),
                                    WhiteKernel(noise_level=0.5)]})

In [117]:
clf2.best_score_

0.7968991274872887

In [118]:
clf2.best_params_

{'kernel': 1**2 * Matern(length_scale=1, nu=1.5)}

### AdaBoost

In [39]:
from sklearn.ensemble import AdaBoostClassifier
adb=AdaBoostClassifier()
parameters={'learning_rate':[0.1,1],'n_estimators':[50,100,15]}

In [40]:
clf3=GridSearchCV(estimator=adb,param_grid=parameters)

In [41]:
clf3.fit(xTrain,yTrain)

GridSearchCV(estimator=AdaBoostClassifier(),
             param_grid={'learning_rate': [0.1, 1],
                         'n_estimators': [50, 100, 15]})

In [42]:
clf3.best_score_

0.814864101437449

In [43]:
clf3.best_params_

{'learning_rate': 1, 'n_estimators': 100}

### Nearest Neighbors

In [122]:
from sklearn.neighbors import KNeighborsClassifier#need scaling
knnclf = KNeighborsClassifier()
parameters={'n_neighbors':[3,4,5,6,7],'weights':('uniform', 'distance'),'algorithm':('auto', 'ball_tree', 'brute')}

In [123]:
clf4=GridSearchCV(estimator=knnclf,param_grid=parameters,scoring='f1')

In [124]:
clf4.fit(xTrain,yTrain)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'algorithm': ('auto', 'ball_tree', 'brute'),
                         'n_neighbors': [3, 4, 5, 6, 7],
                         'weights': ('uniform', 'distance')},
             scoring='f1')

In [125]:
clf4.best_score_

0.6138664654384588

In [126]:
clf4.best_params_

{'algorithm': 'brute', 'n_neighbors': 6, 'weights': 'distance'}

### QDA

### Random Forest

###  Naive Bayes

In [108]:
model=AdaBoostClassifier(learning_rate=1,n_estimators=100)
model.fit(xTrain,yTrain)


AdaBoostClassifier(learning_rate=1, n_estimators=100)

In [109]:
yPred=model.predict(test_df)
yPred

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,

In [121]:
df=pd.read_csv('test.csv')

yPred=pd.DataFrame(yPred,columns=['Survived'])
yPred=pd.concat([yPred,df[['PassengerId']]],axis=1)
yPred.to_csv(r'C:\Users\Acer\predictions.csv',index=False,index_label=False)
yPred

Unnamed: 0,Survived,PassengerId
0,0,892
1,0,893
2,0,894
3,0,895
4,0,896
...,...,...
413,0,1305
414,1,1306
415,0,1307
416,0,1308
