## Machine Learning Predictions on Titanic datasets

- In this notebook, we will use randomforst classification to predict survived column on the titanic
  dataset


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing


### Data import and preliminary inspection

In [2]:
# import the data as pandas dataframe
test_data  = pd.read_csv('../input/titanic/test.csv');
train_data = pd.read_csv('../input/titanic/train.csv');
train_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [3]:
test_data.info()
PID = test_data['PassengerId'];

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


What we find is some data are missing in Age, Cabin and Embarked columns. Also the titles of names contain some useful social status kind of information. 
So first we will replace missing values by their mean or mode values.

### filling missing data

In [4]:
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True);
train_data['Age'] = train_data['Age'].fillna(train_data['Age'].mean());
train_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Removing useless columns that dont contribute towards survavibility

In [5]:
train_data = train_data.drop(columns=['Cabin','Ticket','PassengerId'], axis=1);
train_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       891 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 62.8+ KB


### Extract title from name column

- I googled about unique titles and if they are professional (Dr., Capt. etc) or from nobility (Don.) I assume it to be class 1 (high social status)
- If titles are Mr. Ms., I assume it to be class 0 (normal social status).
- After I extract the social status column based on titles, I will remove the name column completely. 

In [6]:
regex = "([A-Za-z]+)\."
import re 

def get_title(row):
    match = re.search(regex, str(row))
    title = match.group(0);
    return title

social_status = {'Master.':0,
 'Mrs.':0,
 'Mr.':0,
 'Ms.':0,
 'Col.':1,
 'Mme.':0,
 'Countess.':1,
 'Mlle.':0,
 'Don.':1,
 'Lady.':0,
 'Miss.':0,
 'Dr.':1,
 'Sir.':0,
 'Capt.':1,
 'Rev.':1,
 'Major.':1,
 'Jonkheer.':0,
  'Dona':1};

train_data['socialstatus'] = train_data.Name.apply(lambda x: get_title(x));
train_data.replace({'socialstatus':social_status}, inplace=True);
train_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Survived      891 non-null    int64  
 1   Pclass        891 non-null    int64  
 2   Name          891 non-null    object 
 3   Sex           891 non-null    object 
 4   Age           891 non-null    float64
 5   SibSp         891 non-null    int64  
 6   Parch         891 non-null    int64  
 7   Fare          891 non-null    float64
 8   Embarked      891 non-null    object 
 9   socialstatus  891 non-null    int64  
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB


- Now we can remove name column and encode Sex and embarked columns with numeric values

In [7]:
train_data.replace({'Sex':{'male':0,'female':1}, 'Embarked':{'S':0,'C':1,'Q':2}}, inplace=True);
train_data = train_data.drop(columns=['Name'], axis=1);
train_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Survived      891 non-null    int64  
 1   Pclass        891 non-null    int64  
 2   Sex           891 non-null    int64  
 3   Age           891 non-null    float64
 4   SibSp         891 non-null    int64  
 5   Parch         891 non-null    int64  
 6   Fare          891 non-null    float64
 7   Embarked      891 non-null    int64  
 8   socialstatus  891 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 62.8 KB


### Random Forest
In Random forest, we first randomly selects samples from the original datasets. Then decision trees are built on these feature subsets of randomly selected datasets. Many such decision tree constitute a random forest. Each tree in a random forest is different because decision trees are built on subset of features available. Each tree can make a different decision but on average whatever most tree predict is the solution.

A single decision tree is highly sensitive to training data and hence has large variance. Since random forest is a collection of multiple decision trees, it is less sensitive to the training data and hence less variance.

### Model Accuracy

- To test model accuracy, we will split the training data into subset of train and test 

In [8]:
X = train_data.drop(columns = ['Survived'],axis=1);
y = train_data['Survived'];
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X,y, test_size=0.2, random_state=10);
y_train_m.head()

57     0
717    1
431    1
633    0
163    0
Name: Survived, dtype: int64

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(n_estimators = 100);
model.fit(X_train_m, y_train_m);
y_pred_m = model.predict(X_test_m);
from sklearn import metrics 
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(y_test_m, y_pred_m));

ACCURACY OF THE MODEL:  0.8324022346368715


### Preparing test data for actual competition submission

In [10]:

test_data['Age'] = test_data['Age'].fillna(train_data['Age'].mean());
test_data        = test_data.drop(columns=['Cabin','Ticket','PassengerId'], axis=1);
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean());
test_data['socialstatus'] = test_data.Name.apply(lambda x: get_title(x));
test_data.replace({'socialstatus':social_status}, inplace=True);
test_data['socialstatus'][414]=1;
test_data.replace({'Sex':{'male':0,'female':1}, 'Embarked':{'S':0,'C':1,'Q':2}}, inplace=True);
test_data = test_data.drop(columns=['Name'], axis=1);
test_data.info();


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Pclass        418 non-null    int64  
 1   Sex           418 non-null    int64  
 2   Age           418 non-null    float64
 3   SibSp         418 non-null    int64  
 4   Parch         418 non-null    int64  
 5   Fare          418 non-null    float64
 6   Embarked      418 non-null    int64  
 7   socialstatus  418 non-null    object 
dtypes: float64(2), int64(5), object(1)
memory usage: 26.2+ KB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


### Lets run it on actual test result and save it as csv

In [11]:
y_submission = model.predict(test_data);
output = pd.DataFrame({'PassengerId': PID, 'Survived': y_submission})
output.to_csv('submission8.csv', index=False)