<a href="https://colab.research.google.com/github/M-PRERNA/Onboarding-Titanic/blob/main/Titanic_0_78708.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using random forest classification on Titanic dataset

## importing the libraries

In [134]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [135]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

## Importing the dataset

In [136]:
train_data = pd.read_csv("train.csv") 
test_data = pd.read_csv("test.csv")

In [137]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [138]:
#defining our matrix of features and matrix of dependant variables 
# since there are 12 columns, we'll define a list of chosen columns as features
features = ["Sex","Pclass", "SibSp", "Parch", "Age", "Fare", "Embarked"]
dependant_variable = "Survived"
X_train = train_data[features]
X_test = test_data[features]
Y_train = train_data[dependant_variable]
# y_test (X) because that is what we'll predict

### let's visualise what we have done so far by using .head()
### which prints the first five lines of our dataset

In [139]:
X_train.head()

Unnamed: 0,Sex,Pclass,SibSp,Parch,Age,Fare,Embarked
0,male,3,1,0,22.0,7.25,S
1,female,1,1,0,38.0,71.2833,C
2,female,3,0,0,26.0,7.925,S
3,female,1,1,0,35.0,53.1,S
4,male,3,0,0,35.0,8.05,S


In [140]:
X_test.head()

Unnamed: 0,Sex,Pclass,SibSp,Parch,Age,Fare,Embarked
0,male,3,0,0,34.5,7.8292,Q
1,female,3,1,0,47.0,7.0,S
2,male,2,0,0,62.0,9.6875,Q
3,male,3,0,0,27.0,8.6625,S
4,female,3,1,1,22.0,12.2875,S


In [141]:
Y_train.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

## Taking care of the missing data 

searching for the missing values

The first step in processing the data is to fill in the gaps! (of course, after choosing the main features that you select based on the analysis of all data, for example, with visualization tools, we must first assess the scale of our future actions).

Missing values may not necessarily be lost data, sometimes it may indicate that this feature simply does not exist or it cannot be obtained, which is already some interesting pattern. For example, the column may be called Pets of passengers and among Cats, Dogs and Parrots, there may be missing values that are not reasonable to fill in with Cats (for example), they will mean that the passenger has no pets 🤷‍♀️)

we can use different approaches for imputing data to different columns, so we'll use .fillna, which is something different from SimpleImputer wiz generally applied on whole dataset

In [142]:
X_train = X_train.fillna(value = {'Age': X_train['Age'].mean(), 'Fare': X_train['Fare'].median(), 'Embarked': 'N'})
X_test = X_test.fillna(value = {'Age': X_test['Age'].mean(), 'Fare': X_test['Fare'].median(), 'Embarked': 'N'})

In [143]:
X_train.head()

Unnamed: 0,Sex,Pclass,SibSp,Parch,Age,Fare,Embarked
0,male,3,1,0,22.0,7.25,S
1,female,1,1,0,38.0,71.2833,C
2,female,3,0,0,26.0,7.925,S
3,female,1,1,0,35.0,53.1,S
4,male,3,0,0,35.0,8.05,S


## Encoding the categorical data

there are two columns which take categorical data - Sex and Emabarked

In [144]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
# X_train[0] = label_encoder.fit_transform(X_train[0])
for col in ['Sex']:
    X_train[col] = label_encoder.fit_transform(X_train[col])
    X_test[col] = label_encoder.transform(X_test[col])

In [145]:
X_train.head()

Unnamed: 0,Sex,Pclass,SibSp,Parch,Age,Fare,Embarked
0,1,3,1,0,22.0,7.25,S
1,0,1,1,0,38.0,71.2833,C
2,0,3,0,0,26.0,7.925,S
3,0,1,1,0,35.0,53.1,S
4,1,3,0,0,35.0,8.05,S


Since we have converted all the other categorical data into numerical counterparts, we can now convert our last categorical variable, which is the embarked values into dummies using the get_dummies function. we use it at the end because it could have converted all the exixting categorical data into dummies, but we want it to do so...only for the embarked values. 

In [146]:
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
X_train.head()

Unnamed: 0,Sex,Pclass,SibSp,Parch,Age,Fare,Embarked_C,Embarked_N,Embarked_Q,Embarked_S
0,1,3,1,0,22.0,7.25,0,0,0,1
1,0,1,1,0,38.0,71.2833,1,0,0,0
2,0,3,0,0,26.0,7.925,0,0,0,1
3,0,1,1,0,35.0,53.1,0,0,0,1
4,1,3,0,0,35.0,8.05,0,0,0,1


In [147]:
X_test.head()

Unnamed: 0,Sex,Pclass,SibSp,Parch,Age,Fare,Embarked_C,Embarked_Q,Embarked_S
0,1,3,0,0,34.5,7.8292,0,1,0
1,0,3,1,0,47.0,7.0,0,0,1
2,1,2,0,0,62.0,9.6875,0,1,0
3,1,3,0,0,27.0,8.6625,0,0,1
4,0,3,1,1,22.0,12.2875,0,0,1


As we can see that, there is a difference of columns in the test and the train data. This is because, the test data does not have any missing value, whereas the training data has.
Therefore, we would now add an extra column into our test data (which will have only zeroes), so that our model can work properly!

In [148]:
X_test['Embarked_N'] = 0

In [149]:
X_test.head()

Unnamed: 0,Sex,Pclass,SibSp,Parch,Age,Fare,Embarked_C,Embarked_Q,Embarked_S,Embarked_N
0,1,3,0,0,34.5,7.8292,0,1,0,0
1,0,3,1,0,47.0,7.0,0,0,1,0
2,1,2,0,0,62.0,9.6875,0,1,0,0
3,1,3,0,0,27.0,8.6625,0,0,1,0
4,0,3,1,1,22.0,12.2875,0,0,1,0


## Splitting the training data

In [150]:
split_x_train,split_x_test,split_y_train,split_y_test=train_test_split(X_train,Y_train,test_size=0.25, random_state=0)

## Training the model

In [151]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 550, max_depth = 11, random_state = 1)
classifier.fit(split_x_train,split_y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=11, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=550,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

## Predicting on the split_x_test

In [152]:
predictions = classifier.predict(split_x_test)

In [153]:
accuracy_score(split_y_test,predictions)

0.8385650224215246

## Predicting on the actual data

In [159]:
classifier.fit(X_train,Y_train) 
Actual_pred = classifier.predict(X_test)


## Creating the competition submissions

In [155]:
submission = pd.DataFrame({'PassengerID': test_data.PassengerId ,'Survived':Actual_pred})

In [156]:
submission.to_csv('my_submissions2.csv',index=False)