<a href="https://colab.research.google.com/github/MorshedulHoque/Titanic-Survival-Prediction-using-Random-Forest/blob/main/Titanic_Survival_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [53]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [54]:
data = pd.read_csv("/content/drive/MyDrive/Datasets/Titanic-Dataset.csv")

In [55]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [56]:
data.shape

(891, 12)

Here, 'train_data' and 'test_data' are variables being assigned the results of the 'train_test_split' function.

'train_test_split' is a function used to split a dataset into training and testing subsets. It takes several parameters:

*   data: The input dataset I want to split.
*   test_size=0.2: This parameter specifies that I want to allocate 20% of the data for testing and 80% for training.
*   random_state=42: This sets the random seed for reproducibility. It ensures that if I run this code multiple times with the same random seed, you'll get the same split.

The end result is that the data will be split into two parts: train_data, which contains 80% of the original data and will be used for training my machine learning model, and test_data, which contains 20% of the original data and will be used to evaluate the model's performance.

In [57]:
train_data, test_data = train_test_split(data, test_size = 0.2, random_state = 42)

In [58]:
train_data.shape

(712, 12)

In [59]:
test_data.shape

(179, 12)

In [60]:
woman = train_data.loc[train_data.Sex == 'female']["Survived"]

In [61]:
woman.shape

(245,)

In [62]:
sum(woman) / len(woman)

0.7387755102040816

In [63]:
women = train_data[train_data.Sex == 'female'].Survived
rate_women =  sum(women) / len(women)

In [64]:
men = train_data[train_data.Sex == 'male'].Survived
rate_men =  sum(men) / len(men)

In [65]:
print("% of women survived:", rate_women)
print("% of men survived:", rate_men)

% of women survived: 0.7387755102040816
% of men survived: 0.18629550321199143


In [66]:
target = train_data["Survived"]

In [67]:
target

331    0
733    0
382    0
704    0
813    0
      ..
106    1
270    0
860    0
435    1
102    0
Name: Survived, Length: 712, dtype: int64

In [71]:
features = ["Pclass", "Sex", "SibSp", "Parch"]

In [72]:
train_data[features]

Unnamed: 0,Pclass,Sex,SibSp,Parch
331,1,male,0,0
733,2,male,0,0
382,3,male,0,0
704,3,male,1,0
813,3,female,4,2
...,...,...,...,...
106,3,female,0,0
270,1,male,0,0
860,3,male,2,0
435,1,female,1,2


We need to convert the string into numerical value

In [75]:
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

In [77]:
X_test

Unnamed: 0,Pclass,SibSp,Parch,Sex_female,Sex_male
709,3,1,1,0,1
439,2,0,0,0,1
840,3,0,0,0,1
720,2,0,1,1,0
39,3,1,0,1,0
...,...,...,...,...,...
433,3,0,0,0,1
773,3,0,0,0,1
25,3,1,5,1,0
84,2,0,0,1,0




1.   **model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)**:
In this line, you are creating an instance of the RandomForestClassifier class from a machine learning library, presumably scikit-learn. The RandomForestClassifier is an ensemble learning method that combines multiple decision trees to make predictions. The parameters you've specified are:

> *   **n_estimators=100**: This parameter sets the number of decision trees (estimators) in the random forest ensemble. In this case, you've set it to 100.
*   **max_depth=5**: This parameter sets the maximum depth of each individual decision tree. It limits how deep the tree can grow, which can help prevent overfitting.
*   **random_state=1**: This parameter sets the random seed for reproducibility. It ensures that if you run this code multiple times with the same random seed, you'll get the same random decisions during the training process.

2.   **model.fit(X, target)**: This line fits (trains) the RandomForestClassifier model using the training data. It takes two main arguments:


> *   **X**: This is the feature matrix, which is likely the one-hot encoded DataFrame you created earlier using pd.get_dummies.
*   **target**: This is the target variable or the labels corresponding to each row in the X matrix. The model learns to predict the target based on the features in X.

3.   **predictions = model.predict(X_test)**: After training the model, this line predicts the target values for a new set of data represented by X_test. It assumes that you have a separate dataset (or a subset of the original dataset) for testing purposes. The predicted values are stored in the predictions variable.

In [79]:
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X,target)
predictions = model.predict(X_test)

In [81]:
output = pd.DataFrame({'PassengerID': test_data.PassengerId, 'Survived': predictions})

In [82]:
output.to_csv("Output.csv", index=False)

In [83]:
output

Unnamed: 0,PassengerID,Survived
709,710,0
439,440,0
840,841,0
720,721,1
39,40,1
...,...,...
433,434,0
773,774,0
25,26,0
84,85,1
