# DATA SCIENCE INTERN @BHARAT INTERN

# AUTHOR : ROSHIK YADA
## TASK 2 : Titanic Classification
### PURPOSE : Make a system which tells whether the person will be save from sinking.

## About Dataset
 One of the most well-known shipwrecks in history is the sinking of the RMS Titanic. Out of 2224 passengers and crew, 1502 died when the Titanic sank on April 15, 1912, during her maiden voyage after striking an iceberg. The international society was stunned by this shocking catastrophe, which prompted improved ship safety rules.
 

 The lack of lifeboats for the passengers and crew was one of the factors that contributed to the shipwreck's high death toll. Some groups of people had a higher chance of surviving the sinking than others, such as women, children, and the upper class, even though there was some element of luck involved.

The dataset is available at Kaggle : https://www.kaggle.com/datasets/rahulsah06/titanic

## Problem understanding and Definition
In this challenge, we need to complete the analysis of what sorts of people were most likely to survive. In particular, we apply the tools of machine learning to predict which passengers survived the tragedy

Predict whether passenger will survive or not.

##  Data Loading and Required libraries 

In [1]:
import pandas as pd #Data manipulation and analysis
import numpy as np #Linear Algebra

In [2]:
#Algorithms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings



In [3]:
#load dataset
df=pd.read_csv('Titanic.csv')

In [4]:
#First five
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [6]:
#Columns
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [7]:
#information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [8]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
#Handle the missing data
df.fillna(0, inplace=True)# Replace missing values with 0 for simplicity

In [10]:
# Select relevant features
X = df[['Pclass', 'Age', 'Sex', 'Fare']].copy()
X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})

In [11]:
#Target variable
Y=df['Survived']

In [12]:
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=29)

In [13]:
# Standardize features 
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
# Create a Random Forest classifier
classifier = RandomForestClassifier(random_state=29)

In [15]:
# Train the model
classifier.fit(X_train, Y_train)

RandomForestClassifier(random_state=29)

In [16]:
# Make predictions
Y_pred = classifier.predict(X_test)

In [17]:
# Calculate accuracy
accuracy = accuracy_score(Y_test, Y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 78.77%


In [18]:
#Report
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84       116
           1       0.70      0.70      0.70        63

    accuracy                           0.79       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.79      0.79      0.79       179



In [19]:
# Confusion matrix
conf_matrix = confusion_matrix(Y_test, Y_pred)
print('Confusion Matrix:')
print(conf_matrix)

Confusion Matrix:
[[97 19]
 [19 44]]
