# TITANIC PROJECT

The Titanic dataset is a well-known dataset in the field of data analysis and machine learning. It contains information about the passengers who were aboard the Titanic during its maiden voyage, which famously sank in 1912. The dataset includes details about each passenger's age, gender, class of ticket, fare, and whether or not they survived the sinking.The titanic dataset is a widely popular dataset which predicts the survival of people on board with given features.


# Importing Libraries

In [1]:
# Importing data visualization library
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# for mathemaical operations
import numpy as np

# for dataframe manipulations
import pandas as pd

#for model building
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix


# Reading the file

In [2]:
#Lets read the csv file from directory
Titanic = pd.read_csv("titanic.csv")

In [3]:
#Lets check the head of the dataset
Titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
#Lets check the shape of the dataset
Titanic.shape

(891, 12)

In [5]:
#Lets expand to view all the columns and rows
pd.set_option("max_columns", 20)
pd.set_option("max_rows", 900)

In [None]:
Titanic

In [None]:
# Our target column here is 'Survived' , we will relocate the column 
col_list = list(Titanic.columns.values)

In [None]:
#lets check the column list again after relocating
print(col_list)

In [None]:
# Let's now reindex our columns

changed_indexed_columns = ['PassengerId', 'Pclass', 'Name', 'Sex', 
'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 
'Survived']
Titanic = Titanic[changed_indexed_columns]


In [None]:
# check the data
Titanic.head()

#                                     Columns Exploring

We will one by one explore each columns and take action on the same according to it's importance.



In [None]:
#pclass refers to the passenger class in the ship 
# this might play an important role in terms of survival 
# as may be first class passengers are given more importance during Evacuation

In [None]:
Titanic["Pclass"].value_counts() # to check the values in certain row

In [None]:
#Checking missing values in the data
print(f" Missing value in the Pclass column is: {Titanic['Pclass'].isnull().sum()}")


In [None]:
Titanic["Pclass"].isnull().sum()

In [None]:
Titanic.isnull().sum()

In [None]:
# Generally the name column should not have any affect on survival
# We can drop the name column for the same reason
# Before we drop the column we will try to find out how the data is divided with age 
# we will split the names to identify the title

In [None]:
Titanic['Name'] = Titanic['Name'].str.split(', ').str[1]
Titanic['Name'] = Titanic['Name'].str.split('.').str[0]
Titanic['Name'].value_counts()

In [None]:
# Let's find the maximum age for 'Master' 
# This will determine the child age high limit in data

In [None]:
child_age = Titanic[Titanic['Name'] == 'Master']
child_age['Age'].max()


#we can see that maximum age limit given here is 29.This can't be considered so we will take 15 as the highest age for 
children in the data

In [None]:
Titanic.drop(['Name'],axis = 1,inplace = True)

In [None]:
Titanic

In [None]:
Titanic.shape

In [None]:
# Children are given more impotance during evacuation
# so we will segreagte our sex and age column together and make new columns
# Let's check the statistical distribution of the 'Age' Column

In [None]:
print(f"the maximum age in data is {Titanic['Age'].max()}")
print(f"the minimum age in data is {Titanic['Age'].min()}")
print(f"the average age in data is {Titanic['Age'].mean()}")


In [None]:
# let's check the missing value in the data
print(f"The missing value in age column is: {Titanic['Age'].isnull().sum()}")
print(f"The missing value share in age column is: {Titanic['Age'].isnull().sum()/len(Titanic['Age'])}")
len(Titanic['Age']) #177/891


In [None]:
# The missing value share here is just 20%
# So we will fill this missing values in 'mean' strategy usning fillna
Titanic['Age'].fillna(Titanic['Age'].mean(),inplace = True)


In [None]:
print(f"The missing value in age column is :{Titanic['Age'].isnull().sum()}")

In [None]:
#Now let's check the missing value in the 'sex' column
print(f"The missing value in sex column is :{Titanic['Sex'].isnull().sum()}")


In [None]:
# Now let's combine sex and age column to cometother into a new column
# we will construct male,female child and male , female adult column
# Below 15 years of age is child and above 15 years will be treated asadults
# We will python functions her

In [None]:
def age_groups(Titanic):
    if Titanic['Sex'] == 'male' and Titanic['Age'] >= 15:
        return 'Male Adult'
    elif Titanic['Sex'] == 'male' and Titanic['Age'] <= 15:
        return 'Male Child'
    elif Titanic['Sex'] == 'female' and Titanic['Age'] >= 15:
        return 'Female Adult'
    elif Titanic['Sex'] == 'female' and Titanic['Age'] <= 15:
        return 'Female Child'

In [None]:
# Let's apply the grouping in the dataframe
Titanic['Passenger_group'] = Titanic.apply(age_groups,axis =1)


In [None]:
# Let's check the data
Titanic.head()

In [None]:
# Now that our data is properly grouped , we can go ahead and drop sex and age column
Titanic.drop(['Sex','Age'],axis = 1,inplace = True)
# Checking the data
Titanic.head()


In [None]:
# Embarked column is a categorical one so we need to convert it 
# Let's find it's missing values
print(f"the missing value in the data of embarked column: {Titanic['Embarked'].isnull().sum()}")

In [None]:
# Let's handle the missing values in the column
Titanic.fillna(Titanic['Embarked'].mode().iloc[0],inplace =True)



In [None]:
print(f"the missing value in the data of embarked column: {Titanic['Embarked'].isnull().sum()}")

In [None]:
# Let's convert our categorical column of embarked into numerical by using labelEncoder
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
Titanic['Embarked']= le.fit_transform(Titanic['Embarked'])
Titanic['Embarked'].value_counts().unique

In [None]:
Titanic['Passenger_group'] = le.fit_transform(Titanic['Passenger_group'])
Titanic['Passenger_group'].value_counts().unique

In [None]:
Titanic.head()

In [None]:
Titanic.info()


In [None]:
Titanic.isnull().sum()

In [None]:
col = "Survived"
x = Titanic.loc[:, Titanic.columns != col]
y = Titanic.pop('Survived')

In [None]:
x

In [None]:
y

In [None]:
x = pd.DataFrame(x, columns = ['PassengerId', 'Pclass',
'SibSp', 'Parch', 'Fare',
'Embarked','Passenger_group'],)


In [None]:
x.head()

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
object= StandardScaler()

In [None]:
x = object.fit_transform(x) 

In [None]:
x

In [None]:
# let's do our train test splitting now
from sklearn.model_selection import train_test_split


In [None]:
x_train , x_test , y_train , y_test = train_test_split(x ,y ,stratify = y , random_state = 42)


# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression().fit(x_train,y_train)

In [None]:
# Confusion Mtarix for Logistic Regression Classifier
from sklearn.metrics import confusion_matrix
y_pred = logreg.predict(x_test)#important
confusion_matrix(y_test, y_pred)


In [None]:
# Let's plot the confusion matrix

from sklearn.metrics import plot_confusion_matrix
color = 'black'
matrix = plot_confusion_matrix(logreg, x_test, y_test, 
cmap=plt.cm.Blues)
matrix.ax_.set_title('Confusion Matrix', color=color)
plt.xlabel('Predicted Label', color=color)
plt.ylabel('True Label', color=color)
plt.gcf().axes[0].tick_params(colors=color)
plt.gcf().axes[1].tick_params(colors=color)
plt.show()


In [None]:
# Scores

print("Training set score:{:.2f}".format(logreg.score(x_train,y_train)))
print("Test set score: {:.2f}".format(logreg.score(x_test,y_test)))


In [None]:
# Running the classification report

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))


# Decision Tree Classifier

In [None]:
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dclf = DecisionTreeClassifier()
dclf = dclf.fit(x_train,y_train)



In [None]:
# Confusion Mtarix for Decision Tree Classifier
y_pred = dclf.predict(x_test)
confusion_matrix(y_test, y_pred)

In [None]:
# Let's plot the confusion matrix
color = 'black'
matrix = plot_confusion_matrix(dclf, x_test, y_test, 
cmap=plt.cm.Blues)
matrix.ax_.set_title('Confusion Matrix', color=color)
plt.xlabel('Predicted Label', color=color)
plt.ylabel('True Label', color=color)
plt.gcf().axes[0].tick_params(colors=color)
plt.gcf().axes[1].tick_params(colors=color)
plt.show()

In [None]:
# Running the classification report
print(classification_report(y_test, y_pred))

In [None]:
# Scores
print("Training set score:{:.2f}".format(dclf.score(x_train,y_train)))
print("Test set score: {:.2f}".format(dclf.score(x_test,y_test)))


In [None]:
# Decision Tree Classifier with Entropy
dclf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
dclf = dclf.fit(x_train,y_train)
print("Training set score:{:.2f}".format(dclf.score(x_train,y_train)))
print("Test set score: {:.2f}".format(dclf.score(x_test,y_test)))

# Random Forest Classifier

In [None]:
#Random classifier model

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=5, 
random_state=1)
rf = rf.fit(x_train,y_train)
print("Training set score:{:.2f}".format(rf.score(x_train,y_train)))
print("Test set score: {:.2f}".format(rf.score(x_test,y_test)))


# Pipeline Creation
Now we will go ahead and create Pipeline for all the models with which we want our data 
to be trained

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
#We will also import PCA to perform decomposition or dimensionality reduction

In [None]:
from sklearn.decomposition import PCA
# Pipeline creation for Logistic Regression
# As our data is already scaled hence we don't need further scaling 

pipeline_lr = Pipeline([('pca_lr',
 PCA(n_components = 2)),
 ('lr_classifier',
 LogisticRegression(random_state=42))])
# Pipeline creation for Decision Tree Classifier
pipeline_dt = Pipeline([('pca_dt',
 PCA(n_components = 2)),
 ('dt_classifier',
 DecisionTreeClassifier(criterion="entropy", 
max_depth=3))])
# Pipeline creation for Random Forest Classifier
pipeline_rf = Pipeline([('pca_rf',
 PCA(n_components = 2)),
 ('rf_classifier',
 RandomForestClassifier(n_estimators=100, 
max_depth=3, random_state=42))])



In [None]:
# Listing all the Pipelines created
pipelines = [pipeline_lr,pipeline_dt,pipeline_rf]

In [None]:
# Now we will initialize three variables to showcase that which of the
#model is performing the best here
# the values will be added in the variable
best_accuracy = 0.0
best_classifier = 0
best_pipeline = " "

In [None]:
# Let's now initiate a disctionary where key is assigned to each given model

pipe_dict = {0:'Logistic regression',
 1:'Decision Tree',
 2:'Random Forest'}



In [None]:
# Let's fit our data into the pipelines
for pipe in pipelines:
    pipe.fit(x_train,y_train)


In [None]:
# Now we will evaluate each model and it's test accuracy with the help of a for loop

for i,model in enumerate(pipelines):
    print("{} test accuracy:{:.2f}".format(pipe_dict[i],model.score(x_test,y_test)))



In [None]:
# Now let's use our previously created variables to define the model specifics 

for i,model in enumerate(pipelines):
    if model.score(x_test,y_test)> best_accuracy:
        best_accuracy = model.score(x_test,y_test)
        best_pipeline = model
        best_classifier = i
print('Classifier with best accuracy:{}'.format(pipe_dict[best_classifier]))

# Pipeline without PCA
     We can see our scores decreasing with PCA , let's try creating models without PCA

In [None]:
# Pipeline creation for Logistic Regression
pipeline_lr = Pipeline([('lr_classifier',
 LogisticRegression(random_state=42))])

# Pipeline creation for Decision Tree Classifier
pipeline_dt = Pipeline([('dt_classifier',
 DecisionTreeClassifier(criterion="entropy", 
max_depth=3))])

# Pipeline creation for Random Forest Classifier
pipeline_rf = Pipeline([('rf_classifier',
 RandomForestClassifier(n_estimators=100, 
max_depth=3, random_state=42))])


In [None]:
# Listing all the Pipelines created
pipelines = [pipeline_lr,pipeline_dt,pipeline_rf]


In [None]:
# Now we will initialize three variables to showcase that which of the model is performing the best here
# the values will be added in the variable
best_accuracy = 0.0
best_classifier = 0
best_pipeline = " "

In [None]:
# Let's now initiate a disctionary where key is assigned to each given model
pipe_dict = {0:'Logistic regression',
 1:'Decision Tree',
 2:'Random Forest'}
 

In [None]:
# Let's fit our data into the pipelines
for pipe in pipelines:
    pipe.fit(x_train,y_train)
# Now we will evaluate each model and it's test accuracy with the help  a for loop
for i,model in enumerate(pipelines):
    print("{} test accuracy: {:.2f}".format(pipe_dict[i],model.score(x_test,y_test)))


In [None]:
# Now let's use our previously created variables to define the model specifics 
for i,model in enumerate(pipelines):
    if model.score(x_test,y_test)>best_accuracy:
        best_accuracy = model.score(x_test,y_test)
        best_pipeline = model
        best_classifier = i
print('Classifier with best accuracy:{}'.format(pipe_dict[best_classifier]))


# Conclusion

In conclusion, the Titanic dataset provides valuable insight into the tragic event that claimed the lives of passengers and crew members aboard the ill-fated ship.Through data analysis, we can understand the demographics of those onboard, the factors that contributed to the sinking, and the survival rate of different groups. This dataset serves as a reminder of the importance of safety measures and caution when traveling, especially in dangerous situations. Furthermore, it demonstrates the potential of data analysis to uncover patterns and correlations in historical events.

Through descriptive statistics we cleaned the dataset. We predict the various model builing to get best accuracy, we can see that without using dimensionality reduction we are able to get better score for our model.We conclude that Classifier with best accuracy:Decision Tree-0.79.