<span style="font-size: 35px; color: Red;">Titanic Dataset</span>

* In this project, we analyze the Titanic dataset, which contains information about passengers on the RMS Titanic, including their age, gender, ticket class, and whether they survived or not. We use pandas for data manipulation and Plotly for data visualization. Finally, we create a Streamlit app to display our analysis and visualizations.

* Our goal is to explore the dataset, identify trends and correlations, and present our findings through intuitive visualizations. Through this analysis, we hope to gain insights into the factors that may have contributed to the passengers' survival during the sinking of the Titanic.

<span style="font-size: 30px; color: green;">Import the Libraries</span>

In [21]:
import pandas as pd
# import numpy as np
import plotly.express as px
# import matplotlib.pyplot as plt

<span style="font-size: 30px; color: green;">Load the dataset</span>

In [2]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(url)
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<span style="font-size: 30px; color: Green;">Data Preprocessing</span>

<span style="font-size: 20px; color: Blue;">Check for missing values</span>

In [3]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

<span style="font-size: 20px; color: Blue;">Handle missing values</span>

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
# Fill missing values for age
data["Age"].fillna(data["Age"].mean(), inplace=True)

# Drop cabin column since most of the rows have null values
data.drop("Cabin", axis=1, inplace=True)

# Fill the Embarked with mode
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

In [6]:
data.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

<span style="font-size: 30px; color: Green;">Feature Engineering</span>

In [7]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C


In [16]:
# Add Family size column to contain the total number of people in that family plus themselves
data["FamilySize"] = data["SibSp"] + data["Parch"] + 1

# Create a new binary feature indicating whether a passenger is traveling alone or with family
data["IsAlone"] = data["FamilySize"].apply(lambda x: 1 if x == 1 else 0)

In [20]:
data.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,FamilySize,IsAlone
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,S,1,1
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,S,1,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.45,S,4,0
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C,1,1
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,Q,1,1


<span style="font-size: 30px; color: Green;">Data Visualization</span>

<span style="font-size: 20px; color: Blue;">Survival rate by gender</span>

In [31]:
survival_data = data.groupby("Sex")["Survived"].mean().reset_index()
survival_data
fig = px.bar(survival_data, x="Sex", y="Survived", color="Sex", title='Survival Rate by Gender')
fig.show()

<span style="font-size: 20px; color: Blue;">Survival rate by Passager Class</span>

In [36]:
survival_data = data.groupby("Pclass")["Survived"].mean().reset_index()
fig = px.bar(survival_data, x="Pclass", y="Survived", title='Survival Rate by Passenger Class', color="Pclass")
fig.show()

<span style="font-size: 20px; color: Blue;">Age distribution of passengers</span>

In [49]:
# All passangers who servived or not
fig = px.histogram(data, x="Age", nbins=20, color="Survived", title='Age Distribution of Passengers')
fig.show()