# End-to-end data analysis project, 

## Table of Contents

### 1 Introduction
### 2 Loading the Data
### 3 Data Preprocessing

* Handling Missing Values
* Feature Engineering

### 4 Data Visualization
* Survival Rate by Gender
* Survival Rate by Passenger Class
* Age Distribution of Passengers
* Age Distribution by Survival Status

### 5 Streamlit App

### 6 Assigment

### 7 Conclusion

## Introduction¶

In this project, we analyze the Titanic dataset, which contains information about passengers on the RMS Titanic, including their age, gender, ticket class, and whether they survived or not. We use pandas for data manipulation and Plotly for data visualization. Finally, we create a Streamlit app to display our analysis and visualizations.

Our goal is to explore the dataset, identify trends and correlations, and present our findings through intuitive visualizations. Through this analysis, we hope to gain insights into the factors that may have contributed to the passengers' survival during the sinking of the Titanic.

# 2 Loading the Data and Exploratory Data Analysis

###  Import necessary libraries:

In [1]:
import pandas as pd
import plotly.express as px


### Load the dataset:

In [2]:
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)


### Inspect the data:

In [3]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Check for missing values:

In [4]:
data.isnull().sum()


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## 3 Data Preprocessing

### Fill missing values or drop rows/columns with missing data:

In [5]:
# Fill missing values in the 'Age' column with the median age value
data['Age'].fillna(data['Age'].median(), inplace=True)

# Fill missing values in the 'Embarked' column with the most frequent value
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Drop the 'Cabin' column from the dataset
data.drop('Cabin', axis=1, inplace=True)



### Feature engineering (create new columns):

In [11]:
# Calculate the size of each passenger's family by summing their siblings/spouses and parents/children, and adding one for themselves
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

# Create a new binary feature indicating whether a passenger is traveling alone or with family
data['IsAlone'] = data['FamilySize'].apply(lambda x: 1 if x == 1 else 0)



## 4 Visualize data:

### Survival rate by gender:

In [12]:
gender_survival = data.groupby('Sex')['Survived'].mean().reset_index()
fig = px.bar(gender_survival, x='Sex', y='Survived', title='Survival Rate by Gender')
fig.show()


### Survival rate by passenger class:



In [13]:
pclass_survival = data.groupby('Pclass')['Survived'].mean().reset_index()
fig = px.bar(pclass_survival, x='Pclass', y='Survived', title='Survival Rate by Passenger Class')
fig.show()


### Age distribution of passengers:

In [14]:
fig = px.histogram(data, x='Age', nbins=50, title='Age Distribution of Passengers')
fig.show()


### Age distribution by survival status:

In [15]:
# fig = px.histogram(data, x='Age', color='Survived', nbins=50, title='Age Distribution by Survival Status')
fig.show()


### Analyze correlations:

In [16]:
correlation_matrix = data.corr()
fig = px.imshow(correlation_matrix, title='Correlation Matrix')
fig.show()


# Streamlit App

#### To create a Streamlit app for the Titanic data analysis, follow these steps:

Install Streamlit if you haven't already:

#### Create a new Python file, e.g., titanic_app.py, and import the necessary libraries:

In [None]:
import pandas as pd
import plotly.express as px
import streamlit as st


#### Add a title and a brief description to the app:

In [None]:
st.title("Titanic Data Analysis")
st.write("This app analyzes the Titanic dataset and displays various visualizations.")


#### Load the dataset and preprocess it (as shown in the previous response):

In [None]:
@st.cache_data
def load_data():
    url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
    data = pd.read_csv(url)
    data['Age'].fillna(data['Age'].median(), inplace=True)
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
    data.drop('Cabin', axis=1, inplace=True)
    data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
    data['IsAlone'] = data['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
    return data

data = load_data()


#### Display the data (optional):

In [None]:
if st.checkbox("Show Raw Data"):
    st.subheader("Raw Data")
    st.write(data)


### Create the visualizations using Plotly and display them in the app:

#### Survival rate by gender:

In [None]:
gender_survival = data.groupby('Sex')['Survived'].mean().reset_index()
fig_gender_survival = px.bar(gender_survival, x='Sex', y='Survived', title='Survival Rate by Gender')
st.plotly_chart(fig_gender_survival)


#### Survival rate by passenger class:

In [None]:
pclass_survival = data.groupby('Pclass')['Survived'].mean().reset_index()
fig_pclass_survival = px.bar(pclass_survival, x='Pclass', y='Survived', title='Survival Rate by Passenger Class')
st.plotly_chart(fig_pclass_survival)


#### Age distribution of passengers:

In [None]:
fig_age_survival = px.histogram(data, x='Age', color='Survived', nbins=50, title='Age Distribution by Survival Status')
st.plotly_chart(fig_age_survival)


#### Age distribution by survival status:

In [None]:
fig_age_survival = px.histogram(data, x='Age', color='Survived', nbins=50, title='Age Distribution by Survival Status')
st.plotly_chart(fig_age_survival)


#### Save the file and run the Streamlit app:

In [None]:
streamlit run titanic_app.py


### Here's the complete code for the titanic_app.py file:

In [None]:
import pandas as pd
import plotly.express as px
import streamlit as st

st.title("Titanic Data Analysis")
st.write("This app analyzes the Titanic dataset and displays various visualizations.")

@st.cache_data
def load_data():
    url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
    data = pd.read_csv(url)
    data['Age'].fillna(data['Age'].median(), inplace=True)
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
    data.drop('Cabin', axis=1, inplace=True)
    data['FamilySize'] = data['SibSp'] + data['Parch'] + 1
    data['IsAlone'] = data['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
    return data

data = load_data()

if st.checkbox("Show Raw Data"):
    st.subheader("Raw Data")
    st.write(data)

gender_survival = data.groupby('Sex')['Survived'].mean().reset_index()
fig_gender_survival = px.bar(gender_survival, x='Sex', y='Survived', title='Survival Rate by Gender')
st.plotly_chart(fig_gender_survival)

pclass_survival = data.groupby('Pclass')['Survived'].mean().reset_index()
fig_pclass_survival = px.bar(pclass_survival, x='Pclass', y='Survived', title='Survival Rate by Passenger Class')
st.plotly_chart(fig_pclass_survival)

fig_age_distribution = px.histogram(data, x='Age', nbins=50, title='Age Distribution of Passengers')
st.plotly_chart(fig_age_distribution)

fig_age_survival = px.histogram(data, x='Age', color='Survived', nbins=50, title='Age Distribution by Survival Status')
st.plotly_chart(fig_age_survival)


# Assigment 

### Host the above  Streamlit app using Streamlit Share, follow these steps:

#### 1 Sign up for Streamlit Share:

Go to https://share.streamlit.io/ and sign up using your GitHub account. If you don't have a GitHub account, you will need to create one at https://github.com/.
    
#### 2 Create a GitHub repository:

Create a new public repository on your GitHub account. You can name it something like "titanic-data-analysis". For instructions on how to create a new repository, visit https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-new-repository.
    
#### 3 Add your app code to the repository:

Upload your titanic_app.py file to the newly created GitHub repository. You can either do this by directly uploading the file via the GitHub web interface or by using Git from the command line. For instructions on how to upload a file using the web interface, visit https://docs.github.com/en/github/managing-files-in-a-repository/adding-a-file-to-a-repository.

#### 4 Deploy your app on Streamlit Share:

* Go to https://share.streamlit.io/ and sign in with your GitHub account.
* Click on "New app" in the top right corner.
* Select your GitHub repository ("titanic-data-analysis" or whatever you named it) from the "Repository" dropdown.
*  In the "Branch" dropdown, choose the appropriate branch, usually "main" or "master".
* In the "File" dropdown, select the titanic_app.py file.
* Click on "Deploy".



Your app should now be deployed on Streamlit Share, and you will be given a URL to access it. Note that the URL will be in the format https://share.streamlit.io/<your-github-username>/<your-repository-name>/<branch-name>/<app-file-name>.

### 5 Share your app:
You can now share the URL with anyone, and they will be able to access your Streamlit app without needing to install anything. Streamlit Share will automatically update your app whenever you push changes to the corresponding GitHub repository.