# Titanic Survival Analysis


#### Loading Data <a class="anchor" id="loading"></a>

Import Libraries and load the dataset

In [4]:
import pandas as pd
import numpy as np

train_df = pd.read_csv(r"C:\Users\Believe\OneDrive\Documents\GitHub\Titanic---Machine-Learning-from-Disaster\data\raw\train.csv")
test_df = pd.read_csv(r"C:\Users\Believe\OneDrive\Documents\GitHub\Titanic---Machine-Learning-from-Disaster\data\raw\test.csv")

train_df.head()

import warnings
warnings.filterwarnings('ignore')

In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Data Dictionary

- Survived: 0 = No, 1 = Yes
- pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- cabin: Cabin number
- embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

#### Data Information <a class="anchor" id="data_info"></a>

Some immediate insights are:
* There are 12 columns and 891 rows.
* The name and datatype of each column -- most values are integers and objects in this dataset.
* The `Age` column, the `Cabin` column and the `Embarked` column have some missing data, values that are not integers or floats, so some cleaning will be necessary for this columns prior to conducting EDA. 
* The column names could be renamed for more consistency.
* Some basic summary statistics on each of the numerical variables.

In [12]:
train_df.describe(include = 'all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Dooley, Mr. Patrick",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [13]:
train_df.groupby(['Pclass'], as_index=False)['Survived'].mean()

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [14]:
train_df.groupby(['Sex'], as_index=False)['Survived'].mean()

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


In [16]:
train_df.groupby(['SibSp'], as_index=False)['Survived'].mean()

Unnamed: 0,SibSp,Survived
0,0,0.345395
1,1,0.535885
2,2,0.464286
3,3,0.25
4,4,0.166667
5,5,0.0
6,8,0.0


In [17]:
train_df.groupby(['Parch'], as_index=False)['Survived'].mean()

Unnamed: 0,Parch,Survived
0,0,0.343658
1,1,0.550847
2,2,0.5
3,3,0.6
4,4,0.0
5,5,0.2
6,6,0.0


In [18]:
train_df['Family_Size'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['Family_Size'] = test_df['SibSp'] + test_df['Parch'] + 1

train_df.groupby(['Family_Size'], as_index=False)['Survived'].mean()

Unnamed: 0,Family_Size,Survived
0,1,0.303538
1,2,0.552795
2,3,0.578431
3,4,0.724138
4,5,0.2
5,6,0.136364
6,7,0.333333
7,8,0.0
8,11,0.0


In [38]:
family_map = {1: 'Alone', 2: 'Small', 3: 'Small', 4: 'Small', 5: 'Medium', 6: 'Medium', 7: 'Large', 8: 'Large', 11: 'Large'}
train_df['Family_Size_Grouped'] = train_df['Family_Size'].map(family_map)
test_df['Family_Size_Grouped'] = test_df['Family_Size'].map(family_map)

train_df.groupby(['Family_Size_Grouped'], as_index=False)['Survived'].mean()

Unnamed: 0,Family_Size_Grouped,Survived
0,Alone,0.303538
1,Large,0.16
2,Medium,0.162162
3,Small,0.578767


In [20]:
train_df.groupby(['Embarked'], as_index=False)['Survived'].mean()

Unnamed: 0,Embarked,Survived
0,C,0.553571
1,Q,0.38961
2,S,0.336957


### Data Cleaning <a class="anchor" id="cleaing"></a>

The initial raw Titanic dataset was loaded and inspected in Jupyter. However , data cleaning was performed externally in Excel for simplicity and clarity.

The cleaned dataset `train_clean.csv` was then re-imported into this notebook for further analysis.

### Cleaning Steps Performed in Excel:
* Dropped irrelevant colummns such as `Name`, `Cabin` and `Ticket` columns.
* Filled in blanks in `Age` with the median of the `Age` column.
* Created a new column named `Family_Size`, it was calulated using the following formula:
       `SibSp` + `Parch` + `1`
* Replace missing values in `Embarked with the mode of the column.

### Cleaned Dataset Info

In [36]:
clean_train_df = pd.read_csv(r"C:\Users\Believe\OneDrive\Documents\GitHub\Titanic---Machine-Learning-from-Disaster\data\final\train_clean.csv", delimiter=';')
clean_test_df = 
clean_train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Family_Size,Fare,Embarked
0,1,0,3,male,22.0,1,0,1,7.25,S
1,2,1,1,female,38.0,1,0,1,71.2833,C
2,3,1,3,female,26.0,0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,1,53.1,S
4,5,0,3,male,35.0,0,0,0,8.05,S


In [37]:
clean_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          891 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Family_Size  891 non-null    int64  
 8   Fare         891 non-null    float64
 9   Embarked     891 non-null    object 
dtypes: float64(2), int64(6), object(2)
memory usage: 69.7+ KB


## Exploratory Data Analysis

Exploratory Data Analysis was done with `Tableau`. The link below redirects you to the final dashboard, I have also provided explanation of what is going on the dashboard below the link.

LINK: [Visualization of Titanic Survival Analysis](https://public.tableau.com/views/TitanicSurvivalAnalysisEDA/Dashboard1?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link)

### Findings

* Women had a much higher survival rate than men.
* First-class passengers had higher survival compared to second and third class.
* Younger passengers had slightly better survival chances.
* Very large families had low survival rates; small families (2–4) had higher. The average survival rate was 30%.
* Majority of passengers boareded the Titanic in Southampton.

### Reproducibility
- Data: [Kaggle Titanic Dataset](https://www.kaggle.com/c/titanic)  
- Cleaning: Excel + Pandas.  
- Visualization: [Tableau dashboards](https://public.tableau.com/views/TitanicSurvivalAnalysisEDA/Dashboard1?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link).  
- Code & versioning: [GitHub Repository Link]  
