<a href="https://colab.research.google.com/github/Derpitron/titanic_investigation/blob/main/titanic_investigation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import Python math libs

In [None]:
import numpy as np
import pandas as pd

Import Google Drive files

In [None]:
try:
	from google.colab import drive
	drive.mount('/content/drive')
except ModuleNotFoundError:
	pass

Load CSV Dataset from Google Drive folder or Local folder

In [None]:
#import os

try:
	titanic_train_df = pd.read_csv( 'drive/MyDrive/titanic_investigation/titanic.csv' )
except FileNotFoundError:
	titanic_train_df = pd.read_csv( 'data/titanic.csv' )

Control the color palette for generated graphs.

In [None]:
graph_colour_palette = 'flare'

Inspect titanic dataframe for correctness.

In [None]:
titanic_train_df.head(10)

`tail()` method shows the last specified records.

In [None]:
titanic_train_df.tail(10)

`shape` attribute shows the length and width of the dataframe.

In [None]:
titanic_train_df.shape

Display additional info about the dataframe.

In [None]:
titanic_train_df.info()

**What each field means:**

`survival`: Whether the person survived or not. `0` = no, `1` = yes

`pclass`: Passenger class. `1, 2, 3`-th class

`name`: Name of passenger

`sex`: Sex/gender of the passenger

`age`: Age of passenger

`cabin no`: The cabin number of passenger

etc..

**Scrub the data and clean it.**

Begin with finding out if there are any missing values. If `isnull()` is `True`, then the particular value is missing. Otherwise, the value exists.



1.   Forecast
2.   Classify
3.   Identify
4. Clustering



In [None]:
titanic_train_df.describe()

In [None]:
titanic_train_df.isnull

Data Visualisation:

In [None]:
# Importing data-vis libraries

import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import style
%matplotlib inline

Generating the plot image

In [None]:
sns.heatmap(
    titanic_train_df.head(100).isnull(),
    yticklabels = True,
    cbar = True,
    cmap = graph_colour_palette
)

Removing all the empty/nearly variables columns for **data cleaning**

In [None]:
titanic_cleared_df = titanic_train_df.drop(
	[
		'age',
		'name',
		'fare',
		'ticket',
		'cabin',
		'boat',
		'body',
		'home.dest',
	],
	axis = 1
)

In [None]:
titanic_cleared_df.shape

In [None]:
titanic_cleared_df = titanic_cleared_df.dropna()

In [None]:
titanic_cleared_df.shape

Regenerating cleaned plot:

In [None]:
sns.heatmap(
    titanic_train_df.head(100).isnull(),
    yticklabels = True,
    cbar = True,
    cmap = graph_colour_palette
)

## Explore, model, and interpret the data

In [None]:
sns.set_style('whitegrid')
sns.countplot(
    x = 'survived',
    data = titanic_cleaned_df,
    palette = graph_colour_palette
)

In [None]:
sns.set_style('darkgrid')
sns.countplot(
    x = 'survived',
    hue = 'sex',
    data = titanic_cleaned_df,
    palette = graph_colour_palette
)

In [None]:
sns.catplot(
    x = 'sex',
    y = 'survived',
    kind = 'point',
    data = titanic_cleaned_df,
    palette = graph_colour_palette
)

In [None]:
sns.catplot(
    x = 'sex',
    col = 'pclass',
    hue = 'sex',
    kind = 'count',
    data = titanic_cleaned_df,
    height = 7,
    aspect = 0.8,
    palette = graph_colour_palette
)

plt.subplots_adjust(top = 0.9)
plt.suptitle('Titanic Investigation - Class and Gender separation and analysis', fontsize = 20)

From the given graph, we conclude that the male passengers survived more across all the classes; with a notable exception for `pclass == 1.0`, where female passengers survived more.

We see more survivors in the higher passenger class, especially `pclass == 3.0`.

Thus, we can conclude that `pclass` and `gender` are the factors that determined who is more likely to survive the Titanic crash.