# Titanic Dataset Data Analysis

Perform data cleaning and exploratory data analysis (EDA) on a dataset of your choice, such as the Titanic dataset from Kaggle. Explore the relationships between variables and identify patterns and trends in the data.

## Data Understanding

The descriptions of the columns in the data are as follows:

    1. Survived - Survival (0 = No; 1 = Yes)
    2. Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
    3. Name - Name
    4. Sex - Sex
    5. Age - Age
    6. SibSp - Number of Siblings/Spouses Aboard
    7. Parch - Number of Parents/Children Aboard
    8. Ticket - Ticket Number
    9. Fare - Passenger Fare
    10. Cabin - Cabin
    11. Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [2]:
# Importing libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline

In [3]:
# Loading data
train_data = pd.read_csv('../Downloads/titanic_data/train.csv')
test_data = pd.read_csv('../Downloads/titanic_data/test.csv')


In [33]:
# Viewing the datasets
display(train_data.sample(3))

display(test_data.sample(3))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
242,243,0,2,"Coleridge, Mr. Reginald Charles",male,29.0,0,0,W./C. 14263,10.5,,S
661,662,0,3,"Badt, Mr. Mohamed",male,40.0,0,0,2623,7.225,,C
547,548,1,2,"Padro y Manent, Mr. Julian",male,,0,0,SC/PARIS 2146,13.8625,,C


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
229,1121,2,"Hocking, Mr. Samuel James Metcalfe",male,36.0,0,0,242963,13.0,,S
358,1250,3,"O'Keefe, Mr. Patrick",male,,0,0,368402,7.75,,Q
407,1299,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5,C80,C


There is one column missing in the test data, otherwise the merge of the two datasets would be perfect. 

Therefore, I will create a separate 'survived' variable to represent the survived column (Diff column).

In [35]:
survived = train_data['Survived'].copy()

train_data.drop('Survived', axis=1, inplace=True)

# Combining the test and train data
data = pd.merge(test_data, train_data, how='outer')

In [36]:
# Viewing the data
data.sample(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1060,643,3,"Skoog, Miss. Margit Elizabeth",female,2.0,3,2,347088,27.9,,S
996,579,3,"Caram, Mrs. Joseph (Maria Elias)",female,,1,0,2689,14.4583,,C
522,105,3,"Gustafsson, Mr. Anders Vilhelm",male,37.0,2,0,3101276,7.925,,S
504,87,3,"Ford, Mr. William Neal",male,16.0,1,3,W./C. 6608,34.375,,S
528,111,1,"Porter, Mr. Walter Chamberlain",male,47.0,0,0,110465,52.0,C110,S


In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Pclass       1309 non-null   int64  
 2   Name         1309 non-null   object 
 3   Sex          1309 non-null   object 
 4   Age          1046 non-null   float64
 5   SibSp        1309 non-null   int64  
 6   Parch        1309 non-null   int64  
 7   Ticket       1309 non-null   object 
 8   Fare         1308 non-null   float64
 9   Cabin        295 non-null    object 
 10  Embarked     1307 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 112.6+ KB


Clearly there are missing values in the data

In [38]:
def missing_value_magnitude(data):
    return pd.DataFrame(data.isna().sum() / data.shape[0] * 100, columns=['% missing values'])

In [39]:
display(missing_value_magnitude(train_data))

display(missing_value_magnitude(test_data))

display(missing_value_magnitude(data))

Unnamed: 0,% missing values
PassengerId,0.0
Pclass,0.0
Name,0.0
Sex,0.0
Age,19.86532
SibSp,0.0
Parch,0.0
Ticket,0.0
Fare,0.0
Cabin,77.104377


Unnamed: 0,% missing values
PassengerId,0.0
Pclass,0.0
Name,0.0
Sex,0.0
Age,20.574163
SibSp,0.0
Parch,0.0
Ticket,0.0
Fare,0.239234
Cabin,78.229665


Unnamed: 0,% missing values
PassengerId,0.0
Pclass,0.0
Name,0.0
Sex,0.0
Age,20.091673
SibSp,0.0
Parch,0.0
Ticket,0.0
Fare,0.076394
Cabin,77.463713


The cabin column has a high percentage of missing values. There is no need to fill these missing values and performing analysis on them as the bias would be too high. Dropping the column is a better call.

In [40]:
data.drop('Cabin', axis=1, inplace=True)

In [45]:
# Checking the summary of the distribution of the data
display(data.describe(include=['float', 'int']))

display(data.describe(include='object'))

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,655.0,2.294882,29.881138,0.498854,0.385027,33.295479
std,378.020061,0.837836,14.413493,1.041658,0.86556,51.758668
min,1.0,1.0,0.17,0.0,0.0,0.0
25%,328.0,2.0,21.0,0.0,0.0,7.8958
50%,655.0,3.0,28.0,0.0,0.0,14.4542
75%,982.0,3.0,39.0,1.0,0.0,31.275
max,1309.0,3.0,80.0,8.0,9.0,512.3292


Unnamed: 0,Name,Sex,Ticket,Embarked
count,1309,1309,1309,1307
unique,1307,2,929,3
top,"Kelly, Mr. James",male,CA. 2343,S
freq,2,843,11,914


Things to consider:

- Pclass has a possibility of being an ordinal variable
- Look at the number of people age below 1
- Look for outliers in Fare. There are people with 0 fare?
- Why is there a repeat of Kelly, Mr.James?
- Why are the significantly less number of tickets than passangers?
- Missing values still aren't filled