## Titanic Dataset
### Dataset Features:
1. survived: Whether the passenger survived (0 = No, 1 = Yes).
2. pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
3. sex: Gender of the passenger.
4. age: Age of the passenger in years. Some values are missing.
5. sibsp: Number of siblings/spouses aboard the Titanic.
6. parch: Number of parents/children aboard the Titanic.
7. fare: Passenger fare.
8. embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
9. class: Duplicate of 'pclass' (used for plotting by Seaborn).
10. who: Describes whether the passenger is a man, woman, or child.
11. adult_male: Indicates whether the passenger is an adult male (True/False).
12. deck: The deck the passenger was on (missing for many passengers).
13. embark_town: The name of the town where the passenger boarded.
14. alive: Indicator of whether the passenger survived (Yes/No, derived from 'survived').
15. alone: Indicates whether the passenger was traveling alone (True/False).


In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Loading the datasets onto the notebook


In [12]:
dataSet=sns.load_dataset('titanic')

In [13]:
dataSet.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [15]:
dataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [14]:
print("Random sample from the dataset: ")
dataSet.sample(20)

Random sample from the dataset: 


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
610,0,3,female,39.0,1,5,31.275,S,Third,woman,False,,Southampton,no,False
328,1,3,female,31.0,1,1,20.525,S,Third,woman,False,,Southampton,yes,False
198,1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
778,0,3,male,,0,0,7.7375,Q,Third,man,True,,Queenstown,no,True
723,0,2,male,50.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
868,0,3,male,,0,0,9.5,S,Third,man,True,,Southampton,no,True
64,0,1,male,,0,0,27.7208,C,First,man,True,,Cherbourg,no,True
259,1,2,female,50.0,0,1,26.0,S,Second,woman,False,,Southampton,yes,False
196,0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
528,0,3,male,39.0,0,0,7.925,S,Third,man,True,,Southampton,no,True


We can see that some values in the Age column are null.
Additionally,we have some redudant values

### Data Cleaning
#### Removing redundant columns
The "alive" column and the "survived" column are one in the same so we need to get rid of one. The same applies to the embark and embacked_town column.

In [None]:
#We will check if the two columns are identical
are_identical = dataSet['survived'].equals(dataSet['alive'].apply(lambda x:1 if x== 'yes' else 0))
print(f"Are the 'survived' and 'alive' identical? {are_identical}")

Are the 'survived' and 'alive' identical? True


In [22]:
are_identical2=dataSet['embarked'].equals(dataSet['embark_town'].apply(lambda x:1 if x== 'yes' else 0))
print(f"Are 'embarked' and 'embarked_town' identical? {are_identical2}")

Are 'embarked' and 'embarked_town' identical? False


The embark and embarked_town values are not identical meaning that there are unique values in both columns

In [23]:
# Check unique values in both columns
print("Unique values in 'embarked':", dataSet['embarked'].unique())
print("Unique values in 'embark_town':", dataSet['embark_town'].unique())

Unique values in 'embarked': ['S' 'C' 'Q' nan]
Unique values in 'embark_town': ['Southampton' 'Cherbourg' 'Queenstown' nan]


In [30]:
embarked_mapping = {'S':'Southampton','C':'Cherbourg','Q':'Queenstown'}
dataSet['embarked_mapped']=dataSet['embarked'].map(embarked_mapping)

In [31]:
are_identical=dataSet['embarked_mapped'].equals(dataSet['embark_town'])
print(f"Are 'embarked' and 'embark_town' identical? {are_identical}")


Are 'embarked' and 'embark_town' identical? True


We will drop the "extra" columns

In [32]:
dataSet.drop('alive',axis=1,inplace=True)
dataSet.drop('embark_town',axis=1,inplace=True)
dataSet.drop('embarked_mapped',axis=1,inplace=True)
dataSet.drop('class',axis=1,inplace=True)

In [None]:
dataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   survived    891 non-null    int64   
 1   pclass      891 non-null    int64   
 2   sex         891 non-null    object  
 3   age         714 non-null    float64 
 4   sibsp       891 non-null    int64   
 5   parch       891 non-null    int64   
 6   fare        891 non-null    float64 
 7   embarked    889 non-null    object  
 8   who         891 non-null    object  
 9   adult_male  891 non-null    bool    
 10  deck        203 non-null    category
 11  alone       891 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(3)
memory usage: 65.7+ KB


In [39]:
dataSet.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [42]:
#Checking for missing values
dataSet.isnull().sum()

survived        0
pclass          0
sex             0
age           177
sibsp           0
parch           0
fare            0
embarked        2
who             0
adult_male      0
deck          688
alone           0
dtype: int64

In [43]:
dataSet.isnull().mean() * 100

survived       0.000000
pclass         0.000000
sex            0.000000
age           19.865320
sibsp          0.000000
parch          0.000000
fare           0.000000
embarked       0.224467
who            0.000000
adult_male     0.000000
deck          77.216611
alone          0.000000
dtype: float64

Since only 20% of the values in the age column are null,we will drop the null values
However,since the deck column has 70% of its column as null,we will drop the whole column

In [54]:
dataSet_clean = dataSet.drop(columns=['deck'])
dataSet_clean.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,adult_male,alone
0,0,3,male,22.0,1,0,7.25,S,man,True,False
1,1,1,female,38.0,1,0,71.2833,C,woman,False,False
2,1,3,female,26.0,0,0,7.925,S,woman,False,True
3,1,1,female,35.0,1,0,53.1,S,woman,False,False
4,0,3,male,35.0,0,0,8.05,S,man,True,True
5,0,3,male,,0,0,8.4583,Q,man,True,True
6,0,1,male,54.0,0,0,51.8625,S,man,True,True
7,0,3,male,2.0,3,1,21.075,S,child,False,False
8,1,3,female,27.0,0,2,11.1333,S,woman,False,False
9,1,2,female,14.0,1,0,30.0708,C,child,False,False


In [56]:
dataSet_clean1=dataSet_clean.dropna()

In [58]:
dataSet_clean1.isnull().sum()

survived      0
pclass        0
sex           0
age           0
sibsp         0
parch         0
fare          0
embarked      0
who           0
adult_male    0
alone         0
dtype: int64

In [66]:
#Will convert age column into integer from float
dataSet_clean1['age']=dataSet_clean1['age'].fillna(0).astype(int)

Null values have been dropped and the dataset is "clean"

In [67]:
df_cleaned = dataSet_clean1

In [68]:
#Searching for duplicated values
df_cleaned.duplicated().sum()
#Dropping duplicated values
df_cleaned.drop_duplicates(inplace=True)

In [69]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 671 entries, 0 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   survived    671 non-null    int64  
 1   pclass      671 non-null    int64  
 2   sex         671 non-null    object 
 3   age         671 non-null    int64  
 4   sibsp       671 non-null    int64  
 5   parch       671 non-null    int64  
 6   fare        671 non-null    float64
 7   embarked    671 non-null    object 
 8   who         671 non-null    object 
 9   adult_male  671 non-null    bool   
 10  alone       671 non-null    bool   
dtypes: bool(2), float64(1), int64(5), object(3)
memory usage: 53.7+ KB


Lastly,after data cleaning ,we group the ages into categories for easy future visualisation

In [70]:
bins = [0,12,18,30,50,100]
labels=['child','teen','young_adult','adult','senior']

df_cleaned['age_group'] = pd.cut(df_cleaned['age'],bins=bins,labels=labels,right=False)

In [73]:
df_cleaned.describe(include='all')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,adult_male,alone,age_group
count,671.0,671.0,671,671.0,671.0,671.0,671.0,671,671,671,671,671
unique,,,2,,,,,3,3,2,2,5
top,,,male,,,,,S,man,True,True,young_adult
freq,,,419,,,,,515,379,379,364,249
mean,0.417288,2.223547,,29.737705,0.539493,0.457526,35.836916,,,,,
std,0.493479,0.846827,,14.732431,0.94845,0.872926,54.193827,,,,,
min,0.0,1.0,,0.0,0.0,0.0,0.0,,,,,
25%,0.0,1.0,,20.0,0.0,0.0,8.05,,,,,
50%,0.0,2.0,,28.0,0.0,0.0,16.1,,,,,
75%,1.0,3.0,,39.0,1.0,1.0,35.0771,,,,,


In [76]:
df_cleaned.describe(include="object")

Unnamed: 0,sex,embarked,who
count,671,671,671
unique,2,3,3
top,male,S,man
freq,419,515,379


In [77]:
df_cleaned.describe(include="bool")

Unnamed: 0,adult_male,alone
count,671,671
unique,2,2
top,True,True
freq,379,364
