**Solutions to titanic dataset**

**Goals:**

**Steps:**

1. Import libraries

2. Load the dataset

3. Use z-score to detect outliers

4. Hypothesis testing

    4.1. Extract the age values of survivors and non-survivors

    4.2. State H0 and H1

5. Exercise

---



# **0. Import libraries**

In [149]:
import pandas as pd
import numpy as np
import scipy.stats as stats

# **1. Load the dataset**

In [117]:
# load the dataset
df = pd.read_csv('/content/drive/MyDrive/1. Data Science/12. Solutions to titanic dataset/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# **2. Clean the data**

In [118]:
# check the dimension
df.shape

(891, 12)

In [119]:
# check for null values
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [120]:
# calculate the mean of the age
mean_age = df['Age'].mean()
mean_age

29.69911764705882

*The average age is 30.272590361445783.*

---





In [121]:
# place the average of the age where it has NaN values
df['Age'].fillna(mean_age, inplace=True)

In [122]:
# drop cabin column
df.drop(columns=['Cabin'], inplace=True)

In [123]:
# find the mode of fare
mode_fare = df['Fare'].mode()[0]
mode_fare

8.05

*The mode of the fare column is 7.75.*

---



In [124]:
# check how many times that mode occurs
df['Fare'].value_counts()[mode_fare]

43

*The mode occurs 21 times.*

---



In [125]:
# place the mode of fare where it has NaN values
df['Fare'].fillna(mode_fare, inplace=True)

In [126]:
# show the info of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


In [135]:
# fill mode to the NaN value of embarked
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

In [136]:
# show the info of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


*There is no NaN now.*

---



# **3. Use z-score to detect outliers**

In [127]:
# calculate z-score of age
z_scores = np.abs(stats.zscore(df['Age']))
z_scores

0      0.592481
1      0.638789
2      0.284663
3      0.407926
4      0.407926
         ...   
886    0.207709
887    0.823344
888    0.000000
889    0.284663
890    0.177063
Name: Age, Length: 891, dtype: float64

In [128]:
# check outlier
outliers = df['Age'][z_scores > 3]
outliers

96     71.0
116    70.5
493    71.0
630    80.0
672    70.0
745    70.0
851    74.0
Name: Age, dtype: float64

*We found only one outlier which is the age of 76.*

---



In [129]:
# calculate Q1 and Q3
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)

# calculate IQR
IQR = Q3 - Q1

# define lower and upper bounds
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

# filter rows with values outside the IQR bounds
df_cleaned = df[(df['Age'] > lower_bound) & (df['Age'] < upper_bound)]
df_cleaned

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C


*We got 382 rows left.*

---



# **4. Hypothesis testing**

## **4.1. Extract the age values of survivors and non-survivors**

In [155]:
# extract
survived_age = df.loc[df['Survived'] == 1, 'Age']
died_age = df.loc[df['Survived'] == 0, 'Age']

## **4.2. State H0 and H1**

H0: m1 = m2

H1: m1 =/= m2

- m1 = mean age of survivors
- m2 = mean age of non-surviors

---





In [156]:
# calculate t-statistics
t_stat, p_value = stats.ttest_ind(survived_age, died_age)
print(f't-statistics: {t_stat:.2f}')
print(f'p-value: {p_value:.2f}')

t-statistics: -2.09
p-value: 0.04


In [163]:
# interpret
if p_value <= 0.05:
    print('Reject H0 and conclude that there is a statistically significant difference in the mean age of survivors and non-survivors.')
else:
    print('Accept H0 and conclude that there is no statistically significant difference in the mean age of survivors and non-survivors.')

Reject H0 and conclude that there is a statistically significant difference in the mean age of survivors and non-survivors.


In [146]:
# Calculate the effect size : a common measure of effect size for comparing the means of two independent groups.
effect_size = t_stat / np.sqrt(len(survived_age) + len(died_age) - 2)

# Print the effect size
print(f'The effect size is {effect_size:.2f}.')

The effect size is 0.27.


# **5. Exercise**

In [152]:
# randomly select 50 people from the dataset
sample = df.sample(n=50, random_state=42)
sample.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
709,710,1,3,"Moubarek, Master. Halim Gonios (""William George"")",male,29.699118,1,1,2661,15.2458,C
439,440,0,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,0,0,C.A. 18723,10.5,S
840,841,0,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,0,0,SOTON/O2 3101287,7.925,S
720,721,1,2,"Harper, Miss. Annie Jessie ""Nina""",female,6.0,0,1,248727,33.0,S
39,40,1,3,"Nicola-Yarred, Miss. Jamila",female,14.0,1,0,2651,11.2417,C


In [153]:
# extract the fare values of survivors and non-survivors
survived_fare = sample.loc[df['Survived'] == 1, 'Fare']
died_fare = sample.loc[df['Survived'] == 0, 'Fare']

In [154]:
# calculate t-statistic and p-value
t_stat, p_value = stats.ttest_ind(survived_fare, died_fare)

H0: m1 = m2

H1: m1 =/= m2

- m1 = mean age of survivors

- m2 = mean age of non-surviors

---



In [168]:
# define the above statements
h0 = 'there is no difference in the mean fare of survivors and non-survivors'
h1 = 'there is a difference in the mean fare of survivors and non-survivors'

In [162]:
# show the result
print(f't-statistics: {t_stat:.2f}')
print(f'p-value: {p_value:.4f}')

t-statistics: -2.09
p-value: 0.0372


In [171]:
# compare the p-value with the significan level
if p_value <= 0.05:
    print(f'Reject H0 and conclude that {h1}.')
else:
    print(f'Accept H0 (we fail to reject hypothesis) and conclude that {h0}.')

Reject H0 and conclude that there is a difference in the mean fare of survivors and non-survivors.
