# ANALYSIS OF TITANIC DATASET USING PANDAS

# Introduction

#### The Titanic had sunk on April 15, 1912 during her maiden voyage. After colliding with an iceberg. It was one of the most dramatic events of the twentieth century. 1502 of its 2224 passengers died. The data set investigated in the following sections contains detailed information about 891 passengers.

### The dataset consists of the following parameters regarding to the passengers

- class: Passenger class (1 = 1st; 2 = 2nd and 3 = 3rd)
- survival: A Boolean indicating whether the passenger survived or not (0 = No; 1 = Yes);

- name: A field rich in information as it contains title and family names
- sex: male/female passengers
- age: Age, a significant portion of this values are missing
- sibsp: Number of siblings/spouses on the ship.
- parch: Number of parents/children on the ship.
- ticket: Ticket number of the passengers.
- fare: Passenger fare (in British Pound).
- cabin: Does the location of the cabin influence chances of survival?
- embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- boat: Lifeboat, many missing values are here.
- body: Body Identification Number
- home.dest: Home/destination


## $\text{Analysis of the Data using Pandas}$ 


In [1]:
import pandas as pd

#### First of all importing the Dataset with read_csv

In [2]:
ds= pd.read_csv("train.csv")

In [3]:
ds.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Now getting  informations from the Dataset using info() function. 

In [4]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Now Getting the Numerical Dataset using describe function (.describe())

In [5]:
ds.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
ds.Age.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

## Summary of the catagorical Data 

In [7]:
print(ds.describe(include=['O']))

                              Name   Sex Ticket Cabin Embarked
count                          891   891    891   204      889
unique                         891     2    681   147        3
top     Navratil, Master. Michel M  male   1601    G6        S
freq                             1   577      7     4      644


## Finding the Null values

In [8]:
ds.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## Droping the unnecessary columns.

The Dataset contains information which are not useful here, are passengerId, ticket and Cabin. so droping them.


In [9]:
ds.drop(['PassengerId', 'Ticket','Cabin'], axis=1,  inplace= True)

print(ds.columns.values)

['Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Fare' 'Embarked']


### Again checking the Null values

In [10]:
ds.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

Here the targeted values remained

### Now Cleaning the Data.
- To avoid the floating points rounding up them to the closest integer using ceil() Function

In [11]:
ds['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [12]:
import numpy as np

In [13]:
ds["Age"]= np.ceil(ds["Age"])

In [14]:
ds["Age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

## Now Filling missing values of the Data

In [15]:
ds.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

##### There are 177 missing values in age. So, by filling those issing values by mean value of ages. 

Taking the mean value to filling the missing cells of Age. 

In [16]:
age_mean = ds['Age'].mean()
age_mean

29.714285714285715

In [17]:
ds.fillna(ds.mean(), inplace = True)

In [18]:
ds.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

##### Check if all values have been replaced

In [19]:
print("Remaining NaN values: {}".format(ds['Age'].isnull().sum()))

Remaining NaN values: 0


In [20]:

ds.fillna(ds.mean(), inplace = True)

In [21]:
ds.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

#####  The embarked column misses only 2 values. To replace these, we use the most common Embarked which is by far (about 72 %) Southampton.

###### Replacing the missing value by alphabet 's'

In [22]:
ds['Embarked'] = ds['Embarked'].fillna('s')

In [23]:
ds.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

Now the missing values are filled

## Selection of Columns

In [44]:
survived = ds.Survived
sex = ds.Sex
clss = ds.Pclass
sibsp = ds.SibSp
age = ds.Age
listt = list(zip(survived,sex,clss,sibsp,age))
columns = ['Survived', 'Sex', 'Pclass', 'SibSp', 'Age']
ds_list= pd.DataFrame(listt, columns= columns)
ds_list.isnull().sum()
ds_list.fillna(ds_list.mean(), inplace = True)

In [45]:
ds_list

Unnamed: 0,Survived,Sex,Pclass,SibSp,Age
0,0,male,3,1,22.000000
1,1,female,1,1,38.000000
2,1,female,3,0,26.000000
3,1,female,1,1,35.000000
4,0,male,3,0,35.000000
...,...,...,...,...,...
886,0,male,2,0,27.000000
887,1,female,1,0,19.000000
888,0,female,3,1,29.714286
889,1,male,1,0,26.000000


In [46]:
ds_list["Age"]= np.ceil(ds_list['Age'])
ds_list

Unnamed: 0,Survived,Sex,Pclass,SibSp,Age
0,0,male,3,1,22.0
1,1,female,1,1,38.0
2,1,female,3,0,26.0
3,1,female,1,1,35.0
4,0,male,3,0,35.0
...,...,...,...,...,...
886,0,male,2,0,27.0
887,1,female,1,0,19.0
888,0,female,3,1,30.0
889,1,male,1,0,26.0


# Problems Statements

### 1. How many passengers survived and how many were killed (in number and %)

In [47]:
sex_survival = ds[['Sex', 'Survived']].groupby('Sex', as_index= False).mean()
sex_survival

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


In [48]:
total_survived= ds.Survived.sum()
print (f"Total Survived:  {total_survived}")

Total Survived:  342


In [51]:
total_psgr = 891
total_survived = 342
in_percent = np.ceil(342/891*100)
print(f"Total survived: {in_percent} %")

Total survived: 39.0 %


In [52]:
total_male=ds_list.Sex == 'male'
survived_male = (ds_list['Survived'] == 1) & (ds_list.Sex == 'male')
s_male = survived_male.sum()
dead_male = (ds_list['Survived'] == 0) & (ds_list.Sex == 'male')
d_male= dead_male.sum()


print(f" Survived male: ", s_male)
print(f" Dead male: ", d_male)
print(f' Total male',total_male.sum())

print("-" *50)

total_female = ds_list.Sex == 'female'
survived_female = (ds_list['Survived'] == 1) & (ds_list.Sex == 'female')
s_female = survived_female.sum()
dead_female=(ds_list['Survived'] == 0) & (ds_list.Sex == 'female')
d_female = dead_female.sum()


print(f" Survived Female : ", s_female)
print(f" Dead Female: ", d_female)
print(f' Total Female', total_female.sum())

print("-" *50)

total_survived2=s_male+s_female
total_died2=d_male+ d_female

print(f"Total Survived: {total_survived2}")
print(f"Total Died: {total_died2}")

 Survived male:  109
 Dead male:  468
 Total male 577
--------------------------------------------------
 Survived Female :  233
 Dead Female:  81
 Total Female 314
--------------------------------------------------
Total Survived: 342
Total Died: 549


## Total Survived and Dead Passengers 

Making DataFrame of the above data

In [53]:
data1= { "Passenger": ["Male", "Female"], "Total Survived":[s_male,s_female ], "Total Deaths": [d_male,d_female ]}
index= [1,2]
ds1=pd.DataFrame(data1,index)
ds1

Unnamed: 0,Passenger,Total Survived,Total Deaths
1,Male,109,468
2,Female,233,81


## Passengers survived and how many were dead in Percentage %

Finding the percentage of survived and dead passengers

In [54]:
per1=s_male /total_psgr*100
per1
psm= np.ceil(per1)
print(f"Total Survived Male: {psm} %" )

per2=d_male /total_psgr*100
per2
pdm= np.ceil(per2)
print(f"Total Dead Male: {pdm} %" )


per3=s_female /total_psgr*100
per3
psf= np.ceil(per3)
print(f"Total Survived female: {psf} %" )
per4=d_female /total_psgr*100
per4
pdf= np.ceil(per4)
print(f"Total Dead female: {pdf} %" )

Total Survived Male: 13.0 %
Total Dead Male: 53.0 %
Total Survived female: 27.0 %
Total Dead female: 10.0 %


### Representing in Data Frame:

In [55]:
data= { "Passenger": ["Survived Male", "Dead Male","Survived Female", "Dead Female" ], "Percentage (%)": [psm,pdm,psf,pdf]}
index= [1,2,3,4]
ds2=pd.DataFrame(data,index)
ds2

Unnamed: 0,Passenger,Percentage (%)
1,Survived Male,13.0
2,Dead Male,53.0
3,Survived Female,27.0
4,Dead Female,10.0


## 2) does survival had anything to do with the gender? (Survival vs Gender)

In [56]:
data3= { "Passenger": ["Survived Male", "Dead Male","Survived Female", "Dead Female" ],
        "Total": [ s_male, d_male, s_female, d_female], "Percentage %": [psm,pdm,psf,pdf]}
index= [1,2,3,4]
ds3=pd. DataFrame(data3,index)
ds3

Unnamed: 0,Passenger,Total,Percentage %
1,Survived Male,109,13.0
2,Dead Male,468,53.0
3,Survived Female,233,27.0
4,Dead Female,81,10.0


### From the above analysis it can be concluded that Survival rate of female are greater than male

## 3) Does survival have anything to do with the class in which the passenger was traveling and who survived more 1st class passengers, 2nd class passengers or 3rd class passengers?


### Pclass

In [57]:
survival_in_Class1 = (ds_list['Survived']==1) & (ds_list['Pclass']==1)
s_c1 = survival_in_Class1.sum()
print(f"Survival in class 1: {s_c1}"  )
ps_c1=s_c1/total_psgr*100
ps_c1=np.ceil(ps_c1)
dead_in_Class1 = (ds_list['Survived']==0) & (ds_list['Pclass']==1)
d_c1= dead_in_Class1.sum()
print(f"Death in class 1:  {d_c1}" )
pd_c1=d_c1/total_psgr*100
pd_c1=np.ceil(pd_c1)

print("-" *50)

survival_in_Class2=(ds_list['Survived']==1) & (ds_list['Pclass']==2)
s_c2 = survival_in_Class2.sum()
ps_c2=s_c2/total_psgr*100
ps_c2=np.ceil(ps_c2)
print(f"Survival in class 2: {s_c2}" )
dead_in_Class2=(ds_list['Survived']==0) & (ds_list['Pclass']==2)
d_c2 = dead_in_Class2.sum()
pd_c2=s_c2/total_psgr*100
pd_c2=np.ceil(pd_c2)
print(f"Death in class 2: {d_c2}" )

print("-" *50)

survival_in_Class3=(ds_list['Survived']==1) & (ds_list['Pclass']==3)
s_c3 = survival_in_Class3.sum()
ps_c3=s_c3/total_psgr*100
ps_c3=np.ceil(ps_c3)
print( f"Survival in class 3: {s_c3} " )

dead_in_Class3=(ds_list['Survived']==0) & (ds_list['Pclass']==3)
d_c3 = dead_in_Class3.sum()
pd_c3=d_c3/total_psgr*100
pd_c3=np.ceil(pd_c3)

print (f"Death in class 2:  {d_c3}" )

Survival in class 1: 136
Death in class 1:  80
--------------------------------------------------
Survival in class 2: 87
Death in class 2: 97
--------------------------------------------------
Survival in class 3: 119 
Death in class 2:  372


### Pclass 1 Passengers survived the most


In [58]:
data4= { "Pclass":["Class 1","Class 2","class 3"],"Survived":[s_c1,s_c2, s_c3],"Dead": [d_c1,d_c2,d_c3 ],"Survived(%)": [ps_c1,ps_c2,ps_c3], "Dead(%)":[pd_c1,pd_c2,pd_c3]}
index= [1,2,3]
ds4=pd.DataFrame(data4,index)
ds4

Unnamed: 0,Pclass,Survived,Dead,Survived(%),Dead(%)
1,Class 1,136,80,16.0,9.0
2,Class 2,87,97,10.0,10.0
3,class 3,119,372,14.0,42.0


### 4) Which age group, survived the most, kids (0-12), teen agers (13-19), youngsters (20-30), middle age people (31-50) or elders (50 and greater) ?

In [72]:
kidds = (ds.Survived == 1) & (ds.Age <12)
kids= kidds.sum()
print(f"Total kids( 0-12 years) survived :  {kids}" )
print("-" *50)

teen_agrs = (ds.Survived == 1) & (ds.Age < 19) & (ds.Age > 13)
teen_agers= teen_agrs.sum()
print(f"Total Teen agers ( 13-19 years) survived : {teen_agers} ")
print("-" *50)

young_agers = (ds.Survived == 1) &  (ds.Age < 30) & (ds.Age > 20)
youngsters = young_agers.sum() 
print(f"Total Survived Youngsters ( 20-30 years) are:  {youngsters}")

print("-" *50)

mid_age = (ds.Survived == 1) & (ds.Age < 50) & (ds.Age > 31)
middle_age = mid_age.sum()

print(f"Total Survived Middle age (31-50 years) people are:  {middle_age}")

print("-" *50)

elder_age = (ds.Survived == 1) & (ds.Age > 50)
elders = elder_age.sum()
 
print(f"Total Survived elder(greater than 50 years) age people are:  {elders}")

print("-" *50)

 

Total kids( 0-12 years) survived :  39
--------------------------------------------------
Total Teen agers ( 13-19 years) survived : 28 
--------------------------------------------------
Total Survived Youngsters ( 20-30 years) are:  126
--------------------------------------------------
Total Survived Middle age (31-50 years) people are:  89
--------------------------------------------------
Total Survived elder(greater than 50 years) age people are:  22
--------------------------------------------------


#### The results from the age group analysis show little Youngsters (20-30 years) had the highest chances of survival from all the other age groups. The lowest survival rate are observed for elder passengers

## 5) Which people survived the most, either those who were traveling with the family or those who were traveling alone?

In [73]:
family_survival = ds[['SibSp', 'Survived']].groupby('SibSp',as_index=False).mean()
family_survival

Unnamed: 0,SibSp,Survived
0,0,0.345395
1,1,0.535885
2,2,0.464286
3,3,0.25
4,4,0.166667
5,5,0.0
6,8,0.0


#### From the above analysis it is shown that survival rate decrease as the family size increase upto 4, and then falls to zero.

# 6) Finally compare the survival rate vs gender, class, age group, and family

In [74]:
print(f'Genderwise survival Rate in Titanic')
print('The survival rate of male is: ', np.ceil((s_male/total_psgr*100)))
print('The survival rate of female is: ', np.ceil((s_female/total_psgr*100)))
print('-'*50)

print('Passenger Classwise survival rate (%)')
print('The survival rate of class 1 is: ', ps_c1)
print('The survival rate of class 2 is: ', ps_c2)
print('The survival rate of class 3 is: ', ps_c3)
print('-'*50)

print('Age groupwise Survival rate (%)')
print(f"Total kids( 0-12 years) survived is: " , np.ceil(kids/total_psgr*100))
print(f"Total Teen agers ( 13-19 years) survived is: ", np.ceil(teen_agers/total_psgr*100))
print(f"Total Survived Youngsters ( 20-30 years) are",  np.ceil(youngsters/total_psgr*100))
print(f"Total Survived Middle age (31-50 years) people are:" , np.ceil(middle_age/total_psgr*100))
print(f"Total Survived elder(greater than 50 years) age people are:",  np.ceil(elders/total_psgr*100))
print('-'*50)

print(f'Family wise Survival rate (%)')
without_family=(ds_list['Survived']==1) & (ds_list['SibSp']==1)
without_family=np.sum(without_family)
with_family=(ds_list['Survived']==1) & (ds_list['SibSp']>1)
with_family=np.sum(with_family)
print('The survival with Family is:', np.ceil(with_family/total_psgr*100))
print('The survival without Family is:', np.ceil(without_family/total_psgr*100))

Genderwise survival Rate in Titanic
The survival rate of male is:  13.0
The survival rate of female is:  27.0
--------------------------------------------------
Passenger Classwise survival rate (%)
The survival rate of class 1 is:  16.0
The survival rate of class 2 is:  10.0
The survival rate of class 3 is:  14.0
--------------------------------------------------
Age groupwise Survival rate (%)
Total kids( 0-12 years) survived is:  5.0
Total Teen agers ( 13-19 years) survived is:  4.0
Total Survived Youngsters ( 20-30 years) are 15.0
Total Survived Middle age (31-50 years) people are: 10.0
Total Survived elder(greater than 50 years) age people are: 3.0
--------------------------------------------------
Family wise Survival rate (%)
The survival with Family is: 3.0
The survival without Family is: 13.0


## Conclusions of the Analysis:
- From my analysis of Titanic dataset it is concluded that, women had higher chances of survival.
- The Class (Socio-Econominc status) of the passengers had played a vital role in their survival.So, class 1 had higher surival rate.
- From age group, youngsterst (20-30) survived more than the other age groups.
- Survival rates increase up to family sizes of 4 people and then falls for significantly for larger families.
- passengers without family had higher rate of survival than passengers having families.

$\text{Thank you}$