First we will import numpy and pandas


In [37]:
import numpy as np
import pandas as pd


now we will upload the data present in the excel file (titanic.csv) and want to see the first five rows of this data

In [38]:
df=pd.DataFrame(pd.read_csv('/content/titanic.csv'))
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's see how many rows and columns are present in our dataset

In [39]:
df.shape


(891, 12)

Now we will see how many null values are present in different columns in our data set


In [40]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We want to know which column has null values more than 35 % of the total number of rows or data


In [41]:
x=df.isnull().sum()
drop_col=x[x>(35/100)*df.shape[0]]
print(drop_col)

Cabin    687
dtype: int64


Let's see which column we are going to drop

In [42]:
drop_col.index

Index(['Cabin'], dtype='object')

Now, We will drop that column which has more than 35% of Null values.

In [43]:
df.drop(drop_col.index,axis=1,inplace=True)
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

Now we will fill the null values with the mean of the age.

In [44]:
df.fillna(df.mean(),inplace=True)
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

the null values of the Embarked is still 2 because the values stored in the Embarked column is of type string so we can't calculate the mean of strings.Now we will see the features of the Embarked column by using describe()

In [45]:
df['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

Now we will change the null values of Embarked column from 'S' because above we saw that S has a frequency of 646 it means that most of the values of the Embarked column is S

In [46]:
df['Embarked'].fillna('S',inplace=True)
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Now we will see the corelation between different columns ,A larger magnitude means that there is a good relation between that two columns.

In [47]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
Age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
SibSp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


In the above we can see that there is a good relation between fare and the Pclass colums and it is obvious that larger fare means good Pclass 

If We add the two colums(SibSp and Parch) then we will get the family size so we want to add a column containing the sum.

In [48]:
df['FamilySize']=df['SibSp']+df['Parch']
df.drop(['SibSp','Parch'],axis=1,inplace=True)
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,FamilySize
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
FamilySize,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


Above we can see that there is no corelation between the familysize and the number of survivors it means that if your familysize is big,it doesn't mean that there are more survivors from that family.But here there is a negative corelation between the survivors and the Pclass ,it means that those having 1 class has the more survivors.

Now we want to add a column which tells that whether the person is alone or not.

In [49]:
df['Alone']=[0 if df['FamilySize'][i]>0 else 1 for i in df.index]
df.head() 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


We want to know the realtion between  person who is alone or not and the survivors.

In [50]:
df.groupby(['Alone'])['Survived'].mean()

Alone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

The above reuslt concludes that the people who are alone are the less survivors. 

In [51]:
df[['Alone','Fare']].corr()

Unnamed: 0,Alone,Fare
Alone,1.0,-0.271832
Fare,-0.271832,1.0


This negative corelation says that those who have a familyy has a larger fare and those who are alone has a lesser fare.

Now we will see the chances of suriving according to gender.

In [52]:
df['Sex']=[0 if df['Sex'][i]=='male' else 1 for i in df.index]# 1 for female and 0 for male
df.groupby(['Sex'])['Survived'].mean()

Sex
0    0.188908
1    0.742038
Name: Survived, dtype: float64

Here we notice that female has a greater chance of surivival.

In [53]:
df.groupby(['Embarked'])['Survived'].mean() 

Embarked
C    0.553571
Q    0.389610
S    0.339009
Name: Survived, dtype: float64

 **CONCLUSION**

> 

1.   Female passengers were prioritized
over man.
2.People with high class or rich people 
have higher survival rate than others. The hierarichy might have been followed while saving the passangers.
1.Passengers travelling with their family have higher survival rate.
2.Passengers who borded the ship at Cherbourg,survived more in proportion than the others.






