# Data Analysis with Python
## Author: Soumarya Basak


In [113]:
import pandas as pd
import numpy as np

In [114]:
df = pd.DataFrame(pd.read_csv(r"train.csv"))

In [115]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [116]:
df.shape

(891, 12)

So, our data consists of 12 columns and for each column we have 891 observations.

### Data Cleaing
But from this large data we hava to clean the data by removing null values

In [117]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

So, there is 177 Null values in "Age" column and 687 Null values is "Cabin" column

#### Criteria
We will remove that colum which has more than 35% of the data as Null

In [118]:
df.isnull().sum() > (.35 * df.shape[0])


PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked       False
dtype: bool

In [119]:
dropcol = df.isnull().sum()[df.isnull().sum() > (.35*df.shape[0])]
dropcol

Cabin    687
dtype: int64

So, we have to remove the column "cabin"


In [102]:
dropcol.index

Index(['Cabin'], dtype='object')

In [120]:
df= df.drop(dropcol.index,axis =1)

In [121]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [122]:
df.shape

(891, 11)

In [123]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

Now for the rest of the data we have to remove the null values from the data set by replacing them with either their column mean or max value or frequently used value 

In [125]:
df= df.fillna(df.mean())

In [126]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

still the 'Embarked' is consists of 2 null values

**Here we will replacxe the Null values in a different way**

In [127]:
df['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

In [128]:
df['Embarked'].fillna("s", inplace=True)

In [130]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

#### Analysis

In [131]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [133]:
df['family_size']= df['SibSp']+ df['Parch']

In [142]:
df.drop(['SibSp','Parch'],axis=1 ,inplace=True)

In [144]:
df.corr()


Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,family_size
PassengerId,1.0,-0.005007,-0.035144,0.033207,0.012658,-0.040143
Survived,-0.005007,1.0,-0.338481,-0.069809,0.257307,0.016639
Pclass,-0.035144,-0.338481,1.0,-0.331339,-0.5495,0.065997
Age,0.033207,-0.069809,-0.331339,1.0,0.091566,-0.248512
Fare,0.012658,0.257307,-0.5495,0.091566,1.0,0.217138
family_size,-0.040143,0.016639,0.065997,-0.248512,0.217138,1.0


#### **Conclusion 1**
So we can see that survival rate is slightly depending on the fare of ship,
So it can be said that the higher class people got priority at the time of escape from the ship.



**Q:** If a person is alone or not does effect the survival rate?

In [147]:
df['Alone']= [0 if df['family_size'][i]>0 else 1 for i in df.index]

In [148]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,family_size,Alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.925,S,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.05,S,0,1


In [150]:
df.groupby(['Alone'])['Survived'].mean()

Alone
0    0.505650
1    0.303538
Name: Survived, dtype: float64

#### **Outcome**
If a person is travelling with his or her family then he or she has a greater chance to survive.

**$H_0 :$** Taking the *hypothesis* that a person who is travelling with his family might belong to rich class and might be prioritized over other.


In [152]:
df[['Alone','Fare']].corr()

Unnamed: 0,Alone,Fare
Alone,1.0,-0.271832
Fare,-0.271832,1.0



#### **So,  If the person is traveling alone then the fare of the ticket would be low  & if the person is travelling with his family then the chance that the price of ticket would be higher would be higher.**



Which also supports our 2nd outcome.


#### **Comclusion 2**
If a person is travelling with his or her family then he or she has a greater chance to survive.

**Q:** Whether survival rate depends upon gender or not?

In [155]:
df['Sex']=[0 if df['Sex'][i]=='male' else 1 for i in df.index]

In [156]:
df['Sex'].head()

0    0
1    1
2    1
3    1
4    0
Name: Sex, dtype: int64

In [157]:
df.groupby(['Sex'])['Survived'].mean()

Sex
0    0.188908
1    0.742038
Name: Survived, dtype: float64

#### Conclusion 3
The survival rate of the female passengers is higher than the males.


**Q:** whether the survival rate depends on the place of Embarked or not?

In [158]:
df.groupby(['Embarked'])['Survived'].mean()

Embarked
C    0.553571
Q    0.389610
S    0.336957
s    1.000000
Name: Survived, dtype: float64

#### Conclusion 4
The passengers who boarded from Cherbourg had a higher chance of surviving.




## Result of the analysis

1) We can see that survival rate is slightly depending on the fare of ship. So it can be said that the higher class people got priority at the time of escape from the ship.

2) If a person is travelling with his or her family then he or she has a greater chance to survive.

3) The survival rate of the female passengers is higher than the males.

4) The passengers who boarded from Cherbourg had a higher chance of surviving.