Explore The Data: What Data Are We Using?
Using the Titanic dataset from this Kaggle competition.

This dataset contains information about 891 people who were on board the ship when departed on April 15th, 1912. As noted in the description on Kaggle's website, some people aboard the ship were more likely to survive the wreck than others. There were not enough lifeboats for everybody so women, children, and the upper-class were prioritized. Using the information about these 891 passengers, the challenge is to build a 
model to predict which people would survive based on the following fields:

Name (str) - Name of the passenger
Pclass (int) - Ticket class (1st, 2nd, or 3rd)
Sex (str) - Gender of the passenger
Age (float) - Age in years
SibSp (int) - Number of siblings and spouses aboard
Parch (int) - Number of parents and children aboard
Ticket (str) - Ticket number
Fare (float) - Passenger fare
Cabin (str) - Cabin number
Embarked (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Read In Data

# read the data

In [56]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats


In [8]:
data = pd.read_csv("data/titanic.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
#no of rows and column
data.shape

(891, 12)

In [12]:
#type of data
data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [23]:
data['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [14]:
#exploring continous variable

In [22]:
#drop al categorical features
cat_feat = ['PassengerId','Name','Sex','Ticket','Cabin','Embarked']
data.drop(cat_feat,axis=1,inplace=True)
data.head()

KeyError: "['PassengerId' 'Name' 'Sex' 'Ticket' 'Cabin' 'Embarked'] not found in axis"

In [27]:
data.describe()


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [28]:
data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
0,0,3,22.0,1,0,7.25
1,1,1,38.0,1,0,71.2833
2,1,3,26.0,0,0,7.925
3,1,1,35.0,1,0,53.1
4,0,3,35.0,0,0,8.05


In [30]:
#look at correalation matrix
data.corr()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
Survived,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


In [41]:
#look at fare by different passenger class level
data.groupby('Pclass')['Fare'].value_counts()

Pclass  Fare   
1       26.5500    15
        52.0000     7
        0.0000      5
        30.0000     5
        30.5000     5
                   ..
3       15.5500     1
        17.4000     1
        21.6792     1
        22.0250     1
        22.5250     1
Name: Fare, Length: 255, dtype: int64

In [43]:
data.groupby('Pclass')['Fare'].count()


Pclass
1    216
2    184
3    491
Name: Fare, dtype: int64

In [44]:
data.groupby('Pclass')['Fare'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,216.0,84.154687,78.380373,0.0,30.92395,60.2875,93.5,512.3292
2,184.0,20.662183,13.417399,0.0,13.0,14.25,26.0,73.5
3,491.0,13.67555,11.778142,0.0,7.75,8.05,15.5,69.55


In [57]:
def describe_cont_feature(feature):
    print(f'\n ****results for**** {feature}')
    print(data.groupby('Survived')[feature].describe())
    
def ttest(feature):
    survived = data[data['Survived']==1][feature]
    not_survived = data[data['Survived']==0][feature]
    tstat,pval = stats.ttest_ind(survived,not_survived,equal_var=False)
    print('tstatistics:{:.1f},p-value:{:.3}'.format(tstat,pval))
    

In [58]:
#look at the distribution of each feature at each level of target variable
for feature in ['Pclass','Age','SibSp','Parch','Fare']:
    describe_cont_feature(feature)
    ttest(feature)


 ****results for**** Pclass
          count      mean       std  min  25%  50%  75%  max
Survived                                                    
0         549.0  2.531876  0.735805  1.0  2.0  3.0  3.0  3.0
1         342.0  1.950292  0.863321  1.0  1.0  2.0  3.0  3.0
tstatistics:-10.3,p-value:2.91e-23

 ****results for**** Age
          count       mean        std   min   25%   50%   75%   max
Survived                                                           
0         424.0  30.626179  14.172110  1.00  21.0  28.0  39.0  74.0
1         290.0  28.343690  14.950952  0.42  19.0  28.0  36.0  80.0
tstatistics:nan,p-value:nan

 ****results for**** SibSp
          count      mean       std  min  25%  50%  75%  max
Survived                                                    
0         549.0  0.553734  1.288399  0.0  0.0  0.0  1.0  8.0
1         342.0  0.473684  0.708688  0.0  0.0  0.0  1.0  4.0
tstatistics:-1.2,p-value:0.233

 ****results for**** Parch
          count      mean       std

In [59]:
data.groupby(data['Age'].isnull()).mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
False,0.406162,2.236695,29.699118,0.512605,0.431373,34.694514
True,0.293785,2.59887,,0.564972,0.180791,22.158567
