# **Missing Values in Titanic DataSet**

**Dataset Link** : https://www.kaggle.com/c/titanic/data

#### Read the data in a pandas dataframe called titanic_df and understand their characteristics. Use "columns", "describe" and try to understand data with kaggle description


In [1]:
import numpy as np 
import pandas as pd 

In [2]:
titanic_df = pd.read_csv("titanic.csv", sep=",")
titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
titanic_df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

In [4]:
titanic_df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881138,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.413493,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0








1.  **Survived:** Outcome of survival (0 = No; 1 = Yes)
2.  **Pclass:** Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
3.  **Name:** Name of passenger
4.  **Sex:** Sex of the passenger
5.  **Age:** Age of the passenger (Some entries contain NaN)
6.  **SibSp:** Number of siblings and spouses of the passenger aboard
7.  **Parch:** Number of parents and children of the passenger aboard
8.  **Ticket:** Ticket number of the passenger
9.  **Fare:** Fare paid by the passenger
10. **Cabin** Cabin number of the passenger (Some entries contain NaN)
11. **Embarked:** Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)









df.describe() method gives statistical information about numerical columns of the dataset

### Use 2 different methods to understand the number of non-null counts per column
df.info tells us about column datatypes, non-null counts and memory consumption. Use another one.

In [5]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
pclass       1309 non-null int64
survived     1309 non-null int64
name         1309 non-null object
sex          1309 non-null object
age          1046 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [6]:
titanic_df.count() #

pclass       1309
survived     1309
name         1309
sex          1309
age          1046
sibsp        1309
parch        1309
ticket       1309
fare         1308
cabin         295
embarked     1307
boat          486
body          121
home.dest     745
dtype: int64

## Three columns have missing values : age, cabin and embarked.
Calculate the percentage of null values for numeric columns

## cabin column

In [7]:
### Calculate the percentage of null values for cabin column


In [8]:
cabin_nulls = (titanic_df.cabin.isna().sum()/len(titanic_df.cabin)*100).round(2)
cabin_nulls

77.46

### Over 77% values in this column are missing. Which do you think is the best approach for this column?

In [10]:
titanic_df.cabin

0            B5
1       C22 C26
2       C22 C26
3       C22 C26
4       C22 C26
         ...   
1304        NaN
1305        NaN
1306        NaN
1307        NaN
1308        NaN
Name: cabin, Length: 1309, dtype: object

In [None]:
titanic_df.drop(columns="cabin",)

In [64]:
titanic_df.dropna(axis = 1, thresh = int(len(titanic_df)*(cabin_nulls/100))).head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,S


# embarked column

In [None]:
### Embarked is a categorical column. Find wich values takes this variable

In [12]:
titanic_df["embarked"].unique()

array(['S', 'C', nan, 'Q'], dtype=object)

In [66]:
titanic_df.groupby('embarked').count()

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,boat,body,home.dest
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
C,270,270,270,270,212,270,270,270,270,118,149,25,172
Q,123,123,123,123,50,123,123,123,123,5,38,7,37
S,914,914,914,914,782,914,914,914,913,170,297,89,535


In [None]:
### Count the numbers for each value and calculate the percentage for the higher value

In [None]:
### Which do you think is the best approach in this case?

# age column

In [None]:
### Calculate the percentage of null values for age column

In [None]:
### Plot an histogram with the distribution of the age column

### Build a dataframe with Age and Sex columns

### Use a lambda function to  atribute the mean of the age per sex to the values to the null values

### Plot an histogram with the distribution of the column with the new values. Do you think is a good approach?

### Use a lambda function to  atribute the median of the age per sex to the values to the null values

### Plot an histogram with the distribution of the column with the new values. Do you think is a good approach?