# # Working With Missing Data
[From the dataquest.io site](https://www.dataquest.io/m/83/data-manipulation-with-pandas/5/normalizing-columns-in-a-data-set)
### <p style="color:Tomato">Learn to handle missing data using pandas, and a data set on Titanic survival.
<p/>
#### <p style="color:Gray">Clean and analyze data on passenger survival from the Titanic. Many of the columns, such as age and sex, have missing data.<p/>
* cause errors
* finding the mean of a column with a missing value is not successful.
because it's impossible to average  missing value.


<p style="color:Blue">**1. titanic_survival.csv**<p/>

In [1]:
import pandas as pd

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
titanic_survival = pd.read_csv("titanic_survival.csv")

In [4]:
age = titanic_survival["age"]
print(age.loc[10:20])

10    47.0
11    18.0
12    24.0
13    26.0
14    80.0
15     NaN
16    24.0
17    50.0
18    32.0
19    36.0
20    37.0
Name: age, dtype: float64


In [5]:
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)

264


#### <p style="color:Gray">NaN<p/>
not a number, to indicate a missing value
#### <p style="color:Gray">pandas.isnull()<p/>
returns a series of True and False values.

In [6]:
sex = titanic_survival["sex"]
sex_is_null = pd.isnull(sex)
print(sex_is_null)

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1280    False
1281    False
1282    False
1283    False
1284    False
1285    False
1286    False
1287    False
1288    False
1289    False
1290    False
1291    False
1292    False
1293    False
1294    False
1295    False
1296    False
1297    False
1298    False
1299    False
1300    False
1301    False
1302    False
1303    False
1304    False
1305    False
1306    False
1307    False
1308    False
1309     True
Name: sex, Length: 1310, dtype: bool


In [7]:
sex_null_true = sex[sex_is_null]
print(sex_null_true)

1309    NaN
Name: sex, dtype: object


In [8]:
mean_age = sum(titanic_survival["age"])/len(titanic_survival["age"])
print(mean_age)

nan


This is because any calculations we do with a null value also result in a null value. 

In [9]:
age_is_null = pd.isnull(titanic_survival["age"])
good_ages = titanic_survival["age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)

29.8811345124


#### <p style="color:Gray">Series.mean()<p/>
To calculate the mean of a column,<br/>
missing values will not be includd in the calculation.

In [10]:
correct_mean_age = titanic_survival["age"].mean()
print(correct_mean_age)

29.8811345124283


Assign the mean of the "fare" column to correct_mean_fare.

In [15]:
correct_mean_fare = titanic_survival["fare"].mean()
print(correct_mean_fare)

33.29547928134572
