In [1]:
import pandas as pd
titanic_survival=pd.read_csv("titanic_survival.csv")
titanic_survival.head(2)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"


Finding the Missing Data

Missing data can take a few different forms:

    In Python, the None keyword and type indicates no value.
    The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.

In general terms, both NaN and None can be called null values.

If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values, the same way that NumPy did when we compared arrays.

In [3]:
age = titanic_survival["age"] # Gives series of column age
print(age.loc[10:20])
#pd.isnull() takes column series and returns  sseries of True and False values. True if value of column is null ,false otherwise
age_is_null=pd.isnull(age) 
#below returns dataframe whose age column is null
age_null_true=age[age_is_null]
age_null_count=len(age[age_is_null])
print('no of rows which has age with Nan is ',age_null_count)

10    47.0
11    18.0
12    24.0
13    26.0
14    80.0
15     NaN
16    24.0
17    50.0
18    32.0
19    36.0
20    37.0
Name: age, dtype: float64
('no of rows which has age with Nan is ', 264)


So, we know that quite a few values are missing from the "age" column, and other columns are missing data too. But why is this a problem?

Lets look at a typical approach to calculate the average for the "age" column:

mean_age = sum(titanic_survival["age"]) / len(titanic_survival["age"])

The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value. This makes sense when you think about it -- how can you add a null value to a known value?

Instead, we have to filter out the missing values before we calculate the mean.


In [9]:
age_is_null = pd.isnull(titanic_survival["age"])
print(type(age_is_null))
print(age_is_null[0:16])
good_age=titanic_survival["age"][age_is_null==False]
print(good_age.head(10))
correct_mean_age=sum(good_age)/len(good_age)
print(correct_mean_age)


<class 'pandas.core.series.Series'>
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15     True
Name: age, dtype: bool
0    29.0000
1     0.9167
2     2.0000
3    30.0000
4    25.0000
5    48.0000
6    63.0000
7    39.0000
8    53.0000
9    71.0000
Name: age, dtype: float64
29.8811345124


In [10]:
correct_mean_age = titanic_survival["age"].mean()
correct_mean_fare=titanic_survival["fare"].mean()

In [11]:
#mean fare for all the class
passenger_classes = [1, 2, 3]
fares_by_class = {}
for classes in passenger_classes:
    selected_class=titanic_survival["pclass"]==classes
   
    selected_row_of_class=titanic_survival[selected_class]
    # print(type(selected_row_of_class))
    # print (selected_row_of_class.head())
    fare_of_selected_row=selected_row_of_class["fare"]
    #print(fare_of_selected_row)
    #fare_of_selected_row=titanic_survival[selected_class,9]
    mean_of_fare_for_class=fare_of_selected_row.mean()
    fares_by_class[classes]=mean_of_fare_for_class
print(fares_by_class)    
    

{1: 87.50899164086687, 2: 21.1791963898917, 3: 13.302888700564957}


Making Pivot Tables

Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean. 
Pivot tables first group and then apply a calculation. In the previous eg, we actually made a pivot table manually by grouping by the column "pclass" and then calculating the mean of the "fare" column for each class.

passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)

The first parameter of the method, index tells the method which column to group by. The second parameter values is the column that we want to apply the calculation to, and aggfunc specifies the calculation we want to perform. The default for the aggfunc parameter is actually the mean, so if we're calculating this we can omit this parameter.

In [13]:
#mean survival for all class
import numpy as np
passenger_survival=titanic_survival.pivot_table(index="pclass",values="survived",aggfunc=np.mean)
passenger_survival

pclass
1.0    0.619195
2.0    0.429603
3.0    0.255289
Name: survived, dtype: float64

In [15]:
#pivot table that calculates the total fares collected ("fare") and total number of survivors ("survived") for each embarkation port ("embarked")
port_stats=titanic_survival.pivot_table(index="embarked",values=["fare","survived"],aggfunc=np.sum)
port_stats

Unnamed: 0_level_0,fare,survived
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,16830.7922,150.0
Q,1526.3085,44.0
S,25033.3862,304.0


Drop Missing Values

We can use the DataFrame.dropna() method on pandas DataFrames to do this. The method will drop any rows that contain missing values.

The dropna() method takes an axis parameter, which indicates whether you would like to drop rows or columns. Specifying axis=0 or axis='index' will drop any rows that have null values, while specifying axis=1 or axis='columns' will drop any columns that have null values. We will use 0 and 1 since they're more commonly used, but you can use either.

The code below will drop all rows in titanic_survival that have null values.

drop_na_rows = titanic_survival.dropna(axis=0)


In [16]:
drop_na_rows = titanic_survival.dropna(axis=0)
drop_na_columns=titanic_survival.dropna(axis=1)
#using subset clause , we can drop only rows/cols whose column/s having Nan
new_titanic_survival=titanic_survival.dropna(axis=0,
                                             subset=["age","sex"])

Using iloc to Access Rows by Position

In previous missions, we have used row labels to select data in pandas using Dataframe.loc[]. These work just like column labels, and can be values like numbers, characters, and strings.

Sometimes your dataset will have row labels that are not numbers, or that are not in order. We have sorted the new_titanic_survival dataframe by the "age" column from highest to lowest. Here is a preview of the a few of the columns for the first five rows of the data, or the five oldest passengers onboard.

In [17]:
first_five_rows = new_titanic_survival.iloc[0:5]
first_ten_rows=new_titanic_survival.iloc[0:10]
row_position_fifth=new_titanic_survival.iloc[4]
row_index_25=new_titanic_survival.loc[25]

In [18]:
first_row_first_column = new_titanic_survival.iloc[0,0]
all_rows_first_three_columns = new_titanic_survival.iloc[:,0:3]
row__index_83_age = new_titanic_survival.loc[83,"age"]
row_index_1000_pclass = new_titanic_survival.loc[766,"pclass"]
row_index_1100_age=new_titanic_survival.loc[1100,"age"]
row_index_25_survived=new_titanic_survival.loc[25,"survived"]
five_rows_three_cols=new_titanic_survival.iloc[0:5,0:3]

Reindexing Rows

After we sorted new_titanic_survival by age, the row indexes were no longer sequential. Each row retained its original index from titanic_survival.

Sometimes it's useful to reindex, starting from 0. We can use the DataFrame.reset_index() method to do this. By default, the method retains the old index by adding an extra column to the dataframe with the old index values.

In this exercise, we don't want to retain the index. Check the documentation to see what parameter you need to add so that we don't retain the old index.

In [None]:
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:5,0:3])

Applying a Function to a Row



In [21]:
def is_minor(row):
    if row["age"] < 18:
        return True
    else:
        return False

minors = titanic_survival.apply(is_minor, axis=1)
import pandas as pd

def generate_age_label(row):
    age = row["age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

titanic_survival["age_labels"] = titanic_survival.apply(generate_age_label, axis=1)

In [23]:
# pivot table to find the probability of survival for each age group
import numpy as np
age_group_survival=titanic_survival.pivot_table(index="age_labels",values="survived",aggfunc=np.mean)
print(age_group_survival)

age_labels
adult      0.387892
minor      0.525974
unknown    0.277567
Name: survived, dtype: float64
