# Data Quality Considerations

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('titanic.csv')

## Managing Missing Data

How big is the dataset?

In [2]:
len(df)

1309

How big would it be if we were to drop all rows missing data?

In [3]:
len(df.dropna())

183

So how many NaNs are there in each of the rows?  The following Lambda function calls the **isna** method of each *Series*, which returns a *Series* of boolean values, then adds them up (as True is considered 1 and False 0).

In [4]:
df.aggregate(lambda x: x.isna().sum())

Unnamed: 0       0
Cabin         1014
Embarked         2
Fare             1
Pclass           0
Ticket           0
Age            263
Name             0
Parch            0
Sex              0
SibSp            0
Survived       418
dtype: int64

Extract all rows where Embarked and Fare is not NaN

In [5]:
df_valid = df.loc[(~df.Embarked.isna()) & (~df.Fare.isna())]

Of the valid data, what is the average age of each of the Pclasses?

In [6]:
df_valid.loc[df.Pclass == 1, 'Age'].mean()

39.08304964539007

In [7]:
df_valid.loc[df.Pclass == 2, 'Age'].mean()

29.506704980842912

Let's fill in the missing age values with the mean values as grouped by Pclass and Sex

In [8]:
def func(x):
    return x.fillna(x.mean())

In [9]:
mean_ages = df_valid.groupby(['Pclass', 'Sex'])['Age'].transform(lambda x: x.fillna(x.mean()))
mean_ages

0       22.000000
1       38.000000
2       26.000000
3       35.000000
4       35.000000
5       25.863017
6       54.000000
7        2.000000
8       27.000000
9       14.000000
10       4.000000
11      58.000000
12      20.000000
13      39.000000
14      14.000000
15      55.000000
16       2.000000
17      30.815380
18      31.000000
19      22.185329
20      35.000000
21      34.000000
22      15.000000
23      28.000000
24       8.000000
25      38.000000
26      25.863017
27      19.000000
28      22.185329
29      25.863017
          ...    
1279    21.000000
1280     6.000000
1281    23.000000
1282    51.000000
1283    13.000000
1284    47.000000
1285    29.000000
1286    18.000000
1287    24.000000
1288    48.000000
1289    22.000000
1290    31.000000
1291    30.000000
1292    38.000000
1293    22.000000
1294    17.000000
1295    43.000000
1296    20.000000
1297    23.000000
1298    50.000000
1299    22.185329
1300     3.000000
1301    22.185329
1302    37.000000
1303    28

## Class Imbalance

There is a significant class imbalance between the number of passengers who did or did not survive:

In [10]:
len(df.loc[df.Survived == 1])

342

In [11]:
len(df.loc[df.Survived == 0])

549

We could randomly remove some of the Survived = 0 values to even the numbers

In [12]:
num_to_drop = len(df.loc[df.Survived == 0]) - len(df.loc[df.Survived == 1]) # Get the number of samples to drop
indices_to_drop = df.loc[df.Survived == 0].index.tolist() # Get the indices of Survived == 0 as a list
np.random.shuffle(indices_to_drop) # Randomly shuffle the indices

df_dropped_survive = df.copy()
df_dropped_survive = df_dropped_survive.drop(indices_to_drop[:num_to_drop]) # Drop the indices

How is the balance now?


In [13]:
len(df_dropped_survive.loc[df_dropped_survive.Survived == 1])

342

In [14]:
len(df_dropped_survive.loc[df_dropped_survive.Survived == 0])

342

We can also upsample some of the Survived = 1 values

In [15]:
num_to_add = num_to_drop
indices_to_append = df.loc[df.Survived == 1].index.tolist() # Get the indices of Survived == 0 as a list
np.random.shuffle(indices_to_append) # Randomly shuffle the indices

samples_to_append = df.iloc[indices_to_append[:num_to_add]]

df_upsampled_survive = df.copy()
df_upsampled_survive = pd.concat((df_upsampled_survive, samples_to_append))

How is the balance now?


In [16]:
len(df_upsampled_survive.loc[df_upsampled_survive.Survived == 1])

549

In [17]:
len(df_upsampled_survive.loc[df_upsampled_survive.Survived == 0])

549