# Chapter 3: Data exploration

In [5]:
%reset
low_memory=False
import numpy as np
import pandas as pd

Once deleted, variables cannot be recovered. Proceed (y/[n])?  y


## 3.1 Introduction & Problem Setting

As said during the first lesson, the most important purpose of data processing is making sure the data we provide contains **no garbage**. Because of this, we have to explore our dataset, figure out exactly how its constructed and which values it contains so we can fix the 'cracks'. Only then will we be able to end up with a qualitative dataset.

## 3.2 Explore the content of the dataset

Let's dive into it! Today we will be working with the titanic dataset, it is a common dataset which contains data about the passengers on the titanic at the time it sank. Load in the dataset and have a quick look at the data.

In [6]:
titanic = pd.read_csv("titanic_uncleaned.csv", sep = ";")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,No,3.0,"Braund, Mr. Owen Harris",male,,1.0,0.0,A/5 21171,7.25,,S
1,2,Yes,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1.0,0.0,PC 17599,712.833,C85,C
2,3,Yes,3.0,"Heikkinen, Miss. Laina",female,26.0,0.0,0.0,STON/O2. 3101282,7.925,,S
3,4,Yes,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.0,0.0,113803,53.1,C123,S
4,5,No,3.0,"Allen, Mr. William Henry",male,35.0,0.0,0.0,373450,8.05,,S


Another way to look at some data is by using .sample instead of .head. Here you can pass along a number to determine your amount of samples.

In [7]:
titanic.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
302,303,No,3.0,"Johnson, Mr. William Cahoone Jr",male,19.0,0.0,0.0,LINE,0.0,,S
471,472,No,3.0,"Cacic, Mr. Luka",male,38.0,0.0,0.0,315089,86.625,,S
563,564,No,3.0,"Simmons, Mr. John",male,,0.0,0.0,SOTON/OQ 392082,8.05,,S
588,589,No,3.0,"Gilinski, Mr. Eliezer",male,22.0,0.0,0.0,14973,8.05,,S
10,11,Yes,3.0,"Sandstrom, Miss. Marguerite Rut",female,4.0,1.0,1.0,PP 9549,16.7,G6,S


### Question 1: Based on the output above, what is the main difference between .head and .sample? And why would you use .sample over .head?

In general, they both do a similar thing. Both show a certain amount of records, and both can be passed a number to determine how many records to show. However the main difference lies in the fact that .head will always show the **first X records**, while .sample will show **X random records**. This provides you with a more valid 'sample' and accurate representation than just the beginning of the dataset!

We've also seen another method to explore our dataset, this was by looking at the datatypes.

In [8]:
titanic.dtypes

PassengerId      int64
Survived           str
Pclass         float64
Name               str
Sex                str
Age            float64
SibSp          float64
Parch          float64
Ticket             str
Fare               str
Cabin              str
Embarked           str
dtype: object

### Question 2: Looking at the datatypes of each column, which ones will we likely need to adapt?

We always want to have as many numbers as possible. Because of this, integers and floats are our ideal data types, while 'objects' (likely strings) are less ideal and will have to be adapted.

These columns are:
- Survived
- Name
- Sex
- Age
- Ticket
- Cabin
- Embarked

Another method to explore our dataset is by using the describe function. This function gives you a lot of information more on the statistical side of things, such as the minimun value, maximum value, average value, and much more!

In [9]:
titanic.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch
count,892.0,891.0,714.0,891.0,891.0
mean,446.5,2.308642,29.70612,0.523008,0.381594
std,257.642517,0.836071,14.523986,1.102743,0.806057
min,1.0,1.0,0.42,0.0,0.0
25%,223.75,2.0,20.125,0.0,0.0
50%,446.5,3.0,28.0,0.0,0.0
75%,669.25,3.0,38.0,1.0,0.0
max,892.0,3.0,80.0,8.0,6.0


Similar to how we would create Series filled with booleans when selecting subsets of the data, we can create entire dataframes of booleans! The most useful case for this is when identifyin missing values.

In [10]:
titanic.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,True,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False
890,False,False,False,False,False,False,False,False,False,False,True,False


But what are we with this information? Well simple! When combining it with our earlier seen .describe method, we get some slightly different behaviour giving us a great overview of the missing values!

In [11]:
titanic.isna().describe()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,892,892,892,892,892,892,892,892,892,892,892,892
unique,1,2,2,2,1,2,2,2,2,2,2,2
top,False,False,False,False,False,False,False,False,False,False,True,False
freq,892,891,891,891,892,714,891,891,891,891,688,890


### Question 3: Try and interpret the graph above. Where do the most missing values lie? Are there collumns with no missing values?

The most missing values lie in the 'Cabin' column. You can see this because it is the only column where 'True' (as in, 'yes, this is a missing value' by our .isna) is the most common value. By looking at the frequency we can see there are 688 missing values in this column.

The only columns with no missing values are PassengerId and Sex, as they both only have 'False' values.

## 3.3 First data transformations 

So we've talked about modifying data types, now let's actually start doing this!

The easiest and most common way to do so is called **one-hot encoding**, also called **dummy variables**. The theory behind this is simple: when a variable is binary, meaning either A or B, we can represent this wit hthe numbers 0 and 1.

This column then represents **whether the given variable is present**. This is important, because it means we must pick what are interpretation of 'the variable being present' is.

When looking at the titanic dataset, a first column we may want to encode this way is 'Survived'. In this case we would define a passenger surviving as the variable being present. This means that 'Yes' becomes '1' and 'No' becomes '0'.

Now the question remains: how do we actually implement this? Do we go over each record and manually replace the values? Of course not, that would take waaay too much time. By now you should know that for everything you can think of, someone probably already wrote a python package. Luckily for us, this functionality is baked right into pandas with the .map function.

All we have to do is call the column we want to map and pass on a dictionary of our mapping where the keys are the variables we want to replace and the values are the variables we wish to have.

In [12]:
titanic['Survived'] = titanic['Survived'].map({'No': 0, 'Yes' : 1})

In [13]:
titanic['Survived']

0      0.0
1      1.0
2      1.0
3      1.0
4      0.0
      ... 
887    1.0
888    0.0
889    1.0
890    0.0
891    0.0
Name: Survived, Length: 892, dtype: float64

### Question 4: One-hot encoding works by indicating whether a certain variable is present, thus limiting it to binary columns only. This means we cannot simply apply it to columns like 'Embarked'. Can you think of a way to use this dummy variables technique anyways to encode such columns? Are there any new limitations for this?

A column like 'Embarked' has multiple possible values. While we cannot convert it into a single column indicating the variable is present, we can create **a new column** for **each possible value** and indicate if that value is present for each record. We have to make sure that such a transformation makes sense.

For example, if we have a car dataset, and did this for the model name column, it would **not gain us any new information**. Why? Simple! Each car has a unique name, so each new column would be almost completely zeroes and only a single one. Doing this transformation might even make your overal dataset a lot worse as you are increasing the size for no good reason, but we'll talk more about that in the intro to ML class. Where it might be beneficial however would be the brand name column. Lots of cars will have the same brand, and this will lead to actual information gain.

Now again, we are not coding manually. Pandas once again saves the day with a simple yet elegant function: .get_dummies

In [14]:
titanic = pd.get_dummies(titanic, columns = ['Embarked'])

In [15]:
titanic.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
107,108,1.0,3.0,"Moss, Mr. Albert Johan",male,,0.0,0.0,312991,7.775,,False,False,True
623,624,0.0,3.0,"Hansen, Mr. Henry Damsgaard",male,21.0,0.0,0.0,350029,78.542,,False,False,True
332,333,0.0,1.0,"Graham, Mr. George Edward",male,38.0,0.0,1.0,PC 17582,1.534.625,C91,False,False,True
803,804,1.0,3.0,"Thomas, Master. Assad Alexander",male,0.42,0.0,1.0,2625,85.167,,True,False,False
62,63,0.0,1.0,"Harris, Mr. Henry Birkhardt",male,45.0,1.0,0.0,36973,83.475,C83,False,False,True


Amazing! Now, are these the only ways to execute this transformation? Of course not! Often you will find datasets already use booleans to indicate binary data. Well, pandas makes it super easy to 'swap' datatypes, on the condition that they are compatible (such as True/False & 1/0).

In [16]:
titanic['Survived'].astype('bool')

0      False
1       True
2       True
3       True
4      False
       ...  
887     True
888    False
889     True
890    False
891    False
Name: Survived, Length: 892, dtype: bool

If you ever lose track of what datatype you are currently working with, remember to simply check it!

In [17]:
titanic['Survived'].dtype

dtype('float64')

## 3.4 Missing values

There are three main ways to handle missing data:

- Removing data
- Data imputation
- Data flagging

Removing data is the simplest method of them all, but also leads to the most information loss.

In [18]:
titanic2 = titanic.dropna()

In [19]:
titanic.isna().describe()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
count,892,892,892,892,892,892,892,892,892,892,892,892,892,892
unique,1,2,2,2,1,2,2,2,2,2,2,1,1,1
top,False,False,False,False,False,False,False,False,False,False,True,False,False,False
freq,892,891,891,891,892,714,891,891,891,891,688,892,892,892


In [20]:
titanic2.isna().describe()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
count,185,185,185,185,185,185,185,185,185,185,185,185,185,185
unique,1,1,1,1,1,1,1,1,1,1,1,1,1,1
top,False,False,False,False,False,False,False,False,False,False,False,False,False,False
freq,185,185,185,185,185,185,185,185,185,185,185,185,185,185


Notice how easy it was to drop all the missing data, but how in the end it led to us retaining only 185 rows of the 891 rows? That is a huge loss! This is because by default the .dropna() function will delete any row where a missing value is present. to tweak this a bit, we can adjust the treshold.

In [21]:
titanic2 = titanic.dropna(thresh=5)

In [22]:
titanic2.isna().describe()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
count,892,892,892,892,892,892,892,892,892,892,892,892,892,892
unique,1,2,2,2,1,2,2,2,2,2,2,1,1,1
top,False,False,False,False,False,False,False,False,False,False,True,False,False,False
freq,892,891,891,891,892,714,891,891,891,891,688,892,892,892


Now we have removed no missing values... You can see how this threshold is a finnicky thing to play with!

Another way to handle these empty values is by filling them in with a 'good enough' value. There are several default functions to do so!

In [23]:
age = titanic['Age'].ffill()
age.value_counts()

Age
24.00    39
19.00    38
21.00    33
28.00    32
22.00    31
         ..
24.50     1
0.67      1
0.42      1
34.50     1
74.00     1
Name: count, Length: 88, dtype: int64

In [24]:
age = titanic['Age'].bfill()
age.value_counts()

Age
24.00    38
18.00    35
28.00    30
22.00    30
25.00    30
         ..
24.50     1
0.67      1
0.42      1
34.50     1
74.00     1
Name: count, Length: 88, dtype: int64

Sometimes a default function isn't good enough or makes no sense for the use case, and then you have to code a custom one.

In [25]:
age = titanic['Age'].fillna(titanic[titanic['Age'].notna()].Age.median())
age.value_counts()

Age
28.00    203
24.00     30
18.00     26
22.00     26
19.00     25
        ... 
24.50      1
0.67       1
0.42       1
34.50      1
74.00      1
Name: count, Length: 88, dtype: int64

Remember to be extremely careful when filling the data like this! Its easy for a value to look good, but completely ruin the dataset! That's why sometimes it is better to simply 'flag' the missing values. This way they get an assigned value, but this assigned value won't (if you did it correctly) ruin your dataset. 

Again, it's important to make sure you pick values that **do not ruin your dataset**! Don't pick extremely low or high values for example!

In [26]:
age = titanic['Age'].fillna(0)
age.value_counts()

Age
0.00     178
24.00     30
18.00     26
22.00     26
28.00     25
        ... 
24.50      1
0.67       1
0.42       1
34.50      1
74.00      1
Name: count, Length: 89, dtype: int64