# Chapter 3: Data exploration

In [None]:
%reset
low_memory=False
import numpy as np
import pandas as pd

## 3.1 Introduction & Problem Setting

As said during the first lesson, the most important purpose of data processing is making sure the data we provide contains **no garbage**. Because of this, we have to explore our dataset, figure out exactly how its constructed and which values it contains so we can fix the 'cracks'. Only then will we be able to end up with a qualitative dataset.

## 3.2 Explore the content of the dataset

Let's dive into it! Today we will be working with the titanic dataset, it is a common dataset which contains data about the passengers on the titanic at the time it sank. Load in the dataset and have a quick look at the data.

In [None]:
titanic = pd.read_csv("titanic_uncleaned.csv", sep = ";")
titanic.head()

Another way to look at some data is by using .sample instead of .head. Here you can pass along a number to determine your amount of samples.

In [None]:
titanic.sample(15)

### Question 1: Based on the output above, what is the main difference between .head and .sample? And why would you use .sample over .head?

In [None]:
titanic = pd.get_dummies(titanic, columns = ['Embarked'])

In [None]:
titanic.head()

We've also seen another method to explore our dataset, this was by looking at the datatypes.

In [None]:
titanic.dtypes

### Question 2: Looking at the datatypes of each column, which ones will we likely need to adapt?

Another method to explore our dataset is by using the describe function. This function gives you a lot of information more on the statistical side of things, such as the minimun value, maximum value, average value, and much more!

In [None]:
titanic.describe()

Similar to how we would create Series filled with booleans when selecting subsets of the data, we can create entire dataframes of booleans! The most useful case for this is when identifyin missing values.

In [None]:
titanic.isna()

But what are we with this information? Well simple! When combining it with our earlier seen .describe method, we get some slightly different behaviour giving us a great overview of the missing values!

In [None]:
titanic.isna().describe()

### Question 3: Try and interpret the graph above. Where do the most missing values lie? Are there collumns with no missing values?

## 3.3 First data transformations 

So we've talked about modifying data types, now let's actually start doing this!

The easiest and most common way to do so is called **one-hot encoding**, also called **dummy variables**. The theory behind this is simple: when a variable is binary, meaning either A or B, we can represent this wit hthe numbers 0 and 1.

This column then represents **whether the given variable is present**. This is important, because it means we must pick what are interpretation of 'the variable being present' is.

When looking at the titanic dataset, a first column we may want to encode this way is 'Survived'. In this case we would define a passenger surviving as the variable being present. This means that 'Yes' becomes '1' and 'No' becomes '0'.

Now the question remains: how do we actually implement this? Do we go over each record and manually replace the values? Of course not, that would take waaay too much time. By now you should know that for everything you can think of, someone probably already wrote a python package. Luckily for us, this functionality is baked right into pandas with the .map function.

All we have to do is call the column we want to map and pass on a dictionary of our mapping where the keys are the variables we want to replace and the values are the variables we wish to have.

In [None]:
titanic['Survived'] = titanic['Survived'].map({'No': 0, 'Yes' : 1})

In [None]:
titanic['Survived']

### Question 4: One-hot encoding works by indicating whether a certain variable is present, thus limiting it to binary columns only. This means we cannot simply apply it to columns like 'Embarked'. Can you think of a way to use this dummy variables technique anyways to encode such columns? Are there any new limitations for this?

Amazing! Now, are these the only ways to execute this transformation? Of course not! Often you will find datasets already use booleans to indicate binary data. Well, pandas makes it super easy to 'swap' datatypes, on the condition that they are compatible (such as True/False & 1/0).

In [None]:
titanic.head()

In [None]:
titanic['Embarked_C'] = titanic['Embarked_C'].astype('int')

If you ever lose track of what datatype you are currently working with, remember to simply check it!

In [None]:
titanic['Survived'].dtype

## 3.4 Missing values

There are three main ways to handle missing data:

- Removing data
- Data imputation
- Data flagging

Removing data is the simplest method of them all, but also leads to the most information loss.

In [None]:
titanic2 = titanic.dropna()

In [None]:
titanic.isna().describe()

In [None]:
titanic2.isna().describe()

Notice how easy it was to drop all the missing data, but how in the end it led to us retaining only 185 rows of the 891 rows? That is a huge loss! This is because by default the .dropna() function will delete any row where a missing value is present. to tweak this a bit, we can adjust the treshold.

In [None]:
titanic2 = titanic.dropna(thresh=5)

In [None]:
titanic2.isna().describe()

Now we have removed no missing values... You can see how this threshold is a finnicky thing to play with!

Another way to handle these empty values is by filling them in with a 'good enough' value. There are several default functions to do so!

In [None]:
age = titanic['Age'].ffill()
age.value_counts()

In [None]:
age = titanic['Age'].bfill()
age.value_counts()

Sometimes a default function isn't good enough or makes no sense for the use case, and then you have to code a custom one.

In [None]:
age = titanic['Age'].fillna(titanic[titanic['Age'].notna()].Age.median())
age.value_counts()

Remember to be extremely careful when filling the data like this! Its easy for a value to look good, but completely ruin the dataset! That's why sometimes it is better to simply 'flag' the missing values. This way they get an assigned value, but this assigned value won't (if you did it correctly) ruin your dataset. 

Again, it's important to make sure you pick values that **do not ruin your dataset**! Don't pick extremely low or high values for example!

In [None]:
age = titanic['Age'].fillna(0)
age.value_counts()