# Exercise Notebook - SLU8 - Data Problems

This notebook is associated with this [presentation](https://docs.google.com/presentation/d/1bu6ORtlvKfPI7ZwEA-BOSxpg1pHGvsemFHo0MhoR6ss/edit?usp=sharing). What we cover here:
- Common data entry problems
- Missing data
- Duplicated data
- Outlier detection
- Dealing with outliers
- Nunique
- Drop duplicates
- Converting dtypes
- Data imputation techniques

The **main objective** is to arrive at the end of this notebook with our dataset "cleaned" of any problems: entry problems, duplicated eliminated, missing values all identified, and outliers handles.

-----
_By: Hugo Lopes  
LDSA - SLU8_

In [1]:
import pandas as pd
import numpy as np

% matplotlib inline
from matplotlib import pyplot as plt 

# Load Data
We will use an extraction **and altered** (!) set of the [Titanic Dataset](https://www.kaggle.com/c/titanic).

In [15]:
df = pd.read_csv('titanic_exercise.csv')
print('Initial Shape:', df.shape)
df.head()

Initial Shape: (896, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Exercise 1: Data Entry Problems
The feature `Sex` has a problem. Let's solve it.

In [3]:
# EXERCISE
# Check the unique values of Sex, and assign it to a variable 'uniques' 
# uniques = ...
# YOUR CODE HERE (~1 line)
uniques = df.Sex.unique()
#raise NotImplementedError()


# For validation (do not modify):
print('Unique values:', uniques)

Unique values: ['male' 'female' 'Squirrel']


Expected output:
    
    Unique values: ['male' 'female' 'Squirrel']

Looks like we found a _Squirrel_! This does not make sense!! Let's drop the rows where we can find `Squirrel`. First, let's check how many squirrels...

In [4]:
# EXERCISE
# Find the rows with Squirrel (create a boolean mask)
# mask = ...
# YOUR CODE HERE
mask = df.Sex == 'Squirrel'
#raise NotImplementedError()


# For validation (do not modify):
print('Number of Squirrels =', mask.sum())

Number of Squirrels = 1


Expected output:
    
    Number of Squirrels = 1

In [5]:
# Now drop the rows that have Squirrel (update 'df')
# df = ...
# YOUR CODE HERE
df = df[~mask]
#raise NotImplementedError()


# For validation (do not modify):
print('Shape after dropping:', df.shape)

Shape after dropping: (895, 12)


Expected output:
    
    Shape after dropping: (895, 12)

# Exercise 2: Duplicated Data
Time to check if we have duplicated observations (rows). 

In [6]:
# Drop the duplicated lines according to the 'PassengerId' subset. 
# Update the existing 'df' variable.
# df = ...
# YOUR CODE HERE (~1 line)
df = df.drop_duplicates()
#raise NotImplementedError()

# For validation (do not modify):
print('Number of duplicated lines after drop:', df.duplicated().sum())
print('Shape after drop:', df.shape)

Number of duplicated lines after drop: 0
Shape after drop: (890, 12)


Expected output:

    Number of duplicated lines after drop: 0
    Shape after drop: (890, 12)

# Exercise 3: Missing Values
The missing values are the single most complex and common data problems there is. There are several full books about handling missing values! 

You can think of cases where the presence of missing values is just completely random, or cases where missing values are _missing_ for some reason (e.g., I may not want to tell my income for a loan application because I'm earning very few money, or because I simply don't have income). 

Since this is a very complex topic we will focus on solving it the easy way (not optimal):
- Dropping columns with a high percentage of missing values.
- Numerical features: Replacing the missing values by a value.
- Categorical features: Replacing the missing values by a new category (e.g. 'unknown').

In [7]:
def eliminate_missing_values(data):
    """
    Eliminate the missing values, numpy.nan, of numerical features,
    categorical feature. Also, drop one of the feature which has a lot of missing.
    """
    data = data.copy()
    
    # 1) Analysis
    # Count the number of missing values in the full dataset. 
    # Use pandas '.isnull()'. Return a single 'int' number
    # number_of_missing = ...
    # YOUR CODE HERE (~1 line)
    number_of_missing = data.isnull().sum().sum()
    #raise NotImplementedError()
    
    
    # 2) Solving Numerical Features
    # Fill the missing values of 'Age' by the median. 
    # You can use 'fillna'
    # df.Age = ...
    # YOUR CODE HERE (~1-2 lines)
    data.Age = data.Age.fillna(df.Age.median())
    #raise NotImplementedError()
    
    
    # 3) Solving Categorical Features
    # Replace the missing values in the feature 'Embarked' by 'unknown'
    # You can use 'fillna()'
    # df.Embarked = ...
    # YOUR CODE HERE (~1 line)
    data.Embarked = data.Embarked.fillna('unknown')
    #raise NotImplementedError()
    
    
    # 4) Drop the feature 'Cabin' which has a lot of missing values
    # You can use the method 'drop(...)'.
    # df = ...
    # YOUR CODE HERE (~1 line)
    data = data.drop('Cabin', axis=1)
    #raise NotImplementedError()
    
    return number_of_missing, data

In [8]:
# For validation (do not modify):
number_of_missing, df = eliminate_missing_values(df)

print('Number of missing values', number_of_missing)
print('Uniques of Embarked', df.Embarked.unique())
print('Age most common value:', df.Age.value_counts().index[0])
print('Shape after handling missing values', df.shape)
print('Current number of missing values:', df.isnull().sum().sum())

Number of missing values 865
Uniques of Embarked ['S' 'C' 'Q' 'unknown']
Age most common value: 28.0
Shape after handling missing values (890, 11)
Current number of missing values: 0


Expected output:

    Number of missing values 865
    Uniques of Embarked ['S' 'C' 'Q' 'unknown']
    Age most common value: 28.0
    Shape after handling missing values (890, 11)
    Current number of missing values: 0

# Exercise 4: Outliers
It is heard that the `Age` variable has some outliers. Time to take a look at it.

In [9]:
def eliminate_age_outliers(data, minimum, maximum):
    """
    Eliminate the outliers in Age, and update full dataframe, 
    by dropping the rows with these outliers.
    """
    data = data.copy()

    # 1) Create a boolean mask with the values out of range [minimum, maximum]
    # Make the minimum and maximum values inclusive. 
    # Also, count the number of outliers that were found
    # mask = ...
    # number_of_outliers = ...
    # YOUR CODE HERE (~2 lines)
    mask = (data['Age'] <= 117) & (data['Age'] >= 0)
    number_of_outliers = (~mask).sum()
    #raise NotImplementedError()
    
    # 2) Update the dataframe 'data'. Keep only the rows that do not
    # have outliers in 'Age'. 
    # data = ...
    # YOUR CODE HERE (~1 line)
    data = data[mask]
    #raise NotImplementedError()
    
    assert mask.dtype == 'bool', "The mask must be of bool type"
    return data, number_of_outliers

In [10]:
# For validation (do not modify):
print('Shape before removing outliers:', df.shape)

df, num_of_outliers = eliminate_age_outliers(df, 0, 117)

print('Number of outliers:', num_of_outliers)
print('Final shape of dataset:', df.shape)

Shape before removing outliers: (890, 11)
Number of outliers: 2
Final shape of dataset: (888, 11)


Expected output:

    Shape before removing outliers: (890, 11)
    Number of outliers: 2
    Final shape of dataset: (888, 11)

# Exercise 5: Data Types
The most common data types are:
- `str` (in dataframe it is seen as `object`), e.g., `female`
- `float`, e.g., `13.2`
- `int`, e.g., `120` 

Sometimes you'll need to convert between datatypes. For example, you might have a variable with values `['3.1', '4.6', '???', '3.9']` that you are sure it is numerical. After you take care of that `???`, by any method you wish, you will need to convert to either `float` or `int`. The array does not automatically convert to numerical just because it only has numerical data.

For a more advanced information on dtypes, please refer to [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.dtypes.html) which has its own dtypes (that you will interact a lot when using pandas).

Let's convert the feature `Age` from `float` to `int`.

In [11]:
# RUN this cell to check the dtypes of all features. Check that Age is NOT an int.
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked        object
dtype: object

In [12]:
# EXERCISE
# Convert the feature Age from float to int. Update the dataframe.
# Use the method `astype()`.
# df.Age = ...
# YOUR CODE HERE (~1 line)
df.Age = df.Age.astype(int)
#raise NotImplementedError()

# For validation (do not modify):
print('New Age dtype:', df.Age.dtype)

New Age dtype: int64


Expected output:

    New Age dtype: int64

# Final Dataset

In [13]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,S
5,6,0,3,"Moran, Mr. James",male,28,0,0,330877,8.4583,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,C


# EXTRA Exercises: 
## 1) Our workflow might lead us to one problem. Can you find it out?
(hint: replace with mean)

## 2) Can you find people from the same family? [advanced!]
(hint: use Python's `re`, regular expressions)

## 3) Is there any outlier in feature `Fare`? Why?