# Exercise Notebook - SLU8 - Data Problems

This notebook is associated with this [presentation](https://docs.google.com/presentation/d/1bu6ORtlvKfPI7ZwEA-BOSxpg1pHGvsemFHo0MhoR6ss/edit?usp=sharing). What we cover here:
- Common data entry problems
- Missing data
- Duplicated data
- Outlier detection
- Dealing with outliers
- Number of uniques (_nunique_)
- Drop duplicates
- Converting dtypes
- Data imputation techniques

The **main objective** is to arrive at the end of this notebook with our dataset "cleaned" of any problems: entry problems, duplicated eliminated, missing values all identified, and outliers handles.

-----
_By: Hugo Lopes  
LDSA - SLU8_

In [None]:
import pandas as pd
import numpy as np

% matplotlib inline
from matplotlib import pyplot as plt 

# Load Data
We will use an extraction **and altered** (!) set of the [Titanic Dataset](https://www.kaggle.com/c/titanic).

In [None]:
df = pd.read_csv('titanic_exercise.csv')
print('Initial Shape:', df.shape)
df.head()

# Exercise 1: Data Entry Problems
The feature `Sex` has a problem. Let's solve it.

In [None]:
# EXERCISE
# Check the unique values of Sex, and assign it to a variable 'uniques' 
# uniques = ...
### BEGIN SOLUTION
uniques = df.Sex.unique()
### END SOLUTION


# For validation (do not modify):
print('Unique values:', uniques)

Expected output:
    
    Unique values: ['male' 'female' 'Squirrel']

In [None]:
### BEGIN TESTS
assert set(list(uniques)) == set(['male', 'female', 'Squirrel']) 
### END TESTS

Looks like we found a _Squirrel_! This does not make sense!! Let's drop the rows where we can find `Squirrel`. First, let's check how many squirrels...

In [None]:
# EXERCISE
# Find the rows with Squirrel (create a boolean mask)
# mask = ...
### BEGIN SOLUTION
mask = df.Sex == 'Squirrel'
### END SOLUTION


# For validation (do not modify):
print('Number of Squirrels =', mask.sum())

Expected output:
    
    Number of Squirrels = 1

In [None]:
### BEGIN TESTS
assert mask.sum() == 1
assert mask.dtype == bool, 'Mask is not boolean'
### END TESTS

In [None]:
# Now drop the rows that have Squirrel (update 'df')
# df = ...
### BEGIN SOLUTION
df = df[~mask]
### END SOLUTION


# For validation (do not modify):
print('Shape after dropping:', df.shape)

Expected output:
    
    Shape after dropping: (895, 12)

In [None]:
### BEGIN TESTS
assert df.shape == (895, 12)
### END TESTS

# Exercise 2: Duplicated Data
Time to check if we have duplicated observations (rows). 

In [None]:
# Find the duplicated lines according to the 'PassengerId' subset. 
# Create a mask out of it (hint: use 'duplicated()')
# duplicates = ...
### BEGIN SOLUTION
duplicates = df.duplicated(subset=['PassengerId'])
### END SOLUTION

print('Number of duplicates:', duplicates.sum())

Expected output:

    Number of duplicates: 5

In [None]:
### BEGIN TESTS
assert duplicates.sum() == 5, 'Wrong number of duplicates'
### END TESTS

Now, after verification, we know how many duplicates we have. It's time to drop the duplicated rows:

In [None]:
# Drop the duplicated lines according to the 'PassengerId' subset. 
# Create a mask out of it (hint: use 'duplicated()')
# df = ...
### BEGIN SOLUTION
df = df.drop_duplicates(subset=['PassengerId'])
### END SOLUTION

# For validation (do not modify):
print('Number of duplicated lines after drop:', df.duplicated().sum())
print('Shape after drop:', df.shape)

Expected output:

    Number of duplicated lines after drop: 0
    Shape after drop: (890, 12)

In [None]:
### BEGIN TESTS
assert df.duplicated().sum() == 0, 'You still have dulicates'
assert df.shape == (890, 12), 'Dataframe shape is not correct'
### END TESTS

# Exercise 3: Missing Values
The missing values are the single most complex and common data problems there is. There are several full books about handling missing values! 

You can think of cases where the presence of missing values is just completely random, or cases where missing values are _missing_ for some reason (e.g., I may not want to tell my income for a loan application because I'm a very small amount of money, or because I simply don't have income). 

Since this is a very complex topic we will focus on solving it the easy way (not optimal):
- Dropping columns with a high percentage of missing values (rule of thumb >80-90%).
- Numerical features: Replacing the missing values by a value.
- Categorical features: Replacing the missing values by a new category (e.g. 'unknown').

In [None]:
def eliminate_missing_values(data):
    """
    Eliminate the missing values, numpy.nan, of numerical features and
    categorical features. Also, drop one of the features which has a lot of missing data.
    """    
    # 1) Analysis
    # Count the number of missing values in the full dataset. 
    # Use pandas '.isnull()'. Number_of_missing should be a single int number
    # number_of_missing = ...
    ### BEGIN SOLUTION
    number_of_missing = data.isnull().sum().sum()
    ### END SOLUTION
    
    
    # 2) Cleaning missing data on numerical features
    # Fill the missing values of 'Age' by the median. 
    # You can use 'fillna'
    # df.Age = ...
    ### BEGIN SOLUTION
    data.Age = data.Age.fillna(df.Age.median())
    ### END SOLUTION
    
    
    # 3) Solving Categorical Features
    # Replace the missing values in the feature 'Embarked' by 'unknown'
    # You can use 'fillna()'
    # df.Embarked = ...
    ### BEGIN SOLUTION
    data.Embarked = data.Embarked.fillna('unknown')
    ### END SOLUTION
    
    
    # 4) Drop the feature 'Cabin' which has a lot of missing values
    # You can use the method 'drop(...)'. 
    # Hint: remember what you learned about the axis number
    # df = ...
    ### BEGIN SOLUTION
    data = data.drop('Cabin', axis=1)
    ### END SOLUTION
    
    return number_of_missing, data

In [None]:
# For validation (do not modify):
number_of_missing, df = eliminate_missing_values(df)

print('Number of missing values', number_of_missing)
print('Uniques of Embarked', df.Embarked.unique())
print('Age most common value:', df.Age.value_counts().index[0])
print('Shape after handling missing values', df.shape)

Expected output:

    Number of missing values 865
    Uniques of Embarked ['S' 'C' 'Q' 'unknown']
    Age most common value: 28.0
    Shape after handling missing values (890, 11)

In [None]:
### BEGIN TESTS
assert number_of_missing == 865
assert set(list(df.Embarked.unique())) == set(['S', 'C', 'Q', 'unknown'])
assert np.isclose(df.Age.value_counts().index[0], 28)
assert df.shape == (890, 11)
### END TESTS

# Exercise 4: Outliers
You suspect that the `Age` variable has some outliers. Time to take a look at it.

In [None]:
def eliminate_age_outliers(data, minimum, maximum):
    """
    Eliminate the outliers in Age, and update full dataframe, 
    by dropping the rows with these outliers.
    """
    data = data.copy()

    # 1) Create a boolean mask with the values out of range [minimum, maximum]
    # Make the minimum and maximum values inclusive. 
    # Also, count the number of outliers that were found
    # Hint: beware of your parenthesis!
    # mask = ...
    # number_of_outliers = ...
    ### BEGIN SOLUTION
    mask = (data['Age'] <= 117) & (data['Age'] >= 0)
    number_of_outliers = (~mask).sum()
    ### END SOLUTION
    
    # 2) Update the dataframe 'data'. Keep only the rows that do not
    # have outliers in 'Age'. 
    # data = ...
    ### BEGIN SOLUTION
    data = data[mask]
    ### END SOLUTION
    
    assert mask.dtype == 'bool', "The mask must be of bool type"
    return data, number_of_outliers

In [None]:
# For validation (do not modify):
print('Shape before removing outliers:', df.shape)

df, num_of_outliers = eliminate_age_outliers(df, 0, 117)

print('Number of outliers:', num_of_outliers)
print('Final shape of dataset:', df.shape)

Expected output:

    Shape before removing outliers: (890, 11)
    Number of outliers: 2
    Final shape of dataset: (888, 11)

In [None]:
### BEGIN TESTS
assert num_of_outliers == 2, 'Incorrect number of outliers'
assert df.shape == (888, 11)
### END TESTS

# Exercise 5: Data Types
The most common data types are:
- `str` (in dataframe it is seen as `object`), e.g., `female`
- `float`, e.g., `13.2`
- `int`, e.g., `120` 

Sometimes you'll need to convert between datatypes. For example, you might have a variable with values `['3.1', '4.6', '???', '3.9']` that you are sure it is numerical. After you take care of that `???`, by any method you wish, you will need to convert to either `float` or `int`. The array does not automatically convert to numerical just because it only has numerical data.

Let's convert the feature `Age` data type.

In [None]:
# EXERCISE
# Check the Age dtype, assign it to 'dtype' variable
# dtype = ...
### BEGIN SOLUTION
dtype = df.Age.dtype
### END SOLUTION

print('Current Age dtype:', dtype)

Expected output:

    Current Age dtype: float64

In [None]:
### BEGIN TESTS
assert dtype == np.float64
### END TESTS

In [None]:
# EXERCISE
# Convert the feature Age to int. Update the dataframe.
# Hint: Use the method `astype()`.
# df.Age = ...
### BEGIN SOLUTION
df.Age = df.Age.astype(int)
### END SOLUTION

# For validation (do not modify):
print('New Age dtype:', df.Age.dtype)

Expected output:

    New Age dtype: int64

In [None]:
### BEGIN TESTS
assert df.Age.dtype == np.int64
### END TESTS

# Final Dataset

In [None]:
df.head(10)

# EXTRA (optional) Exercises: 
## 1) Our workflow might lead us to one problem. Can you find it out?
(hint: replace with mean)

## 2) Can you find people from the same family? [advanced!]
(hint: use Python's `re`, regular expressions)

## 3) Is there any outlier in feature `Fare`? Why?