# SLU06 - Dealing with Data Problems

In [None]:
import os
import pandas as pd
import numpy as np
import copy
import hashlib
import calendar
import datetime
import pycountry_convert as cc

def _hash(s):
    return hashlib.blake2b(bytes(str(s), encoding='utf8'), digest_size=5).hexdigest()

Welcome to the wonderful world of Data Cleanup! In the real would, a lot of good people are spending a lot of time  cleaning datasets and getting them down to a form with which they can work. 

There is one thing that you should always keep in mind when working with data:

<img src="media/clean-you-must.jpg"/>

Let's get our hands dirty.

## The CR hotels

You took a job as a data scientist in the fancy CR hotel imperium. Your first task is to clean a dataset of reservations from 2015 to 2017. You need to get it nice and tidy before it can be analysed.

Here is the data dictionary for the dataset:
- **hotel:** Resort Hotel or City Hotel
- **is_canceled:** Value indicating if the booking was canceled (1) or not (0)
- **lead_time:** Number of days that elapsed between the date of the booking and the arrival date
- **arrival_date**: Arrival date
- **stays_in_weekend_nights:** Number of weekend nights (Saturday or Sunday) the guest booked at the hotel
- **stays_in_week_nights:** Number of week nights (Monday to Friday) the guest booked at the hotel
- **adults**: Number of adults
- **is_repeated_guest:** Value indicating if the booking is from a repeated guest (1) or not (0)
- **previous_cancelations:** Number of previous bookings that were canceled by the customer prior to the current booking
- **agent:** ID of the travel agency that made the booking
- **adr:** Average Daily Rate defined as the total price of the stay divided by the number of staying nights
- **total_of_special_requests:** Number of special requests made by the customer (e.g. twin bed or high floor)
- **reservation_status:** Last status of the reservation, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer checked in and out of the hotel; No-Show – customer did not check in and did not inform the hotel of the reason why
- **reservation_status_date:** Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer check out of the hotel

## Exercise 1 - Clean the data

### Exercise 1.1 - Get the data

Let's start by importing the dataset and taking a look at it. The dataset is located in the `data` folder, in a file named `crset_hotel_bookings.csv`. This file came straight out of MS Excel, so the values are separated by semicolons. Load the dataset into the `df_crset` dataframe.

In [None]:
# use pandas to load the data into the df_crset dataframe
# df_crset = ... 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(df_crset, pd.DataFrame), "df_crset should be a dataframe"
assert df_crset.shape == (119390, 14), "The shape of the dataframe is different then expected. Are you using the right separator?"
df_crset.head()

### Exercise 1.2 - Arrival date 

Let's start by cleaning the `arrival_date` column.

Create a function called `format_arrival_date()` that extracts the day, month, and year values from this column and stores this information in new columns `arrival_date_month`, `arrival_date_day` and `arrival_date_year`. All values should be integers. Remove the  `arrival_date` column and return the cleaned dataframe. 

You can use the `calendar.month_name` function from the [calendar python module](https://docs.python.org/3/library/calendar.html#module-calendar).

In [None]:
def format_arrival_date(df: pd.DataFrame)->pd.DataFrame:
    """
    This function cleans the "arrival_date" column
    """

    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
clean_arrival = format_arrival_date(df_crset)
assert isinstance(clean_arrival, pd.DataFrame), "The function should return a dataframe."
assert clean_arrival.shape == (119390, 16), "The shape of the dataframe is different then expected."
assert 'arrival_date' not in clean_arrival.columns, "You should remove the old arrival_date column."
assert 'arrival_date_month' in clean_arrival.columns, \
"You're missing the arrival_date_month column. Have you named the new column correctly?"
assert 'arrival_date_day' in clean_arrival.columns, \
"You're missing the arrival_date_day column. Have you named the new column correctly?"
assert 'arrival_date_year'  in clean_arrival.columns, \
"You're missing the arrival_date_year column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_month), "Months should be saved as integers." 
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_day), "Days should be saved as integers."
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_year), "Years should be saved as integers."
assert clean_arrival.arrival_date_month.sum()==782301, "Something is wrong with your month conversion."
assert clean_arrival.arrival_date_day.sum()==1886152, "Something is wrong with your day conversion."
assert clean_arrival.arrival_date_year.sum()==240708931, "Something is wrong with your year conversion."

### Exercise 1.3 - Week of year 

Create a function named `get_week_of_year` that takes the newly created columns `arrival_date_month`, `arrival_date_day` and `arrival_date_year` and creates a new variable in the same dataframe called `arrival_date_week_number` with the week number of the arrival date. The function should return a dataframe with the newly created column.

You can use the `date.isocalendar()` method from the [datetime python module](https://docs.python.org/3/library/datetime.html).

In [None]:
def get_week_of_year(df: pd.DataFrame)->pd.DataFrame:
    """
    This function gets the arrival week of the year
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
clean_arrival_week_of_year = get_week_of_year(clean_arrival)
assert isinstance(clean_arrival_week_of_year, pd.DataFrame), "The function should return a dataframe."
assert clean_arrival_week_of_year.shape == (119390, 17), \
"The shape of the dataframe is different then expected. Have you saved the new column?"
assert 'arrival_date_week_number' in clean_arrival_week_of_year.columns, \
"You're missing the clean_arrival_week_of_year column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_arrival_week_of_year.arrival_date_week_number), \
"The values in the new column should be integers." 
assert clean_arrival_week_of_year.arrival_date_week_number.sum()==3194375, "Something is wrong with your data conversion."

### Exercise 1.4 - The reservation status date

Follow the same process you used in Exercise 1.3, but this time for the `reservation_status_date` column: extract the day, month, year, and week number, and then store them in new columns as integers. Name the columns `reservation_status_date_day`, `reservation_status_date_month` and so on, and don't forget to remove the `reservation_status_date column`, as well.

Return the cleaned dataframe.

Implement these steps in the function below.

In [None]:
def process_reservation_status_date(df: pd.DataFrame)->pd.DataFrame:
    """
    This function cleans "reservation_status_date" column
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
clean_status_date = process_reservation_status_date(clean_arrival_week_of_year)
assert isinstance(clean_status_date, pd.DataFrame), "The function should return a dataframe."
assert clean_status_date.shape == (119390, 20), \
"The shape of the dataframe is different then expected. Have you dropped the old reservation_status_date column?"
assert 'reservation_status_date' not in clean_status_date.columns, "You should remove the old reservation_status_date column."
assert 'reservation_status_date_day' in clean_status_date.columns, \
"You're missing the day column. Have you named the new column correctly?"
assert 'reservation_status_date_month' in clean_status_date.columns, \
"You're missing the month column. Have you named the new column correctly?"
assert 'reservation_status_date_year'  in clean_status_date.columns, \
"You're missing the year column. Have you named the new column correctly?"
assert 'reservation_status_date_week_number'  in clean_status_date.columns, \
"You're missing the week number column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_day), "Days should be integers." 
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_month), "Months should be integers."
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_year), "Years should be integers."
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_week_number), "Week of the year should an integer."
assert clean_status_date.reservation_status_date_week_number.sum()==3092150, \
"Something is wrong with your data conversion in reservation_status_date_week_number."
assert clean_status_date.reservation_status_date_month.sum()==756231, "Something is wrong with your month conversion."
assert clean_status_date.reservation_status_date_day.sum()==1870440, "Something is wrong with your day conversion."
assert clean_status_date.reservation_status_date_year.sum()==240701432, "Something is wrong with your year conversion."

## Exercise 2 - Missing data

Let's now look at missing data.

There's over 16000 missing values in the `agent` column, representing over 10% of the total observations and we need to do something about it. The missing values are simply empty strings at moment.

Usually if more than 70% of values in a column are missing and there is no way to fill them in, then the column can be completely dropped from the dataset. Our `agent` column is a categorical variable that represents the ID of the travel agency that made the booking. We can fill out the missing values with a new category named `unknown`.

Create a new function named `impute_agents` that does exactly that.

In [None]:
def impute_agents(df: pd.DataFrame)->pd.DataFrame:
    """
    This function imputes the missing values in the agents column with a 
    new 'unknown' category
    """

    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
imputed_df = impute_agents(clean_status_date) 
assert isinstance(imputed_df, pd.DataFrame), "The function should return a dataframe."
assert _hash(imputed_df.agent.sort_values())=='31878956f7', "Did you fill in all the missing values?"

## Exercise 3 - Drop duplicates

The last thing you need to ensure is that your dataset doesn't have any duplicated data.
Create a short function to remove duplicates!

In [None]:
def drop_duplicated_entries(df: pd.DataFrame)->pd.DataFrame:
    """
    This function drops duplicates from the dataframe
    """

    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
clean_crset_df = drop_duplicated_entries(imputed_df)
assert isinstance(clean_crset_df, pd.DataFrame), "The function should return a dataframe."
assert clean_crset_df.shape == (82870, 20), \
"The shape of the dataframe is different then expected. Have you removed all the duplicated rows?" 

Congratulations! The *CRSet() Hotel* dataset is looking very clean and tidy!
<img src="media/CRset.png" width="400">

## A mess to be tidied up

You decided to switch to a more interesting job with the World Health Organization analyzing disease incidence across the world. Your first task is to analyse the incidence of tuberculosis between 1989 and 2008. WHO has been recording the cases all over the world. They have good intentions, but not very good methods to store data.

<img src="media/tbc.png"  width="400">

Let's have a look:

In [None]:
df_tb_who = pd.read_csv(os.path.join('data', 'tb.csv'), sep=',')
df_tb_who.head()

The dataset contains counts of confirmed tuberculosis cases by country, year and demographic group. The demographic data contains information on sex (*m* for male and *f* for female)  and age (*0-14, 15-24, 25-34, 35-44, 45-54, 55-64* and *65+*). Except for the column `year`, the column names are not very intuitive. The column `iso2` contains the country code in [iso2 format](https://www.iso.org/iso-3166-country-codes.html). The remaining columns are incidence data for different age and sex groups.

## Exercise 4 - Country 

Start by addressing the `iso2` column. Save in a new `country` column the corresponding country name from the iso2 code. The [pycountry-convert](https://pypi.org/project/pycountry-convert/) package is your friend! It's already imported as `cc`.

First, create a function `get_country` that converts the iso2 code to country name. If the name cannot be retrieved, the function should return `np.nan`. You can use a `try .. except` block in the function.

Afterwards copy the who dataframe to a new `df_tb_who_country` dataframe, apply the function and drop the `iso2` column.

In [None]:
# def get_country():
# ...

# df_tb_who_country = ...  

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(df_tb_who_country, pd.DataFrame), "The result should be a dataframe."
assert 'iso2' not in df_tb_who_country.columns, "You should drop the original iso2 column."
assert 'country' in df_tb_who_country.columns, "Have you stored the results in a new column named 'country'?"
assert _hash(df_tb_who_country['country'].sort_values())=='4ad9abb441', "Have you converted the iso2 codes to the country name"

## Exercise 5 - the melt function

This is our dataframe with country names:

In [None]:
df_tb_who_country.head()

Before we can continue, we need to tidy it up. Of all the columns, only two are variables, `year` and `country`. All the other columns do not follow the 'one colum, one variable convention' and the column names are actually values.

Use the function `melt()` to tidy the dataframe and store it in `tidy_tb`. Keep the `country` and `year` columns and melt all the other columns. Store the melted column names in the variable `column_name` and their values in variable `cases`.

In [None]:
#tidy_tb = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(tidy_tb, pd.DataFrame), "The result should be a dataframe."
assert tidy_tb.shape == (121149, 4), "Your dataframe doesn't have the expected shape. Have you melted the dataframe correctly?"
assert "column_name" in tidy_tb.columns, "The variables other than 'country' and 'year' should be stored in a column named 'column_name'."
assert "cases" in tidy_tb.columns, "Number of cases should be stored in a column named 'cases'."
assert _hash(tidy_tb['column_name'].sort_values())=='882405bb95', "The column_name column doesn't look as expected."
assert tidy_tb['cases'].sum()==49790719.0, "The cases column doesn't look as expected."

## Exercise 6 - Data cleanup

Our dataframe is tidy, but it's not clean. 

From the `tidy_tb` dataframe, drop all the rows where the information about `cases` **OR** `country` is missing, as we just cannot guess the number of cases or the country of origin. (Remember how the missing country values are represented?) 

Convert the `cases` column to `int`. 

Save the final dataframe in `clean_tidy_tb` sorted by `country`, `year`, and `column_name`. The index should be reset (with `drop=True`).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(clean_tidy_tb, pd.DataFrame), "The result should be a dataframe."
assert clean_tidy_tb.shape == (38358, 4), "The shape of the dataframe is not correct."
assert _hash(clean_tidy_tb['country'].sort_values())=='bb7e8179ca', "There is something wrong with the values in the country column."
assert clean_tidy_tb['cases'].sum()==49648715, "There is something wrong with the values in the cases column."

## Exercise 7 - Multiple variables stored in one column

Our `clean_tidy_tb` is looking better, but now we need to address the problem of having multiple variables stored in the `column_name` column. Let's fix that in a few steps.

### Exercise 7.1 

From the `column_name` column, extract the codes for female/male to a new column named `sex` and the codes for the age to a new column `age`. Drop all missing values afterwards.

In [None]:
#clean_tidy_tb[["sex", "age"]] =

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(clean_tidy_tb, pd.DataFrame), "The result should be a dataframe."
assert clean_tidy_tb.shape == (35311, 6), "The shape of your dataframe is off. Have you dropped the missing values?"
assert _hash(sorted(clean_tidy_tb['sex'].unique()))=='f9c570761b', "The values in the sex column are not ok."
assert _hash(sorted(clean_tidy_tb['age'].unique()))=='248571e48d', "The values in the age column are not ok."

### Exercise 7.2

The values in the `age` column are not very easy to understand. Use the `decode_age` dictionary to convert them to a more readable format. Drop any row where the values could not be converted.

In [None]:
decode_age =   {
        "014": "0-14",
        "1524": "15-24",
        "2534": "25-34",
        "3544": "35-44",
        "4554": "45-54",
        "5564": "55-64",
        "65": "65+",
        "u": "unknown",
    }

In [None]:
# clean_tidy_tb["age"] = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert clean_tidy_tb.shape == (33725, 6), "The shape of your dataframe is off. Have you dropped the missing values?"
assert _hash(clean_tidy_tb.sort_index()['age'])=='e18f2dc8d2', "The decoding did not work as expected."

### Exercise 7.3

Finally, save in `final_tb_df` the dataframe with just the columns "country", "year", "sex", "age" and "cases".

In [None]:
# final_tb_df = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(final_tb_df, pd.DataFrame), "The result should be a dataframe."
assert final_tb_df.shape == (33725, 5), "The shape of your dataframe is off."
assert sorted(final_tb_df.columns) == ['age', 'cases', 'country', 'sex', 'year'], "The column names are not as expected."

Congratulations!!! You're a data cleaning master!

<img src="media/good-job.jpg"  width="400">