# SLU06 - Dealing with Data Problems


In [19]:
!pip install -r requirements.txt









In [20]:
import os
import pandas as pd
import numpy as np
import copy
import hashlib
import json
import warnings
import calendar
import datetime
warnings.filterwarnings('ignore')
import pycountry_convert as cc

Welcome to the wonderful world of Data Cleanup! In the real would, a lot of good people are spending a lot of time  cleaning datasets and getting them down to a form with which they can work. 

There is one thing that you should always keep in mind when working with data:

<img src="media/clean-you-must.jpg"/>

Let's get our hands dirty.

## The CR hotels

You took a job as a data scientist in the fancy CR hotel imperium. Your first task is to clean a dataset of reservations from 2015 to 2017. You need to get it nice and tidy before it can be analysed.

Here is the data dictionary for the dataset:
- **hotel:** Resort Hotel or City Hotel
- **is_canceled:** Value indicating if the booking was canceled (1) or not (0)
- **lead_time:** Number of days that elapsed between the date of the booking and the arrival date
- **arrival_date**: Arrival date formatted as "Month Day Year"
- **stays_in_weekend_nights:** Number of weekend nights (Saturday or Sunday) the guest booked at the hotel
- **stays_in_week_nights:** Number of week nights (Monday to Friday) the guest booked at the hotel
- **adults**: Number of adults
- **is_repeated_guest:** Value indicating if the booking is from a repeated guest (1) or not (0)
- **previous_cancelations:** Number of previous bookings that were canceled by the customer prior to the current booking
- **agent:** ID of the travel agency that made the booking
- **adr:** Average Daily Rate defined as the total price of the stay divided by the number of staying nights
- **total_of_special_requests:** Number of special requests made by the customer (e.g. twin bed or high floor)
- **reservation_status:** Last status of the reservation, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer checked in and out of the hotel; No-Show – customer did not check in and did not inform the hotel of the reason why
- **reservation_status_date:** Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer check out of the hotel

### Exercise 1.1

Let's start by importing the dataset and taking a look at it. The dataset is located in the `data` folder, in a file named `crset_hotel_bookings.csv`. This file came straight out of MS Excel, so the values are separated by semicolons. Load the dataset into the `df_crset` dataframe.

In [21]:
# use pandas to load the data into the df_crset dataframe
# df_crset = ... 

# YOUR CODE HERE
df_crset = pd.read_csv('data/crset_hotel_bookings.csv', sep=';')

In [22]:
df_crset.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date,stays_in_weekend_nights,stays_in_week_nights,adults,is_repeated_guest,previous_cancelations,agent,adr,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,July 1 2015,0,0,2,0,0,,0.0,0,Check-Out,01/07/2015
1,Resort Hotel,0,737,July 1 2015,0,0,2,0,0,,0.0,0,Check-Out,01/07/2015
2,Resort Hotel,0,7,July 1 2015,0,1,1,0,0,,75.0,0,Check-Out,02/07/2015
3,Resort Hotel,0,13,July 1 2015,0,1,1,0,0,304.0,75.0,0,Check-Out,02/07/2015
4,Resort Hotel,0,14,July 1 2015,0,2,2,0,0,240.0,98.0,1,Check-Out,03/07/2015


In [23]:
assert isinstance(df_crset, pd.DataFrame), "df_crset should be a dataframe"
assert df_crset.shape == (119390, 14), "The shape of the dataframe is different then expected. Are you using the right separator?"
df_crset.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date,stays_in_weekend_nights,stays_in_week_nights,adults,is_repeated_guest,previous_cancelations,agent,adr,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,July 1 2015,0,0,2,0,0,,0.0,0,Check-Out,01/07/2015
1,Resort Hotel,0,737,July 1 2015,0,0,2,0,0,,0.0,0,Check-Out,01/07/2015
2,Resort Hotel,0,7,July 1 2015,0,1,1,0,0,,75.0,0,Check-Out,02/07/2015
3,Resort Hotel,0,13,July 1 2015,0,1,1,0,0,304.0,75.0,0,Check-Out,02/07/2015
4,Resort Hotel,0,14,July 1 2015,0,2,2,0,0,240.0,98.0,1,Check-Out,03/07/2015


### Exercise 1.2 - Arrival Date 

Let's start by cleaning the arrival date. According to the data dictionary and the first 5 rows of the dataframe, the `arrival_date` stores a string with the month spelled out, the day in numeral and the year in numeral.

Create a function called `format_arrival_date()` that extracts the day, month, and year values from this column and stores this information in new columns `arrival_date_month`, `arrival_date_day` and `arrival_date_year`. All values should be integers. Remove the  `arrival_data` column and return the cleaned dataframe. 

You can use the `calendar.month_name` function from the [calendar python module](https://docs.python.org/3/library/calendar.html#module-calendar) and pandas `map()` method.

In [24]:
calendar.month_name[7]

'July'

In [25]:
{name: num for num, name in enumerate(calendar.month_name) if name}


{'January': 1,
 'February': 2,
 'March': 3,
 'April': 4,
 'May': 5,
 'June': 6,
 'July': 7,
 'August': 8,
 'September': 9,
 'October': 10,
 'November': 11,
 'December': 12}

In [26]:
#def format_arrival_date(df: pd.DataFrame) -> pd.DataFrame:
#    """
#    This function cleans the "arrival_date" column
#    """
#    # YOUR CODE HERE
#
#    # Define a mapping from month name to month number
#    month_to_num = {name: num for num, name in enumerate(calendar.month_name) if name}
#
#    # Function to extract day, month, and year
#    def extract_date_components(date_str):
#        parts = date_str.split()
#        month = month_to_num[parts[0]]
#        day = int(parts[1])
#        year = int(parts[2])
#        return month, day, year
#
#    # Apply the function to the arrival_date column
#    df[['arrival_date_month', 'arrival_date_day', 'arrival_date_year']] = \
#        df['arrival_date'].map(lambda x: extract_date_components(x)).tolist()
#
#    # Drop the original arrival_date column
#    df.drop(columns=['arrival_date'], inplace=True)
#
#    return df
#

In [27]:
def format_arrival_date(df: pd.DataFrame) -> pd.DataFrame:
    """
    This function cleans the 'arrival_date' column
    """
    # Define a mapping from month name to month number
    month_to_num = {name: num for num, name in enumerate(calendar.month_name) if name}

    # Apply a lambda function to extract day, month, and year
    df[['arrival_date_month', 'arrival_date_day', 'arrival_date_year']] = df['arrival_date'].map(
        lambda date_str: (month_to_num[date_str.split()[0]], int(date_str.split()[1]), int(date_str.split()[2]))
    ).tolist()

    # Drop the original arrival_date column
    df.drop(columns=['arrival_date'], inplace=True)

    return df

In [28]:
clean_arrival = format_arrival_date(df_crset)
assert isinstance(clean_arrival, pd.DataFrame), "The function should return a dataframe."
assert clean_arrival.shape == (119390, 16), "The shape of the dataframe is different then expected."
assert 'arrival_date' not in clean_arrival.columns, "You should remove the old arrival_date column."
assert 'arrival_date_month' in clean_arrival.columns, "You're missing the arrival_date_month column. Have you named the new column correctly?"
assert 'arrival_date_day' in clean_arrival.columns, "You're missing the arrival_date_day column. Have you named the new column correctly?"
assert 'arrival_date_year'  in clean_arrival.columns, "You're missing the arrival_date_year column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_month), "Months should be saved as integers." 
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_day), "Days should be saved as integers."
assert all(isinstance(item, int) for item in clean_arrival.arrival_date_year), "Years should be saved as integers."
assert hashlib.sha256(json.dumps(str(clean_arrival.arrival_date_month.sum())).encode()).hexdigest() == 'd3d4cd02abe661b2bc608a13cc791e7543da035d1c729ce171b1fda52effd8c8', "Something is wrong with your month conversion."
assert hashlib.sha256(json.dumps(str(clean_arrival.arrival_date_day.sum())).encode()).hexdigest() == '4db570fffca303e3eeb7ec3bc940a3812f8466a3b6e0d8b44633eaabde411ed4', "Something is wrong with your day conversion."
assert hashlib.sha256(json.dumps(str(clean_arrival.arrival_date_year.sum())).encode()).hexdigest() == '2f83dbe44d408f6735bc778eed604edf730377524018bb9f6a176e503511d033', "Something is wrong with your year conversion."

### Exercise 1.3 - Week of year 

Create a function named `get_week_of_year` that takes the newly created columns `arrival_date_month`, `arrival_date_day` and `arrival_date_year` and creates a new variable in the same dataframe called `arrival_date_week_number` with the week number of the arrival date.

You can use the `date.isocalendar()` method from the [datetime python module](https://docs.python.org/3/library/datetime.html) and pandas `apply()` method.

In [29]:
datetime.date(df_crset['arrival_date_year'][0], df_crset['arrival_date_month'][0], df_crset['arrival_date_day'][0])


datetime.date(2015, 7, 1)

In [30]:
#def get_week_of_year(df: pd.DataFrame) -> pd.DataFrame:
#    # Function to calculate the week number
#    def calculate_week_number(row):
#        # Create a date object
#        d = datetime.date(row['arrival_date_year'], row['arrival_date_month'], row['arrival_date_day'])
#        # Get the week number
#        return d.isocalendar()[1]
#
#    # Apply the function to each row and create a new column
#    df['arrival_date_week_number'] = df.apply(calculate_week_number, axis=1)
#
#    return df

In [31]:
def get_week_of_year(df: pd.DataFrame) -> pd.DataFrame:
    # Apply a lambda function to calculate the week number directly
    df['arrival_date_week_number'] = df.apply(
        lambda row: datetime.date(row['arrival_date_year'], row['arrival_date_month'], row['arrival_date_day']).isocalendar()[1], 
        axis=1
    )
    return df


In [32]:
clean_arrival_week_of_year = get_week_of_year(clean_arrival)
assert isinstance(clean_arrival_week_of_year, pd.DataFrame), "The function should return a dataframe."
assert clean_arrival_week_of_year.shape == (119390, 17), "The shape of the dataframe is different then expected. Have you saved the new column?"
assert 'arrival_date_week_number' in clean_arrival_week_of_year.columns, "You're missing the clean_arrival_week_of_year column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_arrival_week_of_year.arrival_date_week_number), "The values in the new column should be integers." 
assert hashlib.sha256(json.dumps(str(clean_arrival_week_of_year.arrival_date_week_number)).encode()).hexdigest() == 'fbd3bdce9f2b3a1aff60369c351a1ceb63a167a80d78b140859af7f687cd76a3', "Something is wrong with your data conversion."

### Exercise 1.4 - The reservation status date

Do the same processing as for the `arrival_date` column but this time for the `reservation_status_date` - extract the day, month, year, and week number and store them in new columns as integers. The columns should be named `reservation_status_date_day`, `reservation_status_date_month` and so on. Remove the `reservation_status_date` column and return the cleaned dataframe.

All the steps should be done in a single function named `process_reservation_status_date()`. 

In [15]:
#def process_reservation_status_date(df: pd.DataFrame) -> pd.DataFrame:
#    """
#    This function cleans "reservation_status_date" column
#    """
#    # Function to parse the date and extract components
#    def parse_date(date_str):
#        parsed_date = datetime.datetime.strptime(date_str, '%d/%m/%Y')
#        return parsed_date.day, parsed_date.month, parsed_date.year, parsed_date.isocalendar()[1]
#
#    # Apply the function to the reservation_status_date column and create new columns
#    df[['reservation_status_date_day', 
#        'reservation_status_date_month', 
#        'reservation_status_date_year', 
#        'reservation_status_date_week_number']] = df['reservation_status_date'].apply(lambda x: parse_date(x)).tolist()
#
#    # Drop the original reservation_status_date column
#    df.drop(columns=['reservation_status_date'], inplace=True)
#
#    return df


In [35]:
def process_reservation_status_date(df: pd.DataFrame) -> pd.DataFrame:
    """
    This function cleans the 'reservation_status_date' column
    """

    # Apply a lambda function to parse the date and extract components
    date_components = df['reservation_status_date'].apply(
        lambda x: datetime.datetime.strptime(x, '%d/%m/%Y')
    )

    # Extract and assign components to new columns
    df['reservation_status_date_day'] = date_components.dt.day
    df['reservation_status_date_month'] = date_components.dt.month
    df['reservation_status_date_year'] = date_components.dt.year
    df['reservation_status_date_week_number'] = date_components.apply(lambda x: x.isocalendar()[1])

    # Drop the original reservation_status_date column
    df.drop(columns=['reservation_status_date'], inplace=True)

    return df

In [36]:
clean_status_date = process_reservation_status_date(clean_arrival_week_of_year)
assert isinstance(clean_status_date, pd.DataFrame), "The function should return a dataframe."
assert clean_status_date.shape == (119390, 20), "The shape of the dataframe is different then expected. Have you dropped the old reservation_status_date column?"
assert 'reservation_status_date' not in clean_status_date.columns, "You should remove the old reservation_status_date column."
assert 'reservation_status_date_day' in clean_status_date.columns, "You're missing the day column. Have you named the new column correctly?"
assert 'reservation_status_date_month' in clean_status_date.columns, "You're missing the month column. Have you named the new column correctly?"
assert 'reservation_status_date_year'  in clean_status_date.columns, "You're missing the year column. Have you named the new column correctly?"
assert 'reservation_status_date_week_number'  in clean_status_date.columns, "You're missing the week number column. Have you named the new column correctly?"
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_day), "Days should be integers." 
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_month), "Months should be integers."
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_year), "Years should be integers."
assert all(isinstance(item, int) for item in clean_status_date.reservation_status_date_week_number), "Week of the year should an integer."
assert hashlib.sha256(json.dumps(str(clean_status_date.reservation_status_date_week_number)).encode()).hexdigest() == '035d73f3bf98929a3e2d1ea7b1dd02645a86cc8f1d751dd088533b0e11de0352', "Something is wrong with your data conversion in reservation_status_date_week_number."
assert hashlib.sha256(json.dumps(str(clean_status_date.reservation_status_date_month.sum())).encode()).hexdigest() == '8505858d4a9ee251ff772f4486d8b18f63e2d2015da3ab435371a604800b8f69', "Something is wrong with your month conversion."
assert hashlib.sha256(json.dumps(str(clean_status_date.reservation_status_date_day.sum())).encode()).hexdigest() == 'f8c8d1439b0d0d51ea8a9dc900dc4309b7364a5c4791b417c0e6a5cc093fc9b8', "Something is wrong with your day conversion."
assert hashlib.sha256(json.dumps(str(clean_status_date.reservation_status_date_year.sum())).encode()).hexdigest() == '0c1465052551cc91c40abbd465ae0337ebc526a00b5b536b9d8bfbe8cc35ef3f', "Something is wrong with your year conversion."

### Exercise 2 - Missing data

Let's now look at missing data.

In [37]:
np.sum(clean_status_date.isnull())

hotel                                      0
is_canceled                                0
lead_time                                  0
stays_in_weekend_nights                    0
stays_in_week_nights                       0
adults                                     0
is_repeated_guest                          0
previous_cancelations                      0
agent                                  16340
adr                                        0
total_of_special_requests                  0
reservation_status                         0
arrival_date_month                         0
arrival_date_day                           0
arrival_date_year                          0
arrival_date_week_number                   0
reservation_status_date_day                0
reservation_status_date_month              0
reservation_status_date_year               0
reservation_status_date_week_number        0
dtype: int64

There's over 16000 missing values in the `agent` column, representing over 10% of the total observations and we need to do something about it. 

Usually if more than 70% of values in a column are missing and there is no way to fill them in, then the column can be completely dropped from the dataset. Our `agent` column is a categorical variable that represents the ID of the travel agency that made the booking. We can fill out the missing values with a new category named `unknown`.

Create a new function named `impute_agents` that does exactly that.

In [None]:
def impute_agents(df: pd.DataFrame)->pd.DataFrame:
    """
    This function imputs the missing values in the agents column with a new 'unknown' category
    """

    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
imputed_df = impute_agents(clean_status_date) 
assert isinstance(imputed_df, pd.DataFrame), "The fucntion should return a dataframe."
assert hashlib.sha256(json.dumps(str(imputed_df.agent)).encode()).hexdigest() == 'e88eaebe8668246fc74259c12ea8e38311c86ec96a73e2a8de26eb30bb7aefb9', "Something is wrong with your data imputation." 
assert hashlib.sha256(json.dumps(sorted(imputed_df.agent[imputed_df.agent == 'unknown'].sum())).encode()).hexdigest() == 'bb80fda4a664ce06ee5d39edb0488f95a9496f1a7dde86e8b750246967e5a921', "Did you fill in all the missing values?"

## Exercise 3 - Drop duplicates

The last thing you need to ensure is that your dataset doesn't have any duplicated data.
Create a short function to remove duplicates!

In [None]:
def drop_duplicated_entries(df: pd.DataFrame)->pd.DataFrame:
    """
    This function drops duplicates from the dataframe
    """

    # YOUR CODE HERE
    raise NotImplementedError()
    

In [None]:
clean_crset_df = drop_duplicated_entries(imputed_df)
assert isinstance(clean_crset_df, pd.DataFrame), "The function should return a dataframe."
assert clean_crset_df.shape == (82870, 20), "The shape of the dataframe is different then expected. Have you removed all the duplicated rows?" 

Congratulations! The *CRSet() Hotel* dataset is looking very clean and tidy!
<img src="media/CRset.png" width="400">

## A mess to be tidied up

You decided to switch to a more interesting job with the World Health Organization analyzing disease incidence across the world. Your first task is to analyse the incidence of tuberculosis between 1989 and 2008. WHO has been recording the cases all over the world. They have good intentions, but not very good methods to store data.

<img src="media/tbc.png"  width="400">

Let's have a look:

In [None]:
df_tb_who = pd.read_csv(os.path.join('data', 'tb.csv'), sep=',')
df_tb_who.head()

The dataset contains counts of confirmed tuberculosis cases by country, year and demographic group. The demographic data contains information on sex (*m* for male and *f* for female)  and age (*0-14, 15-24, 25-34, 35-44, 45-54, 55-64* and *65+*). Except for the column `year`, the column names are not very intuitive. The column `iso2` contains the country code in [iso2 format](https://www.iso.org/iso-3166-country-codes.html). The remaining columns are actually joint realizations of two variables: `sex` and `age`. 

## Exercise 4 - Country 

Start by addressing the `iso2` column. Save in a new `country` column the corresponding country name from the iso2 code. The [pycountry-convert](https://pypi.org/project/pycountry-convert/) package is your friend! It's already imported as `cc`.

First, create a function `get_country` that converts the iso2 code to country name. If the name cannot be retrieved, the function should return the original iso2 code as a `string`. Afterwards copy the who dataframe to new `df_tb_who_country` dataframe, apply the function and drop the `iso2` column.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#def get_country():

#df_tb_who_country = ...

In [None]:
assert isinstance(df_tb_who_country, pd.DataFrame), "The result should be a dataframe."
assert 'iso2' not in df_tb_who_country.columns, "You should drop the original iso2 column."
assert 'country' in df_tb_who_country.columns, "Have you stored the results in a new column named 'country'?"
assert hashlib.sha256(json.dumps(sorted(df_tb_who_country['country'].unique())).encode()).hexdigest() == '40515e68a196feaac974999d8d4fa9f3dd814e1bde66243a968fadf41a8e84de', "Have you converted the iso2 codes to the country NAME?"

## Exercise 5 - the melt function

This is our dataframe with country names:

In [None]:
df_tb_who_country.head()

Before we can continue, we need to tidy it up. Of all the columns, only two are variables, `year` and `country`. All the other column names are values. 

Use the function `melt()` to tidy the dataframe and store it in `tidy_tb`. Keep the `country` and `year` columns and melt all the other columns. Store the melted column names in the variable `column_name` and their values in variable `cases`.

In [None]:
#tidy_tb = pd.melt(...)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(tidy_tb, pd.DataFrame), "The result should be a dataframe."
assert tidy_tb.shape == (121149, 4), "Your dataframe doesn't have the expected shape. Have you melted the dataframe correctly?"
assert "column_name" in tidy_tb.columns, "The variables other than 'country' and 'year' should be stored in a column named 'column_name'."
assert "cases" in tidy_tb.columns, "Number of cases should be stored in a column named 'cases'."
assert hashlib.sha256(json.dumps(sorted(tidy_tb['column_name'])).encode()).hexdigest() == '4ba594d958d63b5bab87fe50944b16f30a93824e56fa331d5d9b59dddf285e35', "The column_name column doesn't look as expected" 
assert hashlib.sha256(json.dumps(sorted(tidy_tb['cases'])).encode()).hexdigest() == '8a294cd49c8fc1b29b60893e45426a80cb9681be4c699ae950c7103552cc7153', "The cases column doesn't look as expected"

## Exercise 6 - Data cleanup

Our dataframe is tidy, but it's not clean. From the `tidy_tb` dataframe, drop all the rows where `cases` **OR** `country` is null, as we just don't have any information and we cannot guess the number of cases or the country of origin. Convert the `cases` column to `int`. Note that in the Exercise 4, the `country` column was converted to string. Consequently, the NaN values are now "nan" strings. These will affect the behaviour of the .isnull() or .notnull methods. Save the final dataframe in `clean_tidy_tb` sorted by `country`, `year`, and `column_name`. The indexes should be reset (with `drop=True`).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(clean_tidy_tb, pd.DataFrame), "The result should be a dataframe."
assert clean_tidy_tb.shape == (38619, 4), "The shape of the dataframe is not correct."
assert hashlib.sha256(json.dumps(sorted(tidy_tb['country'])).encode()).hexdigest() == '43543c9e06fe9846897c269db635da02fc76dae4527775062a2f66efc707e87e', "There is something wrong with the values in the country column."
assert hashlib.sha256(json.dumps(sorted(tidy_tb['cases'])).encode()).hexdigest() == '8a294cd49c8fc1b29b60893e45426a80cb9681be4c699ae950c7103552cc7153', "There is something wrong with the values in the cases column."

## Exercise 7 - Multiple Variables stored in one Column

Our `clean_tidy_tb` is looking better, but now we need to address the problem of having multiple variables stored in the `column_name` column. Let's fix that in a few steps.

### Exercise 7.1 

From the `column_name` column, extract the codes for female/male to a new column named `sex` and the codes for the age to a new column `age`. Use pandas `str.extract` to do this. Drop all missing values afterwards.

In [None]:
#clean_tidy_tb[["sex", "age"]] =

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(clean_tidy_tb, pd.DataFrame), "The result should be a dataframe."
assert clean_tidy_tb.shape == (35552, 6), "The shape of your dataframe is off. Have you dropped the missing values?"
assert hashlib.sha256(json.dumps(sorted(clean_tidy_tb['sex'].unique())).encode()).hexdigest() == '1a336f5ee71cf591bfd047e8facc048011b4b2bb760743e979ebe7c445dacf1b', "The values in the sex column are not ok."
assert hashlib.sha256(json.dumps(sorted(clean_tidy_tb['age'].unique())).encode()).hexdigest() == '70a6918917681862857b955b14e70f7bc68d0050382ef81945fe963e48135f10', "The values in the age column are not ok."

### Exercise 7.2

The values in the `age` column are not very easy to understand. Use the `decode_age` dictionary to convert them to a more readable format. Drop any row where the values could not be converted.

In [None]:
decode_age =   {
        "014": "0-14",
        "1524": "15-24",
        "2534": "25-34",
        "3544": "35-44",
        "4554": "45-54",
        "5564": "55-64",
        "65": "65+",
        "u": "unknown",
    }

In [None]:
# clean_tidy_tb["age"] = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert clean_tidy_tb.shape == (33962, 6), "The shape of your dataframe is off. Have you dropped the missing values?"
assert hashlib.sha256(json.dumps(sorted(clean_tidy_tb['age'].unique())).encode()).hexdigest() == '8135dd0c090f9073cbb69a7bbacefd8ad0ecdb6e26415ece93e3fb5f8f5d17e6', "The decoding did not work as expected."

### Exercise 7.3

Finally, save in `final_tb_df` the dataframe with just the columns "country", "year", "sex", "age" and "cases".

In [None]:
# final_tb_df = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(final_tb_df, pd.DataFrame), "The result should be a dataframe."
assert final_tb_df.shape == (33962, 5), "The shape of your dataframe is off."
assert sorted(final_tb_df.columns) == ['age', 'cases', 'country', 'sex', 'year'], "The column names are not as expected."

Congratulations!!! You're a data cleaning master!

<img src="media/good-job.jpg"  width="400">