# Programming in Python for Data Science 

# Assignment 8: A Slice of NumPy and Advanced Data Wrangling

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).       

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Use [NumPy](https://numpy.org/) to create ndarrays with `np.array()` and from functions such as `np.arrange()`, `np.linspace()` and `np.ones()`.
- Describe the shape, dimension and size of an array.
- Identify null values in a dataframe and manage them by removing them using [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) or replacing them using [`.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).
- Manipulate non-standard date/time formats into standard Pandas datetime using [`pd.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html).
- Find, and replace text from a dataframe using verbs such as [`.replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) and [`.contains()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html).  


This assignment covers [Module 8](https://prog-learn.mds.ubc.ca/en/module8) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [187]:
# Import libraries needed for this lab
import pandas as pd
import numpy as np
import test_assignment8 as t
from hashlib import sha1
import altair as alt
import inspect

In [188]:
import datetime as dt

## 1.  Using NumPy 

**Question 1(a)** <br> {points: 1}  

Create a slice from `arr` named `answer_1a` of the values `[1,5,9]`.

In [189]:
arr = np.arange(1, 11)
answer_1a = np.array([arr[1-1], arr[5-1], arr[9-1]])
display(answer_1a)

array([1, 5, 9])

In [190]:
t.test_1a(answer_1a)

'Success'

**Question 1(b)** <br> {points: 1}  

Create a 2d array named `answer_1b` of shape (2,2) filled with value 3.4 using `np.full()`.

In [191]:
?np.full

[1;31mSignature:[0m [0mnp[0m[1;33m.[0m[0mfull[0m[1;33m([0m[0mshape[0m[1;33m,[0m [0mfill_value[0m[1;33m,[0m [0mdtype[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0morder[0m[1;33m=[0m[1;34m'C'[0m[1;33m,[0m [1;33m*[0m[1;33m,[0m [0mlike[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a new array of given shape and type, filled with `fill_value`.

Parameters
----------
shape : int or sequence of ints
    Shape of the new array, e.g., ``(2, 3)`` or ``2``.
fill_value : scalar or array_like
    Fill value.
dtype : data-type, optional
    The desired data-type for the array  The default, None, means
     ``np.array(fill_value).dtype``.
order : {'C', 'F'}, optional
    Whether to store multidimensional data in C- or Fortran-contiguous
    (row- or column-wise) order in memory.
like : array_like
    Reference object to allow the creation of arrays which are not
    NumPy arrays. If an array-like passed in as ``like``

In [192]:
answer_1b = np.full(
    shape = (2,2),
    fill_value = 3.4
)

display(answer_1b)


array([[3.4, 3.4],
       [3.4, 3.4]])

In [193]:
t.test_1b(answer_1b)

'Success'

**Question 1(c)** <br> {points: 1}  

Create a 3d array named `answer_1c` of shape (2, 3, 4) using `np.ones()`.

In [194]:
answer_1c = np.ones(
    shape = (2, 3, 4)
)
display(answer_1c)

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

In [195]:
t.test_1c(answer_1c)

'Success'

**Question 1(d)** <br> {points: 2}  

Which of the following arrays are two dimensional? 

 `array_1 = np.array([1, 4, 5, 6])`

 `array_2 = np.array([[1, 4, 5, 6]])`

 `array_3 = np.array([[1], [4], [5], [6]])`

 `array_4 = np.array([[[1, 4]], [[5, 6]]])`

Save all possible answers as strings within a list.      
Remember you can chose from the following data types:  

***Example:***    

`answer1_d = ['array_1', 'array_2']`


In [196]:
array_1 = np.array([1, 4, 5, 6])
array_2 = np.array([[1, 4, 5, 6]])
array_3 = np.array([[1], [4], [5], [6]])
array_4 = np.array([[[1, 4]], [[5, 6]]])

In [197]:
print(f'Array 1 Shape {array_1.shape}')
print(f'Array 2 Shape {array_2.shape}')
print(f'Array 3 Shape {array_3.shape}')
print(f'Array 4 Shape {array_4.shape}')

Array 1 Shape (4,)
Array 2 Shape (1, 4)
Array 3 Shape (4, 1)
Array 4 Shape (2, 1, 2)


In [198]:
answer1_d = ['array_2', 'array_3']

In [199]:
# check that the function exists
assert 'answer1_d' in globals(
), "Please make sure that your solution is named 'answer1_d'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

## 2. DateTime Wrangling 

<a href="https://en.wikipedia.org/wiki/Chopped_(TV_series)" target="_blank">Chopped</a> is a cooking show aired in North America where 4 contestants must prepare a dish that incorporates unusual basket ingredients unknown to the contestants beforehand. The dishes are then presented to a panel of three celebrity chef judges where the contestant of the least liked dish is "chopped" from the competition. There are 3 rounds in the contest ("Appetizer", "Entrée", and "Dessert") and the winner of the final round is deemed the "Chopped Champion". 

[This Chopped open-source dataset](https://www.kaggle.com/jeffreybraun/chopped-10-years-of-episode-data) combines allows us to identify some insights into this popular TV series. 


**Question 2(a)** <br> {points: 1}  

Load in the data, assigning the `air_date` column as  `Datetime64` dtype.     
Save the dataframe as an object named `chopped`.

In [200]:
chopped = pd.read_csv(
    'data/chopped.csv',
    parse_dates = ['air_date']
)

display(chopped['air_date'].dtypes)

dtype('<M8[ns]')

In [201]:
t.test_2a(chopped)

'Success'

**Question 2(b)** <br> {points: 2}  

Determine how long the show been airing for (in years) by looking at the earliest and latest air dates.

Save the result as an object named `air_length_yrs`. 

In [202]:
display(f"First Row : {chopped.iloc[0]['air_date']}")
display(f"Final Row : {chopped.iloc[-1]['air_date']}")

'First Row : 2009-01-13 00:00:00'

'Final Row : 2020-07-28 00:00:00'

In [203]:
# chopped.sort_values(by = 'air_date', ascending = True, inplace = True)

In [204]:
display(f"First Row : {chopped.iloc[0]['air_date']}")
display(f"Final Row : {chopped.iloc[-1]['air_date']}")

'First Row : 2009-01-13 00:00:00'

'Final Row : 2020-07-28 00:00:00'

In [205]:
air_length_yrs = (chopped.iloc[-1]['air_date'] - chopped.iloc[0]['air_date']).days
days_per_year = 365.25 # This is the total number of days per year including 0.25 to account for the leap year.
air_length_yrs/= days_per_year

air_length_yrs = round(air_length_yrs, 2) # This will round your answer to 2 decimal places. Do not delete! 
display(air_length_yrs)

11.54

In [206]:
# check that the function exists
assert 'air_length_yrs' in globals(
), "Please make sure that your solution is named 'air_length_yrs'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 2(c)** <br> {points: 1}  

How many days are between each of the 569 episodes?  
Save this as an object named `days_apart`. 

*Hints:* 
- You may need to use `.diff()` and `days_apart` should have 568 rows.        
- Here you are measuring time between episodes. `diff()` produces a dataframe that have a `NaT` value for the first row since there is no episode before it to calculate an interval from. We need to remove this row. Although there are 569 episodes, the number of intervals *between* episodes is 568.  


In [207]:
days_apart = chopped['air_date'].diff().dropna().reset_index().drop(columns = 'index').squeeze()
display(days_apart)

0      7 days
1      7 days
2      7 days
3      7 days
4      7 days
        ...  
563    7 days
564    7 days
565    7 days
566   14 days
567    7 days
Name: air_date, Length: 568, dtype: timedelta64[ns]

In [208]:
t.test_2c(days_apart)

'Success'

**Question 2(d)** <br> {points: 1}  

Of these inter-episode intervals, what fraction of them were not aired on a weekly basis? 

Save the result in an object named `irregular_aired_fraction`.


In [209]:
irregular_aired_fraction = days_apart.where(days_apart != dt.timedelta(days = 7)).dropna().size/days_apart.size
display(irregular_aired_fraction)

0.477112676056338

In [210]:
t.test_2d(irregular_aired_fraction)

'Success'

**Question 2(e)** <br> {points: 1}  

Make a new dataframe named `chopped2` that contains an additional column named `weekday_aired` that specifies the day of the week that it was aired.

*Hint: you'll need to used `dt.day_name()`* 


In [211]:
?pd.Series.dt.day_name

[1;31mSignature:[0m [0mpd[0m[1;33m.[0m[0mSeries[0m[1;33m.[0m[0mdt[0m[1;33m.[0m[0mday_name[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return the day names of the DateTimeIndex with specified locale.

Parameters
----------
locale : str, optional
    Locale determining the language in which to return the day name.
    Default is English locale.

Returns
-------
Index
    Index of day names.

Examples
--------
>>> idx = pd.date_range(start='2018-01-01', freq='D', periods=3)
>>> idx
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'],
              dtype='datetime64[ns]', freq='D')
>>> idx.day_name()
Index(['Monday', 'Tuesday', 'Wednesday'], dtype='object')
[1;31mFile:[0m      c:\users\muntakim\appdata\local\programs\python\python310\lib\site-packages\pandas\core\accessor.py
[1;31mType:[0m      function


In [212]:
chopped2 = chopped.assign(weekday_aired = lambda x : x['air_date'].dt.day_name())
display(chopped2.head())
display(chopped2.tail())

Unnamed: 0,season,season_episode,series_episode,episode_name,episode_notes,air_date,judge1,judge2,judge3,appetizer,...,dessert,contestant1,contestant1_info,contestant2,contestant2_info,contestant3,contestant3_info,contestant4,contestant4_info,weekday_aired
0,1,1,1,"""Octopus, Duck, Animal Crackers""",This is the first episode with only three offi...,2009-01-13,Marc Murphy,Alex Guarnaschelli,Aarón Sánchez,"baby octopus, bok choy, oyster sauce, smoked ...",...,"prunes, animal crackers, cream cheese",Summer Kriegshauser,Private Chef and Nutrition Coach New York NY,Perry Pollaci,Private Chef and Sous chef Bar Blanc New Yo...,Katie Rosenhouse,Pastry Chef Olana Restaurant New York NY,Sandy Davis,Catering Chef Showstoppers Catering at Union...,Tuesday
1,1,2,2,"""Tofu, Blueberries, Oysters""",This is the first of a few episodes with five ...,2009-01-20,Aarón Sánchez,Alex Guarnaschelli,Marc Murphy,"firm tofu, tomato paste, prosciutto",...,"phyllo dough, gorgonzola cheese, pineapple ri...",Raymond Jackson,Private Caterer and Culinary Instructor West...,Klaus Kronsteiner,Chef de cuisine Liberty National Golf Course...,Christopher Jackson,Executive Chef and Owner Ted and Honey Broo...,Pippa Calland,Owner and Chef Chef for Hire LLC Newville PA,Tuesday
2,1,3,3,"""Avocado, Tahini, Bran Flakes""",,2009-01-27,Aarón Sánchez,Alex Guarnaschelli,Marc Murphy,"lump crab meat, dried shiitake mushrooms, pin...",...,"brioche, cantaloupe, pecans, avocados",Margaritte Malfy,Executive Chef and Co-owner La Palapa New Y...,Rachelle Rodwell,Chef de cuisine SoHo Grand Hotel New York NY,Chris Burke,Private Chef New York NY,Andre Marrero,Chef tournant L’Atelier de Joël Robuchon Ne...,Tuesday
3,1,4,4,"""Banana, Collard Greens, Grits""","In the appetizer round, Chef Chuboda refused t...",2009-02-03,Scott Conant,Amanda Freitag,Geoffrey Zakarian,"ground beef, wonton wrappers, cream of mushro...",...,"maple syrup, black plums, almond butter, waln...",Sean Chudoba,Executive Chef Ayza Wine Bar New York NY,Kyle Shadix,Chef Registered Dietician and Culinary Consu...,Luis Gonzales,Executive Chef Knickerbocker Bar & Grill Ne...,Einat Admony,Chef and Owner Taïm New York NY,Tuesday
4,1,5,5,"""Yucca, Watermelon, Tortillas""",,2009-02-10,Geoffrey Zakarian,Alex Guarnaschelli,Marc Murphy,"watermelon, canned sardines, pepper jack chee...",...,"flour tortillas, prosecco, Canadian bacon, ro...",John Keller,Personal Chef New York NY,Andrea Bergquist,Executive Chef New York NY,Ed Witt,Executive Chef / Wine Director Bloomingdale ...,Josh Emett,Chef de cuisine Gordon Ramsay at The London ...,Tuesday


Unnamed: 0,season,season_episode,series_episode,episode_name,episode_notes,air_date,judge1,judge2,judge3,appetizer,...,dessert,contestant1,contestant1_info,contestant2,contestant2_info,contestant3,contestant3_info,contestant4,contestant4_info,weekday_aired
564,45,9,563,"""Terrine Cuisine""",Chef Jose cut himself in the first round and c...,2020-06-23,Chris Santos,Scott Conant,Erik Ramirez,"rabbit terrine, guanciale, spring garlic, bur...",...,"feta ice cream, pears, blueberry ketchup, cho...",Jose Luis Chavez,Chef and Owner from New York NY,Matt Greiner,Executive Chef from Raleigh NC,Mimi Weissenborn,Executive Chef from Harlem NY,Nemo Bolin,Chef and Owner from Providence RI,Tuesday
565,45,10,564,"""Time and Turmoil""",Chef Arden forgot an ingredient in the first r...,2020-06-30,Amanda Freitag,Maneet Chauhan,Scott Conant,"hash brown patties, Manila clams, escarole, b...",...,"boozy cranberry gelatin, cherry scones, necta...",Lindsay Smith-Rosales,Chef and Owner from Laguna Beach CA,Arden Lewis,Executive Chef from New York NY,Lina Zarcaro,Private Chef from Bradley Beach NJ,Luca Annunziata,Executive Chef from Charlotte NC,Tuesday
566,45,11,565,"""Jarring Jars""",The guest judge in this episode was Chef Ray G...,2020-07-07,Scott Conant,Geoffrey Zakarian,Ray Garcia,"sea beans, dehydrated carrot sticks, egg coff...",...,"guava, kefir, honeycomb, pickled pig lips",May Siricharoen,Executive Chef from Los Angeles CA,Chris Day,Executive Sous Chef from Boston MA,Patrick McKee,Executive Chef from Portland OR,Phillip Esteban,Research & Development Chef from San Diego CA,Tuesday
567,45,12,566,"""Cauliflower Power""",In this unofficially vegetarian themed episode...,2020-07-21,Maneet Chauhan,Marc Murphy,Esther Choi,"cauliflower avocado toast, cauliflower rice, ...",...,"cauliflower oatmeal, halo-halo fruit mix, red...",Manjit Manohar,Executive Sous Chef from New York NY,Edy Massih,Private Chef and Caterer from Brooklyn NY,Megan Marlow,Executive Chef and Owner from Los Angeles CA,Kei Ohdera,Chef and Owner from Portland OR,Tuesday
568,45,13,567,"""Quail Without Fail""",,2020-07-28,Chris Santos,Maneet Chauhan,Geoffrey Zacharian,"gopchang, ghost pepper aioli, nopales, hominy",...,"chicken salt, syrniki, passion fruit, cajeta",Bryant Kryck,Executive Chef from Portland OR,Caroline Hough,Chef de Cuisine from Philadelphia PA,Marco Maestoso,Chef and Owner from San Diego CA,Calin Sauvron,Executive Chef from Bethel CT,Tuesday


In [213]:
t.test_2e(chopped2)

'Success'

**Question 2(f)** <br> {points: 1}  

Most Chopped episodes are aired on a `Tuesday`. How many were not? 
Save this value in an object name `irregular_airdays`.


In [214]:
irregular_airdays = len(chopped2.query(f'weekday_aired != "Tuesday"'))
display(irregular_airdays)

94

In [215]:
t.test_2f(irregular_airdays)

'Success'

**Question 2(g)** <br> {points: 2}  

How many of the 45 chopped seasons had a perfectly consistent schedule with each episode being released exactly on a weekly basis?
Save this value in an object name `num_perfect_season`.

*Hint:*

* You may find some of the skills you used in 2(c) and 2(d) helpful here. 
* To loop over all the groups in a groupby object you can use the syntax `for name, group in data.groupby(['grouping_column']):`.
* For a season to have a consistence airing schedule, both the max and min days between episodes would equal 7.

In [216]:
?dt.timedelta

[1;31mInit signature:[0m [0mdt[0m[1;33m.[0m[0mtimedelta[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Difference between two datetime values.

timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0)

All arguments are optional and default to 0.
Arguments may be integers or floats, and may be positive or negative.
[1;31mFile:[0m           c:\users\muntakim\appdata\local\programs\python\python310\lib\datetime.py
[1;31mType:[0m           type
[1;31mSubclasses:[0m     _Timedelta


In [217]:
regular_aired = chopped[['season', 'air_date']].sort_values(['season', 'air_date'], ascending = True)
regular_aired = regular_aired.assign(days_apart = lambda x : x['air_date'].diff()).dropna()

num_perfect_season = 0
for i, season in regular_aired.groupby(['season']) :
    days_apart = season.iloc[1:]['days_apart'].drop_duplicates().to_list()
    if (len(days_apart) == 1) and (days_apart[0] == dt.timedelta(days = 7)) :
        num_perfect_season += 1
display(num_perfect_season)

3

In [218]:
t.test_2g(num_perfect_season)

'Success'

## 3. Cleaning a dataframe with Strings and Handling missing values

**Question 3** <br> {points: 8}  

Now that you have learned about string operations and an entry level of regular expressions, let's see you apply your skills to a real dataset. 

In this exercise, you will start with the dirty version of the `Gapminder` dataset that we've seen before. By "dirty" we mean there are some inconsistencies and irregularities in the dataset as one would more typically find with real world data.  Your task is to write a function named `clean_gapminder` that takes in this dataset as an argument, and returns a cleaned up dataframe. The goal of this exercise is to use Python code to clean up the `dirty_gapminder` to the point that it's identical to `clean_gapminder`. 

Note: in the real world you wouldn't have a `clean_gapminder` reference to compare to!

Things you might want to do to clean up `dirty_gapminder`:

1. We recommend first writing code that cleans this dataset and then moving it all into a function after. 
1. If there is missing data (NaNs or empty strings) fill it in with sensible values.
1. Check that all values match those in `clean_gapminder` (e.g., check capitalization, spelling, grammar, etc).
1. There may be entries that appear to have the exact same spelling and capitalization in both the dirty and clean gapminder datasets, but still don't match... Extra whitespace is often a frustrating (and invisible) problem when wrangling text data. You can use `print('**' + x + '**')` to identify any strings with whitespace and `Series.str.strip()` to trim unwanted whitespace around a string. 
1. When you are ready, test that your dirty dataframe matches the clean gapminder data using `df.equals()`.
1. Since you are writing a function named `cleaned_gapminder`, our autograding tests will grade that your function contains certain code and returns the expected output.

Hint: We've provided a unit test for you to compare the two dataframes after wranging. However, during your wrangling you can check the equality of individual elements in two dataframes using `df.eq()`. If your dataframes are `df1` and `df2`, you can check which rows are not equal using `df1[(~df2.eq(df1)).any(axis=1)]` (You've seen something of this nature in Module 3).



In [219]:
dirty = pd.read_csv('data/dirty_gapminder.csv')
display(dirty.head())
display(dirty.tail())

Unnamed: 0,year,pop,lifeExp,gdpPercap,continent,country
0,1952,8425333.0,28.801,779.445314,Asia,Afghanistan
1,1957,9240934.0,30.332,820.85303,Asia,Afghanistan
2,1962,10267083.0,31.997,853.10071,Asia,Afghanistan
3,1967,11537966.0,34.02,836.197138,Asia,Afghanistan
4,1972,13079460.0,36.088,739.981106,Asia,Afghanistan


Unnamed: 0,year,pop,lifeExp,gdpPercap,continent,country
1699,1987,9216418.0,62.351,706.157306,Africa,Zimbabwe
1700,1992,10704340.0,60.377,693.420786,Africa,Zimbabwe
1701,1997,11404948.0,46.809,792.44996,Africa,Zimbabwe
1702,2002,11926563.0,39.989,672.038623,Africa,Zimbabwe
1703,2007,12311143.0,43.487,469.709298,Africa,Zimbabwe


In [220]:
clean = pd.read_csv('data/clean_gapminder.csv')
display(clean.head())
display(clean.tail())

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
1699,Zimbabwe,1987,9216418.0,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340.0,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948.0,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563.0,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143.0,Africa,43.487,469.709298


The code below shows that there are 28 rows in total that are not equal between the two dataframes.

In [221]:
dirty[(~clean.eq(dirty)).any(axis=1)].shape

(28, 6)

In [233]:
def cleaned_gapminder(dirty_df):
    '''
    Clean the GapMinder DataFrame to be the same as the Cleaned DataFrame.

    Parameter
    ----------
    dirty_df : pandas.core.frame.DataFrame
        The dataframe to clean     

    Returns
    -------
    pandas.core.frame.DataFrame
        The cleaned GapMinder DataFrame 

    Examples
    --------
    >>>

    '''
    dirty_df = dirty_df[[
        'country',
        'year',
        'pop',
        'continent',
        'lifeExp',
        'gdpPercap',
    ]].replace(
        to_replace = 'china',
        value = 'China',
    ).replace(
        to_replace = 'Central african republic',
        value = 'Central African Republic',
    ).dropna(how = 'all').drop_duplicates()

    # Order DataFrame is Ascending Order By Country, Year.
    dirty_df = dirty_df.sort_values(
        ['country', 'year']
    ).reset_index().drop(columns = ['index'])
    
    # Assign Continent to Americas for Canada.
    dirty_df.loc[dirty_df[dirty_df['country'] == 'Canada'].index, 'continent'] = 'Americas'
    
    # Remove Leading/Trailing Whitespaces
    dirty_df['country'] = dirty_df['country'].str.strip()
    dirty_df['continent'] = dirty_df['continent'].str.strip()

    return dirty_df

cleaned_data = cleaned_gapminder(dirty)

# display(cleaned_data.head())
# display(cleaned_data.tail())

# display(cleaned_data.query(f'country == "Czech Republic"'))
# display(clean.query(f'country == "Czech Republic"'))

df_1 = cleaned_data[(~clean.eq(cleaned_data)).any(axis=1)]
df_2 = clean[(~cleaned_data.eq(clean)).any(axis=1)]

# for col in df_1.columns :
#     if not (df_1.iloc[0][col] == df_2.iloc[0][col]) :
#         print(f'DF1 {col} : {df_1.iloc[0][col]}')
#         print(f'DF2 {col} : {df_2.iloc[0][col]}')

display(df_1.tail())
display(df_2.tail())

# assert cleaned_data.equals(clean), "Dataframes are not the same!"

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
403,Czech Republic,1992,10315702.0,Europe,72.4,14297.02122
404,Czech Republic,1997,10300707.0,Europe,74.01,16048.51424
405,Czech Republic,2002,10256295.0,Europe,75.51,17596.21022
406,Czech Republic,2007,10228744.0,Europe,76.486,22833.30851
407,Democratic Republic of the Congo,1972,23007669.0,Africa,45.989,904.896068


Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
403,Czech Republic,1987,10311597.0,Europe,71.58,16310.4434
404,Czech Republic,1992,10315702.0,Europe,72.4,14297.02122
405,Czech Republic,1997,10300707.0,Europe,74.01,16048.51424
406,Czech Republic,2002,10256295.0,Europe,75.51,17596.21022
407,Czech Republic,2007,10228744.0,Europe,76.486,22833.30851


In [232]:
t.test_3(cleaned_gapminder,dirty,clean)

AssertionError: Make sure you are replacing the missing values.

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

You did it! You got to the end of all 8 assignments in Programming in Python for Data Science. We are all very proud of you here and are excited to see you translate everything you've learned into a final project! 

## Attributions
- Gapminder Dataset - [Gapminder](https://www.gapminder.org/data/)
- UBC's original STAT545 - [Stat545 by Jenny Bryan](https://stat545.com/)
- MDS DSCI 523 - Data Wrangling course - [MDS's GitHub website](https://ubc-mds.github.io/) 
- Chopped Dataset - [Kaggle](https://www.kaggle.com/jeffreybraun/chopped-10-years-of-episode-data)

## Module Debriefing

If this video is not showing up below, click on the cell and click the ▶ button in the toolbar above.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('PCBPzCFQwHs', width=854, height=480)