# SLU01 - Pandas 101: Exercise notebook

In [1]:
# First things first, import:
import pandas as pd
import numpy as np
import hashlib

- In this notebook the following is tested:

    - Creating pandas Series
    - Creating pandas DataFrame
    - Load a dataset from a file
    - Preview a dataframe
    - Convert datatypes in DataFrame

## Exercise 1: Series

### 1.1 Create a pandas Series from a list.

Create a `list` with the names of the items given in the cell below. Then create a `Series` using that list, using the order provided.

Note: Let pandas generate the index on its own. Also, don't forget to delete the raise error!

In [2]:
# items (each of type string): pork, milk, peas, peanuts, potatoes
# items_list = 

# Create a series with the items above, and call it items_series
# items_series = 

### BEGIN SOLUTION
items_list = ['pork', 'milk', 'peas', 'peanuts', 'potatoes']
items_series = pd.Series(items_list)
### END SOLUTION

In [3]:
# run this cell to see the results 
items_series

0        pork
1        milk
2        peas
3     peanuts
4    potatoes
dtype: object

In [4]:
assert(isinstance(items_list, list)), "The data structure used is not the correct one."
assert(len([i for i in items_list if not isinstance(i, str)]) == 0), "The items inside the list should be of type string."
assert(len(items_series) == 5), "The length of the list is not correct."
assert(hashlib.sha256(''.join(items_list).encode()).hexdigest() == '2b0898564acfeeacc4c2f2ae3fb7e8057552d46849f2523d5d4edb3a234528a2'), "The items in the list don't seem to be correct."
assert(hashlib.sha256(''.join(items_series.values.tolist()).encode()).hexdigest() == '2b0898564acfeeacc4c2f2ae3fb7e8057552d46849f2523d5d4edb3a234528a2'), "The items in the Series don't seem to be correct."
assert(np.equal(items_series.index.to_numpy(), np.arange(0, 5)).sum() == 5), "The Series index doesn't seem to be correct."

### 1.2 CO<sub>2</sub> equivalents

Let's now use the `items_list` as index and build a `Series` containing the kg CO<sub>2</sub> equivalents per item (the amount of CO<sub>2</sub> corresponding to the warming effect of greenhouse gases emitted during production of the item, [ref](https://doi.org/10.1126/science.aaq0216)).

In [5]:
# RUN THIS CELL FIRST
# kg CO2 equivalents of the foods in items_list:
food_CO2 = np.array([7.6, 3.2, 0.4, 1.2, 0.6]) 

In [6]:
# Create a series which has items_list (created in exercise 1.1) as index, 
# and the respective value in the food_CO2 list given in the cell above as values.
# CO2_eq = 

### BEGIN SOLUTION
CO2_eq = pd.Series(data=food_CO2, index=items_list)
### END SOLUTION

In [7]:
# run this cell to see the results 
CO2_eq

pork        7.6
milk        3.2
peas        0.4
peanuts     1.2
potatoes    0.6
dtype: float64

In [8]:
assert (isinstance(CO2_eq, pd.Series)), "CO2_eq is not a series."
assert(len(CO2_eq==5)), "The length of CO2_eq is not correct."
assert(CO2_eq.sum() == 13), 'Something is wrong with the values of the series. Did you use the correct input?'
assert(hashlib.sha256(''.join(CO2_eq.index.tolist()).encode()).hexdigest() == '2b0898564acfeeacc4c2f2ae3fb7e8057552d46849f2523d5d4edb3a234528a2'), 'The order of the index is not correct.'

Cool! In 1 exercise you learned 2 things:
- that a pandas Series can be built using several data structures, such as lists or numpy arrays
- that if you want to reduce your carbon footprint and help save the planet, you should turn to a plant-based diet (<300 g meat and <1.7 kg dairy per week, see here: https://takethejump.org/)
😃 🌍

## Exercise 2: DataFrames

The goal of this exercise is to create a simple DataFrame from several data structures.

In [9]:
# RUN THIS CELL FIRST
# this is the data you'll use to fill each column of your dataframe
github_repos = ['freeCodeCamp', 'free-programming-books', 'awesome', '996.ICU', 'coding-interview-university']
repo_creators = np.array(['freeCodeCamp', 'EbookFoundation', 'sindresorhus', '996icu', 'jwasham'])
number_of_forks = [33599, 57194, 26485, 21497, 69434]
number_of_stars = [374074, 298393, 269997, 267901, 265161]

In [10]:
# Add the data from the lists github_repos, repo_creators, number_of_forks, number_of_stars
# to a dictionary called dict_most_popular_repos:
#   - use the 4 variables created in the cell above to fill the data for each key
#   - each key should be a string containing the name of the corresponding variable.
# dict_most_popular_repos = 

# Create a dataframe called df_most_popular_repos
#   - set an index with the values 'first', 'second', 'third', 'fourth', 'fifth'
#   - use the dictionary created above to populate the dataframe.
# df_most_popular_repos = ...


### BEGIN SOLUTION
dict_most_popular_repos = {
    'github_repos': github_repos,
    'repo_creators': repo_creators,
    'number_of_forks': number_of_forks,
    'number_of_stars': number_of_stars
}

df_most_popular_repos = pd.DataFrame(
    data = dict_most_popular_repos,
    index = ('first','second','third','fourth','fifth')
)
### END SOLUTION

In [11]:
assert(list(dict_most_popular_repos.keys())==['github_repos', 'repo_creators', 'number_of_forks', 'number_of_stars']), "The dictionary keys are not correct."
assert(len(dict_most_popular_repos) == 4), "The length of the dictionary is not correct."
assert(isinstance(dict_most_popular_repos,dict)), 'dict_most_popular_repos is not a dictionary.'
assert(isinstance(df_most_popular_repos, pd.DataFrame)), 'df_most_popular_repos is not a DataFrame'
assert(df_most_popular_repos['github_repos'].tolist()==github_repos), "The 'github_repos' column doesn't look right."
assert(df_most_popular_repos['repo_creators'].tolist()==list(repo_creators)), "The 'repo_creators' column doesn't look right."
assert(df_most_popular_repos['number_of_forks'].tolist()==number_of_forks), "The 'number_of_forks' column doesn't look right."
assert(df_most_popular_repos['number_of_stars'].tolist()==number_of_stars), "The 'number_of_stars' column doesn't look right."
assert(df_most_popular_repos.shape == (5, 4)), 'The size of the dataframe is not correct.'
assert(df_most_popular_repos.index.tolist() != ('first', 'second', 'third', 'fourth','fifth')), 'The index is not correct. Reread the instructions.'

## Exercise 3: Loading DataFrames from files

### 3.1 Load a dataset into a dataframe
Let's load a dataset with data about chocolate bars. It is a subset from a Kaggle dataset available [here](https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings/).

In [12]:
# Load the dataset from the file located at data/chocolate_bars.csv.
# ds_chocolate = 

### BEGIN SOLUTION
ds_chocolate = pd.read_csv('data/chocolate_bars.csv')
### END SOLUTION

In [13]:
# Print the dataframe head() to get an idea of what you've just loaded.
ds_chocolate.head()

Unnamed: 0,co,name,date,pct,free,loc,rtg,btp,org
0,Artisan du Chocolat,Venezuela,2010,100,True,U.K.,1.75,,Venezuela
1,Bonnat,One Hundred,2006,100,True,France,1.5,,
2,Bouga Cacao (Tulicorp),"El Oro, Hacienda de Oro",2009,100,True,Ecuador,1.5,Forastero (Arriba),Ecuador
3,C-Amaro,Ecuador,2013,100,True,Italy,3.5,,Ecuador
4,Claudio Corallo,Principe,2008,100,True,Sao Tome,1.0,Forastero,Sao Tome & Principe


In [14]:
assert(isinstance(ds_chocolate, pd.DataFrame)), "Something is wrong. ds_chocolate is not a dataframe."
assert(ds_chocolate.shape == (45, 9)), "The shape is not correct. Did you follow all the instructions in the comments?"
assert(sum(ds_chocolate.columns == ['co', 'name', 'date', 'pct', 'free', 'loc', 'rtg', 'btp', 'org']) == 9), "The columns don't look right."
assert(ds_chocolate.rtg.sum()+ds_chocolate.date.sum()+ds_chocolate.pct.sum() == 94886),"The content of the dataframe is not correct"
assert(hashlib.sha256(''.join(ds_chocolate.name.tolist()).encode()).hexdigest() == '9c1403a3605e81d77ee5379fc733705bace8235c4a5cf36f27f9af27440b68f3'), "The content of the dataframe is not correct."

### 3.2 Load a dataset, but this time better

Notice that the column names in the ds_chocolate dataframe are not very informative. This is not very useful to someone looking at the data. Instead we want to load the dataset with the following `column names` (in this order):
- `'company'` - maker of the chocolate bar
- `'bar_name'` - original name of the chocolate bar
- `'review_date'` - date when the rating was given
- `'cocoa_percentage'`
- `'sugarfree'` - contains sugar or not
- `'company_location'` 
- `'rating'` - from 1 to 5
- `'cocoa_bean_type'` - cocoa variety
- `'cocoa_bean_origin'` - where the bean type originated

In [15]:
# Load the file at 'data/chocolate_bars.csv' into a dataframe ds_chocolate.
# set the column names to 
#  in this order.
# You will need to check the documentation at 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html to see how you can do this.
# ds_chocolate = 


### BEGIN SOLUTION
col_names = ['company','bar_name','review_date','cocoa_percentage','sugarfree',
             'company_location','rating','cocoa_bean_type','cocoa_bean_origin']
ds_chocolate = pd.read_csv('data/chocolate_bars.csv', names=col_names, header=0)
### END SOLUTION

In [16]:
# Print the dataframe head() to get an idea of what you've just loaded.
ds_chocolate.head()

Unnamed: 0,company,bar_name,review_date,cocoa_percentage,sugarfree,company_location,rating,cocoa_bean_type,cocoa_bean_origin
0,Artisan du Chocolat,Venezuela,2010,100,True,U.K.,1.75,,Venezuela
1,Bonnat,One Hundred,2006,100,True,France,1.5,,
2,Bouga Cacao (Tulicorp),"El Oro, Hacienda de Oro",2009,100,True,Ecuador,1.5,Forastero (Arriba),Ecuador
3,C-Amaro,Ecuador,2013,100,True,Italy,3.5,,Ecuador
4,Claudio Corallo,Principe,2008,100,True,Sao Tome,1.0,Forastero,Sao Tome & Principe


In [17]:
assert(isinstance(ds_chocolate, pd.DataFrame)), "Something is wrong. ds_chocolate is not a dataframe."
assert(ds_chocolate.shape == (45, 9)), "The shape is not correct. Did you follow all the instructions in the comments?"
assert(sum(ds_chocolate.columns == ['company','bar_name','review_date','cocoa_percentage','sugarfree','company_location','rating','cocoa_bean_type','cocoa_bean_origin']) == 9), "The columns don't look right."
assert(ds_chocolate.rating.sum()+ds_chocolate.review_date.sum()+ds_chocolate.cocoa_percentage.sum() == 94886),"The content of the dataframe is not correct"
assert(hashlib.sha256(''.join(ds_chocolate.bar_name.tolist()).encode()).hexdigest() == '9c1403a3605e81d77ee5379fc733705bace8235c4a5cf36f27f9af27440b68f3'), "The content of the dataframe is not correct."
assert(hashlib.sha256(''.join(ds_chocolate.columns).encode()).hexdigest() == '9681020039c95c14566f818837a704bafa96b014f0438921179a1ff52e2286ed'), "The column names are not correct"

### 3.3 Preview the datatypes

In [18]:
# Store the datatypes of all columns of ds_chocolate in ds_chocolate_dtypes.
# Use the method you learned in the learning notebook.
# ds_chocolate_dtypes = 

# Note: if you used the correct method, 
# the result will be a pandas series containing the datatypes of each column,
# with the index formed by the column names


### BEGIN SOLUTION
ds_chocolate_dtypes = ds_chocolate.dtypes
### END SOLUTION

In [19]:
# Check your output - there should be object, float, bool, and integer types.
ds_chocolate_dtypes

company               object
bar_name              object
review_date            int64
cocoa_percentage       int64
sugarfree               bool
company_location      object
rating               float64
cocoa_bean_type       object
cocoa_bean_origin     object
dtype: object

In [20]:
assert(sum([x in ds_chocolate_dtypes.index for x in ds_chocolate.columns]) == 9), "The index of ds_chocolate_dtypes should contain all columns in ds_chocolate."
assert(hashlib.sha256(''.join([str(typ) for typ in ds_chocolate_dtypes]).encode()).hexdigest() == '10f3499fbbb3d16acebe010d689e3d64246a7cd6dff7bdb18cbfa845d5caf3c0'), "The dtypes are not correct."

### 3.4 Set the correct datatypes
The datatypes in `ds_chocolate` were infered, so all `strings` are set as `objects`. Convert all these `object` datatypes to `string` using a function you learned in the learning notebook.

In [21]:
# Set the correct datatypes in the ds_chocolate dataframe - convert the objects to strings.
# Store the new dtypes in the variable ds_chocolate_dtypes_converted.
# ds_chocolate = 
# ds_chocolate_dtypes_converted = 


### BEGIN SOLUTION
ds_chocolate=ds_chocolate.convert_dtypes()
ds_chocolate_dtypes_converted = ds_chocolate.dtypes
### END SOLUTION

In [22]:
# Check you solution and compare it to the result of exercise 3.3. There will be pandas datatypes now.
ds_chocolate_dtypes_converted

company              string[python]
bar_name             string[python]
review_date                   Int64
cocoa_percentage              Int64
sugarfree                   boolean
company_location     string[python]
rating                      Float64
cocoa_bean_type      string[python]
cocoa_bean_origin    string[python]
dtype: object

In [23]:
assert(sum([x in ds_chocolate_dtypes_converted.index for x in ds_chocolate.columns]) == 9), "The index of ds_chocolate_dtypes_converted should contain all columns in ds_choclate."
assert(hashlib.sha256(str(ds_chocolate_dtypes_converted['company']).encode()).hexdigest() == 
       '473287f8298dba7163a897908958f7c0eae733e25d2e027992ea2edc9bed2fa8'), "The dtype of column 'company' is not as expected."
assert(hashlib.sha256(str(ds_chocolate_dtypes_converted['cocoa_percentage']).lower().encode()).hexdigest() == 
       'bc2229666b96007e875c5f62897ee5b7648db2baa5fabf3e771afac323afbd57'), "The dtype of column 'cocoa_percentage' is not as expected."
assert(hashlib.sha256(str(ds_chocolate_dtypes_converted['sugarfree'])[:4].encode()).hexdigest()==
       'b760f44fa5965c2474a3b471467a22c43185152129295af588b022ae50b50903'), "The dtype of column 'sugarfree' is not as expected."

SyntaxError: invalid syntax (3437992524.py, line 1)

### 3.5 Get information about the dataframe size
Use a method you learned in the learning notebook to retrieve the `number of rows` and the `number of columns` in the ds_chocolate dataframe.

In [None]:
# number_of_rows = 
# number_of_columns = 


### BEGIN SOLUTION
number_of_rows = ds_chocolate.shape[0]
number_of_columns = ds_chocolate.shape[1]
### END SOLUTION

In [None]:
assert(hashlib.sha256(str(int(number_of_rows)).encode()).hexdigest() == '811786ad1ae74adfdd20dd0372abaaebc6246e343aebd01da0bfc4c02bf0106c'), "The number of rows is not correct."
assert(hashlib.sha256(str(int(number_of_columns)).encode()).hexdigest() == '19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7'), "The number of columns is not correct."

### 3.6 Load a json file into a dataframe
Let's load a new dataframe called hdi from the file stored at `data/HDI.json`. It's the human development index statistics in the years 1990-2019, a subset of a kaggle dataset available [here](https://www.kaggle.com/datasets/elmartini/human-development-index-historical-data).

In [None]:
# Load the datafile from data/HDI.json and store it in the variable hdi. Use the appropriate method for json files.
# hdi = 


##BEGIN SOLUTION
hdi=pd.read_json('data/HDI.json')
##END SOLUTION

In [None]:
# Preview your dataframe
hdi.head()

In [None]:
assert(isinstance(hdi, pd.DataFrame)), "hdi is not a dataframe."
assert(hdi.shape == (189, 9)), "The shape of the dataframe is not correct.?"
assert(sum(hdi.columns == ['HDI Rank', 'Country', '1991', '1996', '2001', '2006', '2011', '2016', '2019']) == 9), "The columns don't look right."
assert(hashlib.sha256(''.join(hdi.Country.tolist()).encode()).hexdigest() == '8471ec328ba910fd961813c7619ae8776684a396672a959975015b24b3335ca3'), "The Country column looks wrong."
assert(hdi['HDI Rank'].sum() == 17914), "Something is wrong with the 'HDI Rank' column."
assert(hdi['2001'].sum() == 111.082), "Something is wrong with the '2019' column."

### 3.7 Get a numpy array of column names
Store the names of the `columns` in the hdi dataframe as a `numpy array`.

In [None]:
# First extract the columns into hdi_cols.
# hdi_cols = 

# Then convert the output into a NumPy array.
# hdi_cols_array = 


### BEGIN SOLUTION
hdi_cols = hdi.columns
hdi_cols_array = hdi_cols.to_numpy()
### END SOLUTION

In [None]:
# Always preview your variables to see the result of the operations.
print(hdi_cols, type(hdi_cols), "\n", sep="\n")
print(hdi_cols_array, type(hdi_cols_array), sep="\n")

In [None]:
assert(isinstance(hdi_cols, pd.core.indexes.base.Index)), "Use the method you learned to extract the columns into hdi_cols."
assert(len(hdi_cols) == 9), "There are 9 columns in the hdi dataframe. Did you extract them all? Also, make sure you don't change the variable hdi."
assert(isinstance(hdi_cols_array, np.ndarray)), "The hdi_cols_array is not a numpy array."
assert(hashlib.sha256(''.join(hdi_cols).encode()).hexdigest() == 'd29520e57d10ac733aad2b8a0e5b044083fc150a5110543e05f5a0c82699cd8f'),"The extracted column names are not correct."

### 3.8 Extract the index as a numpy array
Do the same as in exercise 3.7, but now for the index of hdi.

In [None]:
# Extract the index using the method you learned.
# hdi_index = 

# Convert it to a numpy array.
# hdi_index_array = 


### BEGIN SOLUTION
hdi_index = hdi.index
hdi_index_array = hdi_index.to_numpy()
### END SOLUTION

In [None]:
assert(isinstance(hdi_index, pd.core.indexes.base.Index)), "Use the method you learned to extract the index into hdi_index."
assert(len(hdi_index) == 189), "The length of the hdi_index variable is incorrect."
assert(sum(hdi_index_array) == 17766), "Something is wrong with the index array."
assert(isinstance(hdi_index_array, np.ndarray)), "The hdi_index_array does not look like a numpy array."


### 3.9 Describe the data in your dataframe
Last but not least, remember how you can get some stats and info on your dataframe? If you don't, make sure to reread the learning notebook. If you do, let's jump to this final exercise.

Using only the two methods you learned to get information and statistics on a dataframe answer the three questions in the cell below manually.

In [None]:
# Use this draft cell to print stuff to help you answer the questions below.


In [None]:
# Question 1
# What is the mean value for HDI in the year 2011 (rounded to 2 decimal points)?
# mean_HDI_2011 = 

# Question 2
# What is the maximum value for HDI in the year 1996 (round to 2 decimal points)?
# max_HDI_1996 = 

# Question 3 
# How many non-null entries do we have for the year 1991? Store the answer as an integer.
# nonnull_HDI_2001 = 



### BEGIN SOLUTION
mean_HDI_2011 = round(hdi.describe().loc['mean', '2011'], 2)
max_HDI_1996 = round(hdi.describe().loc['max', '1996'], 2)
nonnull_HDI_2001 = 174
### END SOLUTION

In [None]:
assert (isinstance(mean_HDI_2011, float)), "mean_HDI_2011 should be a float."
assert (isinstance(max_HDI_1996, float)), "max_HDI_1996 should be a float."
assert (isinstance(nonnull_HDI_2001, int)), "nonnull_HDI_2001 should be an integer."
assert(hashlib.sha256(str(round(mean_HDI_2011,2)).encode()).hexdigest() == '18db98e3760ea46503aa9f79ad850cc408d0192446a5666220ca3b99346e3e8c'), "mean_HDI_2011 does not look right."
assert(hashlib.sha256(str(round(max_HDI_1996,2)).encode()).hexdigest() == '0003dc8ba4e379d078ccc278b8c13ca665b2440770df16f54511bf7609203239'), "max_HDI_1996 does not look right."
assert(hashlib.sha256(str(int(nonnull_HDI_2001)).encode()).hexdigest() == '41e521adf8ae7a0f419ee06e1d9fb794162369237b46f64bf5b2b9969b0bcd2e'), "nonnull_HDI_2001 does not look right."