# SLU01 - Pandas 101: Exercise notebook

In [None]:
# First things first, import:
import pandas as pd
import numpy as np
import hashlib

- In this notebook the following is tested:

    - Creating pandas Series
    - Creating pandas DataFrames
    - Load a dataset
    - Preview a dataframe

---

## Exercise 1: Series

### 1.1) Create a pandas Series from a list.

Create a list with the names of the items given in the cell below. Then create a series using that list, using the order provided.

Note: Let pandas generate the index on its own. Also, don't forget to delete the raise error!

In [None]:
# items (each of type string): coffee, tea, beef, apple
# items_list = ...

# Create a series with the items above, and call it items_series
# items_series = ...

# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# run this cell to see the results 
items_series

In [None]:
assert(isinstance(items_list, list)), "The data structure used is not the correct one."
assert(isinstance(items_list[0], str)), "The items inside the list should be of type string."
assert(len(items_series) == 4), "The length of the list is not correct."
assert(sum([x == y for x, y in zip(items_list, items_series.values.tolist())]) == 4), "The items don't seem correct."
assert(np.equal(items_series.index.to_numpy(), np.arange(0, 4)).sum() == 4), "The items don't seem correct."

### 1.2) Virtual use of water

Let's now use the `items_list` as index and build a series containing the virtual use of water per item (this is, the gallons of water used to produce each item).

In [None]:
# RUN THIS CELL FIRST
# Data on virtual use of water per item (in gallons)
water_used = np.array([37, 9, 1500, 18])

In [None]:
# Create a series which has the items_list (created on exercise 1.1) as index, 
# and the respective water_used given on the cell above as values.
# virtual_water = ...

# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# run this cell to see the results 
virtual_water

In [None]:
assert (isinstance(virtual_water, pd.Series)), "virtual_water is not a series."
assert(len(virtual_water==4)), "The length of virtual_water is not correct."
assert(virtual_water[-3] == 9), 'Something is wrong with the values of the series. Did you use water_used?'
assert(virtual_water.index[1] == 'tea'), 'The order of the index is not correct.'

Cool! In 1 exercise you've learned 2 things:
- that a pandas Series can be built using several data structures, such as lists or numpy arrays;
- that if you want to reduce your water footprint, you should drink tea instead of coffee, and become vegan.

😃

---

## Exercise 2: DataFrames

In this exercise the goal is to create a simple DataFrame, from several data structures.

In [None]:
# RUN THIS CELL FIRST
# this is the data you'll use to fill each column of your dataframe
emojis = ['Face with Tears of Joy', 'Loudly Crying Face', 'Pleading Face', 'Red Heart']
search_engines = np.array(['Google', 'Bing', 'Yahoo!', 'Baidu'])
social_network = ['Facebook', 'YouTube', 'WhatsApp', 'Facebook Messenger']
social_network_active_users = [2740000000, 2291000000, 2000000000, 1300000000]

In [None]:
# Add the data from emojis, search_engines, social_network and social_network_active_users to a dictionary 
#  called most_popular_2020_dictionary:
#    - use the 4 variables created in the cell above to fill the data for each key
#    - each key should be a string containing the name of the corresponding variable
# most_popular_2020_dictionary = ...


# Create a dataframe called most_popular_2020
#   - Set an index with the values 'first', 'second', 'third', 'fourth'
#   - Use the dictionary created above to populate the dataframe
# most_popular_2020 = ...


# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# check your dataframe
most_popular_2020

In [None]:
assert(isinstance(most_popular_2020, pd.DataFrame)), 'most_popular_2020 is not a DataFrame'
assert(isinstance(most_popular_2020_dictionary,dict)), 'Something is wrong! most_popular_2020_dictionary is not a dictionary.'
assert(most_popular_2020['emojis'].tolist()==emojis), "The emojis column doesn't look right."
assert(most_popular_2020['search_engines'].tolist()==list(search_engines)), "The search_engines column doesn't look right."
assert(most_popular_2020['social_network'].tolist()==social_network), "The social_network column doesn't look right."
assert(most_popular_2020.shape == (4, 4)), 'The size of the dataframe is not correct.'
assert(most_popular_2020.index.tolist() != ('first', 'second', 'third', 'fourth')), 'The index is not correct. Reread the instructions.'

---

## Exercise 3: Files and dataframes

Let's load a dataset with r/VaccineMyths subreddit posts and comments.

---

The dataset we're going to load is a subset from a Kaggle dataset available [here](https://www.kaggle.com/gpreda/reddit-vaccine-myths).

### 3.1) Load a dataset into `vaccine_myths` dataframe

In [None]:
# Load the dataset, which is located at data/reddit_vm.json
# Notice that the file is of JSON format. How can you read the file in pandas?
# (recheck the learning notebook if you get stuck, the answer is there)
# NOTE: don't use any additional arguments other than the filepath. We want to load the file as is.
# vaccine_myths = ...

# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# print head it to get an idea of what you've just loaded 
vaccine_myths.head()

In [None]:
assert(isinstance(vaccine_myths, pd.DataFrame)), "Something is wrong. vaccine_myths does not look like a dataframe."
assert(vaccine_myths.shape == (554, 5)), "The shape is not correct. Did you follow all the instructions in the comments?"
assert(sum(vaccine_myths.columns == ['t', 's', 'u', 'comms_num', 'created']) == 5), "The columns don't look right."
assert(vaccine_myths.index[3] == 'lnptv8' and vaccine_myths.index[552] == '1v3blj'), "The index looks wrong."
assert(vaccine_myths.t[5] == 'Canada: Oxford-AstraZeneca vaccine approval expected this week'), "Something is wrong. Did you follow all the instructions in the comments?"
assert(vaccine_myths.s.sum() == 4274), "Something is wrong. Did you follow all the instructions in the comments?"
assert(vaccine_myths.comms_num.max() == 596), "Something is wrong. Did you follow all the instructions in the comments?"
assert(vaccine_myths.u[446] == 'http://www.activistpost.com/2014/03/hands-off-our-vaccine-exemptions.html'), "Something is wrong. Did you follow all the instructions in the comments?"

### 3.2) Load a dataset, but this time better

Notice how we have 5 columns named `'t'`, `'s'`, `'u'`, `'comms_num'` and `'created'` in our dataframe `vaccine_myths`?

This is not very useful to someone looking at the data. Instead we want to load the dataset with the following column names:
- `'title'` - relevant for posts
- `'score'` - relevant for posts - based on impact, number of comments
- `'url'` - relevant for posts - url of post thread

And leave the last 2 columns as are:
- `'comms_num'` - relevant for post - number of comments to this post (same name as before)
- `'created'` - date of creation in seconds since epoch (same name as before)

In [None]:
# Load the file at 'data/vaccine_myths_100.csv' into a dataframe (notice now the file is a CSV !!)
#   - set the column names as 'title', 'score', 'url', 'comms_num' and 'created', in this order
#   - make sure pandas uses the first row as header !!
# You will need to check the documentation at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html, to see how you can do this.
# vaccine_myths_100 = ...


# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# print head it to get an idea of what you've just loaded 
vaccine_myths_100.head()

In [None]:
assert(isinstance(vaccine_myths_100, pd.DataFrame)), "Something is wrong. vaccine_myths does not look like a dataframe."
assert(vaccine_myths_100.shape == (100, 5)), "The shape is not correct. Did you follow all the instructions in the comments?"
assert(vaccine_myths_100.shape != (101, 5)), "Something is wrowg. You have 1 more row than expected. Hint: Did you tell pandas to use the 1st row as header?"
assert(sum(vaccine_myths_100.columns != ['t', 's', 'u', 'comms_num', 'created']) == 3), "Don't forget to tell pandas the new column names."
assert(vaccine_myths_100.index[3] == 'lnptv8' and vaccine_myths.index[552] == '1v3blj'), "The index looks wrong."
assert(vaccine_myths_100.title[5] == 'Canada: Oxford-AstraZeneca vaccine approval expected this week'), "Something is wrong. Did you follow all the instructions in the comments?"
assert(vaccine_myths_100.score.sum() == 179), "Something is wrong. Did you follow all the instructions in the comments?"
assert(vaccine_myths_100.comms_num.max() == 8), "Something is wrong. Did you follow all the instructions in the comments?"
assert(vaccine_myths_100.loc['e83687'].url == 'https://www.facebook.com/groups/445000352804849/?ref=share'), "Something is wrong. Did you follow all the instructions in the comments?"

### 3.3) Preview the last 7 entries of `vaccine_myths_100`

In [None]:
# Store the last 7 entries of vaccine_myths_100
# in a new dataframe called last_seven
# Use a method you learned in the learning notebook to select the last n rows of a dataframe
# last_seven = ...

# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert(last_seven.shape[0] == 7), "The number of rows does not look right. You need to take the last 7 rows."
assert(last_seven.shape[1] == 5), "The number of columns does not look right. All 5 columns should be kept."
assert(last_seven.iloc[2].title == 'A vaccine debate group juts started up. If you’re interested in joining.'), "Something is wrong. Reread the instructions and try again."

### 3.4) Get general information about a dataframe

Let's load [a new dataset](https://www.kaggle.com/crawford/80-cereals), this one is for the cereals lovers:

In [None]:
cereals = pd.read_csv('data/cereal.csv')
cereals.head()

In [None]:
# Use a method you learned about in the learning notebook to retrieve the number of rows
# and the number of columns in cereals
# number_of_rows = ...
# number_of_columns = ...

# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert(hashlib.sha256(str(int(number_of_rows)).encode()).hexdigest() == 'a88a7902cb4ef697ba0b6759c50e8c10297ff58f942243de19b984841bfe1f73'), "The number of rows is not correct."
assert(hashlib.sha256(str(int(number_of_columns)).encode()).hexdigest() == 'b17ef6d19c7a5b1ee83b907c595526dcb1eb06db8227d650d5dda0a9f4ce8cd9'), "The number of columns is not correct."

### 3.5) Use a method to determine all the types

In [None]:
# just in case you inadvertently overwritten cereals... 
cereals = pd.read_csv('data/cereal.csv')

# make sure you don't overwrite any variables in exercise notebooks, 
# otherwise you might inadvertently not pass asserts because you're not working over the correct data.

In [None]:
# Store the types of all columns of cereals in cereals_dtypes
# Use the method you learned in the learning notebook.
# cereals_dtypes = ...



# Note: if you used the correct method, 
# the result will be a pandas series containing the data types of each column,
# with index formed by the columns of cereals_dtypes

# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# check your output -- there should be object, float and integer types in cereal_dtypes
cereals_dtypes

In [None]:
assert(sum([x in cereals_dtypes.index for x in cereals.columns]) == 16), "The index of cereals_dtypes should contain all columns in cereals."
assert(hashlib.sha256(str(cereals_dtypes['name']).encode()).hexdigest() == '2958d416d08aa5a472d7b509036cb7eafd542add84527e66a145ea64cb4cdc75'), "The dtype of column 'name' is not as expected."

### 3.6) Get a numpy array of column names

In [None]:
# store the names of the columns in cereals as a numpy array

# first extract the columns into cereals_cols
# cereals_cols = ...

# then convert the output into a NumPy array
# cereals_cols_array = ...

#YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# always preview your variables
print(cereals_cols, type(cereals_cols), "\n", sep="\n")
print(cereals_cols_array, type(cereals_cols_array), sep="\n")

In [None]:
assert(isinstance(cereals_cols, pd.core.indexes.base.Index)), "Use the method you learned to extract the columns into cereals_cols."
assert(len(cereals_cols) == 16), "There are 16 columns in cereals. Did you extract them all? Also, make sure you don't change the variable cereals."
assert(isinstance(cereals_cols_array, np.ndarray)), "The cereals_cols_array does not look like a numpy array."

### 3.7) Extract the index as a numpy array

In [None]:
# do the same you did in exercise 3.6, but now for the index of cereals

# extract the index using the method you learned
# cereals_index = ...

# convert it to a numpy array
# cereals_index_array = ...

# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# if all is correct, you should notice below that your index, in this case, is a range from 0 to number of columns
# and with step = 1
cereals_index

In [None]:
assert(isinstance(cereals_index, pd.core.indexes.base.Index)), "Use the method you learned to extract the columns into cereals_cols."
assert(len(cereals_index) == 77), "There should be dtypes for the 16 columns of cereals, no more, no less."
assert(isinstance(cereals_index_array, np.ndarray)), "The cereals_cols_array does not look like a numpy array."

### 3.8) Describe the data in your dataframe

Last but not least, remember how you can get some stats and info on your dataframe?

If you don't make sure to reread the learning notebook.

If you do, let's jump to this final exercise.

In [None]:
# RUN THIS CELL FIRST
# bring back the reddit data!
vaccine_myths_reddit = pd.read_json('data/reddit_vm.json')
vaccine_myths_reddit.columns = ['title', 'score', 'url', 'number_of_comments', 'date_created']
vaccine_myths_reddit.head()

You can see that `vaccine_myths_reddit` has 2 different data types in its columns, `object` (which are actually strings) and numeric values.

In [None]:
# Use this draft cell to print stuff to help you answer the questions below
# ...

In [None]:
# Using only the 2 methods you learned to get information and statistics on a dataframe
# answer the following questions manually

# Question 1 - What's the mean value for `score` (rounded to 2 decimal points)?
# mean_score = ...

# Question 2 - What's the maximum value for `number_of_comments` (store the answer as an integer, NOT a float)
# max_number_of_comments = ...

# Question 3 - How many non-null entries do we have for `url` (store the answer as an integer)
# nonnull_url = ...


# YOUR CODE HERE

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
np.testing.assert_almost_equal(float(mean_score), 7.71, 2), "mean_score does not look right."
assert(hashlib.sha256(str(int(max_number_of_comments)).encode()).hexdigest() == 'be6b5b7140b02bff9ad8fa5aaaeca5973791521c5029c9f6b42390f8b87ce2bd'), "max_number_of_comments does not look right."
assert(hashlib.sha256(str(int(nonnull_url)).encode()).hexdigest() == '549a2fac47d713cc00f2db498ad6b5574fb03c9293aef6c7ad50a11b394c197d'), "nonnull_url does not look right."

---