<figure>
   <IMG SRC="https://mamba-python.nl/images/logo_basis.png" WIDTH=125 ALIGN="right">

</figure>

# Pandas benchmarking exercise
    
*Developed by MAMBA*

This notebook contains 8 exercises with pandas. The exercises can be done individually or as a group. For the first 4 exercises (1.1 - 1.4) we use the IMDB top 1000 dataset. The other 4 exercises (2.1 - 2.4) use the covid-19 vaccine dataset. At the top of each of the 4 exercises there is code to obtain a DataFrame of the dataset.
    
The goal is to use the most efficient (in terms of computation time) Python code to obtain what is asked. You can use the [`%%timeit` jupyter command](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html) to get the average execution time of a cell. Please be aware that variables created in a cell with `%%timeit` on top are not stored in memory and cannot be used later on. Therefore it is advised to write the code first and only use `%%timeit` in the end when you are sure the code does what it is supposed to do.
    
For each exercise some stackoverflow posts that could be helpfull have been listed on the bottom of this page.

## Table of contents

1. [IMDB exercises](#1-IMDB-exercises)
2. [Covid vaccine exercises](#2-Covid-vaccine-exercises)

## 1 IMDB exercises

In [2]:
import pandas as pd
imdb_df = pd.read_csv('imdb_1000.csv')
imdb_df.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


## 1. Actors in imdb top 1000.

Use the `imdb_df` for this exercise. Find the most efficient way to obtain the following:
1. a list with all the actors in the IMDB top 1000. The list should not contain any duplicates
2. a pandas Series with the names of the actors as the index and the number of times an actor appears in the IMDB top 1000 as the values.

## 2. Al Pacino in imdb top 1000.

Use the `imdb_df` for this exercise. Find the most efficient way to obtain the following:
1. The number of movies in the IMDB top 1000 with Al Pacino in it
2. The average duration and rating of the movies with Al Pacino in it
3. The number movies with Al Pacino in it per genre

## 3. Rating per actor in imdb top 1000.

Use the `imdb_df` for this exercise. Find the most efficient way to obtain the following:
1. The average rating of the movies in the IMDB top1000 per actor. 


## 4. Weird correlations

1. add a column to the `imdb_df` with the number of words in the movie title.
2. check if there is any correlation with the number of words in the movie title and the star-rating or duration of the movie.
3. create a DataFrame with the average star_rating and duration per number of words in the movie title. See the example below (not the actual values)

| nwords | average_star_rating | average_duration |
|--------|---------------------|------------------|
| 1      | 9.5                 | 120              |
| 2      | 7.6                 | 200              |
| ...    | ...                 | ...              |

4. plot the table from exercise 3

## 2 Covid vaccine exercises

For the following exercises we use the dataset `country_vaccinations.csv`. This dataset contains daily data per country about the covid 19 vaccines.

In [62]:
# https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations
# https://www.kaggle.com/gpreda/covid-world-vaccination-progress
import pandas as pd
vaccine_df = pd.read_csv(r'country_vaccinations.csv', index_col=0)

## Exercise 2.1

- What is the highest number of vaccine doses administered in one day in one country?
- Which country had the highest number of vaccinations in a day? Which day was it?
- Which country has the hight number of fully vaccinated people per hundred?

## Exercise 2.2 visualise time series

- plot the number of daily vaccinations per million for the Netherlands, Belgium, Australia and Israel in one figure
- create a plot with on the left y-axis a bar chart with the number of daily vaccinations in the Netherlands and on the right y-axis a line graph with the total vaccinations in the Netherlands. Both values plotted vs time on the x-axis.

## Exercise 2.3 vaccine type

- list all the countries that use or have used the Pfizer/BioNTech 	vaccin
- find all the vaccine types that are in this dataset
- create a Series with the vaccine types as index and the number of countries that use them as values

## Exercise 2.4 Dataset consistency

The dataset contains the columns 'daily_vaccinations' and 'daily_vaccinations_per_million'. This gives us the option to compute the population per country. Since we have all the values per day we can check if the population values are fairly constant over time. If the population changes over such a short period of time, we have an indication that something is wrong with the data.

- calculate the mean population over time per country.
- calculate the standard deviation of the population over time per country.
- use both previous values to find countries where the population changes over time.

## Tips

### Tips Exercise 1.1

These stackoverflow posts can help you:
- https://stackoverflow.com/questions/1894269/how-to-convert-string-representation-of-list-to-a-list
- https://stackoverflow.com/questions/30885005/pandas-series-of-lists-to-one-series
- https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item
- https://stackoverflow.com/questions/7961363/removing-duplicates-in-the-lists

### Tips Exercise 1.2

These stackoverflow posts can help you:
- https://stackoverflow.com/questions/30944577/check-if-string-is-in-a-pandas-dataframe

### Tips Exercise 1.3

These stackoverflow posts can help you:
- use the answer for exercise 1
- https://stackoverflow.com/questions/30944577/check-if-string-is-in-a-pandas-dataframe

### Tips Exercise 1.4

These stackoverflow posts can help you:
- https://stackoverflow.com/questions/45019319/pandas-split-a-string-and-then-create-a-new-column
- https://stackoverflow.com/questions/52247376/count-total-number-of-list-elements-in-pandas-column
- https://stackoverflow.com/questions/42579908/use-corr-to-get-the-correlation-between-two-columns
- https://stackoverflow.com/questions/39403705/mean-of-values-in-a-column-for-unique-values-in-another-column
- https://stackoverflow.com/questions/17812978/how-to-plot-two-columns-of-a-pandas-data-frame-using-points

### Tips Exercise 2.1

These stackoverflow posts can help you:
- https://stackoverflow.com/questions/41509496/find-row-index-of-highest-value-in-given-column-of-dataframe
- https://stackoverflow.com/questions/39403705/mean-of-values-in-a-column-for-unique-values-in-another-column


### Tips Exercise 2.2

These stackoverflow posts can help you:
- https://stackoverflow.com/questions/44729498/plotting-data-from-multiple-pandas-data-frames-in-one-plot
- https://stackoverflow.com/questions/29498652/plot-bar-graph-from-pandas-dataframe
- https://stackoverflow.com/questions/51784495/plot-dataframe-with-two-y-axes

### Tips Exercise 2.3

These stackoverflow posts can help you:
- https://stackoverflow.com/questions/30944577/check-if-string-is-in-a-pandas-dataframe


### Tips Exercise 2.4

These stackoverflow posts can help you:
- https://stackoverflow.com/questions/39403705/mean-of-values-in-a-column-for-unique-values-in-another-column