# BLU03 - Exercises Notebook

In [None]:
import hashlib # for grading purposes
import math
import numpy as np
import pandas as pd
import requests
import sqlalchemy

from bs4 import BeautifulSoup

## Part A - SQL exercises

### Querying the FIFAdb with a SQL client

Open your favorite SQL client and connect to the FIFAdb.
The connection settings are the following.

* host: batch4-s02-db-instance.ctq2kxc7kx1i.eu-west-1.rds.amazonaws.com
* port: 5432
* user: ldsa_student
* database: batch4_s02_db
* schema: public
* password: XXX (shared through slack)

This is a different schema than the one we used in the learning notebooks (don't forget to change to this schema, see the Learning Notebook). This schema contains information about football matches, players, teams, and which league and country these matches took place in. Additionally, it also contains the player's and team's "attributes", sourced from the EA Sports' FIFA video game series.

The tables in this schema are the following:

1. Match: has information about the football matches: who were the 11 home and away players (identified by their player_id), how many goals did each team score, the date of the match, the league id and the home/away team id's.
2. Player: contains informations about the players.
3. Team: contains information about the teams.
4. League: contains information about the football leagues, including the id of the country where they take place.
5. Country: names and id's of the countries
6. Player_Attributes: contains the attributes for each player.
7. Team_Attributes: contains the attributes for each team.

You can preview these tables using the SQL client.

### Q1. Select the name of the country with id 24558

Write a query that selects the name of the country with id 24558, and run it in the SQL client.

Then, assign the result to variable q1_answer (just copy and paste the name you obtained).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = '2275583196d791405892aaca0d87743c872f3fc0cf3308a6c3ef82528918aa8a'
assert hashlib.sha256(q1_answer.encode()).hexdigest() == expected_hash

### Q2. Count in how many games the home team didn't score any goals

Write a query that counts the number of games in which the home team didn't score any goal (`home_team_goal`)

Then, assign the result to variable q2_answer (just copy and paste the value).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = '816d797f1f1c71bb6104ad8a44416f92eb1a08fdc4bbfa5f33c20b304b2b47a7'
assert hashlib.sha256(str(q2_answer).encode()).hexdigest() == expected_hash

### Q3. Find out the name of the shortest player who is really good at jumping and whose first name is John

That's quite a lot to ask!

Let's break it down. Write a query that:

* takes all players whose `jumping` attribute is greater than 75.
* filters only those whose name is **LIKE** "John *something*".
* sorts them by height in ascending order.

Then, assign the result to variable q3_answer.

**Hints**: check the [LIKE](https://www.postgresql.org/docs/current/static/functions-matching.html#FUNCTIONS-LIKE) keyword for this exercise. Also: the player height is not on the Player_Attributes table - you'll have to get it from somewhere else.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = 'b5e5851f0a161f23043ca27717d62d0f102a8f89906e3e6f5bbc0656a5bb0ef9'
assert hashlib.sha256(q3_answer.encode()).hexdigest() == expected_hash

### Q4. Find out which leagues have had at least one game where the visiting team scored over 7 goals

Write a query that gets the name of all **DISTINCT** leagues that have had at least one game where the visiting ("away") team has scored more than 7 goals.

Order the results **by name** in descending order. Create a list with the results, and assign it to variable q4_answer.

**Hints**: keep in mind you only want to select DISTINCT league names. Meaning: even if a league has had more than one game with the required goal count, we don't want its name to appear more than once in the result. For this, the [DISTINCT](https://www.postgresql.org/docs/current/static/sql-select.html#SQL-DISTINCT) keyword will be essential. Also, keep in mind that the league names are not available in the match table!

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = '43f63158b3cf7dc5024c77e3bc669a85fba1d9e26d52996616b5c487ddd494d3'
assert hashlib.sha256(str(q4_answer).encode()).hexdigest() == expected_hash

### Q5. Find out what country had the least amount of matches

Write a query to find out the name of the country that has had the least amount of football matches.

Assign this country to variable q5_answer_1.

Also find out how many matches where played in that country, and assign that value to q5_answer_2.

**Hint**: there isn't a direct connection between the matches and the country, but you can get there using an extra table.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_country_hash = '2275583196d791405892aaca0d87743c872f3fc0cf3308a6c3ef82528918aa8a'
assert hashlib.sha256(q5_answer_1.encode()).hexdigest() == expected_country_hash, "Wrong country!"

expected_matches_hash = 'fd53efd8940f305f79e212dc2e0a557d23eab8f2f60fbf219e19e3351b68e732'
assert hashlib.sha256(str(q5_answer_2).encode()).hexdigest() == expected_matches_hash, "Wrong number of matches!"

### Querying the FIFAdb with pandas

In these exercises, the goal is to query the FIFAdb using pandas.

### Q6. Find the maximum amount of goals scored away for team with high defence pressure

The connection settings to use in this exercise are the same ones as in the previous exercises.

Write a query to find the team ID, short name and *max amount of goals scored when playing away* of the teams with a high "defencepressure" team attribute (*greater than 60*).

Search only for teams with:

* an *average amount of goals suffered* (when playing away) greater than 1 (that is: an average amount of goals scored by the home team greater than 1) 
* more than 25 games played away, to reduce the number of statistically insignificant results.

Give the team ID column the `tid` alias. 

Order the results by the team short names in ascending order.

Assign the result to dataframe df6.

In [None]:
# Create an engine that allows to connect to the FIFAdb PostgreSQL database
# engine = sqlalchemy.create_engine(...)
# YOUR CODE HERE
raise NotImplementedError()


# Write the query as specified in the question
# query = ...
# YOUR CODE HERE
raise NotImplementedError()

# Use pandas read_sql_query function to read the query result into a DataFrame
# df6 = pd.read_sql_query(...)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(engine) == sqlalchemy.engine.base.Engine
assert len(df6) == 3
assert len(df6.columns) == 3, "Are you sure you selected the requested columns?"

expected_hash = '72353f3e1b10c9a397090043256be11d2a5922815f8313ef638ffbeea5dcadce'
assert hashlib.sha256(df6.iloc[2]["tid"].astype(str).encode()).hexdigest() == expected_hash, "Wrong data"

expected_hash = 'fe34924d143b814542bfb9714341fa68ac9fca7a0b4eeda1b654abacae2d1a50'
assert hashlib.sha256(df6.iloc[1].short_name.encode()).hexdigest() == expected_hash, "Wrong data"

### Q7. Find out some attributes from players with fast reactions and high potential

In this exercise, we want to query a local SQLite database.
In order to do this, connect to the FIFAdb.sqlite database, as was done in the learning notebooks for the_movies.db. The database file we're using is in the **data** directory, and the table names are the same as in the PostgreSQL database.

Write a query that selects the player name, height, weight, sprint_speed, acceleration and shot_power for all players with a value of the `reactions` attribute greater than 85, and a value of the `potential` attribute greater than or equal to 90. Order these results by the `positioning` attribute in descending order.

Use pandas to read this query into a DataFrame called df7 with six columns: name, height, weight, sprint_speed, acceleration and shot_power.

In [None]:
# Create an engine that allows to to connect to the FIFAdb SQLite database
# engine = sqlalchemy.create_engine(...)
# YOUR CODE HERE
raise NotImplementedError()


# Write the query as specified in the question
# query = ...
# YOUR CODE HERE
raise NotImplementedError()

# Use pandas read_sql_query function to read the query result into a DataFrame
# df7 = pd.read_sql_query(...)
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(engine) == sqlalchemy.engine.base.Engine
assert len(df7) == 4
assert len(df7.columns) == 6, "Are you sure you selected the right number of columns?"
assert df7.columns.tolist() == ['name', 'height', 'weight', 'sprint_speed', 'acceleration', 'shot_power'], "Are you sure you selected the requested columns?"

expected_hash = 'e3ccd9684de593c7c6b6354cbe413d233959e7677258bfc3727d807e5900dce2'
assert hashlib.sha256(df7.loc[0, 'name'].encode()).hexdigest() == expected_hash, "Wrong data"

expected_hash = 'a9e2cdc1c1dab67f2dbd9694e8504ae974bf9e98cffeff654bf30cc8b9107423'
assert hashlib.sha256(str(df7.loc[2, 'height']).encode()).hexdigest() == expected_hash, "Wrong data"

expected_hash = '91a78f834681a13134c5cea155b51d1c832aec895b74fc42a3ab757aed5df8e2'
assert hashlib.sha256(str(df7.loc[3, 'shot_power']).encode()).hexdigest() == expected_hash, "Wrong data"

## Part B - Public APIs


-----------------------------------

In this exercises, the goal is to get data from a public API. We'll go full geek, and use a Pokemon API hosted by the LDSA for this BLU! (credit for the data goes to user `fanzeyi`on Github)

The base URL of the API is the following: https://pokemon-api.lisbondatascience.org/

In order to complete the exercises, you'll have to navigate to the API's documentation (`ui` endpoint) on your browser. More specifically, you'll have to learn what are the different endpoints from which you can GET information.

<br>

<img src="media/api-image.jpg" width=600>

<br>

### Q8. Find all of Charmander's evolutions!

As you might know, Pokemon evolve as they grow. Several Pokemon keep a similar name when they evolve. Let's consider my favourite starter Pokemon, Charmander:

<br>

<img src="media/charmander.png" width=300>

<br>

Use the API to find all Charmander's evolutions! You will have to get all Pokemon with `Char` in their name, and you'll also have to filter for "Fire" type Pokemon, since there are a couple of results unrelated to Charmander.

Extract their names from the `["name"]["english"]` attribute of each result, in the order they are returned, and assign the resulting list to the `q8_answer_names` variable.

Also extract their speeds (`["base"]["Speed"]`) and assign them to variable `q8_answer_speeds`

In [None]:
# Do an HTTP GET request to the Pokemon API to get information about 
# all Pokemons with "Char" in their name
# response = ...
# q8_answer_names = ...
# q8_answer_speeds = ...

# YOUR CODE HERE
raise NotImplementedError()



In [None]:
assert type(q8_answer_names) == list, "Names must be in a list"
assert type(q8_answer_speeds) == list, "Speeds must be in a list"

names_hash = '4530988a30da58ce7b0045234c8499b1cc5bbf39412591a28bc49b876dba223c'
assert hashlib.sha256(str(q8_answer_names).encode()).hexdigest() == names_hash, "Wrong names!"

speeds_hash = 'e0919cb78353fd21778684cebea362f41ccaa283ce2aa8d86a190ccc9daec2aa'
assert hashlib.sha256(str(q8_answer_speeds).encode()).hexdigest() == speeds_hash, "Wrong speeds!"

### Q9. Find the strongest Pokemon moves!

Now, use a different endpoint to find out which Pokemon moves have a `power` stat of 150 or higher.

Extract their `enames` (english names) and assign the resulting list to variable `q9_answer`.

In [None]:
# Do an HTTP GET request to find which Pokemon moves have 150 or more power.

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(q9_answer) == list, "Moves must be in a list!"
assert len(q9_answer) == 14, "Wrong number of moves!"

expected_moves_hash = 'a84955c5d99c54a3ed5b21d710bc2ac0c5bb48dfc11a441fdd3fb2943b70017f'
assert hashlib.sha256(str(q9_answer).encode()).hexdigest() == expected_moves_hash

## Part C - Web scraping

In this exercise, we're going to use web scraping to get data from the page of a former LDSA student, Bork Pawson!
Bork has kindly made his very simple and amateurish website available for us to scrape!

You can find his website here: https://s02-infrastructure.s3.eu-west-1.amazonaws.com/ldsa-bork/index.html

### Q10. Scrape Bork's ABSOLUTE favourite things in the world.

Bork has written down his five favourite things in the world. You can find the in a list on the website's sidebar.
Scrape the 5 items in order, using the `requests` and `BeautifulSoup` library, store them in a list, and assign it to the `q10_answer` variable. No cheating! 


In [None]:
# Assign the URL of the page to be scraped to variable url
# url = ...
# YOUR CODE HERE
raise NotImplementedError()

# Do a GET request to get the page content, using the url we've just defined
# response = ...
# YOUR CODE HERE
raise NotImplementedError()

# Instantiate a soup object using the response of the GET request
# YOUR CODE HERE
raise NotImplementedError()
    
# Now it's the tricky part!
# Parse the soup in order to retrieve the list of things.
# In the end, store the favourite things in a list and assign it to variable q10_answer.
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = 'b42c63516b06440a9481cbfbc100f23f2b47f68a008f3e073d6e67ce81a6b81e'
assert hashlib.sha256(str(sorted(q10_answer)).encode()).hexdigest() == expected_hash

### Q11. Find the tennis ball tag

Scrape the tag containing the tennis ball image that is on the center of the grid with Bork's favourite things.
Assign the tag (not the image content) to variable `q11_answer`.

Note: You'll have to find a different way to pass the attribute you want to filter, since the attribute name conflicts with an argument of the `find` function. You can figure out how to do this in the [BeautifulSoup documentation](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find#the-keyword-arguments)!

In [None]:
# Assign the URL of the page to be scraped to variable url
# url = ...
# YOUR CODE HERE
raise NotImplementedError()

# Do a GET request to get the page content, using the url we've just defined
# response = ...
# YOUR CODE HERE
raise NotImplementedError()

# Instanciate a soup object using the response of the GET request
# YOUR CODE HERE
raise NotImplementedError()

# Parse the soup in order to retrieve the tag of the tennis ball image.
# Assign it to variable q11_answer.
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = '369917cf8ea4d7906841cb6e6c264b124911e6d805bd122a23ffcee8fcb67de7'
assert hashlib.sha256(str(q11_answer).encode()).hexdigest() == expected_hash