<a href="https://colab.research.google.com/github/michalis0/BigScaleAnalytics/blob/master/week2/bsa_lab_sql_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BSA Lab Week 2 - SQL in Python

There is an iPython extension that allows us to use SQL from a notebook by means of so-called "magic" commands (%...). You can choose from several SQL engines (PostgreSQL, MySQL, etc.). For these exercises, we will use SQLite. Rather than a full-fledged client-server database engine, SQLite can be embedded onto any program.

In [28]:
%load_ext sql
%sql sqlite://

import pandas as pd

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


## <font color = 'green'>World Cup Data</font>

This dataset consists of two CSV files, `Players.csv` and `Teams.csv`, which have already been joined into a third one for your convenience (`PlayersExt.csv`). We are loading them directly from the GitHub repository, and then persisting the tables to our SQL database so that we can run SQL queries (as opposed to using pandas, for example).

In [29]:
# Load Players
players_url = "https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/week2/data/Players.csv"
Players = pd.read_csv(players_url, index_col=0, encoding="utf-8")
%sql drop table if exists Players;
%sql persist Players

# Load Teams
teams_url = "https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/week2/data/Teams.csv"
Teams = pd.read_csv(teams_url, index_col=0, encoding="utf-8")
%sql drop table if exists Teams;
%sql persist Teams

# Load joined tables
playersext_url = "https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/week2/data/PlayersExt.csv"
PlayersExt = pd.read_csv(playersext_url, index_col=0, encoding="utf-8")
%sql drop table if exists PlayersExt;
%sql persist PlayersExt

 * sqlite://
Done.
 * sqlite://
 * sqlite://
Done.
 * sqlite://
 * sqlite://
Done.
 * sqlite://


'Persisted playersext'

### Preview

Let's run our first SQL queries in order to see what the table attributes and values look like. Here are a couple of things to note about the queries:

* As you can see, any text following the `%%sql` command is interpreted as query language.
* For the SQL keywords (SELECT, FROM, GROUP BY, etc.), it doesn't matter whether you use lowercase or uppercase.
* When you need to use quotes, both single quotes (' ') and double quotes (" ") can be used.
* You can also spread your statements across several lines for better readability.

In [30]:
%%sql
SELECT * FROM Players LIMIT 5

 * sqlite://
Done.


surname,team,position,minutes,shots,passes,tackles,saves
Abdoun,Algeria,midfielder,16,0,6,0,0
Belhadj,Algeria,defender,270,1,146,8,0
Boudebouz,Algeria,midfielder,74,3,28,1,0
Bougherra,Algeria,defender,270,1,89,11,0
Chaouchi,Algeria,goalkeeper,90,0,17,0,2


In [31]:
%%sql
select * from Teams limit 5

 * sqlite://
Done.


team,ranking,games,wins,draws,losses,goalsFor,goalsAgainst,yellowCards,redCards
Brazil,1,5,3,1,1,9,4,7,2
Spain,2,6,5,0,1,7,2,3,0
Portugal,3,4,1,2,1,7,1,8,1
Netherlands,4,6,6,0,0,12,5,15,0
Italy,5,3,0,2,1,4,5,5,0


In [32]:
%%sql
select *
from PlayersExt
limit 5

 * sqlite://
Done.


surname,team,ranking,games,wins,draws,losses,goalsFor,goalsAgainst,yellowCards,redCards,position,minutes,shots,passes,tackles,saves
Abdoun,Algeria,30,3,0,1,2,0,2,4,2,midfielder,16,0,6,0,0
Belhadj,Algeria,30,3,0,1,2,0,2,4,2,defender,270,1,146,8,0
Boudebouz,Algeria,30,3,0,1,2,0,2,4,2,midfielder,74,3,28,1,0
Bougherra,Algeria,30,3,0,1,2,0,2,4,2,defender,270,1,89,11,0
Chaouchi,Algeria,30,3,0,1,2,0,2,4,2,goalkeeper,90,0,17,0,2


### Basic Queries

The first two questions are already solved for you, so that you have concrete examples of queries. Try to solve the remaining four!

*1)  Which player on a team ending with "ia" played less than 200 minutes and made more than 100 passes? Return the player's surname and team.*

**Hint**: To check if attribute A contains a (sub)string S, use the LIKE keyword (e.g. `A like '%S%'`). The % sign indicates a wildcard.

In [33]:
%%sql
select surname, team
from Players
where Players.team like '%ia'
  and Players.minutes < 200 
  and Players.passes > 100

 * sqlite://
Done.


surname,team
Kuzmanovic,Serbia


*2) Find all players who made more than 20 shots. Return all player information in descending order of shots made.*

**Hint**: Sorting results is done via the ORDER BY keyword. The default order is ascending (ASC). If you want descending order, use DESC (e.g. `ORDER BY column_1, column_2 DESC`).

In [34]:
%%sql
select *
from Players
where shots > 20
order by shots desc

 * sqlite://
Done.


surname,team,position,minutes,shots,passes,tackles,saves
Gyan,Ghana,forward,501,27,151,1,0
Villa,Spain,forward,529,22,169,2,0
Messi,Argentina,forward,450,21,321,10,0


*3) Find the goalkeepers of teams that played more than four games. List the surname of the goalkeeper, the team, and the number of minutes the goalkeeper played.*

**Hint**: Use the `PlayersExt` table.

In [None]:
%%sql
YOUR QUERY HERE

*4) How many players on a team with a ranking lower than 10 played more than 350 minutes? Return a single number in a column named "superstar".*

**Hint**: To rename a column, use the AS keyword (e.g. `SELECT column_1 AS label`).

In [None]:
%%sql
YOUR QUERY HERE

*5) What is the average number of passes made by forwards? What about midfielders? Write one query that returns both values with the corresponding position.*

**Hint**: Use the GROUP BY keyword. GROUP BY statements are often used in conjuction with aggregate functions like AVG(), SUM() or COUNT(). 

In [None]:
%%sql
YOUR QUERY HERE

*6) Which team has the highest average number of passes per minute played? Return the team's name and average number of passes per minute.*

**Hint #1**: You can compute a team's average number of passes per minute played by dividing the total number of passes by the total number of minutes. To force floating point division, multiply one operand by 1.0.

**Hint #2**: Consider using the LIMIT keyword.

In [None]:
%%sql
YOUR QUERY HERE

### Advanced Queries

Now, on to more challenging questions...

*1) Find all pairs of teams that have the same number of `goalsFor` as well as the same number of `goalsAgainst` as each other. Return the team pairs and their respective numbers of `goalsFor` and `goalsAgainst` (make sure to return each pair only once!).*

**Hint**: You basically need to do a "self join" of the `Teams` table. For that, you need to join different name aliases of the same table. Check [this page](https://www.w3schools.com/sql/sql_join_self.asp) for help.

In [None]:
%%sql
YOUR QUERY HERE

*2) Find all teams with a ranking below 30 where no player has made more than 150 passes. Return the team's name and ranking.*

**Hint #1**: Consider using the HAVING keyword.

**Hint #2**: You may also want to look up nested queries.

In [None]:
%%sql
YOUR QUERY HERE

*3) Which team has the highest ratio of goalsFor to goalsAgainst?*

In [None]:
%%sql
YOUR QUERY HERE

*4) Find all teams whose defenders averaged more than 150 passes. Return the team and average number of passes by defenders, in descending order of average passes.*

In [None]:
%%sql
YOUR QUERY HERE

## <font color = 'green' italicized text>Titanic Data</font>

This dataset contains the list of passengers who were on board the Titanic.

Personal information such as their name, gender and age is shown. We can also see information about their journey (which class they were travelling in, how much they paid, etc.), and whether they survived or not.

Feel free to do these exercises on your own time in order to prepare for part 1 of the personal assignment.

In [35]:
# Load table from CSV file
titanic_url = "https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/week2/data/Titanic.csv"
Titanic = pd.read_csv(titanic_url, index_col=0, encoding="utf-8")
%sql drop table if exists Titanic;
%sql persist Titanic

 * sqlite://
Done.
 * sqlite://


'Persisted titanic'

### Preview

In [36]:
%%sql
select * from Titanic limit 5

 * sqlite://
Done.


last,first,gender,age,class,fare,embarked,survived
Braund,Mr. Owen Harris,M,22.0,3,7.25,Southampton,no
Cumings,Mrs. John Bradley (Florence Briggs Thayer),F,38.0,1,71.2833,Cherbourg,yes
Heikkinen,Miss Laina,F,26.0,3,7.925,Southampton,yes
Futrelle,Mrs. Jacques Heath (Lily May Peel),F,35.0,1,53.1,Southampton,yes
Allen,Mr. William Henry,M,35.0,3,8.05,Southampton,no


### Basic Queries

*1) How many married women over age 50 embarked in Cherbourg?*

**Hint**: You will need to use wildcards.

In [None]:
%%sql
YOUR QUERY HERE

*2) List the average fare paid by passengers in each of the embarkation cities (along with the city), in descending order of average fare.*

In [None]:
%%sql
YOUR QUERY HERE

*3) What is the most common last name among passengers?*

In [None]:
%%sql
YOUR QUERY HERE

*4) Write 3 queries:*

* *Total number of passengers*
* *Number of passengers under 30*
* *Number of passengers 30 or older*

*Why do the second and third numbers not add up to the first?*

In [None]:
%%sql
YOUR QUERY HERE

In [None]:
%%sql
YOUR QUERY HERE

In [None]:
%%sql
YOUR QUERY HERE

Blanks in SQL tables are given a special value `null`, and conditions like `A is null` and `A is not null` can be used in WHERE statements to check whether attribute A contains blank values or not. 

*5) How many passengers don't have a value for age? Now do your numbers add up?*

In [None]:
%%sql
YOUR QUERY HERE

*6) How many passengers were in each of the following categories, and what was their average fare paid?*

* *Male survivors*
* *Female survivors*
* *Male non-survivors*
* *Female non-survivors*

In [None]:
%%sql
YOUR QUERY HERE

### Advanced Queries

*1) Are there any pairs of passengers with the same last name where one is in first class and the other is in third class?*

*If so, return the last name and each person's first name. Label the first name column as "first" for the passenger in first class, and "third" for the passenger in third class.*

In [None]:
%%sql
YOUR QUERY HERE

*2) Which embarkation cities have more than 40 passengers whose age is missing?*

In [None]:
%%sql
YOUR QUERY HERE

*3) Find all classes where the average fare paid by passengers in that class was more than twice the overall average or less than half the overall average.*

In [None]:
%%sql
YOUR QUERY HERE

*4) EXTRA DIFFICULT CHALLENGE: List each class and its survival rate, i.e., the fraction of passengers in that class who survived.*

In [None]:
%%sql
YOUR QUERY HERE

### Titanic Data Modification

Here again, we are giving you the solution for the first two questions to get you acquainted with the modification syntax.

Note: You may want to reload the CSV frequently to reset the data as you experiment with modifications.

In [None]:
# Load table from CSV file
titanic_url = "https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/week2/data/Titanic.csv"
Titanic = pd.read_csv(titanic_url, index_col=0, encoding="utf-8")
%sql drop table if exists Titanic;
%sql persist Titanic

*1) Subtract 5 from the fare paid by all passengers under the age of 10. Then compute the new average fare similar to question 2 in the previous section.*

**Hint #1**: You can put two SQL statements in one cell separated by a semicolon.

**Hint #2**: Use the UPDATE and SET keywords to modify a column (see [documentation](https://www.w3schools.com/sql/sql_update.asp)).

In [None]:
%%sql
update Titanic
set fare = fare - 5
where age < 10
;
select embarked, avg(fare) as avg_fare
from Titanic
group by embarked
order by avg_fare desc

*2) Create a new table called "Survivors" containing the first and last names of all passengers who survived. Then count the number of tuples in the new table.*

**Hint**: Use the CREATE TABLE keyword.

In [None]:
%%sql
drop table if exists Survivors
;
create table Survivors as
select first, last
from Titanic 
where survived = "yes"
;
select count(*) as count from Survivors

*3) In the Titanic table, delete all but the passengers who paid more than 300. Then count the number of tuples in the table.*

**Hint**: Use the DELETE FROM keyword.

In [None]:
%%sql
YOUR QUERY HERE

*4) In what's left of the table after question 3, insert a new tuple for yourself. You can decide your class, fare, where you embarked, and whether you survived. Then show the whole table.*

**Hint**: Use the INSERT INTO keyword.

In [None]:
%%sql
YOUR QUERY HERE