# CS145: Project 2
## Part 1 | Exploring World Bank Datasets with Colaboratory (40 points)
---



### Notes (read carefully!):

* Be sure you read the instructions on each cell and understand what it is doing before running it.
* Don't forget that if you can always re-download the starter notebook from the course website if you need to.
* You may create new cells to use for testing, debugging, exploring, etc., and this is in fact encouraged!
**Just make sure that the final answer for each question is _in its own cell_ and _clearly indicated_**.
* Colab will not warn you about how many bytes your SQL query will consume.  **Be sure to check on the BigQuery UI first before running queries here!**
* See the assignment handout for submission instructions.
* Have fun!

## Collaborators:
Please list the names and SUNet IDs of your collaborators below:
* *Name, SUNet ID*

## Setting up BigQuery and Dependencies



Run the two cells below (shift + enter) to authenticate your project and import the libraries you'll need. 

Note that you need to fill in the `project_id` variable with the Google Cloud project id you are using for this course.  You can see your project ID by going to https://console.cloud.google.com/cloud-resource-manager

In [0]:
# Run this cell to authenticate yourself to BigQuery.
from google.colab import auth
auth.authenticate_user() 
project_id = "third-diorama-233818"

In [0]:
# Some imports you will need
import pandas as pd
import altair as alt

### How to BigQuery in Collab

Jupyter notebooks (what Collab notebooks are based on) have a concept called "magics".
If you write the following line at the top of a `Code` cell:

```
%%bigquery --project $project_id variable # this is the key line
SELECT ....
FROM ...
```

That "%%" converts the cell into a SQL cell. The resulting table that is generated is saved into `variable`. Then in a second cell:

```python
alt.Chart(variable).mark_line().encode(
...
)
```

You can use the variable like so in order to create a chart! 

# Section 1 | Schema Design!

---





The World Bank collects and aggregates data from many public sources around the world and publishes it online. Google BigQuery has made this data available for us to play with, and it contains a tremendous number of metrics about nation-level activity and outcomes. 

For this project, we will be using the [`world_bank_health_population`](https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population) public dataset.

## Question 1: Describe the dataset of the World Bank (1 point)

If you had to describe the way the data is set up in the World Bank datasets (any of the four datasets, as they have identical structure), what would you say about it? 
**Note:** The rest of this question is about going through the structure of the dataset, so we only need a couple of sentences here. We're looking for your impressions - what do you notice?

*The World Bank dataset describes the *health* status of every country from 1960 - 2018. *

---

## Meet OKV, the Anti-Schema

---

**`OKV`** stands for `Object-Key-Value` [[1]](https://colab.research.google.com/drive/1asMZjcxwBqGpurcPOZ6pmtdUaXipWdmX#scrollTo=ZvK6dAU2k8Qa&line=4&uniqifier=1). It's a way of storing information in a database that has the opposite of a schema - you're free to define any property you'd like on any object. Think of it like a gigantic hashmap (10B rows is nothing with this kind of model [[2]](https://colab.research.google.com/drive/1asMZjcxwBqGpurcPOZ6pmtdUaXipWdmX#scrollTo=ZvK6dAU2k8Qa&line=4&uniqifier=1)) for every variable on every object in your system. 



Here's one way a table for such a system could be laid out:


|  object  |             key          |  value  
|-------------|---------------------------|------
|    102    |        "name"         |  "John Watson"
|    103    |        "name"         |  "Sherlock Holmes"
|    102    |       "address"      |  "221B Baker Street, London, UK"
|    107    |        "name"         |  "Oprah Winfrey"
|    103    |       "address"      |  "221B Baker Street, London, UK"
|    102    |        "canes"        |  26
|    103    |  "cases_solved" |  60




As you can see, the three objects in this table all have different "shapes" (another word for "schema" in systems that don't have a formal schema). 

When you want to query for an object you will need a query like :

```sql
SELECT key, value
FROM table
WHERE object = 102
```

Then merging all the rows in the output will give you the object!


### Notes
1. Other three word versions of this idea include ID-Key-Value, Object-Property-Value, Entity-Attribute-Value, Entity-Property-Value and they have their respective three-letter-acronym (TLA) as well
2. The reason being that the number of rows in an OKV store = number of cells in a regular table.


### Further Reading (Optional)
* Wikipedia [article ](https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model) on this structure of data
* This (OKV store, and database design in general) is a great thing to ask Shiva about during his office hours!

## Question 2: Ruminating about OKVs (6 points)

Compare and contrast the OKV store to the traditional relational table (the kind you see in class) - 
What are the advantages? What are some of the difficulties? 
#### (Keep your answer around 200 words, bullet-points suggested!)

### Hints
* One helpful checklist is the acronym **CRUD**, which stands for **Create**, **Read**, **Update**, **Delete**. These are the four basic data operations. You can create/read/update/delete the values in a database, or the schema of a database (eg, add/delete a property, or change its type). 

* When you think about advantages and disadvantages in software, some of the common desirable attributes are - performance (how fast queries run), memory footprint (less is better), maintainability (can the database gracefully change with product requirements?), and code complexity (does a db design encourage large, unwieldy queries? Programmers are human after all - the more complexity, the more bugs!). Combining these attributes with the operations above should give you a place to start your comparison.

* An extra hint: We haven't talked about performance in databases yet, but for the purposes of this question, you can think in three tiers:
  * Lookup: You have a key of a table, and you want the row (or some subset of the row). You can conceptualize this as O(1)
  * Scan: When you need to lookup rows by some criteria (eg, people taller than 6 feet). Height is not a key, so the database has to scan all the rows. You can conceptualize this as O(N). Scanning also happens when writing a property to many rows at once.
  * Join: When you join tables together, it creates a Cartesian Product of sets (also called pairs). If there are N rows in a table, and another table has roughly the same number of rows (so N there too), then you can think of a two-table join as O(N^2).


---

# OKV vs Relational Databases

*   **Performance**: 
        1.    Create  - O(1) for both.
        2.    Read    - O(1) for both.
        3.    Update  - O(1) for both.
        4.    Delete  - O(1) for both.

*   **Memory**: 
        For Sparse Data OKV is better than relational databases as entering the null valued attributes can be skipped. In normal cases the relational model saves space compared to OKV as the entity keeps repeating in OKV.
*   **Maintainability**:
        If the number of attributes keeps changing then the relational database design gets more complicated. In those cases OKV can be used where the attributes can be added dynamically.

*   **Code Complexity**:
        When the query complexity increases, then writing queries in OKV is more complicated than in Relational Databases.        
---

## One More Thing - Property Names




As you learned in class, redundant data in a table is undesirable, because if you want to change something, then you have to update all the rows corresponding to that value (which is very expensive; remember that an OKV model has way more rows than a traditional table). This is also called an **update anomaly**.

How to solve this issue? Simple, a property table:

```sql
# Schema (in the syntax of the Gradiance homework): 
Property(id, name)
Data(id, key, value)
```

So we'd take the table above, and replace it with:

Property Table:

  
| id | name |
|-----|----------|
| 1  | "name"
| 2  | "address"
| 3  | "canes"
| 4  | "cases_solved"


Data Table:

|    id    |  pid   |  value
--------|-------------|-------------
|  102   |  1      |  "John Watson"
|  103   |  1      |  "Sherlock Holmes"
|  102   |  2      |  "221B Baker Street, London, UK"
|  107   |  1      |  "Oprah Winfrey"
|  103   |  2      |  "221B Baker Street, London, UK"
|  102   |  3      |  26
|  103   |  4      |  60


## Question 3: Rumination Redux (2 points)
Update your comparison above to account for the property table - which things have been made easy to do with this change, and which things are still hard to do? 

Please only comment on the differences - you don't need to redo the whole analysis here.

---
**Pros**: 
*  By placing the key in an other table, the redundant string data is replaced by redundant integer data. Strings take up more space than integers and searching for strings is more complicated than searching for integers.
*  Update and Delete anamolies are removed.

**Cons**:
*   An extra table needs to be added.





---

## One MORE Thing - Types! 

In `SQL`, every column is required to have a type [[1]](https://colab.research.google.com/drive/1asMZjcxwBqGpurcPOZ6pmtdUaXipWdmX#scrollTo=inVjXmRsiXc2&line=3&uniqifier=1). So when we mixed *string* values and *int* values in a single column, that was a simplification. 

There are multiple design options that work for this situation - see the next question for a discussion of them.


#### Notes
1. There are databases that do not offer this feature/suffer from this curse (depending on your opinions), where you are free to put whatever value you'd like in the third column as shown above. If you're curious why anyone would want that - consider the difference between a static vs dynamically typed programming language (e.g. Java vs Python). There are similar tradeoffs in choosing a database that has typed schema vs type agnostic schema. This is another great thing to talk to Shiva about if you have more questions. 

## Question 4:  A Friendly Dialogue (6 points)



(PS - This is a very common software engineering interview question!).

Your well meaning friend tries to implement an OKV store in SQL and runs into the snag above. Let's say, for simplification, they are only interested in storing* string* values and* int *values (you can imagine extending this to more types).

They propose the following solution (assume the property table also exists, it's not shown here):

id    | pid | string_value       | int_value
------|-------|---------------------------|--------
102 | 1   | "Sherlock Holmes" | null
103 | 1   | "John Watson"     | null
102 | 3   | null                        | 60


To explain: if the value is a *string* value, then fill the `string_value` column and fill the `int_value` column with null; vice versa if there's an *int* value. 


### a) What's wrong with this picture? (2 points)

If you had to critique this design, what would you tell your friend? Again, what is difficult or not desirable about this design?

---
The whole point of using OKV is to avoid null data and the above design violates that fact.

**Performance**: 
  *   As the number of data types increase it takes more amount of time to retrieve the data. This is because it needs to be checked among the data types which is not null and then the data type which is non-null needs to be retrieved.
  
  
 **Memory**: 
  *   The number of columns increases as the number of data types increase.

---

### b) Your alternative (2 points)

Come up with an alternative design (hint: you may need more than 1 table) that addresses some of the concerns you brought up above.


---
*   **Approach 1**: Seperate data into different tables based on their data types.
*   **Approach 2**: *ANYDATA* data type.
*   **Approach 3**:  Convert all the data to string.


---


### c)  No free lunch! (2 points)

Your friend looks at your design - what critique would he offer you? What is made difficult or not desirable about your design?

---
# Approach 1:
*   The above design increases the number of tables as the data types increase. 
*   While writing the queries there should be a thought given on which table to query depending on the data types. 
*   String related data and integer related data are in seperate tables so need to query different tables and difficult to retrieve integer and string related data simultaneously.

# Approach 3:
*   It becomes difficult while querying data types other than strings.

---

## Apply Your Knowledge!

The world bank data is structured as a OKV... with a small twist. The tables that contain the information are more or less arranged as follows:

```
object | key | year | value
```

Where `object` = country code & `key` = indicator code.

There are a couple of other fields (a description for the object and the key), but the overall structure is still OKV.

With this new understanding try reading the schema and identifying other features of the key-value store setup that we identified above. 


## Question 5: Schema Comprehension (3 points)

Each of the following parts is worth 1 point.


### a)  What is the name of the table that functions as the property table in the four world bank data sets?


---
*country_series_definitions*,  *series_summary*

---

### b) Which table contains extra information about the "objects" (as in Object-Key-Value) in the table?


---
*health_nutrition_population*

---


### c) What is the "key" (as in Object-Key-Value) of the `health_nutrition_population` table?

---
*indicator_code*

---

## Question 6: Design Theory (12 points)

Design your own schema for the world bank data! Your goal is to make visualizations to answer these questions:

* What is the breakdown of population by decade (0-9, 10-19, 20-29, etc) by male and female for the US over time? 
* Is the US population shifting towards an aging one or a young one? 
* How does the population breakdown of the US and China (or any other country) differ? 
* What is the life expectancy versus the health care expenditure for all the countries of the world?
* Where is HIV in the world? Make a picture that shows the distribution of AIDS patients by major region.  

#### Further Requirements

* No matter which schema you choose, it should be clear how you could add data from more indicators to it.  This could be adding a column(s), row(s) or table(s) depending on your design.
  * Eg, if you're not storing GDP/capita, how could you add it to the table?
* 



#### Hints:

In the real world, stuff happens! We want our databases to be able to handle those cases.
Remember the CRUD (create, read, update, delete) acronym! What if:
* A statistic turns out to be incorrect and it needs to be updated?
* You need to add all the data for all the countries when it's published at the end of 2018?
* A country splits into two because of a revolutionary war?
* A country changes its name back to a pre-colonial era one?
* You need to store very small percentages (think prevalence of rare diseases)?
* There are statistics that only apply to some countries and not others (eg, fishing in a landlocked country)?
* You suddenly want to store data at a higher sampling rate (say, monthly or weekly rather than yearly)?

It's very unlikely your design performs well under all these conditions (and many more you can come up with) and that's okay! No design is perfect - we're looking for you to show your understanding of the tradeoffs you made and what that means for any application you write on top of your database.

### a) What entities are present in your schema? (2 points)


---
# Entities
*   country_details
*   world_bank_health_population
*   international_debt
*   international_education
*   indicators_data
*   series_summary_health
*   series_summary_debt
*   series_summary_education
*   series_summary_indicators
*   series_times_health
*   series_times_debt
*   series_times_education
*   series_times_indicators
---

### b) What is the relationship between them? (You don't need to draw a perfect ER diagram - you can also just list 'one to one', 'one to many' and 'many to many' for each pair). (2 points)



--- 

*   world_bank_health_population -> series_summary_health (Many to one)

*   international_debt -> series_summary_debt (Many to one)

*   international_education -> series_summary_education (Many to one)

*   indicators_data -> series_summary_indicators (Many to one)

*   series_summary_health -> series_times (One to One)

*   series_summary_debt ->   series_times_debt (One to One)

*   series_summary_education -> series_times_education (One to One)

*   series_summary_indicators ->series_times_indicators (One to One)

---

### c) Draw out your tables (like you've seen above and in class), and clearly note which column(s) form the key for that table, and which columns are keys of another table (foreign keys). (3 points)


---
My tables are the same as the one in the public dataset except that country_name and indicator_name are not present in the main tables and country_name is stored in another table. So by doing so the data redundancy can be reduced.

---

### d) List the (minimal) functional dependencies that are present in your tables. (2 points)


---
*   country_name -> country_code (Country Details Table)

*   indicator_name -> indicator_code

*   value -> country_code, indicator_code (Main Tables)

*   All attributes -> series_code (Series Summary Tables)

*   description -> series_code, year (Series Time Tables)


---

### e) Comment on your design - what is it good/bad for? *What are the tradeoffs that you're making in choosing your design?* (3 points)


---
The design opted for is OKV or EAV. 

*   Pros:
  1.   The table is sparse so choosing a relational database would result in a serious wastage of memory.
  2.   New Attributes are added very frequently, so EAV can adapt to changes very easily.
  3.   The data type of Value is always a number so using EAV becomes easier.  
  

 *  Cons:
  1.   Queries can turn to be complex.
  
---

# Section 2 | Learn you a Visualization for Great Good
---

(This title is a pun on a famous series of programming books that look like `Learn you a ____ for Great Good`.)

In this section, you'll be answering questions, similar to the first project. The difference is that instead of just answering with a query, you will be answering with a visualization. Part of this assignment is for you to think about what type of chart/picture/visualization will convey your information well, and to think about which data (specifically, which indicators) you should be using in order to answer a particular question. 

We're focusing on visualizations because they are a primary method of understanding and communicating the nature of data. Especially with the large datasets that are available today, a picture is worth 1M rows :) 

For a look at what is possible, see [Gapminder](https://www.gapminder.org/tools). Gapminder is a professional visualization of all the world bank indicators which is also interactive! You can look up some cool TED talks that use Gapminder to display world metrics (it's a very popular tool).

If you need to see "the answers" for some of the relational (see: scatterplot) type data, feel free to look it up here and verify that you have something that looks right. Also, part of this project will be choosing the right indicators. You are free to use Gapminder to play with different options before deciding on one!

## General Instructions
* For each question, you will have at least two cells - a SQL cell where you run your query (and save the output to a data frame), and a visualization cell, where you construct your chart. Please be clear that **all** data manipulation is to be done in SQL. Please do not use `pandas` or any other python library to massage your data output - in the real world, that would be impossible. 
* Please make all charts legible - this inclues axes labels, clear tick marks, clear point markers/lines/color schemes (i.e. don't repeat colors across categories), legends, and so on. 
* If you're asking for help, be sure to talk to the TAs about which indicators you are using and how you determined they were the right ones. Ultimately we care about the chart, so if you're aggregating the same data stored in a different way, that's fine (eg population by decade instead of population by phase of life). Some indicators will lead to easier solutions though, so we encourage you to spend some time making sure you've got one that feels straightforward to use. Feel free to come to office hours if you want to discuss finding the right indicators. 

## Visualization Libraries
The Colab notebook comes preinstalled with a visualization library called **Altair**. 
You can read its docs here: https://altair-viz.github.io/ 

There are some basic code snippets available if you open the menu on the left of the notebook and look under code snippets. 
We expect you to read the docs and understand how to use this library. The exploration you do now will help for project 3 when you do your own visualizations as well, and for later in your career if you ever want to play around with data (the Jupyter notebook, which is what Colab is based on, is a common data analysis tool these days!)





## Indicators

The World Bank indicators are available and searchable [here](https://data.worldbank.org/). If you want to browse some of what's available, you can check [here](https://data.worldbank.org/indicators). The browsing page doesn't have all the indicators (despite saying "all indicators", it only has the "Primary World Bank" indicators).  
If that fails, then you can Google "World Bank ____________ indicator" and click on the results from the world bank that come up.

You will likely need to look up indicator codes and indicator code patterns in order to extract the necessary data for this portion of the assignment. Once you've found your indicator, you will arrive at a page like `https://data.worldbank.org/indicator/XXXXXXXXX`. The X's are where the `indicator_code` for that indicator are found. For example, in `https://data.worldbank.org/indicator/SH.XPD.CHEX.GD.ZS`, the `SH.XPD.CHEX.GD.ZS` is the `indicator_code` you would search for in the database.

Alternately, you are free to query for keywords in BigQuery directly (it may be easier for some of the simpler plots).

Many of these questions are _intentionally_ left open ended for you to think about what the right metrics are (another important data skill). Part of asking and answering interesting questions is thinking about the blind spots of your metrics. 
For example, say you plot $ spent versus educational attainment for various countries. Would you rather use the raw money spent, or money as % GDP (Gross Domestic Product)? What are the tradeoffs?

## Question 7 (3 points)


First, something basic - let's plot the total population of the USA over time as a stacked area chart - there should be multiple bands for the granularities of population that are recorded (0-14, 15-64, 65+). The x axis should be 'year' and the y axis should be 'population'. The sum of all the bands will equal the total population in the US at that time. 

 **Hint:** BigQuery's REGEX functions may be helpful here. (You may want to test your regex at [regex101.com](https://) before using it in BigQuery to make sure it works!)

In [3]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [4]:
%%bigquery --project third-diorama-233818 q7
SELECT i_n_TOTL as indicator_name, 
       i_c_TOTL as indicator_code,
       population_count_TOTL - population_count_0014 - population_count_1456  as population_count,
       year_TOTL as year
FROM (SELECT indicator_name as i_n_TOTL, 
             indicator_code as i_c_TOTL, 
             value as population_count_TOTL, 
             year as year_TOTL
      FROM `bigquery-public-data.world_bank_intl_education.international_education`
      WHERE country_code = 'USA' AND 
            indicator_code = 'SP.POP.TOTL'
     ), 
     (SELECT indicator_name as i_n_0014, 
             indicator_code as i_c_0014, 
             value as population_count_0014, 
             year as year_0014
       FROM `bigquery-public-data.world_bank_intl_education.international_education` 
       WHERE country_code = 'USA' AND 
             indicator_code = 'SP.POP.0014.TO'
     ), 
     (SELECT indicator_name as i_n_1456, 
             indicator_code as i_c_1456, 
             value as population_count_1456, 
             year as year_1456
       FROM `bigquery-public-data.world_bank_intl_education.international_education` 
       WHERE country_code = 'USA' AND 
             indicator_code = 'SP.POP.1564.TO'
     )
WHERE (year_TOTL = year_0014) AND
      (year_1456 = year_0014)

UNION ALL

SELECT indicator_name, 
       indicator_code, 
       value as population_count, 
       year
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE country_code = 'USA' AND 
      indicator_code = 'SP.POP.0014.TO'

UNION ALL

SELECT indicator_name, 
       indicator_code, 
       value, 
       year
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE country_code = 'USA' AND 
      indicator_code = 'SP.POP.1564.TO'

Unnamed: 0,indicator_name,indicator_code,population_count,year
0,"Population, ages 0-14, total",SP.POP.0014.TO,51201638.0,1983
1,"Population, ages 0-14, total",SP.POP.0014.TO,55176304.0,1991
2,"Population, ages 0-14, total",SP.POP.0014.TO,61653419.0,2015
3,"Population, ages 0-14, total",SP.POP.0014.TO,52437806.0,1978
4,"Population, ages 0-14, total",SP.POP.0014.TO,55654422.0,1973
5,"Population, ages 0-14, total",SP.POP.0014.TO,62354688.0,2008
6,"Population, ages 0-14, total",SP.POP.0014.TO,54006943.0,1975
7,"Population, ages 0-14, total",SP.POP.0014.TO,61734640.0,2014
8,"Population, ages 0-14, total",SP.POP.0014.TO,62283423.0,2007
9,"Population, ages 0-14, total",SP.POP.0014.TO,53437830.0,1989


In [5]:
from altair import *
import pandas as pd

Chart(q7).mark_area ().encode(
 x = X('year'),
 y = Y('sum(population_count)', axis= Axis(title='Population Count')),
 color = Color('indicator_code')
)

## Question 8 (3 points)

Is the US population aging or getting younger overall? Make a normalized, stacked area chart so you can see the answer!

In [6]:
from altair import *
import pandas as pd

Chart(q7).mark_area().encode(
 x = X('year'),
 y = Y('sum(population_count)', axis= Axis(title='Population Count'), stack = "normalize"),
 color = Color('indicator_code')
)


# Insight
    The US population is aging yearly.



## Question 9 (4 points)

Let's make a plot just like Gapminder - Who's getting the most health for their money? Plot "money spent on healthcare" versus "life expectancy" (play with Gapminder to find the right metrics; there are a few options). Make a bubble plot where the size is population of the country, the bubbles are colored by region, and use a slider to change the year (note: choose something reasonable for the range)! Please also include some way of seeing which country is which (a tooltip, perhaps).

In [15]:
%%bigquery --project third-diorama-233818 q9

SELECT country_code, year, value as money_spend_on_health
FROM `bigquery-public-data.world_bank_health_population.health_nutrition_population` 
WHERE indicator_code LIKE 'SH.XPD.CHEX.PC.CD' AND 
      year = 2001

Unnamed: 0,country_code,year,money_spend_on_health
0,CHN,2001,45.327517
1,COD,2001,5.897850
2,MDA,2001,24.867511
3,BLZ,2001,153.678823
4,ARM,2001,45.388312
5,MCO,2001,1436.012458
6,SAU,2001,385.650145
7,ARE,2001,771.683488
8,UMC,2001,99.908334
9,NAC,2001,4631.607700


In [14]:
%%bigquery --project third-diorama-233818 helper_dataframe

SELECT country_code, year, value as life_expectancy
FROM `bigquery-public-data.world_bank_health_population.health_nutrition_population` 
WHERE indicator_code LIKE 'SP.DYN.LE00.IN' AND 
      year = 2001

Unnamed: 0,country_code,year,life_expectancy
0,AZE,2001,67.054000
1,LUX,2001,77.824390
2,LIE,2001,79.275610
3,GRC,2001,78.387805
4,IRN,2001,70.526000
5,ATG,2001,73.752000
6,URY,2001,75.022000
7,NCL,2001,74.831707
8,MUS,2001,71.765854
9,AUT,2001,78.575610


In [0]:
alt.Chart(q9).mark_point().encode(
    x = '',
    y = '',
    size = ''
)