In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("02_groupby_apis.ipynb")

# Assignment: Groupby and API access

**Author**: (Write your name here)

## Introduction

There are two goals for this assignment:

1. Practice using the split-apply-combine strategy using the `.groupby` method 
2. Pracitce integrating with an API to obtain data

## Section 1: World Bank API

In this section we will be continuing our practice with APIs.

In homework 2, we used the `world_bank_data` package as a way to access some time series data from the World Bank

In this assignment we will ask you to gather the same data, but this time making the API calls yourself

You are not to use `world_bank_data`, but rather a combination of `json`, `requests`, and `pandas` to obtain the datasets

As a refresher, here is some background information on the three data series we worked with in homework 2:

- **Primary completion rate** (world bank code `"SE.PRM.CMPT.ZS"`), or gross intake ratio to the last grade of primary education, is the number of new entrants (enrollments minus repeaters) in the last grade of primary education, regardless of age, divided by the population at the entrance age for the last grade of primary education. Data limitations preclude adjusting for students who drop out during the final year of primary education.
- **GDP** (world bank code `"NY.GDP.MKTP.CD"`) at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars.
- **Population** (world bank code `"SP.POP.TOTL"`) is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates.

### Question 1

The main documentation for the World Bank API page is located at: https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information

We will be able to extract the information we need from a single page they call the [API Basic Call Structures page](https://datahelpdesk.worldbank.org/knowledgebase/articles/898581-api-basic-call-structures)

Click the link above to open the API Basic Call Structures page and look for the url needed to access data for **all** countries for a particular indicator

> Hint: the API Basic Call Structures page has an example for the `SP.POP.TOTL` indicator

In the code cell below, alter the right-hand-side of `url = ` and set it to the url you identified

Instead of writing the indicator from the example, use `{}` in its place as a placeholder

> Hint: For example `test_url = "https://github.{}/QuantEcon/{}"` has two placeholders, one after `github.` and one at the end of the string

In [None]:
url = ...

In [None]:
grader.check("q2.1")

<!-- BEGIN QUESTION -->

### Question 2

Navigate back to the API Basic Call Structures page and study the examples

Determine where to put the indicator name

Is it part of the path, query, headers, or payload of the request?

Write your answer in the cell below:


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 3

In the BLS api example shown in class, the data was returned to us in JSON format

This allowed us to use `res.json()` to read the data into a Python dict (where `res` is a `requests.Response` object)

JSON is not the default return value for the World Bank API, but it can be chosen via a query parameter. 

Look at the examples on the API Basic Call Structures page and determine how you can request that the data be returned in JSON format (Hint: it is a query parameter)

In the code cell below, create a new variable `url_q3` that adds the JSON format query parameter to the url determined to be correct in question 3

In [None]:
# create url_q3 here
url_q3 = ...

In [None]:
grader.check("q2.3")

### Question 4

How can you specify the range of dates for which you would like the indicator data?

Create a variable `url_q4` that starts from the value of `url_q3` and adds a query parameter for getting data for years 2000 through 2019 

> Hint 1: you may want to start this exercise by copy/paste `url_q3` from above, then renaming and making your change

> Hint 2: You may want to consult the `dates` object from 01_dataset_manipulation.ipynb for another clue

In [None]:
# create url_q4 here
url_q4 = ...

In [None]:
grader.check("q2.4")

### Question 5

By default, the World Bank API will only return 25 data points at a time. The programmer is then required to move through multiple **page**s of data in order to obtain the entire dataset

There is a query parameter that can increase the number of data points on each page of data. Determine the name of this query parameter

Create a variable `url_q6` that starts from `url_q5` and sets the query parameter you just identified to 10000 (a large enough number for all the data we will fetch to fit on the first page)

In [None]:
# create url_q5 here
url_q5 = ...

In [None]:
grader.check("q2.5")

### Question 6

Fill in the python function below

Consult the documentation string to determine what the function should do

> Hint: see the bls_api notebook from class for how to check status codes and parse the JSON response to a Python dict

> Hint 2: You've been building up a url that can be **format**ted to request the data indicated in the docstring

In [None]:
import requests

def request_wb_data(series_code: str) -> dict:
    """
    Uses the requests library to make a GET request to
    obtain the data in JSON format for the chosen world 
    bank series (indicated by `series_code`) for all countries 
    for years 2000 to 2019
    
    This function will check that the status code on the 
    requests.Response object is less than 300 and will then
    return the parsed JSON data as a Python dictionary
    
    If the status_code from the reqeusts.Response object 
    
    Parameters
    ----------
    series_code: str
        The world bank series code for which to fetch data
    
    Returns
    -------
    results: dict
        A Python dictionary containing the results of the
        API call
    
    """
    ...

In [None]:
grader.check("q2.6")

### Question 7

Call the `request_wb_data` function from above using `"SP.POP.TOTL"` as the `series_code` argument

Study the structure/format of the data that is returned

Fill in the indicated sections of the Python function below so that it successfully returns the value indicated in the docstring

Note there are helpful comments inside the body of the function that provide guidance on how to do this


In [None]:
import pandas as pd

def parse_wb_response(results: list) -> pd.DataFrame:
    """
    Given a dict containing the parsed JSON response from a request
    for World Bank indicator data, construct a DataFrame containing
    the data values
    
    Parameters
    ----------
    results: list
        A list containing the response from the World Bank API. This should
        be the return value from the `request_wb_data` function
    
    Returns
    -------
    data: pd.DataFrame
        A pandas DataFrame containing the data returned by the World Bank.
        This DataFrame will have columns [countryiso3code, date, value, series_code]
    
    """
    # step 1: construct DataFrame with all columns returned by World Bank
    df = ...
    
    # step 2: find the series_code (should only be one) 
    series_code = ...
    
    # step 3: Add the series_code as a column to the DataFrame
    ...
    
    # step 4: limit the DataFrame to the columns indicated in the docstring
    out = ...
    
    # step 5: return!
    ...

In [None]:
grader.check("q2.7")

### Question 8

Write one Python function that will take a World Bank series code and return a DataFrame containing the data for that series, for all countries, and the years 2001 to 2019

Make sure to include a helpful docstring for this function by following the examples above

> Hint: you should call both `request_wb_data` and `parse_wb_response` inside your new function

In [None]:
...

In [None]:
grader.check("q2.8")

### Question 9

Call your newly defined function from quesiton 9 to obtain the three World Bank indicators mentioned above ("SE.PRM.CMPT.ZS", "NY.GDP.MKTP.CD", and "SP.POP.TOTL")

Combine the three DataFrames you recieve into a single long-form DataFrame

In [None]:
pop = ...
gdp = ...
edu = ...

df = ...

In [None]:
grader.check("q2.9")

## Section 2: Groupby

There is one problem in this section: cohort analysis using shopify data

This problem was introduced as part of the groupby lecture.

The problem introduction and explanation has been copied from that lecture and repeated here

The only modification is that instead of using the `qeds` library to simulate data, we load it from a file.

In order to read the the `shopify_orders.parquet` file you will need to have the `pyarrow` library installed. The code cell below includes a `pip` instruction for installing this package. If you need to install it, please remove the comment in the cell below and execute the pip command.

If after installing pyarrow you get errors about pyarrow not being available when trying to read the data, please restart your jupyter kernel and try loading the data again.

In [None]:
#%pip install --upgrade pyarrow

In [None]:
import pandas as pd
import requests

orders = pd.read_parquet("shopify_orders.parquet")
orders.info()
orders.head()

**Definition:** A customer’s cohort is the month in which a customer placed
their first order

The customer type column indicates whether order was placed by a new or returning customer

We now describe the *want* for the exercise, which we ask you to complete

**Want**: Compute the monthly total number of orders, total sales, and
total quantity separated by customer cohort and customer type

Read that carefully one more time…

### Extended Exercise

Using the reshape and `groupby` tools you have learned, apply the want
operator described above

See below for advice on how to proceed

When you are finished, you should have something that looks like this:

<img src="shopify_cohort_answer.png" alt="groupby\_cohort\_analysis\_exercise\_output.png" style="">

  

A few notes on the table above:

1. Your actual output will be much bigger. This is just to give you an idea of what it might look like
1. The numbers you produce should actually be the same as what are included in this table… Index into your answer and compare what you have with this table to verify your progress
1. The labels will not have "Month-year" by default -- they will be numerical dates like `2019-07-31`. This is ok. The changing to "Month-year" representation is optional

Now, how to do it?

There is more than one way to code this, but here are some suggested
steps.

1. Convert the `Day` column to have a `datetime` `dtype` instead of object (Hint: use the `pd.to_datetime` function)
1. Add a new column that specifies the date associated with each
  customer’s `"First-time"` order
  - Hint 1: You can do this with a combination of `groupby` and
    `merge`
  - Hint 2: `customer_type` is always one of `Returning` and
    `First-time`  
  - Hint 3: Some customers don’t have a
    `customer_type == "First-time"` entry. You will need to set the
    value for these users to some date that precedes the dates in the
    sample. After adding valid data back into `orders` DataFrame,
    you can identify which customers don’t have a `"First-Time"`
    entry by checking for missing data in the new column.  
1. You’ll need to group by 3 things  
1. You can apply one of the built-in aggregation functions to the GroupBy
1. After doing the aggregation, you’ll need to use your reshaping skills to
  move things to the right place in rows and columns


Good luck!

> NOTE at the very end of my code, I ran the following to get the dates to appear in a human readable way
> ```
>     .rename(columns=lambda x: x.strftime("%B-%y"))
>     .rename(index=lambda x: x.strftime("%B-%y"), level="cohort")
> ```

In [None]:
solution = ...
months = ["July-20", "August-20", "September-20"]
solution.loc[pd.IndexSlice[:, :, months], months]

In [None]:
grader.check("q_groupby_shopify")