# APIs and Data Frames

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

# Data from APIs 

The data we get from APIs is typically in form of a JSON file, which gets translated into a dictionary within Python. This is useful for getting all sorts of data in all sorts of formats, but when we want to do analysis, we typically want it in a more tabular format. This usually means we want it in a DataFrame rather than in a dictionary.

In [1]:
from requests import get

import numpy as np 
import pandas as pd 

## Census Data
Let's first use the Census API to get some data to analyze. We'll bring in our census key from a text file first. Remember to copy and paste that file into this folder so that the following piece of code works.

In [2]:
with open('census-key.txt', 'r') as f:
    census_key = f.readline()

Let's start by bringing in a set of variables from the 2022 American Community Survey (ACS) Data Profiles tables. We'll look at some employment and education characteristics of people by state. The variables we'll want to pull are:
- `NAME`: State name
- `DP02_0001E`: Total number of households in the state
- `DP03_0087E`: Mean income of people in the state
- `DP03_0002PE`: Percent of people 16 years and older in labor force
- `DP02_0068PE`: Percent of people 25 years and older with a Bachelor's Degree or higher
- `DP02_0066PE`: Percent of people 25 years and older with a Graduate or Professional degree

Note that we are going to be using the Data Profile table (note the DP at the beginning of each variable name) so the base url needs to be for that particular type of table. 

In [3]:
year = 2022
census_base_url = f'https://api.census.gov/data/{year}/acs/acs1/profile'

census_params = {'get':'NAME,DP02_0001E,DP03_0087E,DP03_0002PE,DP02_0068PE,DP02_0066PE',
                 'for':'state:*',
                 'key':census_key}

r = get(census_base_url, params = census_params)
r.status_code

200

In [4]:
# Removing Puerto Rico due to lack of data.
people_by_state = r.json()[:-1]

In [5]:
people_by_state

[['NAME',
  'DP02_0001E',
  'DP03_0087E',
  'DP03_0002PE',
  'DP02_0068PE',
  'DP02_0066PE',
  'state'],
 ['Alabama', '2016448', '100785', '58.6', '28.8', '11.3', '01'],
 ['Alaska', '274574', '124663', '67.0', '30.6', '11.3', '02'],
 ['Arizona', '2850377', '116717', '60.7', '33.0', '12.5', '04'],
 ['Arkansas', '1216207', '92935', '57.8', '25.4', '9.7', '05'],
 ['California', '13550586', '147628', '63.9', '37.0', '14.4', '06'],
 ['Colorado', '2384584', '142387', '68.9', '45.9', '17.1', '08'],
 ['Connecticut', '1433635', '157696', '65.7', '41.9', '18.9', '09'],
 ['Delaware', '402334', '124756', '62.7', '36.5', '15.5', '10'],
 ['District of Columbia', '326970', '202466', '71.8', '65.4', '38.9', '11'],
 ['Florida', '8826394', '115717', '59.6', '34.3', '12.9', '12'],
 ['Georgia', '4092467', '116323', '63.9', '34.7', '14.0', '13'],
 ['Hawaii', '494827', '135028', '63.5', '35.4', '13.3', '15'],
 ['Idaho', '717151', '108727', '62.4', '32.3', '10.8', '16'],
 ['Illinois', '5056360', '129493', '6

### Method 1: Create a dictionary

One common way of creating a Data Frame is by first creating a dictionary of lists (or other sequences, like arrays), then just converting that into a Data Frame. The keys in the dictionary will be set as the column names, and the values associated with those keys will become the data in the columns. 

In [6]:
example_df = pd.DataFrame({'example':[1,2,3], 'example2':[4,5,6]})
example_df

Unnamed: 0,example,example2
0,1,4
1,2,5
2,3,6


So, in order to create a DataFrame, we can just make sure that our data is in the correct dictionary format, then convert it. This means we will need to make sure to convert the data we get back from the API into this type of dictionary. Let's take a look at what we got from the census API to see what we would need to do.

In [7]:
type(people_by_state)

list

In [8]:
len(people_by_state)

52

In [9]:
people_by_state[0]

['NAME',
 'DP02_0001E',
 'DP03_0087E',
 'DP03_0002PE',
 'DP02_0068PE',
 'DP02_0066PE',
 'state']

In [None]:
people_by_state[:4]

<font color ='red'>**Question 1: Create a list called `hh` that has the number of households in each state.**</font>

<font color ='red'>**Question 2: Now, following this model, create a dictionary called `census_dict` with the keys of `state`, `household`, `mean_income`, `percent_employed`, `percent_bachelors`, `percent_graduate`. These should have lists with the appropriate data corresponding to each key. Make sure that the numbers are included as numeric values rather than as strings.**</font>

Note: This is possible to do with dictionary comprehension! Think about how you might do this.

In [None]:
keys = ['state', 'num_households','mean_income','percent_employed','percent_bachelors','percent_graduate']

In [None]:
census_dict = ...





Once you get the dictionary set up correctly, it's very easy to turn it into a DataFrame -- just use `pd.DataFrame`. 

In [None]:
census_df = pd.DataFrame(census_dict)
census_df.head()

<font color ='red'>**Question 3: Use the `.describe` method to look at some summary statistics for the variables that we got. Then, use the `sort_values` method to figure out which state had the highest mean income and which had the lowest.**</font>

Hint: Look at the [documentation for sort_values](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.sort_values.html) to see how you might use this method. You'll need to give it a variable name as an argument, and you can also use `ascending = False` to sort the other way.

In [None]:
census_df.describe()

### Method 2: Convert to Data Frame, then clean

We actually could have made the list of lists into a Data Frame much faster -- by simply turning it into a Data Frame immediately.

In [10]:
df2 = pd.DataFrame(people_by_state)
df2.head()

Unnamed: 0,0,1,2,3,4,5,6
0,NAME,DP02_0001E,DP03_0087E,DP03_0002PE,DP02_0068PE,DP02_0066PE,state
1,Alabama,2016448,100785,58.6,28.8,11.3,01
2,Alaska,274574,124663,67.0,30.6,11.3,02
3,Arizona,2850377,116717,60.7,33.0,12.5,04
4,Arkansas,1216207,92935,57.8,25.4,9.7,05


However, because we aren't being careful about how to specify column names, we end up with the first row containing the column names instead. This is problematic, because we don't want them there and we might want to use the numbers to calculate summaries. We can instead set the column names manually, then remove the first row so that it isn't in our data. If we weren't going to change the column names, then we could have just pulled out the first row and assigned that as the column names instead.

While we're doing this, we'll also remove the last column that has the state ID. Since we have the state names in the dataset already, this is redundant. Note that in the first method above, we simply omitted it while creating the dictionary.

In [11]:
col_names = ['state', 'num_households','mean_income',
             'percent_employed','percent_bachelors',
             'percent_graduate','state_id']
df2.columns = col_names

In [12]:
df2.head()

Unnamed: 0,state,num_households,mean_income,percent_employed,percent_bachelors,percent_graduate,state_id
0,NAME,DP02_0001E,DP03_0087E,DP03_0002PE,DP02_0068PE,DP02_0066PE,state
1,Alabama,2016448,100785,58.6,28.8,11.3,01
2,Alaska,274574,124663,67.0,30.6,11.3,02
3,Arizona,2850377,116717,60.7,33.0,12.5,04
4,Arkansas,1216207,92935,57.8,25.4,9.7,05


<font color ='red'>**Question 4: Remove the first row and the last column of `df2`. Call the new DataFrame `census`.**</font>


In [14]:
census.head()

Unnamed: 0,state,num_households,mean_income,percent_employed,percent_bachelors,percent_graduate
1,Alabama,2016448,100785,58.6,28.8,11.3
2,Alaska,274574,124663,67.0,30.6,11.3
3,Arizona,2850377,116717,60.7,33.0,12.5
4,Arkansas,1216207,92935,57.8,25.4,9.7
5,California,13550586,147628,63.9,37.0,14.4


After that, we need to make it so that the numbers are numeric instead of strings. Remember, the data from the Census API is provided as strings, so those values won't be treated as numbers and any attempt to describe them will not give the correct response. 

In [15]:
census.describe()

Unnamed: 0,state,num_households,mean_income,percent_employed,percent_bachelors,percent_graduate
count,51,51,51,51.0,51.0,51.0
unique,51,51,51,38.0,47.0,40.0
top,Alabama,2016448,100785,62.8,24.8,11.3
freq,1,1,1,4.0,2.0,2.0


The `pd.to_numeric` function can help us do some of the conversion, but it would be quite tedious to have to do this manually for each column that needs it. Instead, we can use the `apply` method.

In [16]:
pd.to_numeric(census.mean_income)

1     100785
2     124663
3     116717
4      92935
5     147628
6     142387
7     157696
8     124756
9     202466
10    115717
11    116323
12    135028
13    108727
14    129493
15    106944
16    114205
17    113962
18     99631
19     96505
20    114321
21    148486
22    162406
23    111255
24    133458
25     87442
26    108019
27    109448
28    116618
29    112544
30    140757
31    157601
32     97014
33    141334
34    112123
35    125399
36    110719
37     98937
38    123364
39    121871
40    133207
41    105495
42    110227
43    105555
44    117182
45    127229
46    117603
47    141078
48    146344
49     89306
50    114906
51    106006
Name: mean_income, dtype: int64

### Using apply

We have already used `apply` to apply a function to an array. We can do the same to a DataFrame over all of its rows or columns. This might be helpful in cases where you want to do something to every single row or every single column (like convert it to a numeric value). 

Pandas has a `to_numeric` function that we can use on a list, tuple, 1-d array or Series object to convert to numeric. However, we can't apply it to the DataFrame overall. We'll need to apply it to each column. 

We'll apply it to every column except the first one (because we don't want to to try to turn the state names into numeric values). 

In [None]:
variables_to_convert = ['num_households', 'mean_income','percent_employed',
                       'percent_bachelors','percent_graduate']
census[variables_to_convert] = census[variables_to_convert].apply(pd.to_numeric, 1)

In [None]:
census.describe()

<font color ='red'>**Question 5: Note that the last three variables are shown as percentages. Using `apply`, change the percentages into proportions (so that they are between 0 and 1) and create a new Data Frame called `census_props`. Everything else in the Data Frame should be the same.**</font>


## A Note on Getting Data into Data Frames

It might seem like the second method is much more convoluted in getting the data into the format we need, but it isn't always the case. For example, if we had strings or categorical variables instead of numbers, then we might not have needed to do the conversion to numeric and use the `apply` step. Also, it can be a bit tedious to try to think about how to build the dictionary to begin with, especially if it takes you some time to think of how to build the list and dictionary comprehension pieces. 

The path that you getting data into Data Frames will differ depending on the data source and what format it comes in. Many times, you'll still have to do some data management steps even after getting a dictionary in the right format. 

Try to think about the format that you want the data in. Some questions to ask are:
- What are the rows? 
- What are the columnns?
- What type of data are you starting with? 
- Where is that data? 

Let's look at another example using the NY Times API. Make sure to copy and paste your NY Times API key text file into this folder before running the following code.

In [None]:
# My file with the key is called nyt-key.txt. Make sure you have that file with your key!
with open('nyt-key.txt', 'r') as f:
    nyt_key = f.readline()

base_url = "https://api.nytimes.com/svc/archive/v1/2019/1.json"
r = get(base_url, params= {'api-key':nyt_key}) 
archive_2019_01 = r.json()['response']['docs']


In [None]:
type(archive_2019_01)

In [None]:
len(archive_2019_01)

In [None]:
archive_2019_01[0].keys()

<font color ='red'>**Question 6: Create a Data Frame called `nyt_archive` that has the `abstract`, `web_url`, `type_of_material` and `word_count` of the articles in `archive_2019_01`.**</font>


<font color ='red'>**Question 7: What are the types of articles there were in January 2019 in the NY Times? What was the most common type of article? Which articles had the longest word counts on average?**</font>

### Scatterplots

To look at the relationship between two numerical variables, we can use a scatterplot. The `plot.scatter` method with the arguments for the variable names to go on the x and y axes does this for us.

In [None]:
census.plot.scatter('percent_bachelors','percent_employed')

<font color = 'red'>**Question 8: Create a scatterplot of `percent_bachelors` and `mean_income`. Does it look like there is a relationship between the two variables?**</font>