## NYTimes Data

A news article can have multiple headlines, multiple authors, and multiple keywords associated with it so it's hard to store "article data" in a tabular format. We will look at data from the [NYTimes Archive API in 2020 January](https://developer.nytimes.com/docs/archive-product/1/overview) to answer the following questions:
- How many unique authors are working for NYTimes?
- How many "subjects" have NYTimes written about?
- Which author has written about the most number of "subjects"?
- Which "subject" is the most popular across authors for NYTimes?

To answer these question, we will try to wrangle the data in `"nytimes_archive_202001.json"` to a data frame where the columns are the different subjects and the rows are the different authors.
- Definition for "subject": each article, under `keywords`, can have several type of keywords like "glocations", "subject", or "persons". The `"value"` corresponding to the keywords with `"name"` being `"subject"` is what we are defining as subjects. In the example below, `"Deaths (Obituaries)"` would be a subject but `"Vasulka Woody"` wouldn't be.
  ```
  [{'name': ['persons'],
    'value': ['Vasulka, Woody'],
    'rank': [1],
    'major': ['N']},
   {'name': ['subject'],
    'value': ['Deaths (Obituaries)'],
    'rank': [2],
    'major': ['N']}]
  ```
- Definition for author: an author is someone under the field `"byline"` under `"person"` with any letters in their first, middle, or last name. Therefore it's possible to have someone named `"J R"`
  - We will assume that authors with different `"{FIRSET_NAME} {MIDDLE_NAME} {LAST_NAME}"` are different individuals. If one does not have a middle name, there should only be one space between their first and last name. If someone only has a first name, there should be no spaces before or after their name.
  - It is possible to not have an author!
- The values within the data frame should be an integer reflecting the number of articles that author has written or co-authored with that subject tag.

## Task 1 - wrangle

Please wrangle the data under `"nytimes_archive_202001.json"` to the data frame format described above.
- Please name your final data frame `"auth_sub_df"`
- Please make sure the author names are stored in the `index` field for your data frame.
- Please make sure the subjects are stored as `columns` for the data frame as well.
- The values in the data frame should correspond to the frequency of that author writing on that subject across all articles.
- Everything should be case sensitive.
- You may need to download this dataset and complete it in your local environment first. Ed will likely crash.
- For beginners, I recommend looping over the data **twice**: figure out the total number of unique authors and subjects with the first pass, create an empty data frame, then fill in the data frame afterwards with the second pass.

In [13]:
import json
with open("nytimes_archive_202001.json", "r") as f:
    dat = json.load(f)

In [14]:
# According to how dictionaries work, i.e. unordered
# you may get a different outcome
demo_key = list(dat.keys())[1]

In [15]:
dat[demo_key].get("keywords")

[{'name': ['persons'],
  'value': ['Vasulka, Woody'],
  'rank': [1],
  'major': ['N']},
 {'name': ['subject'],
  'value': ['Deaths (Obituaries)'],
  'rank': [2],
  'major': ['N']},
 {'name': ['subject'], 'value': ['Art'], 'rank': [3], 'major': ['N']},
 {'name': ['organizations'],
  'value': ['Kitchen, The (Manhattan, NY, Performance Space)'],
  'rank': [4],
  'major': ['N']},
 {'name': ['subject'],
  'value': ['Video Recordings, Downloads and Streaming'],
  'rank': [5],
  'major': ['N']}]

In [16]:
dat[demo_key].get("byline").get("person")

[{'firstname': ['Richard'],
  'middlename': {},
  'lastname': ['Sandomir'],
  'qualifier': {},
  'title': {},
  'role': ['reported'],
  'organization': [''],
  'rank': [1]}]

In [0]:
### TEST FUNCTION: test_nytimes
# DO NOT REMOVE THE LINE ABOVE



## Task 2

Please answer the original questions using `"auth_sub_df"`.

- How many unique authors are working for NYTimes? Please assign the answer to a variable named `"nyt1"`.
- How many "subjects" have NYTimes written about? Please assign the answer to a variable named `"nyt2"`.
- Which author has written about the most number of "subjects"? Please assign the answer to a variable named `"nyt3"`.
- Which "subject" is the most popular across authors for NYTimes (this is different from the number of articles written on that topic)? Please assign the answer to a variable named `"nyt4"`.


In [0]:
### TEST FUNCTION: test_nytimes2
# DO NOT REMOVE THE LINE ABOVE

