# Week 7: Jupyter Notebook Assignment - Working with Data

Fill out the cell below with your information. 

* Student Name: 
* Date: 
* Instructor: Lisa Rhody
* Assignment due: 
* Methods of Text Analysis
* MA in DH at The Graduate Center, CUNY

## Objectives
The purpose of this notebook is to get some hands-on experience putting what you've seen in tutorials about importing and working with text in Python into practice. You'll also be asked to put the reading you've been doing all semester into conversation with the process of importing, cleaning, and preparing data. 

The object of the notebooks this week is: 
* To practice several ways of importing text into your Python environment to study; 
* To become more familiar with various pipelines for cleaning and preparing data for text analysis; 
* To consider the challenges that the availability and scarcity of data presents to the literary scholar (and to consider how other kinds of research might also need to address similar issues); 
* To connect examples of real-world text analysis projects with the practical process of cleaning and preparing data. 

## Importing Data
You've worked with data during the Datacamp exercises, but that was a much more controlled environment. When you are actually doing your own text analysis project, you will have a much messier process. During this week's reading, you will have read several pieces about what cleaning takes place and some of the challenges that data presents when working with text. In particular, we're looking at text analysis from a humanities / litereary perspective; however, one might argue that these challenges are more similar to the text analysis one might perform in the social sciences or with non-fiction work than might appear to be the case on the surface. 

In this lesson, we'll practice importing data: 
* from a file already on your computer (using a directory path); 
* from a file on the web using a URL request 
* from a file on the web using Beautiful Soup. 

There are many other ways to collect data, and perhaps some of our time could / should be spent on webscraping in order to collect data. The easiest way to get lots of data is by importing with an API. For those who are looking for a challenge, try importing Movie Reviews from the New York Times via their API for your notebook assignment this week instead. Be sure to use BeautifulSoup to prettify it, and then store the data in a pandas dataframe. 

### Loading data from a flat file on your local computer
When you downloaded the zip file for this week, included in the folder is a copy of Charlotte Perkins Gilman's _Herland_. During this activity, you will open that file, preview it, manipulate it, and do some cleaning and statistical inquiry. 

The first thing we're going to do is to import the packages that we may need. Then we're going to use a Python function `open()`. We'll use a `for` loop, which simply means that we'll do an action that repeats until we tell it to stop. The following code says that we want to `open` the file `herland.txt` so we can read it (argument `mode='r'`). Then we're going to close the file. When we do this, we're going to assign a variable name to the resulting data, which is now a string called `file`.



In [85]:
import nltk
import numpy as np
import pandas as pd
import urllib

filename = 'herland.txt'
herland = open(filename, mode='r')
hertext = herland.read()
herland.close()

In [86]:
# Here is how you print a string from a file without having to close the file using a context manager
with open('herland.txt','r') as file:
    print(file.read())

﻿The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Herland

Author: Charlotte Perkins Stetson Gilman

Posting Date: June 25, 2008 [EBook #32]
Release Date: May 10, 1992
Last Updated: October 14, 2016

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK HERLAND ***










HERLAND

by Charlotte Perkins Stetson Gilman




CHAPTER 1. A Not Unnatural Enterprise


This is written from memory, unfortunately. If I could have brought with
me the material I so carefully prepared, this would be a very different
story. Whole books full of notes, carefully copied records, firsthand
descriptions, and the pictures--that’s the worst loss. We had some
bird’s-eyes of the cities and

In [87]:
# If you don't want to save the text of the file, but just want to peek into it to see what's there, you could use this method. 

with open('herland.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

﻿The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman



This eBook is for the use of anyone anywhere at no cost and with



### What happens when you import a flat file? 

The python function `type()` will return to you output that explains the data type you are working with. When you pass the new text object `herland` through the `type()` function below, what response do you get? The response will look different from other data types that you've used before. In this case, it is read in as a "file object." Remember that Python won't know how to handle data unless it fits a particular data type that the computer expects when passing a function to it. In the next input, we ask Python for the length of the file. This will throw an error. Why do you think that is? 

In [94]:
# herland is a file object, not a string. 
type(herland)

_io.TextIOWrapper

In [89]:
# since herland is a file object and not a string, you can't find the length of it.
len(herland)

TypeError: object of type '_io.TextIOWrapper' has no len()

#### Response here: 

### We had to go through a process to convert the file object to a string. 
Looking at the cells below, which variable should return `type()` as a string? (The answer is in the cell below.) 

In [90]:
# but hertext is a different datatype. How would you check? 
type(hertext)

str

Once you have a string, there are a number of functions that you can make use of. One of those is the `len()` command, which you can run below. 

In [91]:
# How many characters are in the hertext string? 
len(hertext)

315999

Once an object is recognized as a string, you can begin manipulating it. For example, you could count the number of times the sequence of characters "her" appear within the entire text of _Herland_.

In [104]:
hertext.count('her', 0, -1)

1244

The ability to count characters, words, n-grams, etc. means that we can also more easily target specific sections of the text. For example, when you print to your screen the opening of the herland file, you notice that it is accompanied with metadata. For the purposes of text analysis, what would be the advantages or disadvantages of removing the metadata associated with _Herland_?

In [92]:
# What is happening at the beginning of the herland.txt file, though? We can check to see by using an index. 
print(hertext[:660])

﻿The Project Gutenberg EBook of Herland, by Charlotte Perkins Stetson Gilman

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Herland

Author: Charlotte Perkins Stetson Gilman

Posting Date: June 25, 2008 [EBook #32]
Release Date: May 10, 1992
Last Updated: October 14, 2016

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK HERLAND ***










HERLAND

by Charlotte Perkins Stetson Gilman




CHAPTER 1. 


Working with a string is *more* helpful than simply working with a text object, but there are other things that we can do to the text to make it more easily manipulated in Python and NLTK. For example, when you're working with a string, it's not easy to count whole words. The NLTK word tokenizer function, however, will take a string and turn it into "tokens"--discrete segments of characters. Tokenized strings become a new data type--a list. 

In [93]:
hertokens = nltk.word_tokenize(hertext)
type(hertokens)

list

A tokenized list can be called, acted upon, and manipulated differently than a string. If we call just the tokens that are in index positions 0-15, here is what you would get:

In [40]:
hertokens[:15]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Herland',
 ',',
 'by',
 'Charlotte',
 'Perkins',
 'Stetson',
 'Gilman',
 'This',
 'eBook',
 'is']

In [41]:
text1 = nltk.Text(hertokens)

In [42]:
type(text1)

nltk.text.Text

In [43]:
len(text1)

68494

In [105]:
text1[1000:1025]

['that',
 'they',
 'seemed',
 'sure',
 '.',
 'I',
 'told',
 'the',
 'boys',
 'about',
 'these',
 'stories',
 ',',
 'and',
 'they',
 'laughed',
 'at',
 'them',
 '.',
 'Naturally',
 'I',
 'did',
 'myself',
 '.',
 'I']

## Review
When you import text from a flat file that is saved on your local computer, what will you need to do in order to select parts of the text using an index? 

## Next, we're going to retrieve text directly from a URL with the `urlllib` package
To do this, we're going to call the package `urllib` and specifically from that we're going to use `urlretrieve.` Next, we need to assign the text in the file to a variable. In this case, that variable is `url`. We're going to run `urlretrieve` with two parameters, the name of the URL you want to import (which you assigned to the variable `url` above, and the file name and extension. Here that is `203-0.txt.` If you pay attention to the output, you'll realize that you've imported the file as an object. 

In [107]:
from urllib.request import urlopen
from urllib.request import Request
url = 'https://www.gutenberg.org/files/203/203-0.txt'

In [108]:
with open('203-0.txt','r') as file:
    uncletom=file.read()

### Using what you've learned so far, how would you figure out what data type the file `uncletom` is? Add a cell below and show how you would find the answer. 

## In the next 2 cells, print a selection of Uncle Tom's Cabin and the length of it. What else could we do with the text at this point?

In [109]:
print(uncletom[721:1200])




By Harriet Beecher Stowe




VOLUME I


CHAPTER I

In Which the Reader Is Introduced to a Man of Humanity


Late in the afternoon of a chilly day in February, two gentlemen were
sitting alone over their wine, in a well-furnished dining parlor, in
the town of P----, in Kentucky. There were no servants present, and the
gentlemen, with chairs closely approaching, seemed to be discussing some
subject with great earnestness.

For convenience sake, we have said, hitherto, two _


In [110]:
len(uncletom)

1025490

## Importing an HTML file using an http: request
The previous two files that we imported were _plain text_ files. In other words, there is little to no descriptive encoding. However, we can also use another module from the URLLIB package that is designed to import an .html file directly from the web. We can actually do this with just a few lines of code. First, we import the URLLIB package, and specifically the `request` module. We assign the URL we want to manipulate by assigning the URL to a variable. Next, we pass the URL through the urlopen.request function from the URLLIB package, and also at the same time "read" the file. The output of that string becomes the variable `html`. When we print the variable html, we discover that all of the HTML from the page has been pulled into the variable name. Unfortuantely, it doesn't look very clean. 

In [119]:
# Now import the bibliography page from Colored Conventions in HTML
import urllib.request
anotherurl='http://coloredconventions.org/exhibits/show/bishophmturner'

In [127]:
html = urllib.request.urlopen(anotherurl).read()
print(html)

b'<!DOCTYPE html>\n<html class="" lang="en-US">\n<head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=yes" />\n    \n        <title>Bishop Henry McNeal Turner: Visionary, Preacher, Prophet &middot; ColoredConventions.org</title>\n\n    <link rel="alternate" type="application/rss+xml" title="Omeka RSS Feed" href="/items/browse?output=rss2" /><link rel="alternate" type="application/atom+xml" title="Omeka Atom Feed" href="/items/browse?output=atom" />\n        <!-- Stylesheets -->\n    <link href="http://coloredconventions.org/plugins/ExhibitBuilder/views/public/css/exhibits.css" media="all" rel="stylesheet" type="text/css" >\n<link href="http://coloredconventions.org/themes/berlin_child/css/style.css" media="all" rel="stylesheet" type="text/css" >    <!-- JavaScripts -->\n                    <script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.m

If you are interested in doing text analysis of a webpage, and the only way to ingest the web page is with HTML included, what are things you might need to learn to do to separate the HTML tags from the text? Look at the code above and write a short description of what might need to stay and what might need to be extracted. Should the extracted data be preserved or discarded? 

# Importing Data by Webscraping with BeautifulSoup
If you are interested in scraping data from the open web, BeautifulSoup is a Python pacakge worth exploring in detail. For our purposes here, though, we're going to consider how to use Beautiful Soup to turn "unstructured" data into "structured" data. As you read through this section, consider Muñoz and Rawson's argument about data cleaning. Is there a need for the data to stay unstructured? What is the value of cleaning? 

In [130]:
import requests
from bs4 import BeautifulSoup

In [131]:
# Specify url: url
url4 = 'http://coloredconventions.org/press#scholarship'

# Package the request, send the request and catch the response: r
r = requests.get(url4)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)

<!DOCTYPE html>
<html class="" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=yes" name="viewport"/>
  <title>
   Press &amp; Notices · ColoredConventions.org
  </title>
  <link href="/items/browse?output=rss2" rel="alternate" title="Omeka RSS Feed" type="application/rss+xml"/>
  <link href="/items/browse?output=atom" rel="alternate" title="Omeka Atom Feed" type="application/atom+xml"/>
  <!-- Stylesheets -->
  <link href="http://coloredconventions.org/plugins/ExhibitBuilder/views/public/css/exhibits.css" media="all" rel="stylesheet" type="text/css"/>
  <link href="http://coloredconventions.org/themes/berlin_child/css/style.css" media="all" rel="stylesheet" type="text/css"/>
  <!-- JavaScripts -->
  <script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min.js" type="text/javascript">
  </script>
  <script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.10.3/jquery-ui

Compare the text imported using the "webscraping" method included with BeautifulSoup versus the option of importing the entire file using URLLIB. 

## Cleaning up Webscraped text

In [132]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url5 = 'http://coloredconventions.org/press#scholarship'

# Package the request, send the request and catch the response: r
r = requests.get(url5)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Colored Conventions' webpage: ccc_title
ccc_title = soup.title

# Print the title of Colored Conventions' webpage to the shell
print(ccc_title)


<title>Press &amp; Notices · ColoredConventions.org</title>


In [133]:
# Get Colored Conventions' text: ccc_text
ccc_text = soup.get_text()

# Print CCC's text 
print(ccc_text)





Press & Notices · ColoredConventions.org



 



    //<!--
    jQuery.noConflict();    //-->









Search
  

Search using this query type:
Keyword
Boolean
Exact match 

Search only these record types:
 Item
 Exhibit
 Exhibit Page
 Simple Page

Advanced Search (Items only)







Home


Conventions


Conventions by Year


National Conventions


State Conventions


Regional Conventions


Conventions by Region


Transcribe Minutes


Transcribe-a-thon


AME Church Minutes


Baptist Church Minutes




CCP Corpus


Submit Minutes


Search the Networks




Exhibits


A Brief Introduction to the Movement


To Stay or To Go?: The National Emigration Convention of 1854


The 1853 Manual Labor College Initiative


Bishop Henry McNeal Turner


Mobility, Migration, and the 1855 Philadelphia National Convention


Henry Highland Garnet's "Address"


What Did They Eat? Where Did They Stay?


Black Wealth and the 1843 Convention


Black Women's Economic Power


The First National Convention




In [134]:
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

/items/search
/
/conventions
/convention-by-year
/national-conventions
/state-conventions
/regional-conventions
/conventions-by-region
/transcribe-minutes
http://coloredconventions.org/hbd
http://coloredconventions.org/transcribe-minutes?set=ame
http://coloredconventions.org/transcribe-minutes?set=baptist
http://coloredconventions.org/intro-corpus
/submit-minutes
/delegate-search
http://coloredconventions.org/exhibits
http://coloredconventions.org/introduction-to-movement
http://coloredconventions.org/exhibits/show/conventions-black-press
http://coloredconventions.org/exhibits/show/1853-manual-labor
http://coloredconventions.org/exhibits/show/bishophmturner
http://coloredconventions.org/exhibits/show/mobilitymigration1855
http://coloredconventions.org/exhibits/show/henry-highland-garnet-address
http://coloredconventions.org/exhibits/show/williams-forson-exhibit
http://coloredconventions.org/exhibits/show/exhibit-1843
http://coloredconventions.org/exhibits/show/womens-economic-power
htt

Explain what the value is of importing HTML files using BeautifulSoup. How does this relate to the concerns that Rawson and Muñoz raise in their article? 

## Access data using an API
In the following exercise, you will import data from the Chronicling America API. You will set parameters for what content and keywords to pull in, then you will send the request to the server. After you import the data, you'll organize and clean up the JSON format--in other words, when you get your search results, it will come packaged in a file format, called JSON. We will ingest the JSON file, turn it into a dictionary, and then turn part of that dictionary into a Pandas Dataframe. All we're doing when we turn text data into a dataframe is organizing the metadata and the files into a format that can be used and acted upon in order to do other kinds of analysis. 

In [353]:
# Make the Requests module available
import requests
import pandas as pd

In [354]:
# Create a variable called 'api_search_url' and give it a value
api_search_url = 'https://chroniclingamerica.loc.gov/search/pages/results/'

In [406]:
# This creates a dictionary called 'params' and sets values for the API's mandatory parameters
params = {
    'proxtext': 'poetry' # Search for this keyword -- feel free to change!
    
}

(Later on, you will be asked to return to the above cell and change the search parameters. You do this by replacing `poetry` with `yourterm`.)

In [407]:
# This adds a value for 'encoding' to our dictionary
params['format'] = 'json'

# Let's view the updated dictionary
params

{'proxtext': 'poetry', 'format': 'json'}

In [423]:
# This sends our request to the API and stores the result in a variable called 'response'
response = requests.get(api_search_url, params=params)

# This shows us the url that's sent to the API
print('Here\'s the formatted url that gets sent to the ChronAmerca API:\n{}\n'.format(response.url)) 

# This checks the status code of the response to make sure there were no errors
if response.status_code == requests.codes.ok:
    print('All ok')
elif response.status_code == 403:
    print('There was an authentication error. Did you paste your API above?')
else:
    print('There was a problem. Error code: {}'.format(response.status_code))
    print('Try running this cell again.')

Here's the formatted url that gets sent to the ChronAmerca API:
https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=poetry&format=json

All ok


In [424]:
# Get the API's JSON results and make them available as a Python variable called 'data'
data = response.json()

In [428]:
# Let's prettify the raw JSON data and then display it.

# We're using the Pygments library to add some colour to the output, so we need to import it
import json
from pygments import highlight, lexers, formatters

# This uses Python's JSON module to output the results as nicely indented text
formatted_data = json.dumps(data, indent=2)

# This colours the text
highlighted_data = highlight(formatted_data, lexers.JsonLexer(), formatters.TerminalFormatter())

# And now display the results
print(highlighted_data)


{
  [34;01m"totalItems"[39;49;00m: [34m419070[39;49;00m,
  [34;01m"endIndex"[39;49;00m: [34m20[39;49;00m,
  [34;01m"startIndex"[39;49;00m: [34m1[39;49;00m,
  [34;01m"itemsPerPage"[39;49;00m: [34m20[39;49;00m,
  [34;01m"items"[39;49;00m: [
    {
      [34;01m"sequence"[39;49;00m: [34m25[39;49;00m,
      [34;01m"county"[39;49;00m: [
        [33m"New York"[39;49;00m
      ],
      [34;01m"edition"[39;49;00m: [34mnull[39;49;00m,
      [34;01m"frequency"[39;49;00m: [33m"Daily"[39;49;00m,
      [34;01m"id"[39;49;00m: [33m"/lccn/sn83030272/1913-05-04/ed-1/seq-25/"[39;49;00m,
      [34;01m"subject"[39;49;00m: [
        [33m"New York (N.Y.)--Newspapers."[39;49;00m,
        [33m"New York (State)--New York County.--fast--(OCoLC)fst01234953"[39;49;00m,
        [33m"New York (State)--New York.--fast--(OCoLC)fst01204333"[39;49;00m,
        [33m"New York County (N.Y.)--Newspapers."[39;49;00m
      ],
      [34;01m"city"[39;49;00m: [
        [33m"New 

The output of the above cell will be quite long. Before turning in this assignment, please delete the cell above so the file you turn in is not difficult to read. Thank you!

What kind of data type is `outfile`?

In [429]:
print(outfile)

<_io.TextIOWrapper name='data.json' mode='w' encoding='UTF-8'>


In [437]:
# Get the API's JSON results and make them available as a Python variable called 'data'
data = response.json()


In [438]:
type(data)

dict

In the cell below, we will take the nested dictionary, which is also a json format, and we will convert it into a DataFrame. 

In [430]:
pd.DataFrame.from_dict(data)

Unnamed: 0,totalItems,endIndex,startIndex,itemsPerPage,items
0,419070,20,1,20,"{'sequence': 25, 'county': ['New York'], 'edit..."
1,419070,20,1,20,"{'sequence': 131, 'county': [None], 'edition':..."
2,419070,20,1,20,"{'sequence': 15, 'county': ['Prince George's']..."
3,419070,20,1,20,"{'sequence': 17, 'county': ['Cook County'], 'e..."
4,419070,20,1,20,"{'sequence': 13, 'county': ['Cook County'], 'e..."
5,419070,20,1,20,"{'sequence': 9, 'county': ['Cook County'], 'ed..."
6,419070,20,1,20,"{'sequence': 3, 'county': ['Cook County'], 'ed..."
7,419070,20,1,20,"{'sequence': 1, 'county': ['Fayette', 'Hamilto..."
8,419070,20,1,20,"{'sequence': 93, 'county': [None], 'edition': ..."
9,419070,20,1,20,"{'sequence': 24, 'county': ['Cook County'], 'e..."


You may notice that a lot of the cells repeat the same data over and over again. What do you think is showing up in each row and column? 

In [431]:
pd.DataFrame.from_dict(data, orient='index')

Unnamed: 0,0
totalItems,419070
endIndex,20
startIndex,1
itemsPerPage,20
items,"[{'sequence': 25, 'county': ['New York'], 'edi..."


If we switch the layout of the dataframe, it becomes easier to see how the labels for the dataframe are different from the many items in the items observation. We can try to use the json method `normalize` to flatten out the file into columns. 


In [432]:
df = pd.io.json.json_normalize(data)
df.columns

Index(['endIndex', 'items', 'itemsPerPage', 'startIndex', 'totalItems'], dtype='object')

When we use the Multi Index function, we essentially collapse all the lists in the dataframe into one observation. 

In [433]:
df.columns = pd.MultiIndex.from_tuples([tuple(c.split('.')) for c in df.columns])
df

Unnamed: 0,endIndex,items,itemsPerPage,startIndex,totalItems
0,20,"[{'sequence': 25, 'county': ['New York'], 'edi...",20,1,419070


In [440]:
json=pd.DataFrame.from_dict(data)

If we name the dataframe json, we can run a miniature program over that file that returns the keys (index labels) of each item in the dictionary `data`.

In [456]:
for key in json:
    print(key)

totalItems
endIndex
startIndex
itemsPerPage
items


The `.tail()` method will print out just the last (in this case) 6 items in the dictionary.

In [457]:
json.tail(6)

Unnamed: 0,totalItems,endIndex,startIndex,itemsPerPage,items
14,419070,20,1,20,"{'sequence': 18, 'county': ['Douglas'], 'editi..."
15,419070,20,1,20,"{'sequence': 30, 'county': ['Cook County'], 'e..."
16,419070,20,1,20,"{'sequence': 38, 'county': ['New York'], 'edit..."
17,419070,20,1,20,"{'sequence': 9, 'county': ['Cook County'], 'ed..."
18,419070,20,1,20,"{'sequence': 9, 'county': ['Cook County'], 'ed..."
19,419070,20,1,20,"{'sequence': 9, 'county': ['Cook County'], 'ed..."


The `shape()` method will show how many rows and how many columns are in your dataframe.

In [458]:
json.shape

(20, 5)

Ok, we have lots of differently shaped data objects now. Let's see what the differences are. In the first case, if we take the variable `json` which is a json object and we print `items`, we get a json object.

In [459]:
print(json['items'])

0     {'sequence': 25, 'county': ['New York'], 'edit...
1     {'sequence': 131, 'county': [None], 'edition':...
2     {'sequence': 15, 'county': ['Prince George's']...
3     {'sequence': 17, 'county': ['Cook County'], 'e...
4     {'sequence': 13, 'county': ['Cook County'], 'e...
5     {'sequence': 9, 'county': ['Cook County'], 'ed...
6     {'sequence': 3, 'county': ['Cook County'], 'ed...
7     {'sequence': 1, 'county': ['Fayette', 'Hamilto...
8     {'sequence': 93, 'county': [None], 'edition': ...
9     {'sequence': 24, 'county': ['Cook County'], 'e...
10    {'sequence': 41, 'county': [None], 'edition': ...
11    {'sequence': 42, 'county': [None], 'edition': ...
12    {'sequence': 40, 'county': [None], 'edition': ...
13    {'sequence': 9, 'county': ['Hennepin', 'Ramsey...
14    {'sequence': 18, 'county': ['Douglas'], 'editi...
15    {'sequence': 30, 'county': ['Cook County'], 'e...
16    {'sequence': 38, 'county': ['New York'], 'edit...
17    {'sequence': 9, 'county': ['Cook County'],

When we request the data type of `data` we get a dictionary.

In [460]:
print(data)

{'totalItems': 419070, 'endIndex': 20, 'startIndex': 1, 'itemsPerPage': 20, 'items': [{'sequence': 25, 'county': ['New York'], 'edition': None, 'frequency': 'Daily', 'id': '/lccn/sn83030272/1913-05-04/ed-1/seq-25/', 'subject': ['New York (N.Y.)--Newspapers.', 'New York (State)--New York County.--fast--(OCoLC)fst01234953', 'New York (State)--New York.--fast--(OCoLC)fst01204333', 'New York County (N.Y.)--Newspapers.'], 'city': ['New York'], 'date': '19130504', 'title': 'The sun. [volume]', 'end_year': 1916, 'note': ['A facsimile of Vol. 1, no. 1 (Sept. 3, 1833) issued by The Sun (New York, N.Y. : 1920) on Sept. 2, 1933.', 'Also issued on microfilm by New York Public Library.', 'Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.', 'Evening eds.: Evening sun (New York, N.Y. : 1852), <1852>, and: Evening sun (New York, N.Y. : 1887), 1887-1916.', 'Publisher varies: Benjamin H. Day & George W. Wisner, 1833-1835; Benjamin H

When we "normalize" the dataframe key `items`, we turn it into a dataframe, and when we call the dataframe, we get the contents of this item in the dictionary in a dataframe format. Keys are at the top of each column.

In [448]:
json_file = pd.DataFrame.from_dict(json_normalize(data['items']))

In [452]:
json_file

Unnamed: 0,alt_title,batch,city,country,county,date,edition,edition_label,end_year,frequency,...,publisher,section_label,sequence,start_year,state,subject,title,title_normal,type,url
0,"[Extra sun, New York sun]",nn_ehrlich_ver02,[New York],New York,[New York],19130504,,,1916,Daily,...,Benj. H. Day,THIRD SECTION SUBURBAN REAL ESTATE SECTION,25,1833,[New York],"[New York (N.Y.)--Newspapers., New York (State...",The sun. [volume],sun.,page,https://chroniclingamerica.loc.gov/lccn/sn8303...
1,"[Star, Sunday star]",dlc_2goncharova_ver03,[Washington],District of Columbia,[None],19480926,,,1972,Daily,...,W.D. Wallach & Hope,,131,1854,[District of Columbia],"[Washington (D.C.)--fast--(OCoLC)fst01204505, ...",Evening star. [volume],evening star.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
2,[Greenbelt],mdu_annapolis_ver01,[Greenbelt],Maryland,[Prince George's],19380824,,,1954,Weekly,...,[s.n.],,15,1937,[Maryland],"[Greenbelt (Md.)--Newspapers., Maryland--Green...",Greenbelt cooperator.,greenbelt cooperator.,page,https://chroniclingamerica.loc.gov/lccn/sn8906...
3,[],iune_echo_ver01,[Chicago],Illinois,[Cook County],19120206,,,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,17,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
4,[],iune_golf_ver01,[Chicago],Illinois,[Cook County],19150202,,LAST EDITION,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,13,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
5,[],iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19140204,,NOON EDITION,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,9,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
6,[],iune_foxtrot_ver01,[Chicago],Illinois,[Cook County],19140304,,LAST EDITION,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,3,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
7,"[Blade, Bluegrass blade]",kyu_dylan_ver01,"[Lexington, Cincinnati]",Kentucky,"[Fayette, Hamilton]",19080209,,,1999,Weekly,...,Blade Pub. Co.,,1,1880,"[Kentucky, Ohio]","[Fayette County (Ky.)--Newspapers., Kentucky--...",Blue-grass blade. [volume],blue-grass blade.,page,https://chroniclingamerica.loc.gov/lccn/sn8606...
8,"[Star, Sunday star]",dlc_1noguchi_ver01,[Washington],District of Columbia,[None],19390122,,,1972,Daily,...,W.D. Wallach & Hope,,93,1854,[District of Columbia],"[Washington (D.C.)--fast--(OCoLC)fst01204505, ...",Evening star. [volume],evening star.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...
9,[],iune_hotel_ver01,[Chicago],Illinois,[Cook County],19151214,,LAST EDITION,1917,Daily (except Sunday and holidays),...,N.D. Cochran,,24,1911,[Illinois],"[Chicago (Ill.)--Newspapers., Illinois--Chicag...",The day book. [volume],day book.,page,https://chroniclingamerica.loc.gov/lccn/sn8304...


## Reflection
In this exercise, you queried an API from Chronicling America and pulled in files that included the search term "poetry." Those files, then, were cleaned and made slightly more tidy by highlighting the "keys" to the dictionary, and then taking one small section of the dictionary and turning it into a dataframe. In a markdown section, look over what you have done, and try changing the search *parameter* at the top of the exercise. What changes when you re run the activity? What is "messy" about the file that makes it hard to work with? What is "clean" about the file that makes it easier to work with? 