# __Combining JSON files__

This notebook unzips all the files in the accompanying folder and adds them to a single pandas dataframe

Dataframe fields in json files are as follows: 
* `id`: Persistent id of the newspaper in the Chronicling America database
* `date_issued`: Date the article was issued
* `ed`: Edition in which the article appeared
* `seq`: Image sequence number (may be equivalent to the page number?)
* `paper`: Name of the newspaper to which the associated article belongs
* `text`: Full text of the article

URLs for any paper can be reconstructed via the following format:

https://chroniclingamerica.loc.gov/lccn/{lccn}/{date}/ed-{edition}/seq-{imagesequence}


In [None]:
import zipfile
import os, json
import pandas as pd

#Identify file location, and location to save unzipped files
filepath = 'C:/Users/Lofgran/Documents/Python Scripts/CoronaWhy/spanish_flu_scraper/spanish_flu_data-20200707T034359Z-001.zip'
save_location = 'C:/Users/Lofgran/Documents/Python Scripts/CoronaWhy/spanish_flu_scraper'

#Unzip files
with zipfile.ZipFile(filepath,"r") as zip_ref:
    zip_ref.extractall(save_location)

#Get list of files in folder
json_files = [pos_json for pos_json in os.listdir(save_location+'/spanish_flu_data') if pos_json.endswith('.json')]
print('Number of json_files: ', len(json_files))

# Load files into pandas dataframe
df = pd.DataFrame()
for j in json_files:
    df_temp = pd.read_json('/'.join((save_location, 'spanish_flu_data', j)), orient='columns')
    df = df.append(df_temp)
    
print(df.info())

df.head()

Number of json_files:  318
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15900 entries, 0 to 49
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           15900 non-null  object
 1   date_issued  15900 non-null  object
 2   ed           15900 non-null  object
 3   seq          15900 non-null  object
 4   paper        15900 non-null  object
 5   text         15900 non-null  object
dtypes: object(6)
memory usage: 869.5+ KB
None


Unnamed: 0,id,date_issued,ed,seq,paper,text
0,sn99021999,1918-10-18,ed-1,seq-10,Omaha daily bee.,"10\nTHE BEE: OMAHA, FRIDAY, OCTOBER 18, tftl8...."
1,sn87057934,1918-10-17,ed-1,seq-4,Audubon County journal.,UEGISTRATION OF WAR\nBONDS IS URGED BY\nLOAN O...
2,sn88064328,1918-10-12,ed-1,seq-6,New Iberia enterprise and independent observer.,SPANISH INFLUENZA\nTHREE DAY FEVER.\nWe were h...
3,sn89066315,1918-11-04,ed-1,seq-1,The Evening Missourian. [volume],"I\nsrtgvS!iSiTIBriff!5S5S33BBBaS\n,,w, - p i.j..."
4,sn99021999,1918-10-06,ed-1,seq-5,Omaha daily bee.,THE OMAHA SUNDAY - BEE: OCTOBER 6. 1918.\n'FLU...


In [None]:
df.id.value_counts(dropna=False)

sn85034235    636
sn89058012    636
sn99021999    636
sn83030193    636
sn88085488    636
sn88076432    636
sn89066315    636
sn88056159    318
sn89055004    318
sn88064328    318
sn87065612    318
sn86063041    318
sn87062268    318
sn87096037    318
sn84038095    318
sn82014519    318
sn84027718    318
sn86058242    318
sn94052320    318
sn91068748    318
sn86091100    318
sn90051006    318
sn85038531    318
sn85040437    318
sn95060583    318
sn85040652    318
sn86091084    318
sn85029856    318
sn84026853    318
sn88064055    318
sn89067273    318
sn94050892    318
sn86086586    318
sn86063774    318
sn85050913    318
sn87057934    318
sn85035720    318
sn85040720    318
sn84031081    318
sn88085318    318
sn86069675    318
sn00065154    318
sn91068765    318
Name: id, dtype: int64