# Load JSON files from a directory (or list of directories)

#### Note:
* If using BINDER or a did a git clone to your Jupyter Notebook LOCAL SERVER, skip cell 1, place the cursor on cell 2 and from the main menu, choose "Cell" > "Run All Bellow", if not executing cell-by-cell.

In [None]:
# Only RUN this CELL if using Google COLAB
# To download the zip file (224MB), it could take up to 30 minutes
# Not needed if using BINDER or a did a git clone to your Jupyter Notebook LOCAL SERVER, as you already have the data locally

!wget https://cld.pt/dl/download/20ee7ff2-7e35-41fc-9898-35704e77e0a2/dig19cbooksjsontext_sample_0037.zip

# Only RUN THE UNZIP ONCE, please
!unzip dig19cbooksjsontext_sample_0037.zip 

!mkdir data
!mv dig19cbooksjsontext/ data/dig19cbooksjsontext/

In [1]:
# Load the necessary modules / Libraries:

import os, json
import pandas as pd

### Let's check the files in this Directory:

In [2]:
# If all the files are under the same directory a simpler way to locate them would be:
# path_to_json = 'data/dig19cbooksjsontext/json/0037/'
# json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#this will locate all the JSON files inside the main Directory and any sub-Folder:
path_to_json = 'data/dig19cbooksjsontext/'

json_files = [os.path.join(root, name)
             for root, dirs, files in os.walk(path_to_json)
             for name in files
             if name.endswith((".json"))] #If we needed to read several files extensions: if name.endswith((".ext1", ".ext2"))

print('Number of JSON files ready to be loaded: ' + str(len(json_files)))


Number of JSON files ready to be loaded: 1080


In [3]:
print('Path to the first file: '+json_files[0])
print('Path to the last file: '+json_files[len(json_files)-1])

Path to the first file: data/dig19cbooksjsontext/json/0050/005009643_01_text.json
Path to the last file: data/dig19cbooksjsontext/json/0118/011836203_01_text.json


### Files are ready to be loaded into a big DataFrame
We will add ALEPH SYS ID of the book ("bookid") and the Volume information extracted from the filename of the JSON file.

* A similar process can be used to append all the data into a single JSON, instead of loading the data from the JSON files into a DataFrame.

In [4]:
jsons_df = pd.DataFrame()
#jsons_df = pd.DataFrame(columns=['', '0', '1','bookid','volume'])

for index, js in enumerate(json_files):
    jsons_df_ins = pd.read_json(json_files[index])  #if we were just loading from one Directory, first solution above, we would need to add the path to the file: jsons_df_ins = pd.read_json(path_to_json+json_files[index])
    jsons_df_ins["bookid"] = json_files[index][35:44]
    jsons_df_ins["volume"] = json_files[index][45:47]
    jsons_df = jsons_df.append(jsons_df_ins, ignore_index=True)


print('Blocks of text loaded into the DataFrame: ' + str(jsons_df.size))
    

Blocks of text loaded into the DataFrame: 1587840


In [5]:
print('First 20 blocks of text:')
jsons_df.head(20)

First 20 blocks of text:


Unnamed: 0,0,1,bookid,volume
0,1,,5009643,1
1,2,,5009643,1
2,3,,5009643,1
3,4,,5009643,1
4,5,"REPORT ONTHB GEOLOGY, MINERALOGY, BOTANY, AND ...",5009643,1
5,6,"Entered according to Act of Congress, in the y...",5009643,1
6,7,"To His Excellency, John davis, Esq.. Governor ...",5009643,1
7,8,,5009643,1
8,9,INTRODUCTORY OR HISTORICAL NOTE. On the 3d of ...,5009643,1
9,10,"INTRODUCTION. IV Commonwealth, in connection w...",5009643,1


### Given that some blocks are empty, let's create a new DataFrame with only the blocks that have Text

In [6]:
#Drop blocks of text that are empty

filter = jsons_df[1] != ""
blocksText_df = jsons_df[filter]
print('Blocks of text non empty: ' + str(blocksText_df.size))
print('Excluded ' + str(jsons_df.size - blocksText_df.size) + ' that are empty')


Blocks of text non empty: 1504596
Excluded 83244 that are empty


In [7]:
blocksText_df.head(20)

Unnamed: 0,0,1,bookid,volume
4,5,"REPORT ONTHB GEOLOGY, MINERALOGY, BOTANY, AND ...",5009643,1
5,6,"Entered according to Act of Congress, in the y...",5009643,1
6,7,"To His Excellency, John davis, Esq.. Governor ...",5009643,1
8,9,INTRODUCTORY OR HISTORICAL NOTE. On the 3d of ...,5009643,1
9,10,"INTRODUCTION. IV Commonwealth, in connection w...",5009643,1
10,11,INTRODUCTION. V such way and manner as he shal...,5009643,1
11,12,"VI INTRODUCTION. ' Resolved, that the said fiv...",5009643,1
12,13,CONTENTS. PART I. ECONOMICAL GEOLOGY. Explanat...,5009643,1
13,14,"CONTENTS VI II METALS AND THEIR ORES. Iron, • ...",5009643,1
14,15,CONTENTS IX Fall River: Pawtucket Falls : The ...,5009643,1


### And now we can do whatever needed with this DataFrame

#### E.g., let's look for blocks of text that mention London:

In [15]:
London_df = blocksText_df[blocksText_df[1].str.contains("London")]

In [16]:
print(London_df)

          0                                                  1     bookid  \
176     177  Origin of Diluvium. 171 ists,- that the land w...  005009643   
183     184  178 Scientific Geology. Clay. I am not certain...  005009643   
197     198  192 Scientific Geology. The conglomerate occup...  005009643   
205     206  Scientific Geology. 200 duction of the green s...  005009643   
207     208  202 Scientific Geology. with amber, and the in...  005009643   
...     ...                                                ...        ...   
395727   90  80 KAIRA : A DISTRICT IN FERTILE GUJARAT. No. ...  011834222   
395777   16  IV PREFACE. Island of Madagascar, I prepared a...  011844350   
396058    5  LIFE' IN THE MOFUSSIL; OR, THE CIVILIAN IN LOW...  011837920   
396153  100  Life in the Mofussil. 92 considered the new pr...  011837920   
396906  437  421 APPENDIX H tinge is much less extensive an...  011836203   

       volume  
176        01  
183        01  
197        01  
205        

#### ... and of those, have the term "love":

In [17]:
London_love_df = London_df[London_df[1].str.contains("love")]

In [18]:
print(London_love_df)

          0                                                  1     bookid  \
4002      3  THE VINCENTS «coo» ■ - N the corner of Middles...  003798471   
4356    341  CHATTO &> WIND US. PICCADILLY. 9 Faraday (Mich...  003761409   
6132      9  CONTENTS. PAGE. Chapter the First. — The Metro...  003794928   
6137     14  LONDON AND ITS ENVIRONS. 2 Till the last centu...  003794928   
6153     30  LONDON AND ITS ENVIRONS. 18 III. THE ANCIENT C...  003794928   
...     ...                                                ...        ...   
393912  135  111 STREET COMMERCE. PIC-NICS. dency has to wr...  010878173   
394022  245  LUCKNOW. 221 articles recommend them to the pu...  010878173   
394788  291  A TRIP TO CASHMERE AND LADAK. By Cowley Lamber...  011837920   
395539  214  194 ASHE PYEE. by an extraordinary helmet, 'so...  011833856   
395564  239  219 MISCELLANEOUS RECORD. route to the amount ...  011833856   

       volume  
4002       01  
4356       01  
6132       01  
6137       

#### Let's check one of the records, ID = 6132 (3rd in the list)

In [20]:
London_love_df[1][6153]

'LONDON AND ITS ENVIRONS. 18 III. THE ANCIENT CITY. REMAINS OF ROMAN LONDON. THE LONDON STONE. THE TOWER. — QUEEN VICTORIA\'S KEYS. — ST. JOHN\'S GATE. SHAKESPEARE\'S HOUSE. The almost entire absence of ancient monuments in a city, the origin of wliich is so antiquated, is not one of the least surprises which London presents to the stranger. One fancies usually that the old "City" encloses a number of old (secular) edifices; that in those dark streets and obscure passages called "lanes," new buildings would be the exception, not the rule. But this is an error. The dark houses which border the City ways are relatively modern, for, notwithstanding the respect whicli the English profess for all that savours of antiquity ; notwithstanding their love for old customs and ancient traditions (the word "old" is in English a term of affection), they know how to make allowance for increasing necessities and new wants, and sacrifice — not without hesitation, it is true, but nevertheless with deter

(...) REMAINS OF ROMAN **LONDON**. THE **LONDON** STONE.... notwithstanding their **love** for old customs and ancient traditions ... 