# Load JSON files from a directory (or list of directories)

#### Note:
* If using BINDER or a did a git clone to your Jupyter Notebook LOCAL SERVER, skip cell 1, place the cursor on cell 2 and from the main menu, choose "Cell" > "Run All Bellow", if not executing cell-by-cell.

In [None]:
# Only RUN this CELL if using Google COLAB
# To download the zip file (224MB), it could take up to 30 minutes
# Not needed if using BINDER or a did a git clone to your Jupyter Notebook LOCAL SERVER, as you already have the data locally

!wget https://cld.pt/dl/download/20ee7ff2-7e35-41fc-9898-35704e77e0a2/dig19cbooksjsontext_sample_0037.zip

# Only RUN THE UNZIP ONCE, please
!unzip dig19cbooksjsontext_sample_0037.zip 

!mkdir data
!mv dig19cbooksjsontext/ data/dig19cbooksjsontext/

In [1]:
# Load the necessary modules / Libraries:

import os, json
import pandas as pd

### Let's check the files in this Directory:

Number of JSON files ready to be loaded: 930


In [13]:
# If all the files are under the same directory a simpler way to locate them would be:
# path_to_json = 'data/dig19cbooksjsontext/json/0037/'
# json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#this will locate all the JSON files inside the main Directory and any sub-Folder:
path_to_json = 'data/dig19cbooksjsontext/'

json_files = [os.path.join(root, name)
             for root, dirs, files in os.walk(path_to_json)
             for name in files
             if name.endswith((".json"))] #If we needed to read several files extensions: if name.endswith((".ext1", ".ext2"))

print('Number of JSON files ready to be loaded: ' + str(len(json_files)))


Number of JSON files ready to be loaded: 930


In [6]:
print('Path to the first file: '+json_files[0])
print('Path the the last file: '+json_files[len(json_files)-1])

Name of the first file: data/dig19cbooksjsontext/json/0037/003705089_01_text.json
Name of the last file: data/dig19cbooksjsontext/json/0037/003792400_01_text.json


### Files are ready to be loaded into a big DataFrame
We will add ALEPH SYS ID of the book ("bookid") and the Volume information extracted from the filename of the JSON file.

* A similar process can be used to append all the data into a single JSON, instead of loading the data from the JSON files into a DataFrame.

In [8]:
jsons_df = pd.DataFrame()
#jsons_df = pd.DataFrame(columns=['', '0', '1','bookid','volume'])

for index, js in enumerate(json_files):
    jsons_df_ins = pd.read_json(json_files[index])  #if we were just loading from one Directory, first solution above, we would need to add the path to the file: jsons_df_ins = pd.read_json(path_to_json+json_files[index])
    jsons_df_ins["bookid"] = json_files[index][0:9]
    jsons_df_ins["volume"] = json_files[index][10:12]
    jsons_df = jsons_df.append(jsons_df_ins, ignore_index=True)


print('Blocks of text loaded into the DataFrame: ' + str(jsons_df.size))
    

Blocks of text loaded into the DataFrame: 1284528


In [9]:
print('First 20 blocks of text:')
jsons_df.head(20)

First 20 blocks of text:


Unnamed: 0,0,1,bookid,volume
0,1,,data/dig1,cb
1,2,,data/dig1,cb
2,3,,data/dig1,cb
3,4,,data/dig1,cb
4,5,,data/dig1,cb
5,6,,data/dig1,cb
6,7,"Der m der Gcywnz in seiner Veranlagung, Wirkli...",data/dig1,cb
7,8,,data/dig1,cb
8,9,Vorrede des Verfassers. So viele Daistellungen...,data/dig1,cb
9,10,IV ten gar keine Einsicht in die sogenannten „...,data/dig1,cb


### Given that some blocks are empty, let's create a new DataFrame with only the blocks that have Text

In [10]:
#Drop blocks of text that are empty

filter = jsons_df[1] != ""
blocksText_df = jsons_df[filter]
print('Blocks of text non empty: ' + str(blocksText_df.size))
print('Excluded ' + str(jsons_df.size - blocksText_df.size) + ' that are empty')


Blocks of text non empty: 1214884
Excluded 69644 that are empty


In [11]:
blocksText_df.head(20)

Unnamed: 0,0,1,bookid,volume
6,7,"Der m der Gcywnz in seiner Veranlagung, Wirkli...",data/dig1,cb
8,9,Vorrede des Verfassers. So viele Daistellungen...,data/dig1,cb
9,10,IV ten gar keine Einsicht in die sogenannten „...,data/dig1,cb
10,11,V des politisch-konfessionellen Kampfes von 18...,data/dig1,cb
11,12,VI Schrift zur Aufdeckung vieler unheilvollen ...,data/dig1,cb
12,13,Erster Theil. Die Veranlassung des Krieges I. ...,data/dig1,cb
13,14,"2 In diesen Thälern nun, welche dic Kantone Ui...",data/dig1,cb
14,15,3 8. 2. Die Grundlage der schweizerischen Eidg...,data/dig1,cb
15,16,4 Waldstättc. Daher stunden sie ihm wider den ...,data/dig1,cb
16,17,5 seinem Neffen Johann von Schwaben und dessen...,data/dig1,cb


### And now we can do whatever needed with this DataFrame.