<div class="jumbotron jumbotron-fluid">
  <div class="container">
    <h1 class="display-4">Load JSON files from a directory (and sub-directories)</h1>
    <p class="lead">Examples on how to load the full text from JSON files stored in a certain local directory (and sub-directories) of from a file downloaded from a URL.</p>
  </div>
</div>

<a href="https://colab.research.google.com/github/BL-Labs/Jupyter-notebooks-projects-using-BL-Sources/blob/master/Microsoft19thCenturyBooks/load_JSON_files_not_run.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### In this Notebook:

 * Read JSON files containg full text: from local zip file or downloaded from a URL;
 * Loaded the contents of the several JSON files, even in several sub-directories, into a big DataFrame;
 * Add ALEPH SYS ID of the book ("bookid") and the Volume information extracted from the filename of each JSON file;
 * Exclude blocks of text that are empty;
 * Search for block of text that mention "London";
 * From those, search the ones who have the term "love";
 * Display a couple of those block of texts;
 * Retrieve the BookID and Volume to which those blocks of text belong to;
 * Link to their catalog record on BL Explorer.


<div class="alert alert-info" role="alert">
    
#### Notes:

* If using BINDER or a did a git clone to your Jupyter Notebook LOCAL SERVER, skip cell 1, place the cursor on cell 2 and from the main menu, choose "Cell" > "Run All Bellow", if not executing cell-by-cell;

* If using Google Colab, with the cursor on the first cell, from the main menu, choose "Runtime" > "Run after", if not executing cell-by-cell.
</div> 
<br/>

In [None]:
# Only RUN this CELL if using Google COLAB
# To download the zip file (295MB), it could take from 10 to 30 minutes
# Not needed if using BINDER or a did a git clone to your Jupyter Notebook LOCAL SERVER, as you already have the data locally

!wget https://cld.pt/dl/download/b2d718a4-2aa2-496c-aad8-d889f27a4be2/dig19cbooksjsontext_sample.zip

# Only RUN THE UNZIP ONCE, please
!unzip dig19cbooksjsontext_sample.zip 

!mkdir data
!mv dig19cbooksjsontext/ data/dig19cbooksjsontext/

In [None]:
# Load the necessary modules / Libraries:

import os, json
import pandas as pd

### Let's check the files in this Directory:

In [None]:
# If all the files are under the same directory a simpler way to locate them would be:
# path_to_json = 'data/dig19cbooksjsontext/json/0037/'
# json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#this will locate all the JSON files inside the main Directory and any sub-Folder:
path_to_json = 'data/dig19cbooksjsontext/'

json_files = [os.path.join(root, name)
             for root, dirs, files in os.walk(path_to_json)
             for name in files
             if name.endswith((".json"))] #If we needed to read several files extensions: if name.endswith((".ext1", ".ext2"))

print('Number of JSON files ready to be loaded: ' + str(len(json_files)))


In [None]:
print('Path to the first file: '+json_files[0])
print('Path to the last file: '+json_files[len(json_files)-1])

### Files are ready to be loaded into a big DataFrame
We will add ALEPH SYS ID of the book ("bookid") and the Volume information extracted from the filename of the JSON file.

* A similar process can be used to append all the data into a single JSON, instead of loading the data from the JSON files into a DataFrame.

In [None]:
jsons_df = pd.DataFrame()
#jsons_df = pd.DataFrame(columns=['indexglob', 'indexlocal', 'text','bookid','volume'])

for index, js in enumerate(json_files):
    jsons_df_ins = pd.read_json(json_files[index])  #if we were just loading from one Directory, first solution above, we would need to add the path to the file: jsons_df_ins = pd.read_json(path_to_json+json_files[index])
    jsons_df_ins["bookid"] = json_files[index][35:44]
    jsons_df_ins["volume"] = json_files[index][45:47]
    jsons_df = jsons_df.append(jsons_df_ins, ignore_index=True)


print('Blocks of text loaded into the DataFrame: ' + str(jsons_df.size))
    

In [None]:
print('First 20 blocks of text:')
jsons_df.head(20)

### Given that some blocks are empty, let's create a new DataFrame with only the blocks that have Text

In [None]:
#Drop blocks of text that are empty

filter = jsons_df[1] != ""
blocksText_df = jsons_df[filter]
print('Blocks of text non empty: ' + str(blocksText_df.size))
print('Excluded ' + str(jsons_df.size - blocksText_df.size) + ' that are empty')


In [None]:
blocksText_df.head(20)

### And now we can do whatever needed with this DataFrame

#### E.g., let's look for blocks of text that mention London:

In [None]:
London_df = blocksText_df[blocksText_df[1].str.contains("London")]

In [None]:
print(London_df)

#### ... and of those, have the term "love":

In [None]:
London_love_df = London_df[London_df[1].str.contains("love")]

In [None]:
print(London_love_df)

#### Let's check one of the records, ID = 6132 (3rd in the list)

In [None]:
London_love_df[1][6153]

# if using Google COLAB, comment the above and uncoment the bellow, please
# London_love_df[1][186]

(...) REMAINS OF ROMAN **LONDON**. THE **LONDON** STONE.... notwithstanding their **love** for old customs and ancient traditions ... 

or if using Google Colab example:

(...) I am come down to take her to **London**, where (...) as the only true **love**r and friend she had (...)

#### Retrieve the BookID and Volume to which this block of text belongs:

In [None]:
London_love_df.loc[6153, ['bookid', 'volume']]

#### We can also check a random one (a sample of 1):

In [None]:
London_love_df[1].sample(1)

In [None]:
London_love_df[1][347374] # use the index, first number retrieved above -- when ran this time, it gave the index 347374

(...) St. Paul's cathedral in **London**, it was regarded (...) at that period, so **love**d and reverenced by the people (...)

#### And again, let's retrieve the BookID and Volume to which this block of text belongs:

In [None]:
London_love_df.loc[347374, ['bookid', 'volume']]

#### The bookid is ALEPH SYS number, so we can search it on BL Explorer:

In [None]:
print('Link to this book\'s catalog record at BL explorer:')
print('http://explore.bl.uk/primo_library/libweb/action/search.do?cs=frb&doc=BLL'+ London_love_df["bookid"][347374] + '&dscnt=1&scp.scps=scope:(BLCONTENT)&frbg=&tab=local_tab&srt=rank&ct=search&mode=Basic&dum=true&tb=t&indx=1&vl(freeText0)='+ London_love_df["bookid"][347374] + '&fn=search&vid=BLVU1')