The purpose of this notebook is to provide starter code to be able to use IBM Watson's Natural Language Processing tool - Alchemy. This code expects input in text format in a .JSON file. The code will need to be modified to extract information based on the JSON file format.


** Library imports **

In [1]:
import json
import nltk
import os
from watson_developer_cloud import AlchemyLanguageV1
from sys import getsizeof

** Extract information from JSON file **


Alter the following code depending on the format of JSON file. Currently, the cell below reads in a JSON file that contains report data and creates an array of objects, each of which have a title, year and OCR processed text.

In [2]:
annual_reports = []
with open("hopland.json") as json_file:
    json_data = json.load(json_file,strict=False)
for j in json_data['features']:
    if j['attributes']['doctypename'] == "Annual Project Summary":
        annual_report = {"title":None,"year":None,"OCR":None}
        annual_report['title'] = j['attributes']['title']
        annual_report['year'] = j['attributes']['year']
        annual_report['OCR'] = j['attributes']['OCR']
    
        annual_reports.append(annual_report)

Add your API key below by creating an account here: http://www.alchemyapi.com/api/register.html

In [8]:
alchemy_language = AlchemyLanguageV1(api_key='YOUR_API_KEY')

### Concepts

The below function takes in an array of objects containing OCR processed text and extracts concepts in the text using AlchemyAPI.

In [32]:
def extract_concepts(reports):
    concepts = []
    for record in reports:
        text = record['OCR']
        year = record['year']
        print(len(text))
        if len(text) == 0:
            text = 'Empty OCR'
        # Call AlchemyAPI
#         text_len = getsizeof(text)
#         if text_len > 50000:
#             num_chunks = text_len // 50000
#             smaller_text_list = [text[0+ch:] for ch in range(num_chunks)]
        concept = json.dumps(alchemy_language.concepts(language='english', text=text), indent=2)
        c = {'year':None,'concept':None}
        c['year'] = year
        c['concept'] = concept
        
        concepts.append(c)
    return concepts

In [None]:
annual_report_concepts = extract_concepts(annual_reports)

Export data into a new JSON file.

In [34]:
import json
with open('FinalOutput_AnnualProjectSummary.json', 'w') as outfile:
    json.dump(annual_report_concepts, outfile)

### Entities

The below function takes in an array of objects containing OCR processed text and extracts entities(people, organizations, places) in the text using AlchemyAPI.

In [13]:
def get_entities(reports):
    entities = []
    for record in reports:
        text = record['OCR']
        year = record['year']
        if len(text) == 0:
            text = 'Empty OCR'
        # Call AlchemyAPI
        entity = json.dumps(alchemy_language.entities(language='english',text=text))
        e = {'year':None,'entities':None}
        e['year'] = year
        e['entities'] = entity
        
        entities.append(e)
    return entities

In [None]:
# annual_report_entities = get_entities(annual_reports)
annual_report_entities = []

for i in annual_reports[:100]:
    x = json.dumps(alchemy_language.entities(text=i['OCR']))
    e = {'year':None,'entities':None}
    e['year'] = i['year']
    e['entities'] = x
    annual_report_entities.append(e)

In [14]:
annual_report_entities = get_entities(annual_reports)

In [15]:
import json
with open('FinalOutput_AnnualProjectSummaryEntities.json', 'w') as outfile:
    json.dump(annual_report_entities, outfile)

### Keywords

The below function takes in an array of objects containing OCR processed text and extracts important keywords in the text using AlchemyAPI.

In [4]:
def get_keywords(reports):
    keywords = []
    for record in reports:
        text = record['OCR']
        year = record['year']
        # Call AlchemyAPI
        keyword = json.dumps(alchemy_language.keywords(text=text))
        e = {'year':None,'keywords':None}
        e['year'] = year
        e['entities'] = keyword
        
        keywords.append(e)
    return keywords

In [5]:
annual_report_keywords = get_keywords(annual_reports)

In [6]:
import json
with open('FinalOutput_AnnualReportKeywords.json', 'w') as outfile:
    json.dump(annual_report_keywords, outfile)