# Natural Language Processing: Information Extraction

Information extraction (IE) is the task of automatically extracting <font color="green">structured information</font> from <font color="brown">unstructured and/or semi-structured</font> machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP)    

<b><font color="blue">Objective:</font></b>   
We will consider a collection of text passages having brief descriptions about some of the most popular football players in current generation. The passages have been taken from their respective Wikipedia pages. The task will involve extracting following key information/ relationships from this unstructured data:    
1. Player Name   
2. Country      
3. Teams (Clubs and National)    
4. Birth date   
5. Debut year   
6. Playing positions on the field   

Finally, we create JSON-LD representation of the data, that can be used to build knowledge representatio or ontology of football players.    

<b><font color="blue">Note:</font></b>     
We extensively use <font color="brown">*regular expressions* and NLTK</font> functions for implementing various NLP Information Extraction techniques.   


In [2]:
#Import all the required packages
import nltk
import re
from statistics import mode

In [4]:
inputfile='football_players.txt' #Location of the file
buf=open(inputfile, encoding="UTF-8")
#Split text on new lines and exclude blank lines
list_of_doc= [l for l in buf.read().split('\n') if l != '']

# 1. Preprocessing the text
The text from each document is preprocessed with following steps:   

1. *Sentence segmentation*    
2. *Tokenization*    
3. *part-of-speech tagging*     

In [5]:
def ie_preprocess(document):
    '''
    split the text into lines, lines into words and annotate them with POS tags
    
    Arguments:
        document -- Chunk of text to be processed
    
    Returns:
        List of POS tagged words from given text
    '''
    pos_sentences = nltk.sent_tokenize(document)
    pos_sentences = [nltk.word_tokenize(sent) for sent in pos_sentences]
    pos_sentences = [nltk.pos_tag(sent) for sent in pos_sentences]
    return pos_sentences

<font color="blue">Unit Test: Preprocessing</font>    
Test the above function with result for the first document (related to footballer *Ronaldo*).

In [4]:
first_doc=list_of_doc[0]
pos_sent=ie_preprocess(first_doc)
pos_sent

[[('Cristiano', 'NNP'),
  ('Ronaldo', 'NNP'),
  ('dos', 'NN'),
  ('Santos', 'NNP'),
  ('Aveiro', 'NNP'),
  (',', ','),
  ('ComM', 'NNP'),
  (',', ','),
  ('GOIH', 'NNP'),
  ('(', '('),
  ('born', 'VBN'),
  ('5', 'CD'),
  ('February', 'NNP'),
  ('1985', 'CD'),
  (')', ')'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('Portuguese', 'JJ'),
  ('professional', 'JJ'),
  ('footballer', 'NN'),
  ('who', 'WP'),
  ('plays', 'VBZ'),
  ('for', 'IN'),
  ('Spanish', 'JJ'),
  ('club', 'NN'),
  ('Real', 'NNP'),
  ('Madrid', 'NNP'),
  ('and', 'CC'),
  ('the', 'DT'),
  ('Portugal', 'NNP'),
  ('national', 'JJ'),
  ('team', 'NN'),
  ('.', '.')],
 [('He', 'PRP'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('forward', 'NN'),
  ('and', 'CC'),
  ('serves', 'NNS'),
  ('as', 'IN'),
  ('captain', 'NN'),
  ('for', 'IN'),
  ('Portugal', 'NNP'),
  ('.', '.')],
 [('In', 'IN'),
  ('2008', 'CD'),
  (',', ','),
  ('he', 'PRP'),
  ('won', 'VBD'),
  ('his', 'PRP$'),
  ('first', 'JJ'),
  ('Ballon', 'NNP'),
  ("d'Or", 'NN'),
  ('and', 'CC')

Expected output
 [...[('He', 'PRP'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('forward', 'NN'),
  ('and', 'CC'),
  ('serves', 'NNS'),
  ('as', 'IN'),
  ('captain', 'NN'),
  ('for', 'IN'),
  ('Portugal', 'NNP'),
  ('.', '.')], ...]

# 2. Named Entity Extraction    
Next, we write a function that will take the list of tokens with POS tags for each sentence and returns the named entities (NE). In information extraction, a named entity is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name. [Wikipedia]

In [6]:
def named_entity_finding(pos_sent):
    '''
    Extract named entities from the sentence
    
    Arguments:
        pos_sent -- a POS tagged sentence from the text
    
    Returns:
        List of named tagged entities from the sentence
    '''
    
    #Parse POS tags words to extract chunks in a tree
    tree = nltk.ne_chunk(pos_sent, binary=True)
    named_entities = []

    #Extract chunks tagged specifically as NE (Named Entities) from the subtree
    for subtree in tree.subtrees():
        if subtree.label() == 'NE':
            entity = ""
            for leaf in subtree.leaves():
                entity = entity + leaf[0] + " "
            named_entities.append(entity.strip())
    return named_entities

#Test the output with first sentence from the given text
pos_sents = ie_preprocess(list_of_doc[0])
named_entity_finding(pos_sents[0])

['Cristiano Ronaldo',
 'Santos Aveiro',
 'ComM',
 'GOIH',
 'Portuguese',
 'Spanish',
 'Real Madrid',
 'Portugal']

## 3. Extract and combine all Named Entities
Extract all the named enitities from each player's description and combine them together     

In [6]:
def NE_flat_list_fn(pos_sents): 
    '''
    Extract all named entities from the given document and flatten them in a single list
    
    Arguments:
        pos_sents -- a paragraph of text (with POS tags) from which all NEs are to be extracted
    
    Returns:
        A flattened list of named tagged entities from the text
    '''
    NE=[]
    for pos_sent in pos_sents:
    #Single line code here. Call the funtion named_entity_finding(pos_sent) and 
            #append the result to the NE list
        NE.append(named_entity_finding(pos_sent))
    #Single line code here. Flatten the list of lists to the single list NE_flat_list
        NE_flat_list = [ne for line in NE for ne in line]
    return NE_flat_list


NE_flat_list_fn(pos_sents)

['Cristiano Ronaldo',
 'Santos Aveiro',
 'ComM',
 'GOIH',
 'Portuguese',
 'Spanish',
 'Real Madrid',
 'Portugal',
 'Portugal',
 'Ballon',
 'FIFA',
 'FIFA Ballon',
 'Ronaldo',
 'Ronaldo',
 'Portuguese',
 'Portuguese Football Federation',
 'European Golden Shoe',
 'ESPN',
 'Ronaldo',
 'Manchester United',
 'England',
 'United',
 'UEFA Champions League',
 'FIFA Club',
 'Ballon',
 'FIFA',
 'Manchester United',
 'Madrid',
 'Spain',
 'Ronaldo',
 'UEFA Champions League',
 'Ronaldo',
 'La Liga',
 'Ronaldo',
 'UEFA Champions League',
 'Real Madrid',
 'La Liga',
 'Lionel Messi',
 'Ronaldo',
 'Portugal',
 'Portugal',
 'European',
 'FIFA World Cups',
 'Portuguese',
 'Portugal',
 'Portugal',
 'Portugal',
 'Ronaldo',
 'UEFA European',
 'European',
 'Michel Platini',
 'Ronaldo',
 'Portugal',
 'France',
 'Silver Boot']

## 4. Extract information for key fields

Now, we extract the name of the player, country of origin and date of birth as well as the following relations: team(s) of the player and position(s) of the player.

### 4.1. Name of the player

In [7]:
def name_of_the_player(doc):
    '''
    Find name of the player from given text
    
    Arguments:
        doc -- a text paragraph from document related to a player
    
    Returns:
        name of the player
    '''
    # Hint: Use the named_entity_finding() function
    #tokenize and pos tag sentences from the text before extracting named entities
    pos_sents = ie_preprocess(doc)
    #Extract NEs from the text, find name of the player tagged as 'PERSON' from the first line of the text
    name = named_entity_finding(pos_sents[0])[0]
    
    return name

#Test function to extract names of all players one by one
[name_of_the_player(paragraph) for paragraph in list_of_doc]

['Cristiano Ronaldo',
 'Lionel Andrés',
 'Neymar',
 'Ronaldo',
 'Wayne Mark Rooney',
 'Zlatan Ibrahimović',
 'David Robert Joseph Beckham',
 'Mesut Özil',
 'Gareth Frank Bale',
 'Andrés Iniesta Luján']

### 4.2. Country of origin
  Ideally, country of origin can be deduced by extracting PERSON-GPE relationship from sentences like "is a...[Nationality] professional footballer". e.g. 

In [8]:
sentence = "Christiano Ronaldo is a Portuguese football player who plays for Spanish club Real Madrid and Portugal national team."
chunked_sentence = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))
IS = re.compile(r'.*\bis\b.*')
rel = nltk.sem.relextract.extract_rels('PER', 'GPE', chunked_sentence, corpus='ace', pattern=IS, window=20) 
print (nltk.sem.relextract.rtuple(rel[0]))

[PER: 'Ronaldo/NNP'] 'is/VBZ a/DT' [GPE: 'Portuguese/JJ']


However, presense of non-English names and inaccurate NE tagging of some terms yield incorrect or no results in some cases. So, we resort to NE extraction using regex patterns, finding the national teams the players represent

In [9]:
def country_of_origin(doc):
    '''
    Find the country of origin from the given text
    
    Arguments:
        doc -- a text paragraph from document related to a player
    
    Returns:
        country of the player
    '''
    #Specify a regex pattern that "looks ahead" for 'national' team and returns the preceding word
    national_pattern = re.compile("\w+(?=\snational\steam)")
    #search for the first instance of above defined pattern in the text
    country = national_pattern.search(doc).group(0)
    return country

#Test function for all players one by one
[{name_of_the_player(paragraph) : country_of_origin(paragraph)} for paragraph in list_of_doc]

[{'Cristiano Ronaldo': 'Portugal'},
 {'Lionel Andrés': 'Argentina'},
 {'Neymar': 'Brazil'},
 {'Ronaldo': 'Brazil'},
 {'Wayne Mark Rooney': 'England'},
 {'Zlatan Ibrahimović': 'Sweden'},
 {'David Robert Joseph Beckham': 'England'},
 {'Mesut Özil': 'German'},
 {'Gareth Frank Bale': 'Wales'},
 {'Andrés Iniesta Luján': 'Spain'}]

### 4.3. Date of Birth

In [10]:
def date_of_birth(doc):
    '''
    Extract date of birth for each player from the given text
    
    Arguments:
        doc -- a text paragraph from document related to a player
    
    Returns:
        DOB of the player
    '''
    #Regex pattern to match 'born' date in "DD Month YYYY" format
    birthdate_pattern = re.compile(r"born\s[\d]{1,2}\s[A-Z][a-z]+\s\d{4}")
    #Seach for date and remove "born " preceding it
    date = re.search(string = doc, pattern = birthdate_pattern).group()[5:]
    return date

#Test function for all players one by one
[{name_of_the_player(paragraph) : date_of_birth(paragraph)} for paragraph in list_of_doc]

[{'Cristiano Ronaldo': '5 February 1985'},
 {'Lionel Andrés': '24 June 1987'},
 {'Neymar': '5 February 1992'},
 {'Ronaldo': '21 March 1980'},
 {'Wayne Mark Rooney': '24 October 1985'},
 {'Zlatan Ibrahimović': '3 October 1981'},
 {'David Robert Joseph Beckham': '2 May 1975'},
 {'Mesut Özil': '15 October 1988'},
 {'Gareth Frank Bale': '16 July 1989'},
 {'Andrés Iniesta Luján': '11 May 1984'}]

### 4.4. Teams of the player
 - The names of teams that a player has represented can be found all over the paragraph and not only in the first few sentences.    
 - A player is generally associated with team by 'playing' for it, 'signing' with it or being 'transfered' to it. We first <strong>lemmatize</strong> such keywords and search for the root of these keywords in the text.Now, we consider only these sentences as relevant for extracting team names.    
 - <font color="blue">Club teams:</font> English language sentences follow a few grammar rules in a subordinate clause that help us determine player - club team relationships. eg. A player plays for/ signed with (<font color="green">VERB + PREPOSITION</font>) followed by <font color="green">DETERMINER (Optional) + ADJECTIVE (Optional) + (PROPER NOUN)</font> which is most likely to be the clubs which the players represent. We define a few of these generally observed grammar patterns and use the <strong><I>RegexParser</I></strong> to chunk the sentences appropriately
 - <font color="blue">National team:</font> A regex pattern to find the word preceding the term "national team" is used to determine the player's national team.   
 - The unique national and club team names dispersed throughout the paragraph thus extracted are consolidated in a single list
 - <strong>Note:</strong> A helper function is defined to process the text to combine together named entities for club names like "Paris Saint-Germain", "Real Madrid", "Manchester United" which are usually separated upon word tokenization.

In [11]:
import string
def process_sentence(sentence):
    '''
    Helper function that preprocesses the sentence from the text and merges separated named entities  
    Arguments:
        doc -- a sentence from text
    
    Returns:
        pos tagged named entities in proper format
    '''
    sent = nltk.word_tokenize(sentence) 
    pos_tags = nltk.pos_tag(sent)
    processed_pos_tags = []
    one_word = []
    #Merge consecutive proper nouns into one single proper noun
    for tag in pos_tags:
        if tag[1] =='NNP':
            one_word.append(tag[0])
        elif tag[1] == "," or tag[1] == "." and len(one_word) > 0:
            processed_pos_tags.append((' '.join(one_word), 'NNP'))
            one_word.clear()
        elif tag[0] == 'and' and len(one_word) > 0:
            processed_pos_tags.append((' '.join(one_word), 'NNP'))
            one_word.clear()
        else:
            one_word.clear()
            processed_pos_tags.append(tag)
            
    #Remove unnecessary punctuations from the text
    processed_pos_tags = [w for w in processed_pos_tags if w[0] not in string.punctuation]
    return processed_pos_tags

In [12]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

def team_of_the_player(doc):
    '''
    Determine all the teams that the player has played for, signed with or transfered to.
    Arguments:
        doc -- paragraph of text related to the player
    
    Returns:
        team of the player
    '''
    
    #Filter all the relevant lines from the text with lemmatized keywords
    all_lines = nltk.sent_tokenize(doc)
    club_team_info_lines = [l for l in all_lines for w in nltk.word_tokenize(l) \
                            if porter_stemmer.stem(w) in ['play', 'sign', 'club', 'transfer']]
    teams = []
    
    #Define general grammar rules observed in English language when an entity does an action associated with other entity
    # eg. playing for/ associating with/ signing for/ transfered to/ etc.
    grammar = r"""
      PLAYS_FOR: {<VB.*><IN><JJ>?<NN.*>+}   
          {<VB.*><IN><DT><NN><IN><JJ><NN.*>+}
          {<VB.*><IN><CC><NNS><DT><NN.*>+} 
          {<VB.*><IN><DT><NN><IN><NN.*>+}
          {<VB.*><IN><DT><JJ><NN><IN><NN.*>+}
    """
    #Finding chunks based on defined grammar rules
    chunkParser = nltk.RegexpParser(grammar)
    #Parse through all relevant sentences and create parse trees with "PLAYS_FOR" clauses
    for raw_sentence in club_team_info_lines:
        pos_tagged_sentence = process_sentence(raw_sentence)
        tree = chunkParser.parse(pos_tagged_sentence)

        #Extract all proper nouns from the clause extracted with defined grammar rule
        for subtree in tree.subtrees():
            if subtree.label() == "PLAYS_FOR": 
                clubs = [leaf[0] for leaf in subtree.leaves() if leaf[1] == 'NNP']
                teams += clubs
    
    #Extracting national team names
    #Define regex pattern to extract country name preceding national team keyword
    national_pattern = re.compile("\w+(?=\snational\steam)")
    national_team_info_line = [l for l in all_lines if 'national team' in l]
    national_team = national_pattern.findall(national_team_info_line[0]) 
    teams += [national_team[0] + ' national team']
    
     #Only keep the unique team names
    teams = [team for team in set(teams)]
    return teams

#Test function to extract team names for all players one by one
[{name_of_the_player(paragraph) : team_of_the_player(paragraph)} for paragraph in list_of_doc]

[{'Cristiano Ronaldo': ['Real Madrid', 'Portugal national team']},
 {'Lionel Andrés': ['FC Barcelona', 'Argentina national team']},
 {'Neymar': ['Brazil national team', 'FC Barcelona']},
 {'Ronaldo': ['Brazil national team', 'Flamengo', 'Querétaro']},
 {'Wayne Mark Rooney': ['Manchester United', 'England national team']},
 {'Zlatan Ibrahimović': ['Sweden national team',
   'Manchester United',
   'Juventus',
   'Ajax',
   'Inter Milan']},
 {'David Robert Joseph Beckham': ['LA Galaxy',
   'England national team',
   'Manchester United',
   'Milan',
   'Preston North End',
   'Paris Saint-Germain',
   'Real Madrid']},
 {'Mesut Özil': ['German national team', 'Arsenal']},
 {'Gareth Frank Bale': ['Real Madrid', 'Wales national team']},
 {'Andrés Iniesta Luján': ['Spain national team', 'FC Barcelona']}]

### 4.5. Position(s) of the player
 The relation of player and their playing positions can be extracted by performing a **<i>lexical lookup</i>** on already available <font color = "blue">football field positions **ontology**</font>. Soccer field positions are a well known part of knowledge representation, so we search player's description for such positional information. Some players play or might have played at more than one positions, so we capture positions as a list.

In [13]:
def position_of_the_player(doc):
    '''
    Extract positions on the field that a given player has played at

    Arguments:
        doc -- a text paragraph from document related to a player

    Returns:
        a list of positions on the field where the player has played at
    '''   

    #Compile a list of all the positions available on the football field   
    positions = """
    midfield[a-z]{2}|winger|forward|striker|winger|goalkeeper|defender
    """
    #Search for these positions within the description
    p = re.compile(positions,re.VERBOSE)
    positions = [p for p in set(p.findall(doc))]
    return positions

#Test function to extract team names for all players one by one
[{name_of_the_player(paragraph) : position_of_the_player(paragraph)} for paragraph in list_of_doc]

[{'Cristiano Ronaldo': ['forward']},
 {'Lionel Andrés': ['forward']},
 {'Neymar': ['forward']},
 {'Ronaldo': ['midfielder', 'forward']},
 {'Wayne Mark Rooney': ['forward']},
 {'Zlatan Ibrahimović': ['striker']},
 {'David Robert Joseph Beckham': ['winger']},
 {'Mesut Özil': ['midfielder']},
 {'Gareth Frank Bale': ['winger', 'defender']},
 {'Andrés Iniesta Luján': ['midfielder']}]

### 4.6. Debut year of the player

In [24]:
def debut_year(doc):
    '''
    Extract player's debut year (if available) from the given text
    
    Arguments:
        doc -- a text paragraph from document related to a player
    
    Returns:
        debut year of the player
    '''
    year = None
    #Regex pattern to extract sentence with keywords debut or started in it
    debut_sentence_pattern = re.compile(r"[^.?!]*(?<=[.?\s!])((started)|(debut))(?=[\s.?!])[^.?!]*[.?!]")
    #Regex pattern for finding year value in the sentence
    debut_year_pattern = re.compile(r"\d{4}")
    debut_sentence = re.search(string = doc, pattern = debut_sentence_pattern)
    if debut_sentence is not None:
        year = re.search(string = debut_sentence.group(), pattern = debut_year_pattern)
        if year is not None:
            year = year.group()
    return year

#Find debut year for players for whom the information is available
[{name_of_the_player(paragraph) : debut_year(paragraph)} for paragraph in list_of_doc]

[{'Cristiano Ronaldo': '2003'},
 {'Lionel Andrés': '2004'},
 {'Neymar': None},
 {'Ronaldo': None},
 {'Wayne Mark Rooney': '2002'},
 {'Zlatan Ibrahimović': '2001'},
 {'David Robert Joseph Beckham': '1992'},
 {'Mesut Özil': '2006'},
 {'Gareth Frank Bale': '2006'},
 {'Andrés Iniesta Luján': '2002'}]

# 5. Generate JSON-LD format data for knowledge representation

We use the outputs from the previous functions to generate JSON-LD output as follows.

{ "@id": "http://my-soccer-ontology.com/footballer/name_of_the_player",

    "name": "",
    "born": "",
    "country": "",
    "position": [
        { "@id": "http://my-soccer-ontology.com/position",
            "type": ""
        }
     ]   
     "team": [
        { "@id": "http://my-soccer-ontology.com/team",
            "name": ""
        }   
     ]
}


<strong>Note:</strong>   
PyLD python library is used to generate JSON-LD specification. It can be installed with:
    
    pip install PyLD

In [28]:
from pyld import jsonld
import json

def generate_jsonld(arg_player_name, arg_birthdate, arg_country, arg_position, arg_teams, arg_debut_year = "NA"):
    '''
    Create a JSON-LD representation of player information

    Arguments:
        arg_player_name -- name of the player
        arg_birthdate   -- player's birth date (DD Month YYYY format)
        arg_country     -- country of origin
        arg_position    -- list of soccer field positions player plays at
        arg_teams       -- list of national and club teams, player represents 
        arg_debut_year  -- year the player debuted as per the the given description - default: Not Available
        
    Returns:
        a JSON-LD representation of player information
    '''  
    
    doc = {
        "http://my-soccer-ontology.com/name": arg_player_name,
        "http://my-soccer-ontology.com/born": arg_birthdate,
        "http://my-soccer-ontology.com/country": arg_country,
        "http://my-soccer-ontology.com/debut": arg_debut_year, 
        "http://my-soccer-ontology.com/position": arg_position,
        "http://my-soccer-ontology.com/team": arg_teams 
    }

    context = {
        "name": "http://my-soccer-ontology.com/name",
        "born": "http://my-soccer-ontology.com/born",
        "country": "http://my-soccer-ontology.com/country",
        "debut": "http://my-soccer-ontology.com/debut",
        "position": {"@id": "http://my-soccer-ontology.com/position", "@type": "@id"},
        "team": {"@id": "http://my-soccer-ontology.com/team", "@type": "@id"}
    }
    
    id = {
        "@id": "http://my-soccer-ontology.com/footballer/" + arg_country
    }

    # compact a document according to a particular context
    # see: http://json-ld.org/spec/latest/json-ld/#compacted-document-form
    compacted = jsonld.compact(doc, context)
    #retrieve an expanded JSON-LD if requried
    #expanded = jsonld.expand(doc, context)
    return compacted

In [25]:
#Create JSON-LD with newly created relation above, player debut year
#Create the JSON-LD for a particular player, using functions defined earlier
player_paragraph = 6
player_name = name_of_the_player(list_of_doc[player_paragraph])
player_country = country_of_origin(list_of_doc[player_paragraph])
player_birthday = date_of_birth(list_of_doc[player_paragraph])
player_debut_year = debut_year(list_of_doc[player_paragraph])
player_positions = position_of_the_player(list_of_doc[player_paragraph])
player_teams = team_of_the_player(list_of_doc[player_paragraph])
player_json_ld = generate_jsonld(player_name, player_birthday, 
                                 player_country, player_positions, player_teams, player_debut_year)
print(json.dumps(player_json_ld, indent=2))

{
  "@context": {
    "name": "http://my-soccer-ontology.com/name",
    "born": "http://my-soccer-ontology.com/born",
    "country": "http://my-soccer-ontology.com/country",
    "debut": "http://my-soccer-ontology.com/debut",
    "position": {
      "@id": "http://my-soccer-ontology.com/position",
      "@type": "@id"
    },
    "team": {
      "@id": "http://my-soccer-ontology.com/team",
      "@type": "@id"
    }
  },
  "born": "2 May 1975",
  "country": "England",
  "debut": "1992",
  "name": "David Robert Joseph Beckham",
  "http://my-soccer-ontology.com/position": "winger",
  "http://my-soccer-ontology.com/team": [
    "LA Galaxy",
    "England national team",
    "Manchester United",
    "Milan",
    "Preston North End",
    "Paris Saint-Germain",
    "Real Madrid"
  ]
}


### References

1. Dr. Paul Buitelaar, Class notes on "Information Extraction", CT5101 - Natural Language Processing, NUIG
2. Omnia Zayed, Mihael Arcan, Bharathi Raja, "Lab 9 - Information Extraction", CT5101 - Natural Language Processing, NUIG
3. Steven Bird, Ewan Klein, and Edward Loper "Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit". [Extracting Information from text](http://www.nltk.org/book/ch07.html)
4. [Python Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)
5. [JSON - Linked Data Reference](https://json-ld.org/primer/latest/)
6. [Stackoverflow for Python](https://stackoverflow.com/questions/tagged/python)
