# <center>Supporting data modelling through NLP</center>                         

## Knowledge Modelling and Artificial Intelligence

Data models and data catalogues support AI applications by providing pointers to the data needed to fuel AI applications. 

Likewise, AI applications in the range of Natural Language Processing can be used to support the process of data modelling. One area that gains traction from applying natural language processing on unstructured data is the area of knowledge graphs, also called ontologies. They represent a set of concepts within a domain and the relationships among those concepts. Ontologies can be used to boost AI applications, because they contain a formal expression of knowledge and can thus steer reasoning, i.e. in a chatbot.

It is often difficult to find a starting point and get a complete picture of what to model in an ontology. While some approaches to tackle this problem, like posing competency questions, do not really fly in the corporate world, NLP can provide a starting point by taking various texts from the business and creating the topics and potentially reveal some connections that have not been thought of before (i.e. from wordclouds). In that sense, NLP supports speeding up the ontology creation process in an environment where a lot of written documentation is present, but overview and structure is lacking. We find this the case for many companies, where a lot of documentation about processes etc. is spread across Confluence, Jira and other tools.

In this notebook, we give you an overview of the necessary steps to extract knowledge from unstructured text and pinpoint where support from data managers is needed.

![title](archi1.PNG)



The architecture below summarizes the steps that are applied on text data in order to generate ontology structures. Mind that while a great deal can be automated using NLP pipelines, human experts still need to be involved in evaluating the intermediary outputs of the NLP components and formalization of ontologies. 

A great deal of work has been put into NLP engines by Open Source communities. And while there are a few tools out there to complete the process steps below, we have decided to make use of Stanford Core NLP, mainly because of its ability to generate triples, a common building block of ontology creation.

![title](archi.PNG)

## Data Preparation

Data preparation in this context translates to extracting data from the web and bringing it into shape for consumption by the Stanford CoreNLP engine. We do not want to go into detail with reagrds to data cleaning too much, since this is very much reliant on the underlying text data. Instead, we want to give an overview of how to properly setup all the various components of Stanford CoreNLP.

### Setup and use Stanford CoreNLP Server with Python 

- To download Stanford CoreNLP, go to https://stanfordnlp.github.io/CoreNLP/index.html#download and click on “Download CoreNLP”. The latest version of Stanford CoreNLP at the time of writing is v3.9.2.<br>

- Once the download has completed, unzip the file using the following command: unzip stanford-corenlp-full-2017-06-09.zip <br>

- Install Java 8 (if not installed) <br>

- Running Stanford CoreNLP Server- Now, we have our environment ready to fire up Stanford CoreNLP Server. To do so, go to the path of the unzipped Stanford CoreNLP and execute the below command: <br>

 <font color='grey'> java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000 </font> 
 
 for more information, visit i.e. https://www.khalidalnajjar.com/setup-use-stanford-corenlp-server-python/

In [1]:
# install all libraries and their dependencies

from stanfordcorenlp import StanfordCoreNLP
import logging
import json
import pandas as pd
from nltk.parse.stanford import StanfordDependencyParser
import os
import numpy as np
from graphviz import Source
from nltk.tree import *


The following definitions of class StanfordNLP are used to execute syntactic and semantic natural language processing tasks that are used for the annotation of text corpora.

In [2]:

class StanfordNLP:
    def __init__(self, host='http://localhost', port=9000):
        self.nlp = StanfordCoreNLP(host, port=port,
                                   timeout=30000)  # , quiet=False, logging_level=logging.DEBUG)
        self.props = {
            'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,dcoref,relation',
            'pipelineLanguage': 'en',
            'outputFormat': 'json'
        }

    def word_tokenize(self, sentence):
        return self.nlp.word_tokenize(sentence)

    def pos(self, sentence):
        return self.nlp.pos_tag(sentence)

    def ner(self, sentence):
        return self.nlp.ner(sentence)

    def parse(self, sentence):
        return self.nlp.parse(sentence)

    def dependency_parse(self, sentence):
        return self.nlp.dependency_parse(sentence)

    def annotate(self, sentence):
        return json.loads(self.nlp.annotate(sentence, properties=self.props))
    

    @staticmethod
    def tokens_to_dict(_tokens):
        tokens = defaultdict(dict)
        for token in _tokens:
            tokens[int(token['index'])] = {
                'word': token['word'],
                'lemma': token['lemma'],
                'pos': token['pos'],
                'ner': token['ner']
            }
        return tokens

In [3]:
if __name__ == '__main__':
    sNLP = StanfordNLP()
    text= "Rooney joined the Everton youth team at the age of 9 and made his professional debut for the club in 2002 at the age of 16."    
    


### Semantic Annotation


Any object - like a text fragment - can be examined by a multitude of annotators yielding different outputs that might be beneficial for the modelling task. For Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/annotators.html), their github page provides a list of all their annotators and how to build custom ones. Throughout this document, we highlight the ones helping the modelling process. It depends on the texts which of the annotators yields the best results.

The following steps require tinkering with the data, so some of them might need to be done again iteratively, done differently or omitted. 

#### Word Tokenization and Sentence Splitting

Word tokenization is the first step of the natural language processing task which tokenizes the sentence
into the sequence of tokens. One can already see this as an important preliminary activity to enable semantic annotation, as the number of tokens produced influences the granularity of the modelling. I.e. one could consider removing stopwords as well, but careful with that, as it may tamper with descriptions of entities or relationships if too many are omitted.

Sentence splitting is often used in combination with tokenization. A missing link between the sentences could be healed by using co-variances. If many sentence start off with "he" or "she", then they should be replaced by the entity found in the previous sentence(s).

In [7]:
print ("Tokens:", sNLP.word_tokenize(text))

Tokens: ['Rooney', 'joined', 'the', 'Everton', 'youth', 'team', 'at', 'the', 'age', 'of', '9', 'and', 'made', 'his', 'professional', 'debut', 'for', 'the', 'club', 'in', '2002', 'at', 'the', 'age', 'of', '16', '.']


#### Part of speech

Part of speech tagging annotates each token with their part of speech, such as noun, verb, adjective based on its content and definition. This may already give a good approximation of the much needed subject-predicate-object triple structure required for ontology modelling, but lacks insight into the dependencies between them. 

In [8]:
print ("POS:", sNLP.pos(text))
   

POS: [('Rooney', 'NNP'), ('joined', 'VBD'), ('the', 'DT'), ('Everton', 'NNP'), ('youth', 'NN'), ('team', 'NN'), ('at', 'IN'), ('the', 'DT'), ('age', 'NN'), ('of', 'IN'), ('9', 'CD'), ('and', 'CC'), ('made', 'VBD'), ('his', 'PRP$'), ('professional', 'JJ'), ('debut', 'NN'), ('for', 'IN'), ('the', 'DT'), ('club', 'NN'), ('in', 'IN'), ('2002', 'CD'), ('at', 'IN'), ('the', 'DT'), ('age', 'NN'), ('of', 'IN'), ('16', 'CD'), ('.', '.')]


#### Name Entity Recognition


Named Entity (NE) has a task of finding entities in a text such as a person, location, organization
and country. It does so by cross-checking values against a controlled vocabulary and looking at the token structure. It is a very valuable functionality, because the named entities are already showing metadata. This makes it easier to model triples in the form subject - predicate - object. The first part of the sentence below can consequently be modelled as triple 'Person'--(joins)-->'Organization'. 

In [9]:
print ("NER:", sNLP.ner(text))

NER: [('Rooney', 'PERSON'), ('joined', 'O'), ('the', 'O'), ('Everton', 'ORGANIZATION'), ('youth', 'O'), ('team', 'O'), ('at', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('9', 'NUMBER'), ('and', 'O'), ('made', 'O'), ('his', 'O'), ('professional', 'O'), ('debut', 'O'), ('for', 'O'), ('the', 'O'), ('club', 'O'), ('in', 'O'), ('2002', 'DATE'), ('at', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('16', 'NUMBER'), ('.', 'O')]


#### Shallow parsing and dependency parsing

Understanding the topics and the structure of the text is instrumental for deducting the metadata and the modelling aspect of the task. Shallow parsing and dependency parsing in combination show the part of speech tags in the way they relate to each other. This helps to identify the type of relationships between the entities.

In [8]:

print ("Parse:", sNLP.parse(text))

Parse: (ROOT
  (S
    (NP (NNP Rooney))
    (VP
      (VP (VBD joined)
        (NP (DT the) (NNP Everton) (NN youth) (NN team))
        (PP (IN at)
          (NP
            (NP (DT the) (NN age))
            (PP (IN of)
              (NP (CD 9))))))
      (CC and)
      (VP (VBD made)
        (NP (PRP$ his) (JJ professional) (NN debut))
        (PP (IN for)
          (NP
            (NP (DT the) (NN club))
            (PP (IN in)
              (NP (CD 2002)))))
        (PP (IN at)
          (NP
            (NP (DT the) (NN age))
            (PP (IN of)
              (NP (CD 16)))))))
    (. .)))


In [10]:
print ("Dep Parse:", sNLP.dependency_parse(text))
   

Dep Parse: [('ROOT', 0, 2), ('nsubj', 2, 1), ('det', 6, 3), ('compound', 6, 4), ('compound', 6, 5), ('dobj', 2, 6), ('case', 9, 7), ('det', 9, 8), ('nmod', 2, 9), ('case', 11, 10), ('nmod', 9, 11), ('cc', 2, 12), ('conj', 2, 13), ('nmod:poss', 16, 14), ('amod', 16, 15), ('dobj', 13, 16), ('case', 19, 17), ('det', 19, 18), ('nmod', 13, 19), ('case', 21, 20), ('nmod', 13, 21), ('case', 24, 22), ('det', 24, 23), ('nmod', 13, 24), ('case', 26, 25), ('nmod', 24, 26), ('punct', 2, 27)]


### Ontology Creation

#### Relation extraction using OPENIE CoreNLP

Openie is an inbuilt functionality of the Stanford CoreNLP pipeline. It generates triples in the form s-p-o. This retrieves (multiple) individuals of a to-be ontology;  the "data model", the ontology itself, can be aided by extracting involved concepts, i.e. through named entity recognition, but expert knowledge is also required to decide on what is important to keep and what is not.

In [11]:
import nltk
from pycorenlp import *
import collections
nlp=StanfordCoreNLP("http://localhost:9000/")
print(text )
print("\n")

output = nlp.annotate(text, properties={"annotators":"tokenize,ssplit,pos,depparse,natlog,openie",
                                 "outputFormat": "json","triple.strict":"true"})
result = [output["sentences"][0]["openie"] for item in output]
# print(result)
for i in result:
    for rel in i:
        relationSent=rel['subject'],rel['relation'],rel['object']
        print(relationSent)

Rooney joined the Everton youth team at the age of 9 and made his professional debut for the club in 2002 at the age of 16.


('Rooney', 'joined Everton youth team at', 'age of 9')
('Rooney', 'made', 'his professional debut')
('Rooney', 'made', 'his debut')
('Rooney', 'joined Everton youth team at', 'age')
('Rooney', 'joined', 'Everton youth team')


The work of experts can be facilitated through software. I.e. by presenting them different s-p-o results, they can decide whether the information contained is important. After that, the annotated entities and relationships of the remaining triples can be collected and modelled in an ontology. 

![title](Interface.JPG)