# <center>Tutorial for Ontology Creation From Unstructured Data using CoreNLP</center>                         

## <center>  <font color='red'>  Introduction to the Ontology Creation </font></center>

### Introduction

What is Ontology? It is a data model that represents a set of concepts within a domain and the relationships among those concepts. It converts unstructured data into structured data using NLP. Natural Language Processing (NLP) is a part of artificial intelligence that deals with the interaction between computers and humans using the natural language. NLP perform several automated tasks such as automatic summarization, sentiments analysis, speech recognition (Siri, Google Assistant, Alexa), bank systems to analyse credit worthiness assessment, chatbots and many other applications pertaining to the real-world. The application of NLP can find names of people and companies in free texts and can link the names to public records or a directory of the company. (Jackson and I., 2007)

![title](archi1.PNG)



![title](archi.PNG)

## <center>  <font color='red'> Coding Part </font><center>

### 1. Data Preparation





## Step 1: Install following packages

<br>

### Setup and use Stanford CoreNLP Server with Python 
<br>
- To download Stanford CoreNLP, go to https://stanfordnlp.github.io/CoreNLP/index.html#download and click on “Download CoreNLP”. The latest version of Stanford CoreNLP at the time of writing is v3.9.2.<br>

- Once the download has completed, unzip the file using the following command: unzip stanford-corenlp-full-2017-06-09.zip <br>

- Install Java 8 (if not installed) <br>

- Running Stanford CoreNLP Server- Now, we have our environment ready to fire up Stanford CoreNLP Server. To do so, go to the path of the unzipped Stanford CoreNLP and execute the below command: <br>

 <font color='red'> java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000 </font> 
 
 https://www.khalidalnajjar.com/setup-use-stanford-corenlp-server-python/

## Step 2: Accessing Stanford CoreNLP Server using Python

In [1]:

from stanfordcorenlp import StanfordCoreNLP #pip install stanfordcorenlp
#In this code, I am using the python package “stanfordcorenlp”. 
#Below is a sample code for accessing the server and analysing some text.
import logging
import json
import pandas as pd
# from nltk.parse.stanford import StanfordDependencyParser
import os
import numpy as np
from graphviz import Source
from nltk.tree import *

###  Natural Language Processing Task

The following points describe syntactic and semantic natural language processing tasks that
are used for the annotation of text corpora with relevant features.

In [2]:

class StanfordNLP:
    def __init__(self, host='http://localhost', port=9000):
        self.nlp = StanfordCoreNLP(host, port=port,
                                   timeout=30000)  # , quiet=False, logging_level=logging.DEBUG)
        self.props = {
            'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,dcoref,relation',
            'pipelineLanguage': 'en',
            'outputFormat': 'json'
        }

    def word_tokenize(self, sentence):
        return self.nlp.word_tokenize(sentence)

    def pos(self, sentence):
        return self.nlp.pos_tag(sentence)

    def ner(self, sentence):
        return self.nlp.ner(sentence)

    def parse(self, sentence):
        return self.nlp.parse(sentence)

    def dependency_parse(self, sentence):
        return self.nlp.dependency_parse(sentence)

    def annotate(self, sentence):
        return json.loads(self.nlp.annotate(sentence, properties=self.props))
    

    @staticmethod
    def tokens_to_dict(_tokens):
        tokens = defaultdict(dict)
        for token in _tokens:
            tokens[int(token['index'])] = {
                'word': token['word'],
                'lemma': token['lemma'],
                'pos': token['pos'],
                'ner': token['ner']
            }
        return tokens

## Step 3: Calling Stanford Function

In [3]:
if __name__ == '__main__':
    sNLP = StanfordNLP()
    text= "Rooney joined the Everton youth team at the age of 9 and made his professional debut for the club in 2002 at the age of 16."    
    


### Tokenization and sentence splitting

Tokenization is the first step of the natural language processing task which tokenizes the sentence
into the sequence of tokens. Sentence splitting is often used in combination with the tokenization
at the same time.


### i) Annotation

In [4]:
print ("Annotate:", sNLP.annotate(text))

Annotate: {'sentences': [{'index': 0, 'parse': '(ROOT\r\n  (S\r\n    (NP (NNP Rooney))\r\n    (VP\r\n      (VP (VBD joined)\r\n        (NP (DT the) (NNP Everton) (NN youth) (NN team))\r\n        (PP (IN at)\r\n          (NP\r\n            (NP (DT the) (NN age))\r\n            (PP (IN of)\r\n              (NP (CD 9))))))\r\n      (CC and)\r\n      (VP (VBD made)\r\n        (NP (PRP$ his) (JJ professional) (NN debut))\r\n        (PP (IN for)\r\n          (NP\r\n            (NP (DT the) (NN club))\r\n            (PP (IN in)\r\n              (NP (CD 2002)))))\r\n        (PP (IN at)\r\n          (NP\r\n            (NP (DT the) (NN age))\r\n            (PP (IN of)\r\n              (NP (CD 16)))))))\r\n    (. .)))', 'basicDependencies': [{'dep': 'ROOT', 'governor': 0, 'governorGloss': 'ROOT', 'dependent': 2, 'dependentGloss': 'joined'}, {'dep': 'nsubj', 'governor': 2, 'governorGloss': 'joined', 'dependent': 1, 'dependentGloss': 'Rooney'}, {'dep': 'det', 'governor': 6, 'governorGloss': 'team',

### ii) Part of speech tagging

Part of speech tagging annotates each token with their part of speech, such as noun, verb, adjective based on its content and definition.

In [5]:
print ("POS:", sNLP.pos(text))
   

POS: [('Rooney', 'NNP'), ('joined', 'VBD'), ('the', 'DT'), ('Everton', 'NNP'), ('youth', 'NN'), ('team', 'NN'), ('at', 'IN'), ('the', 'DT'), ('age', 'NN'), ('of', 'IN'), ('9', 'CD'), ('and', 'CC'), ('made', 'VBD'), ('his', 'PRP$'), ('professional', 'JJ'), ('debut', 'NN'), ('for', 'IN'), ('the', 'DT'), ('club', 'NN'), ('in', 'IN'), ('2002', 'CD'), ('at', 'IN'), ('the', 'DT'), ('age', 'NN'), ('of', 'IN'), ('16', 'CD'), ('.', '.')]


### iii) Tokens

In [6]:
print ("Tokens:", sNLP.word_tokenize(text))
  

Tokens: ['Rooney', 'joined', 'the', 'Everton', 'youth', 'team', 'at', 'the', 'age', 'of', '9', 'and', 'made', 'his', 'professional', 'debut', 'for', 'the', 'club', 'in', '2002', 'at', 'the', 'age', 'of', '16', '.']


### iii) Name Entity Recognition


Name Entity (NE) has a task of finding entities in a text such as a person, location, organization
and country.

In [7]:
print ("NER:", sNLP.ner(text))

NER: [('Rooney', 'PERSON'), ('joined', 'O'), ('the', 'O'), ('Everton', 'ORGANIZATION'), ('youth', 'O'), ('team', 'O'), ('at', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('9', 'NUMBER'), ('and', 'O'), ('made', 'O'), ('his', 'O'), ('professional', 'O'), ('debut', 'O'), ('for', 'O'), ('the', 'O'), ('club', 'O'), ('in', 'O'), ('2002', 'DATE'), ('at', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('16', 'NUMBER'), ('.', 'O')]


### iv) Consituency Parser / Shallow parsing

Shallow parsing assigns syntactic structure to the tokens. Constituent parsing transmits
constituent parts of sentences (noun, verb, adjective) to the constituency parser which connects
to the higher-order units that have more grammatical meaning such as noun phrases, preposition
phrases and verb phrases.


In [8]:

print ("Parse:", sNLP.parse(text))

Parse: (ROOT
  (S
    (NP (NNP Rooney))
    (VP
      (VP (VBD joined)
        (NP (DT the) (NNP Everton) (NN youth) (NN team))
        (PP (IN at)
          (NP
            (NP (DT the) (NN age))
            (PP (IN of)
              (NP (CD 9))))))
      (CC and)
      (VP (VBD made)
        (NP (PRP$ his) (JJ professional) (NN debut))
        (PP (IN for)
          (NP
            (NP (DT the) (NN club))
            (PP (IN in)
              (NP (CD 2002)))))
        (PP (IN at)
          (NP
            (NP (DT the) (NN age))
            (PP (IN of)
              (NP (CD 16)))))))
    (. .)))


####  Pretty Print of Consituency Parser

In [9]:
a=sNLP.parse(text)
Tree.fromstring(a).pretty_print()

                                                                              ROOT                                                                                         
                                                                               |                                                                                            
                                                                               S                                                                                           
   ____________________________________________________________________________|_________________________________________________________________________________________   
  |                                                                            VP                                                                                        | 
  |                               _____________________________________________|___________________________________                       

### v) Dependency Parser

In [10]:
print ("Dep Parse:", sNLP.dependency_parse(text))
   

Dep Parse: [('ROOT', 0, 2), ('nsubj', 2, 1), ('det', 6, 3), ('compound', 6, 4), ('compound', 6, 5), ('dobj', 2, 6), ('case', 9, 7), ('det', 9, 8), ('nmod', 2, 9), ('case', 11, 10), ('nmod', 9, 11), ('cc', 2, 12), ('conj', 2, 13), ('nmod:poss', 16, 14), ('amod', 16, 15), ('dobj', 13, 16), ('case', 19, 17), ('det', 19, 18), ('nmod', 13, 19), ('case', 21, 20), ('nmod', 13, 21), ('case', 24, 22), ('det', 24, 23), ('nmod', 13, 24), ('case', 26, 25), ('nmod', 24, 26), ('punct', 2, 27)]


## Step 4: Relation extraction using OPENIE CoreNLP

Relation extraction is a task of determining semantic links between those entities.

In [11]:
import nltk
from pycorenlp import *
import collections
nlp=StanfordCoreNLP("http://localhost:9000/")
print(text )
print("\n")

output = nlp.annotate(text, properties={"annotators":"tokenize,ssplit,pos,depparse,natlog,openie",
                                 "outputFormat": "json","triple.strict":"true"})
result = [output["sentences"][0]["openie"] for item in output]
# print(result)
for i in result:
    for rel in i:
        relationSent=rel['subject'],rel['relation'],rel['object'] ### 3. Ontology Creation
        print(relationSent)

Rooney joined the Everton youth team at the age of 9 and made his professional debut for the club in 2002 at the age of 16.


('Rooney', 'joined Everton youth team at', 'age of 9')
('Rooney', 'made', 'his professional debut')
('Rooney', 'made', 'his debut')
('Rooney', 'joined Everton youth team at', 'age')
('Rooney', 'joined', 'Everton youth team')
