<div class="alert alert-block alert-info" >
    <h1>Knowledge Graph</h1>
    <h3>Created on $1^{st}$ December, 2020 </h3>
</div>

The aim is to create a NLP workflow

# Table of contents
1. [Why Neo4j?](#Neo4j)
2. [Document extraction](#docextraction)
3. [Text cleaning](#textcleaning)
4. [Text preprocessing using Spacy](#spacy)
5. [Connecting to Neo4j from python](#neo2py)
6. [Reference](#reference)

### Why Neo4j? <a name="Neo4j"></a>
- Ranks among the first 20 in the Db engines rank list latest of Dec 2020 [1]
- It is the only graphical database among the popular ones [1]
- Its a open source [2]
- Detailed comparison and further details are provided in [2]
- Neo4j accelerates the Natural Language Processing (NLP) [3][4]

## Headers

In [2]:
# from neo4j import GraphDatabase
from py2neo import Node, Graph
from tika import parser
import nltk
import re
from nltk.corpus import stopwords
import spacy
from collections import Counter
from spacy import displacy
import gensim
from gensim.utils import simple_preprocess 
import gensim.corpora as corpora
from pprint import pprint
from typing import List, Tuple


## Package initialization for NLP

In [3]:
nltk.download("wordnet")
nlp = spacy.load("en_core_web_sm")
eng = spacy.lang.en.English()

[nltk_data] Downloading package wordnet to /home/ganesh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Neo4j Credentials 

In [4]:
# Database Credentials
uri             = "bolt://localhost:7687"
userName        = "neo4j"
password        = " "

## Variable declaration

In [5]:
file_path = '/home/ganesh/Documents/NLP-Fraunhofer/Pdf_extraction/pdf/'

document = [file_path + 'SafetyCompanion-2020-EN.pdf',
            file_path + 'Test_pdf_1.pdf',
            file_path + 'Test_pdf_2.pdf',
            file_path + 'Test_pdf_3.pdf',
            file_path + 'Test_pdf_4.pdf',
            file_path + 'euroncap-2019-vw-golf-datasheet.pdf',
            file_path + 'vehicle_crashworthiness_complete.pdf']

## Document Extraction<a name="docextraction"></a>

In [6]:
def tika_Extraction(filename):
    file_data = parser.from_file(filename)
    text = file_data['content']
    meta_data = file_data['metadata']
#     print("\n Meta data \n")
#     print(meta_data)    
#     print("\n Extracted content \n")
#     print(text)
    return text

def nltk_tokens(article):
    
    #Remove non-english words
    words = set(nltk.corpus.words.words())
    en_article = " ".join(w for w in nltk.wordpunct_tokenize(article) if w.lower() in words or not w.isalpha())

    #Treebank from NLTK for better english word tokenization
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens_with_stopwords = tokenizer.tokenize(article)

    # Remove stop words
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens_with_stopwords if not word in stop_words]
    return tokens
    
def text_cleaning(tokens):
    #Remove anything other than words and numbers from the sentence
    #Remove quotes like isn't to "isn" "t"

    no_new_lines = [re.sub('\s+', ' ', tok) for tok in tokens]
    # print(no_new_lines)
    non_letters = [re.sub('[^a-zA-Z0-9]', ' ', no_new_line) for no_new_line in no_new_lines] 
    no_quotes = [re.sub("\'", '', non_letter) for non_letter in non_letters]

    # article = re.sub('[^\w|^\s]', ' ', article)
    # article = article.replace('\n', '  ')
    # article
    return no_quotes
    
#break down sentences into words
def sent_to_words(sentences): 
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

#N-gram language models
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

## Keyword extraction using Spacy <a name="spacy"></a>

In [7]:
def keywords_spacy(text, pos_tags, n):
    keywords = [] 
    unwanted_words = ["page", "fig", "figure", "th"]
    i=0
    # Read the extracted text file
    doc = nlp(text)
    # Tokenization with removing punctuations and stop wrods
    token_ = [token.text for token in doc if not token.is_stop and not token.is_punct]
    # Get the word frequency
    word_freq = Counter(token_)
    # Common words
    common_words = word_freq.most_common()
    while n>=len(keywords):
        if common_words[i][0] in pos_tags["NOUN"]: 
            if common_words[i][0] not in unwanted_words:
                keywords.append(common_words[i][0])
            else: 
                pass
        else:
            pass
        i+=1
        
    # Unique words
    unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
    # unique_words
    return {
        'unique_words': unique_words,
        'common_words': common_words[:n],
        'keywords': keywords
    }
            

In [20]:
def keywords_extractor(para, n_keywords, min_words):
    
    nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
    nlp.add_pipe('sentencizer')
    nlp.max_length = 1500000     


    doc = nlp(para)
    keywords = []     
    perfect_keyword = []
    accumulator = []
    accumulator_pos = [] 
    indicator = ["ADJ", "PROPN", "NOUN"]
    end = ["NOUN"]#, "PROPN"]
    symbol = set([str(token) for token in doc if token.pos_=="SYM"])        

    for token in doc:                        

        if (token.pos_ in indicator and str(token) not in symbol):         

            accumulator.append(str(token))
            accumulator_pos.append(token.pos_) 
            #print(accumulator)
            #print(accumulator_pos)

        else:               

            if len(accumulator) >= min_words:
                if (end[0] in accumulator_pos):# or end[1] in accumulator_pos):                        
                    index = max([idx for idx, pos in enumerate(accumulator_pos) if pos in end])                     

                    accumulator = accumulator[:index+1]

                    perfect_keyword.append(tuple(accumulator))

                    accumulator = []
                    accumulator_pos = []


    keywords = Counter(perfect_keyword).most_common(n_keywords) 

    return keywords

## POS tagging using Spacy

In [9]:
def POS_tag_spacy(text):
    pos_tags = dict()
    pos_list =[]
    doc = nlp(text)

    # POS tagging from Spacy
    pos_list = [[token.pos_,str(token)] for sent in doc.sents for token in sent if not token.is_stop and not token.is_punct]
    for key, value in pos_list:
        if key in pos_tags:
            pos_tags[key].append(value)
        else:
            pos_tags[key] = [value]

    # Print all the tags with the words as a list
    for key in pos_tags:
        print('\n The words with the POS tag {} are {}.\n'.format(key, pos_tags[key]))
        for token in pos_tags[key]:
            # Everything should be a string
            assert type(token) == str, 'Each token should be a string'
    return pos_tags

In [11]:
article = tika_Extraction(document[-1]).lower()
print(article)









































untitled




vehicle
crashworthiness
and
occupant protection

paul du bois
clifford c. chou
bahig b. fileta
tawfik b. khalil
albert i. king
hikmat f. mahmood
harold j. mertz
jac wismans

editors:

priya prasad
jamel e. belwafa

sponsored by:

automotive applications committee
american iron and steel institute
southfield, michigan



disclaimer
the opinions included in this publication are those of the indi-

vidual authors and in no way represent endorsement of the edi-

tors or american iron and steel institute (aisi) safety panel

members.

copyright c 2004

american iron and steel institute

2000 town center

southfield, michigan  48075



  page  iii

contents

introduction........................................................ 1
1.1 motor vehicle safety ..................................................................... 1

1.2 the automobile structure .............................................................. 3

1.3 materials .........

In [12]:
tokens = nlp(article)

total_sentence = [[sent] for sent in tokens.sents]


In [14]:
result = []
for sentence in total_sentence:
    #Treebank from NLTK for better english word tokenization
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens_with_stopwords = tokenizer.tokenize(str(sentence))

    # Remove stop words
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens_with_stopwords if not word in stop_words]
    no_new_lines = [re.sub('\s+', ' ', tok) for tok in tokens]
    # print(no_new_lines)
    non_letters = [re.sub('[^a-zA-Z0-9]', ' ', no_new_line) for no_new_line in no_new_lines] 
    no_quotes = [re.sub("\'", '', non_letter) for non_letter in non_letters]
    data_words = list(sent_to_words(no_quotes))
    # Build the bigram and trigram models
    bigram = gensim.models.Phrases(data_words, min_count=3, threshold=100) 
    trigram = gensim.models.Phrases(bigram[no_quotes], threshold=100)

    # Faster way to get a sentence clubbed as a trigram/bigram
    bigram_mod = gensim.models.phrases.Phraser(bigram) 

    trigram_mod = gensim.models.phrases.Phraser(trigram)



    # Form Bigrams
    data_words_bigrams = make_bigrams(data_words)
    result.append(data_words_bigrams)

In [15]:
gl = []
for x in result:
    local =[]
    for y in x:
        if y!=[]:
            local.append(" ".join(y))
    if local!=[]:
        gl.append(local)  

sent_list = [' '.join(i) for i in gl]
para = ". ".join(sent_list)
para

'untitled vehicle crashworthiness occupant protection paul du bois clifford chou. bahig fileta tawfik khalil albert king hikmat mahmood harold mertz jac wismans editors priya prasad jamel. belwafa sponsored automotive applications committee american iron steel institute southfield michigan disclaimer opinions included publication indi vidual authors way represent endorsement edi tors american iron steel institute aisi safety panel members. copyright american iron steel institute town center southfield michigan page iii contents introduction motor vehicle safety. automobile structure materials crashworthiness crashworthiness goals crashworthiness requirements achieving crashworthiness crashworthiness models requirements introduction current design practice comparison lms fe based crashworthiness processes lumped mass spring models limitations lms models crash crush design techniques front structures basic principles designing crash energy manage ment desired dummy performance stiff cage

In [16]:
pos_tags = POS_tag_spacy(para)


 The words with the POS tag ADJ are ['untitled', 'priya', 'automotive', 'american', 'disclaimer', 'indi', 'vidual', 'american', 'american', 'crashworthiness', 'crashworthiness', 'current', 'mass', 'basic', 'dummy', 'stiff', 'structural', 'progressive', 'crush', 'limited', 'efficient', 'analytical', 'collapsible', 'dynamic', 'new', 'analytical', 'axial', 'mathematical', 'structural', 'general', 'collapsible', 'thin', 'walled', 'structural', 'different', 'current', 'frontal', 'preliminary', 'frontal', 'analytical', 'historical', 'explicit', 'explicit', 'shell', 'current', 'initial', 'rear', 'basic', 'basic', 'additional', 'equivalent', 'square', 'good', 'frontal', 'structural', 'ugrading', 'supplemental', 'analytical', 'occupant', 'occupant', 'constant', 'traditional', 'multi', 'multi', 'finite', 'multi', 'multi', 'rigid', 'flexible', 'joint', 'kinematic', 'damper', 'dynamic', 'joint', 'multi', 'finite', 'crash', 'dummy', 'real', 'human', 'human', 'lateral', 'low', 'high', 'automotive',

In [21]:
keywords = keywords_extractor(para, n_keywords=60, min_words=2)

for keyword in keywords:
    print('{}'.format(*keyword))

('proceedings', 'stapp', 'conference')
('paper', 'no')
('et',)
('design', 'vehicle', 'structures')
('vehicle', 'crashworthiness', 'occupant', 'protection', 'page')
('proceedings', 'st', 'stapp', 'conference')
('btable', 'dummies')
('restraint', 'system')
('safety', 'vehicles')
('energy', 'management', 'page')
('energy', 'management', 'page', 'fig')
('stapp', 'car', 'crash', 'conference')
('rigid', 'bodies')
('international', 'technical', 'conference')
('frontal', 'impacts')
('mahmood', 'paluszny')
('finite', 'element', 'model')
('fundamental', 'principles', 'vehicle', 'occupant', 'system', 'analysis', 'page')
('vol', 'bed', 'vol')
('shell', 'elements')
('kn',)
('males', 'ft')
('stapp', 'conference')
('ircobi', 'bron', 'france')
('mechanical', 'response')
('dummy', 'family')
('crash', 'energy')
('total', 'energy')
('car', 'body')
('vehicle', 'crashworthiness', 'occupant', 'protection', 'page', 'fig')
('publisher', 'pp')
('force', 'occupant')
('kinetic', 'energy')
('proceedings', 'intern

In [22]:
result = keywords_spacy(para, pos_tags, 60)
result['keywords']

['vehicle',
 'occupant',
 'impact',
 'crash',
 'body',
 'design',
 'model',
 'injury',
 'element',
 'models',
 'response',
 'energy',
 'system',
 'crashworthiness',
 'dummy',
 'force',
 'time',
 'analysis',
 'human',
 'data',
 'protection',
 'head',
 'frontal',
 'crush',
 'restraint',
 'et',
 'structure',
 'sae',
 'velocity',
 'test',
 'bending',
 'safety',
 'structures',
 'finite',
 'conference',
 'tolerance',
 'tests',
 'simulation',
 'elements',
 'mass',
 'paper',
 'load',
 'section',
 'impacts',
 'proceedings',
 'lb',
 'collapse',
 'car',
 'acceleration',
 'barrier',
 'neck',
 'dummies',
 'sey',
 'joint',
 'modeling',
 'stiffness',
 'mph',
 'hybrid',
 'vehicles',
 'motion',
 'injuries']

## Text cleaning <a name="textcleaning"></a>

In [24]:
cleaned_text = text_cleaning(nltk_tokens(article))

data_words = list(sent_to_words(cleaned_text))
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=3, threshold=100) 
trigram = gensim.models.Phrases(bigram[no_quotes], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram) 

trigram_mod = gensim.models.phrases.Phraser(trigram)



# Form Bigrams
data_words_bigrams = make_bigrams(data_words)


In [25]:
#Flatten the list to make as whole paragraph
flat_list = [item for sublist in data_words_bigrams for item in sublist]
para_1 = " ".join(flat_list)
para_1

'untitled vehicle crashworthiness occupant protection paul du bois clifford chou bahig fileta tawfik khalil albert king hikmat mahmood harold mertz jac wismans editors priya prasad jamel belwafa sponsored automotive applications committee american iron steel institute southfield michigan disclaimer opinions included publication indi vidual authors way represent endorsement edi tors american iron steel institute aisi safety panel members copyright american iron steel institute town center southfield michigan page iii contents introduction motor vehicle safety automobile structure materials crashworthiness crashworthiness goals crashworthiness requirements achieving crashworthiness crashworthiness models requirements introduction current design practice comparison lms fe based crashworthiness processes lumped mass spring models limitations lms models crash crush design techniques front structures basic principles designing crash energy manage ment desired dummy performance stiff cage str

In [26]:
POS_tag_spacy(para_1)


 The words with the POS tag ADJ are ['untitled', 'priya', 'automotive', 'american', 'disclaimer', 'indi', 'vidual', 'american', 'american', 'crashworthiness', 'crashworthiness', 'current', 'mass', 'basic', 'dummy', 'stiff', 'structural', 'progressive', 'crush', 'limited', 'efficient', 'analytical', 'collapsible', 'dynamic', 'new', 'analytical', 'axial', 'mathematical', 'structural', 'general', 'collapsible', 'thin_walled', 'finite', 'structural', 'different', 'current', 'frontal', 'preliminary', 'frontal', 'analytical', 'historical', 'explicit', 'explicit', 'current', 'initial', 'rear', 'basic', 'basic', 'additional', 'equivalent', 'square', 'good', 'frontal', 'structural', 'ugrading', 'supplemental', 'analytical', 'occupant', 'occupant', 'constant', 'traditional', 'multi', 'multi', 'finite', 'multi', 'multi', 'rigid', 'flexible', 'joint', 'kinematic', 'joint', 'dynamic', 'joint', 'multi', 'finite', 'real', 'human', 'human', 'lateral', 'low', 'high', 'automotive', 'related', 'abdomina

{'ADJ': ['untitled',
  'priya',
  'automotive',
  'american',
  'disclaimer',
  'indi',
  'vidual',
  'american',
  'american',
  'crashworthiness',
  'crashworthiness',
  'current',
  'mass',
  'basic',
  'dummy',
  'stiff',
  'structural',
  'progressive',
  'crush',
  'limited',
  'efficient',
  'analytical',
  'collapsible',
  'dynamic',
  'new',
  'analytical',
  'axial',
  'mathematical',
  'structural',
  'general',
  'collapsible',
  'thin_walled',
  'finite',
  'structural',
  'different',
  'current',
  'frontal',
  'preliminary',
  'frontal',
  'analytical',
  'historical',
  'explicit',
  'explicit',
  'current',
  'initial',
  'rear',
  'basic',
  'basic',
  'additional',
  'equivalent',
  'square',
  'good',
  'frontal',
  'structural',
  'ugrading',
  'supplemental',
  'analytical',
  'occupant',
  'occupant',
  'constant',
  'traditional',
  'multi',
  'multi',
  'finite',
  'multi',
  'multi',
  'rigid',
  'flexible',
  'joint',
  'kinematic',
  'joint',
  'dynamic',
 

In [27]:
result = keywords_spacy(para_1, pos_tags, 7)
result['keywords']

['vehicle', 'occupant', 'impact', 'crash', 'body', 'design', 'model', 'injury']

## Dependency tree from Spacy

In [28]:
def dependency_tree(text):
    doc = nlp(text)
    displacy.render(doc, style="dep")

In [29]:
dependency_tree(str(total_sentence[4]))

## Connecting to Neo4j from python <a name="neo2py"></a>

In [5]:
# Connect to the neo4j database server
# graphDB_Driver  = GraphDatabase.driver(uri, auth=(userName, password)) 
graph = Graph(host='localhost', user=userName, password=password)
tx = graph.begin()

## Pushing the text to Neo4j

In [6]:
Nodes = Node("Article")
print(Nodes)
graph.create(Nodes)
tx.merge(Nodes)


(:Article {})


### Reference <a name="reference"></a>

1. <a href="https://db-engines.com/en/ranking" target="_top">Db engines Ranking</a><br>
2. <a href="https://db-engines.com/en/system/Neo4j%3BOrientDB%3BRDF4J" target="_top">Neo4j Ranking</a><br>
3. <a href="https://tbgraph.wordpress.com/2020/05/12/nlp-and-graphs-go-hand-in-hand-with-neo4j-and-apoc/" target="_top">NLP and Neo4j</a><br>
4. <a href="https://neo4j.com/blog/accelerating-towards-natural-language-search-graphs/" target="_top">NLP towards Neo4j</a><br>
5. <a href="https://gist.github.com/SandieIJ/69fc80c372e823fecfd4eeeda2156936" target="_top">Cleaning the text</a><br>
