Reference: https://www.kaggle.com/code/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk/notebook

In [6]:
import re
import pandas as pd
import bs4
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher 
from spacy.tokens import Span 

import networkx as nx

import matplotlib.pyplot as plt
from tqdm import tqdm

pd.set_option('display.max_colwidth', 200)
%matplotlib inline

In [7]:
# import DOJ sentences
candidate_sentences = pd.read_csv("doj_api_data_new.csv")
candidate_sentences.shape

(48050, 2)

In [9]:
candidate_sentences['body'].sample(5)

8280     BUFFALO, N.Y.--U.S. Attorney William J. Hochul, Jr. announced today that Myron Johnson, 39, of Buffalo, N.Y., who was convicted of possession with intent to distribute cocaine, was sentenced to 51...
24831    Follow @SDILNews \n\nNAEEM MAHMOOD KOHLI, 60, of Effingham, Illinois, was convicted of seven counts of illegal dispensation of a Schedule II Controlled Substance following a 17-day jury trial held...
4303     WASHINGTON – The Department of Justice and the Department of the Interior announced today that Freeport-McMoRan Corporation and Freeport-McMoRan Morenci Inc. (Freeport-McMoRan) have agreed to pay ...
4782     A federal judge in Worcester, Mass., sentenced William Scott Dion today to 84 months in prison for conspiring to defraud the United States, and for obstructing the Internal Revenue Service (IRS), ...
33373    St. Louis, MO – JOEY D. WOOD pled  guilty to filing four false tax returns for himself and two others claiming  refunds totaling over $23,000 for tax years

### Entities Extraction
To build a knowledge graph, the most important things are the nodes and the edges between them.

In [10]:
def get_entities(sent):
  ## chunk 1
  ent1 = ""
  ent2 = ""

  prv_tok_dep = ""    # dependency tag of previous token in the sentence
  prv_tok_text = ""   # previous token in the sentence

  prefix = ""
  modifier = ""

  #############################################################
  
  for tok in nlp(sent):
    ## chunk 2
    # if token is a punctuation mark then move on to the next token
    if tok.dep_ != "punct":
      # check: token is a compound word or not
      if tok.dep_ == "compound":
        prefix = tok.text
        # if the previous word was also a 'compound' then add the current word to it
        if prv_tok_dep == "compound":
          prefix = prv_tok_text + " "+ tok.text
      
      # check: token is a modifier or not
      if tok.dep_.endswith("mod") == True:
        modifier = tok.text
        # if the previous word was also a 'compound' then add the current word to it
        if prv_tok_dep == "compound":
          modifier = prv_tok_text + " "+ tok.text
      
      ## chunk 3
      if tok.dep_.find("subj") == True:
        ent1 = modifier +" "+ prefix + " "+ tok.text
        prefix = ""
        modifier = ""
        prv_tok_dep = ""
        prv_tok_text = ""      

      ## chunk 4
      if tok.dep_.find("obj") == True:
        ent2 = modifier +" "+ prefix +" "+ tok.text
        
      ## chunk 5  
      # update variables
      prv_tok_dep = tok.dep_
      prv_tok_text = tok.text
  #############################################################

  return [ent1.strip(), ent2.strip()]

In [11]:
get_entities("the film had 200 patents")

['film', '200  patents']

In [14]:
entity_pairs = []

for i in tqdm(candidate_sentences["body"]):
  entity_pairs.append(get_entities(i))

100%|███████████████████████████████████| 48050/48050 [2:02:33<00:00,  6.53it/s]


In [15]:
entity_pairs[10:20]

[['dangerous Ervin Prenci', 'criminal Fugitive Investigative More'],
 ['30 Americas participants', 'legal American States'],
 ['serious  ERO', 'illegally United country'],
 ['', ''],
 ['I', 'forward  questions'],
 ['Western office', 'crucial  David'],
 ['so cyber we', 'important  me'],
 ['also  I', 'forward  questions'],
 ['international  efforts', 'currently Northern visit'],
 ['true Justice Department', 'passing']]

### Relationship Extraction

In [27]:
def get_relation(sent):

  doc = nlp(sent)

  # Matcher class object 
  matcher = Matcher(nlp.vocab)

  #define the pattern 
  pattern = [{'DEP':'ROOT'}, 
            {'DEP':'prep','OP':"?"},
            {'DEP':'agent','OP':"?"},  
            {'POS':'ADJ','OP':"?"}] 

  matcher.add("matching_1",[pattern]) 

  matches = matcher(doc)
  k = len(matches) - 1

  span = doc[matches[k][1]:matches[k][2]] 

  return(span.text)

In [28]:
get_relation("John completed the task")

'completed'

In [None]:
relations = [get_relation(i) for i in tqdm(candidate_sentences['body'])]

 15%|█████▎                              | 7059/48050 [10:58<1:11:41,  9.53it/s]

In [24]:
pd.Series(relations).value_counts()[:50]

NameError: name 'relations' is not defined

Time it took to do it. Whenever doing with GPT model will need to develop prompts but doing through the API have to write code. If want relationships in a specif format, would have to create relationships into a file. But then how does Spacy do that? Is that the same order of time to do that in a GPT model? Can it do it automatically and faster? Is there a much lower bar to make it happen? Even if Spacy is better. 

When we start evaluating downfalls of these models that they don't have memory. The Raven one has infinite memory but looking at GPT based ones we have to use zero shot memory by injecting into data prompts themselves. Are there things that we can do to improve results? Is there an idea around extracting relationship information that even if GPT model does it so now we can inject it inot zero shot lerning to improve how it eva,uates on subsequent topics. Zero shot training. AutoGPT and Jarvis from microsoft is the corrollary to that. 

RE from Spacy and GPT comparison
Article is trying to visualize LLM as it's being trained. Look at this capability. Knowledge of the embedding space. 
Taking HT data, parsing into sections that make sense. Whether that's paragraphs and creating summaries of those sections and using one of the embedding systems to store them in a vector set to use for searching. Now can take prompts from LLM and pass it's concepts and get similaritiues from vector data space and inject those summaries intot he prompt so giving LLM. Doing preprocessing step by taking initial query and before pass it to the LLM getting it's embeddings from that data vector base are relevant and getting articles linked into embeddings. Now prompt will have original question and utilizing info from vector and all of that goes into the prompt. If want the bot to answer a certain way, can use data provided to answer. Will have command of the language from LLM but have relevant information from data set. Have to summarize prompts to have that new vector background info. 

In a HT use case, I want o figure out what places and why they are utilized for trafficking locations? The LLM says: what would I need to do to figure that out: break down into tasks and decides to do web queries, do stat analysis etc. And tasks are autogenerated and at the end it provides a result based on all the tasks it's created. 

Worthwhile getting exposed to all of this. Use the latest and greatest model. LLMs 3.5 and 4. Getting a feel for capacity and capability. 

Looking at trying to utilize this and figure out weakensses and strengths and what can be done to improve those weaknesses. Touch as many tools and things: pine cone for vector databases, autoGPT for task creation. 

Creating pinecone account to create vector database for long-term memory. 

Preprocessing prompt to look for similarities. 