<a href="https://colab.research.google.com/github/AyushiKashyapp/foodwise_knowledgeDB/blob/main/TripleExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting relations from web pages.

# Aim:

The aim is to extract relations (triples) from wikipedia and other online information pages available for major stakeholders and committee members of Food Wise 2025 and Food Vision 2030 projects.

The code overall follows these major steps:

- Get the URLs for the stakeholder organisations and committee members from an excel.
- Scrape the webpages and store the scrapped data in the form of a string to be tokenized.
- Extract the entities and relations from the graph using ***rebel-large model***.
- Export the extracted relations to an excel.

1. **Installing required libraries.**
- transformers  : Library provided by Hugging Face that provides general-purpose architectures (BERT, GPT-2, RoBERTa, etc.) for natural language understanding (NLU) and natural language generation (NLG).
- wikipedia     : To access and parse data from Wikipedia.
- newspaper3k   : Extracting and parsing articles from websites (including newspapers).
- pyvis         : For visualization of graphs and networks.

In [1]:
!pip install transformers wikipedia newspaper3k pyvis

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting GoogleNews
  Downloading GoogleNews-1.6.14-py3-none-any.whl (8.5 kB)
Collecting pyvis
  Downloading pyvis-0.3.2-py3-none-any.whl (756 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tl

In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import math
import torch
import wikipedia
from newspaper import Article, ArticleException
from GoogleNews import GoogleNews
import IPython
from pyvis.network import Network

2. **Reading links**

Reading the links corresponding to ***stakeholder organisations*** and ***committee members*** and storing the links in the list **urls**.

In [3]:
import pandas as pd

df = pd.read_excel('stakeholders.xlsx', sheet_name='Sheet1', engine='openpyxl')
urls = df['Links'].tolist()
print(urls)

['https://birdwatch-europe.org/', 'https://www.fertilizer.org/', 'https://iitc.ie/', 'https://www.localroots.ie/', 'https://icmsa.ie/', 'https://www.thepimlicoproject.com/', 'https://en.wikipedia.org/wiki/Department_of_Transport_(Ireland)', 'https://en.wikipedia.org/wiki/Department_of_Agriculture,_Food_and_the_Marine', 'https://www.epa.ie/', 'https://www.creativead.ie/', 'https://www.glennonbrothers.ie/', 'https://en.wikipedia.org/wiki/Department_of_Public_Expenditure,_National_Development_Plan_Delivery_and_Reform', 'https://www.teagasc.ie/', 'https://bim.ie/', 'https://www.bordbia.ie/', 'https://en.wikipedia.org/wiki/Department_of_Finance_(Ireland)', 'https://en.wikipedia.org/wiki/Department_of_Housing,_Local_Government_and_Heritage', 'https://www.kerry.com/', 'https://www.danone.ie/', 'https://lic.ie/', 'https://www.oecd.org/agriculture/about/', 'https://loveirishfood.ie/brands/glenisk-2/', 'https://enterprise.gov.ie/en/who-we-are/offices-agencies/enterprise-ireland.html', 'https://e

3. **Loading the rebel-large model**.

Loading a *tokenizer pretrained on the rebel-large model* from the Babelscape repository.

Loading the *rebel-large model*, which is a large-scale pre-trained model for natural language processing tasks, provided by ***Babelscape***.

In [4]:
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/344 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

4. **Function to extract triples from a text.**

This function processes a text output from a model that uses special tokens (triplet, subj, obj) to extract relationships.

In [5]:
def extract_relations_from_model_output(text):
    relations = [] #Store the extracted relationships as dictionaries.
    relation, subject, relation, object_ = '', '', '', '' #Hold the current relation type, subject, and object being processed.
    text = text.strip()
    current = 'x'
    text_replaced = text.replace("<s>", "").replace("<pad>", "").replace("</s>", "") #Removes special tokens
    for token in text_replaced.split(): #Iterates through each token in the processed text
        if token == "<triplet>": #start of a new triplet.
            current = 't'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
                relation = ''
            subject = ''
        elif token == "<subj>": #start of a subject in a triplet.
            current = 's'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
            object_ = ''
        elif token == "<obj>": #start of an object in a triplet.
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '': #if relation is not empty, the current relation is added to relations list.
        relations.append({
            'head': subject.strip(),
            'type': relation.strip(),
            'tail': object_.strip()
        })
    return relations

5. **Knowledge Base class.**

The class KB (Knowledge Base) manages relations and provides methods to add, merge, and print them.

In [6]:
class KB():
    def __init__(self): #Initializing the KB class with an empty list relations to store relationship dictionaries.
        self.relations = []

    def are_relations_equal(self, r1, r2): #Checking if two relations r1 and r2 are equal based on their head, type, and tail attributes.
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1): #Checking if a given relation r1 already exists in the relations list.
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def merge_relations(self, r1): #Merging the spans of two equal relations (r1 and r2) by adding spans from r1 that are not already in r2.
        r2 = [r for r in self.relations
              if self.are_relations_equal(r1, r)][0]
        spans_to_add = [span for span in r1["meta"]["spans"]
                        if span not in r2["meta"]["spans"]]
        r2["meta"]["spans"] += spans_to_add


    def add_relation(self, r): #Adding a new relation r to self.relations if it doesn't already exist (exists_relation returns False).
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

    def print(self): #Printing all relations stored in self.relations in a formatted manner.
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

6. **Function to organise the Knowledge Base.**

Tokenizes the input text into spans, generates relations using a pretrained model, and organizes them into a knowledge base (KB) object.

In [7]:
def from_text_to_kb(text, span_length=128, verbose=False):
    # Tokenizing the input text using a pretrained tokenizer (tokenizer) and returns the tokenized inputs as PyTorch tensors
    inputs = tokenizer([text], return_tensors="pt")

    # compute span boundaries based on the length of input_ids and span_length.
    num_tokens = len(inputs["input_ids"][0])
    if verbose:
        print(f"Input has {num_tokens} tokens")
    num_spans = math.ceil(num_tokens / span_length)
    if verbose:
        print(f"Input has {num_spans} spans")
    overlap = math.ceil((num_spans * span_length - num_tokens) /
                        max(num_spans - 1, 1))
    spans_boundaries = []
    start = 0
    for i in range(num_spans):
        spans_boundaries.append([start + span_length * i,
                                 start + span_length * (i + 1)])
        start -= overlap
    if verbose:
        print(f"Span boundaries are {spans_boundaries}")

    # Transforms the tokenized input into spans by slicing input_ids and attention_mask tensors based on the computed spans_boundaries
    tensor_ids = [inputs["input_ids"][0][boundary[0]:boundary[1]]
                  for boundary in spans_boundaries]
    tensor_masks = [inputs["attention_mask"][0][boundary[0]:boundary[1]]
                    for boundary in spans_boundaries]
    inputs = {
        "input_ids": torch.stack(tensor_ids),
        "attention_mask": torch.stack(tensor_masks)
    }

    # Generates relations using a pretrained model (model.generate). Parameters for the generation are passed through gen_kwargs.
    num_return_sequences = 3
    gen_kwargs = {
        "max_length": 256,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": num_return_sequences
    }
    generated_tokens = model.generate(
        **inputs,
        **gen_kwargs,
    )

    # Decodes the generated tokens into relations using the tokenizer (tokenizer.batch_decode).
    decoded_preds = tokenizer.batch_decode(generated_tokens,
                                           skip_special_tokens=False)

    # Initializes a knowledge base (kb), iterates over the decoded predictions, extracts relations using extract_relations_from_model_output, and adds them to the KB.
    kb = KB()
    i = 0
    for sentence_pred in decoded_preds:
        current_span_index = i // num_return_sequences
        relations = extract_relations_from_model_output(sentence_pred)
        for relation in relations:
            relation["meta"] = {
                "spans": [spans_boundaries[current_span_index]]
            }
            kb.add_relation(relation)
        i += 1

    return kb

7. **Function to extract web pages.**

Extract and concatenate the text from multiple articles using the newspaper library.

In [8]:
from newspaper import Article, ArticleException

def article_extraction(url):
  all_text = [] #Initializing an empty list all_text to store the text content of each article.
  for url in urls:
        try:
            article = Article(url) #Iterates through each URL in urls, download and parse the article using the Article class from newspaper.
            article.download()
            article.parse()
            art_text = article.text #Extracting the text content
            all_text.append(art_text)
        except ArticleException as e:
            print(f"Error downloading or parsing article: {e}")

  return all_text

texts = article_extraction(urls)

text = ''.join(texts) #Concatenating all the text strings in texts into a single string text.

Error downloading or parsing article: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.oecd.org/agriculture/about/ on URL https://www.oecd.org/agriculture/about/


8. **Text Cleaning**

Removing the special characters, emojis and any other unwanted characters.

In [9]:
import re

def clean_text(text):
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Remove emojis
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U00002702-\U000027B0"  # Dingbats
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # Remove unwanted characters
    # Add more characters inside [] to remove more unwanted characters
    text = re.sub(r'[\r\n\t\f\v]', ' ', text)  # Remove newlines and tabs

    return text.strip()  # Strip any leading or trailing whitespace


cleaned_text = clean_text(text)


9. Spliting the complete text into smaller chunks.

Since the rebel-large model cannot process texts over 1029 tokens, designed a function to to split a given text into chunks of a specified length (1029 tokens).

In [10]:
def create_chunks(text, chunk_length):
    chunks = []

    for i in range(0, len(text), chunk_length):
        chunks.append(text[i:i+chunk_length])

    return chunks


In [14]:
max_token_length = 1029
tokenized_text = create_chunks(cleaned_text, max_token_length)

In [15]:
print(tokenized_text[0])

And thats a wrap on IFA2024 in Singapore All of us at IFA would like to thank the 1189 delegates representing 552 companies from 72 countries who made it to Singapore for our Annual ConferenceCelebrating 100  Years in Business  Welcome to the IITC website We take pride in being one of Irelands largest Hardware Steel Salt and Agricultural Plastics and Plumbing distributors for the past 100 yearsWith our Jam Jar science project we hope to grow the smartest native oak trees in Ireland We are partnering with three National schools helping every student in 3rd class to grow their very own oak tree The acorn grows roots down into the jam jar and a hardy stem with leaves up into the air  all while just sitting on the class window sill The brains behind this initiative are our friends at OakyworldICMSA has represented farm families from all over Ireland at local national and European level with diligence passion integrity and an emphasis on finding solutions to their farmrelated problems for n

In [20]:
len(cleaned_text)

108478

**Testing from_text_to_kb on the first tokenized sentence.**

In [17]:
kb = from_text_to_kb(tokenized_text[0], verbose=True)
kb.print()

Input has 205 tokens
Input has 2 spans
Span boundaries are [[0, 128], [77, 205]]
Relations:
  {'head': 'IFA2024', 'type': 'country', 'tail': 'Singapore', 'meta': {'spans': [[0, 128]]}}
  {'head': 'IFA2024', 'type': 'location', 'tail': 'Singapore', 'meta': {'spans': [[0, 128]]}}
  {'head': 'IFA2024 in Singapore', 'type': 'country', 'tail': 'Singapore', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Jam Jar', 'type': 'subclass of', 'tail': 'jam jar', 'meta': {'spans': [[77, 205]]}}
  {'head': 'oak tree', 'type': 'has parts of the class', 'tail': 'leaves', 'meta': {'spans': [[77, 205]]}}
  {'head': 'acorn', 'type': 'subclass of', 'tail': 'leaves', 'meta': {'spans': [[77, 205]]}}


10. **Creating complete knowledge base (KB).**

Creating a knowledge base (KB) from text chunks by splitting the input text into chunks and generating relations for each chunk using another function from_text_to_kb.

The function will iterate over the chunks, extract relations from each chunk and store the relationships from all the chunks by appending the relations from each chunk to a dictionary.

In [18]:
def create_kb_from_text_chunks(text, max_token_length=1029):
    chunks = create_chunks(text, max_token_length)
    relations_dict = {}

    for idx, chunk in enumerate(chunks):
        kb = from_text_to_kb(chunk, verbose=False)
        relations_dict[f'Chunk_{idx+1}'] = kb.relations

    return relations_dict

In [21]:
max_token_length = 1029
relations_dict = create_kb_from_text_chunks(cleaned_text, max_token_length)

In [22]:
for chunk, relations in relations_dict.items():
    print(f"Relations for {chunk}:")
    for relation in relations:
        print(relation)

Relations for Chunk_1:
{'head': 'IFA2024', 'type': 'country', 'tail': 'Singapore', 'meta': {'spans': [[0, 128]]}}
{'head': 'IFA2024', 'type': 'location', 'tail': 'Singapore', 'meta': {'spans': [[0, 128]]}}
{'head': 'IFA2024 in Singapore', 'type': 'country', 'tail': 'Singapore', 'meta': {'spans': [[0, 128]]}}
{'head': 'Jam Jar', 'type': 'subclass of', 'tail': 'jam jar', 'meta': {'spans': [[77, 205]]}}
{'head': 'oak tree', 'type': 'has parts of the class', 'tail': 'leaves', 'meta': {'spans': [[77, 205]]}}
{'head': 'acorn', 'type': 'subclass of', 'tail': 'leaves', 'meta': {'spans': [[77, 205]]}}
Relations for Chunk_2:
{'head': 'democratic', 'type': 'subclass of', 'tail': 'nonpolitical', 'meta': {'spans': [[0, 128]]}}
{'head': 'democratic', 'type': 'subclass of', 'tail': 'nondenominational', 'meta': {'spans': [[0, 128]]}}
{'head': 'nonpolitical', 'type': 'subclass of', 'tail': 'democratic', 'meta': {'spans': [[0, 128]]}}
{'head': 'Dáil Éireann', 'type': 'has part', 'tail': 'Teachta Dála', 

11. **Storing the extracted relations in a dataframe.**

The relations' dictionaries are stored in a single column of a dataframe, where each row belongs to the relations' dictionary extracted for each chunk.

In [37]:
relation_df = pd.DataFrame(relations_dict.items())
relation_df = relation_df.rename(columns={0: 'chunk_id', 1: 'triples'})
relation_df.head()

Unnamed: 0,chunk_id,triples
0,Chunk_1,"[{'head': 'IFA2024', 'type': 'country', 'tail'..."
1,Chunk_2,"[{'head': 'democratic', 'type': 'subclass of',..."
2,Chunk_3,"[{'head': 'Minister for Transport', 'type': 'p..."
3,Chunk_4,"[{'head': 'Shannon', 'type': 'located in the a..."
4,Chunk_5,[{'head': 'Department of Agriculture Food and ...


12. **Flattening into a new dataframe.**

Extracting the relations for each chunks out of the dictionary and storing each part of the relation (head, type, and tail) as a separate column in a new dataframe.

In [38]:
flattened_data = []
for index, row in df.iterrows():
    for d in row['triples']:
        flattened_data.append({
            'chunk_id': row['chunk_id'],
            'head': d['head'],
            'type': d['type'],
            'tail': d['tail']
        })

new_relation_df = pd.DataFrame(flattened_data)
print("\nNew DataFrame:")
print(new_relation_df)


New DataFrame:
      chunk_id                    head                    type       tail
0      Chunk_1                 IFA2024                 country  Singapore
1      Chunk_1                 IFA2024                location  Singapore
2      Chunk_1    IFA2024 in Singapore                 country  Singapore
3      Chunk_1                 Jam Jar             subclass of    jam jar
4      Chunk_1                oak tree  has parts of the class     leaves
..         ...                     ...                     ...        ...
649  Chunk_105  Farming and the Burren                owned by    Teagasc
650  Chunk_105  Farming and the Burren        publication date       2005
651  Chunk_106              NUI Galway                 country    Ireland
652  Chunk_106        Heritage Council                 country    Ireland
653  Chunk_106           Ashoka Fellow               inception       2011

[654 rows x 4 columns]


13. Exporting the relations to an excel.

Exporting the extracted relations into an excel to be used further to create a knowledge base in neo4j. This step is done to reduce the need of rerunning the model for the same dataset.

In [40]:
excel_file = 'triples_data.xlsx'
new_relation_df.to_excel(excel_file, index=False)

print(f"Triples saved to {excel_file}")

Triples saved to triples_data.xlsx
