<a href="https://colab.research.google.com/github/Danzigerrr/MultiClass-Entity-Linking-System/blob/main/WordNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WordNet - creating gold and silver seed

## Import libraries


In [None]:
import nltk
from nltk.corpus import wordnet as wn
from collections import deque

# Ensure NLTK data is downloaded
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Create Gold dataset

Select a seed set of the 200 highest-level nominal concepts from the WordNet hypernymy taxonomy.

In [None]:
# Function to get the highest-level nominal concepts (those without hypernyms)
def get_highest_level_nominals():
    # Get all noun synsets
    noun_synsets = list(wn.all_synsets(wn.NOUN))

    highest_level_nominals = []

    for synset in noun_synsets:
        # Check if the synset has no hypernyms or if its hypernym is 'entity' or other high-level concept
        if not synset.hypernyms() or any(hyp.name().split('.')[0] == 'entity' for hyp in synset.hypernyms()):
            # Filter out battle-related terms by checking if the word 'battle' is in the definition
            highest_level_nominals.append(synset)

        if len(highest_level_nominals) >= 200:  # Adjust to 200 as per your requirement
            break

    return highest_level_nominals

# Get the seed set of 200 highest-level nominal concepts
highest_level_nominals = get_highest_level_nominals()

# Sort the concepts alphabetically by their names
highest_level_nominals_sorted = sorted(highest_level_nominals, key=lambda synset: synset.name())

# Print the sorted seed set with a counter
print(f"\nTotal {len(highest_level_nominals_sorted)} highest-level nominal concepts:")
for idx, synset in enumerate(highest_level_nominals_sorted, start=1):
    print(f"{idx}. Synset: {synset.name()}, Definition: {synset.definition()}")



Total 200 highest-level nominal concepts:
1. Synset: abstraction.n.06, Definition: a general concept formed by extracting common features from specific examples
2. Synset: actium.n.02, Definition: the naval battle in which Antony and Cleopatra were defeated by Octavian's fleet under Agrippa in 31 BC
3. Synset: aegates_isles.n.01, Definition: islands west of Sicily (now known as the Egadi Islands) where the Romans won a naval victory over the Carthaginians that ended the first Punic War in 241 BC
4. Synset: aegospotami.n.02, Definition: a river in ancient Thrace (now Turkey); in the mouth of this river the Spartan fleet under Lysander destroyed the Athenian fleet in the final battle of the Peloponnesian War (404 BC)
5. Synset: agincourt.n.01, Definition: a battle in northern France in which English longbowmen under Henry V decisively defeated a much larger French army in 1415
6. Synset: alamo.n.01, Definition: a siege and massacre at a mission in San Antonio in 1836; Mexican forces und

get top concepts after entity

In [None]:
import nltk
from nltk.corpus import wordnet

def print_hierarchy(synset, depth=1, indent=0, count=0):
  """Prints the hierarchy of a given WordNet synset up to a specified depth and counts the concepts.

  Args:
    synset: The WordNet synset to print.
    depth: The maximum depth to print.
    indent: The indentation level for the current synset.
    count: The current count of concepts.

  Returns:
    The updated count of concepts.
  """

  if depth == 0:
    return count

  print(f"{'  ' * indent}{synset.name()}  {synset.definition()}")
  count += 1

  for hyponym in synset.hyponyms():
    count = print_hierarchy(hyponym, depth - 1, indent + 1, count)

  return count

# Get the synset for the entity concept
entity_synset = wordnet.synset('physical_entity.n.01')

# Set the desired depth for the hierarchy
depth = 3  # Adjust this to control the depth of the hierarchy

# Print the hierarchy and count the concepts
total_concepts = print_hierarchy(entity_synset, depth)

print(f"\nTotal concepts: {total_concepts}")

physical_entity.n.01  an entity that has physical existence
  thing.n.12  a separate and self-contained entity
    unit.n.05  a single undivided natural thing occurring in the composition of something else
    body_of_water.n.01  the part of the earth's surface covered with water (such as a river or lake or ocean)
    variable.n.01  something that is likely to vary; something that is subject to variation
    necessity.n.02  anything indispensable
    reservoir.n.04  anything (a person or animal or plant or substance) in which an infectious agent normally lives and multiplies
    inessential.n.01  anything that is not essential
    subject.n.02  something (a person or object or scene) selected by an artist or photographer for graphic representation
    part.n.03  a portion of a natural object
  causal_agent.n.01  any entity that produces an effect or is responsible for events or results
    catalyst.n.02  something that causes an important event to happen
    cause_of_death.n.01  the ca

## Silver dataset creation
Expanding the gold datastet

In [None]:
# Example: Create a BFS function to expand seed set
def expand_seed_set_bfs(seed_set, target_size=1000):
    # Initialize a queue for BFS
    queue = deque(seed_set)
    expanded_set = set(seed_set)  # Set to avoid duplicates
    visited = set(seed_set)  # Track visited synsets

    while queue and len(expanded_set) < target_size:
        current_synset = queue.popleft()

        # Get all hyponyms (children) of the current synset
        for hyponym in current_synset.hyponyms():
            if hyponym not in visited:
                visited.add(hyponym)
                queue.append(hyponym)
                expanded_set.add(hyponym)

                # Stop if we've reached the desired size
                if len(expanded_set) >= target_size:
                    break

    return expanded_set

# Start by selecting the 200 highest-level nominal concepts (your seed set)
# Example: seed_set is a list of manually annotated synsets
seed_set = [wn.synset('abstraction.n.06'), wn.synset('actium.n.02')]  # Your seed set here

# Expand seed set using BFS to get a silver set of around 1000 concepts
silver_seed_set = expand_seed_set_bfs(seed_set, target_size=1000)

# Print the expanded seed set (optional)
print(f"Total {len(silver_seed_set)} concepts in the expanded silver seed set:")
for idx, synset in enumerate(silver_seed_set, start=1):
    print(f"{idx}. Synset: {synset.name()}, Definition: {synset.definition()}")


Total 1000 concepts in the expanded silver seed set:
1. Synset: second.n.01, Definition: 1/60 of a minute; the basic unit of time adopted under the Systeme International d'Unites
2. Synset: radical.n.04, Definition: (mathematics) a quantity expressed as the root of another quantity
3. Synset: outwardness.n.02, Definition: the quality or state of being outside or directed toward or relating to the outside or exterior
4. Synset: unconfessed.n.01, Definition: people who have not confessed
5. Synset: officialese.n.01, Definition: the style of writing characteristic of some government officials: formal and obscure
6. Synset: self-expression.n.01, Definition: the expression of one's individuality (usually through creative activities)
7. Synset: severalty.n.02, Definition: exclusive individual ownership
8. Synset: changelessness.n.02, Definition: the quality of being unchangeable; having a marked tendency to remain unchanged
9. Synset: stockholding.n.02, Definition: ownership of stocks; the s

In [None]:
import nltk
from nltk.corpus import wordnet as wn
from collections import deque

# Ensure NLTK data is downloaded
nltk.download('wordnet')

# Function to expand a set of seed concepts using BFS based on hyponymy
def expand_bfs(seed_set, max_concepts=1000):
    expanded_concepts = set(seed_set)  # Use a set to avoid duplicates
    queue = deque(seed_set)

    while queue and len(expanded_concepts) < max_concepts:
        current_concept = queue.popleft()

        # Get hyponyms (children) of the current concept
        hyponyms = current_concept.hyponyms()

        for hyponym in hyponyms:
            if hyponym not in expanded_concepts:
                expanded_concepts.add(hyponym)
                queue.append(hyponym)

    return expanded_concepts

# Function to assign a NER class to each concept
# In this case, we simulate the NER class as a simple example
def assign_ner_class(concept):
    # Simulated NER class assignment based on the synset name (for example)
    # In practice, you would integrate with BabelNet or a similar resource for better classification
    return {
        'concept': concept.name(),
        'class': 'general_noun'  # Placeholder, assign actual class based on your system
    }

# Example seed set (200 highest-level nominal concepts) - this should be your previously generated set
# For now, let's assume highest_level_nominals contains the seed synsets
highest_level_nominals = get_highest_level_nominals()  # Previously defined function to get your top 200 concepts

# Expand the seed set using BFS
expanded_concepts = expand_bfs(highest_level_nominals)

# Create a list of dictionaries containing concepts and their assigned NER class
concepts_with_classes = [assign_ner_class(concept) for concept in expanded_concepts]

# Print the first few examples of expanded concepts with their NER class
print(concepts_with_classes[:10])  # Example output


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[{'concept': 'radical.n.04', 'class': 'general_noun'}, {'concept': 'abator.n.01', 'class': 'general_noun'}, {'concept': 'cutting.n.04', 'class': 'general_noun'}, {'concept': 'cashier.n.02', 'class': 'general_noun'}, {'concept': 'second_crusade.n.01', 'class': 'general_noun'}, {'concept': 'atlanta.n.02', 'class': 'general_noun'}, {'concept': 'stiffening.n.02', 'class': 'general_noun'}, {'concept': 'learner.n.01', 'class': 'general_noun'}, {'concept': 'new_river_gorge_bridge.n.01', 'class': 'general_noun'}, {'concept': 'change.n.06', 'class': 'general_noun'}]


## TTL file and ontology

In [None]:
!pip install rdflib

Collecting rdflib
  Downloading rdflib-7.1.1-py3-none-any.whl.metadata (11 kB)
Collecting isodate<1.0.0,>=0.7.2 (from rdflib)
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Downloading rdflib-7.1.1-py3-none-any.whl (562 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m562.4/562.4 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading isodate-0.7.2-py3-none-any.whl (22 kB)
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.7.2 rdflib-7.1.1


In [None]:
from rdflib import Graph, Namespace

# Load the Turtle file
ttl_file_path = "/content/ontology.ttl"
g = Graph()
g.parse(ttl_file_path, format="turtle")

# Define the namespaces
OWL = Namespace("http://www.w3.org/2002/07/owl#")
RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#")


def print_hierarchy(parent_class, level=0, max_depth=3):
  """
  Recursively prints the hierarchy of subclasses up to a specified maximum depth.

  Args:
      parent_class (rdflib.term.URIRef): The parent class.
      level (int, optional): The current indentation level. Defaults to 0.
      max_depth (int, optional): The maximum depth to print. Defaults to 3.
  """
  indent = "-" * level
  for subclass, superclass in g.subject_objects(RDFS.subClassOf):
    if superclass == parent_class:
      # Get the comment value
      comment = g.value(subclass, RDFS.comment)
      subclass_name = subclass.split("http://dbpedia.org/ontology/")[1]
      print(f"{indent} {subclass_name}")
      if level < max_depth:  # Check if level is below max depth before recursion
        print_hierarchy(subclass, level + 1, max_depth)


# Print the hierarchy of subclasses of owl:Thing, limiting to 3 levels
print("Subclasses of owl:Thing (up to 2 levels):")
print_hierarchy(OWL.Thing, max_depth=2)

Subclasses of owl:Thing (up to 2 levels):
 Activity
- Game
-- BoardGame
-- CardGame
- Sales
- Sport
-- Athletics
-- TeamSport
 Agent
- Deity
- Employer
- Family
-- NobleFamily
- FictionalCharacter
-- ComicsCharacter
-- DisneyCharacter
-- MythologicalFigure
-- NarutoCharacter
-- SoapCharacter
- Organisation
-- Broadcaster
-- Company
-- EducationalInstitution
-- EmployersOrganisation
-- GeopoliticalOrganisation
-- GovernmentAgency
-- Group
-- InternationalOrganisation
-- Legislature
-- MilitaryUnit
-- Non-ProfitOrganisation
-- Parliament
-- PoliticalParty
-- ReligiousOrganisation
-- SambaSchool
-- SportsClub
-- SportsLeague
-- SportsTeam
-- TermOfOffice
-- TradeUnion
 Algorithm
 Altitude
 AnatomicalStructure
- Artery
- BloodVessel
- Bone
- Brain
- Embryology
- Ligament
- Lymph
- Muscle
- Nerve
- Vein
 ArchitecturalStructure
- AmusementParkAttraction
-- RollerCoaster
-- WaterRide
- Arena
- Building
-- Casino
-- Castle
-- Factory
-- HistoricBuilding
-- Hotel
-- Museum
-- Prison
-- Religiou