#### Demostrating coreference resolution, anaphora resolution, coherence modelling, rhetorical structure theory (RST), discourse parsing, entity tracking, lexical chains, topic modelling, cohesion analysis, and temporal relation identification.

#### Coreference resolution (code demo)

- Most of these techniques use popular NLP libraries in Python, such as spaCy, NLTK, Transformers, Gensim, and others.
- Here's a brief idea of how each technique can be demonstrated with code.

In [None]:
# Install the package needed for coreference resolution.
# Uncomment it if coreferee is not installed on your system.
!python -m pip install coreferee

In [None]:
# Installs the English coreference resolution model for the coreferee package.
!python -m coreferee install en

In [None]:
# Download the large English model ('en_core_web_lg') for spaCy.
# This model includes word vectors and is more accurate for NER and POS tagging.
# It's larger and more powerful than the smaller models lke 'en_core_web_sm'), but it may take up large memory.
# We have not loaded it in the main code body but without this step, the code was throwing error.
# So it's a required step. 
!python -m spacy download en_core_web_lg

In [11]:
# Code to ignore warnings. 
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [23]:
# Import the required package. 
import spacy
# Load the transformer-based English model.
# As of now let's use it. We will take it up in detail in the later chapters. 
nlp = spacy.load('en_core_web_trf')

""" 
Add 'coreferee' to the pipeline.
'coreferee' is a spaCy extension that enables the identification and linking of coreferences in a text.
Adding it to the pipeline so that coreference resolution can be performed after other NLP tasks like tokenization and parsing.
"""
nlp.add_pipe('coreferee')
# Create the sample text for demo. 
sample_doc = nlp(""" Although she was very busy at office work, Mary felt she had had enough of it. 
                     She and her spouse decided they needed to go on a holiday. 
                     They travelled by train to France because they had enough friends in the country.""")
print("")
print("OUTPUT \n")
sample_doc._.coref_chains.print()


OUTPUT 

0: she(2), Mary(10), she(12), She(20), her(22)
1: work(8), it(17)
2: [She(20); spouse(23)], they(25), They(34), they(41)
3: France(39), country(47)


The output above does not look that easy to understand. The first line of the production (indexed 0) tells us that the pronouns, she(2), she(12), She(20), and her(22) refer to the same name, 'Mary(10).' Similarly, the second line (indexed 1) makes us understand that it(17) stands for work(8). At index 2 position, they(25), They(34), they(41) refer to [She(20); spouse(23)]. Finally, at index 3, country(47) stands for France(39). You will notice that in this output all the pronouns in the sample_doc are resolved to their proper nouns (coreference resolution process).

#### Rhetorical structure theory (RST) (code demo)

In [None]:
!pip install stanza

In [42]:
import stanza

# Download and set up the Stanford NLP mode.
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-09-08 16:28:37 INFO: Downloaded file to C:\Users\Shailendra Kadre\stanza_resources\resources.json
2024-09-08 16:28:37 INFO: Downloading default packages for language: en (English) ...
2024-09-08 16:28:38 INFO: File exists: C:\Users\Shailendra Kadre\stanza_resources\en\default.zip
2024-09-08 16:28:40 INFO: Finished downloading models and saved to C:\Users\Shailendra Kadre\stanza_resources


In [43]:
# Initialize a Stanza Pipeline for processing English text.
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,depparse, lemma')

2024-09-08 16:30:35 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-09-08 16:30:36 INFO: Downloaded file to C:\Users\Shailendra Kadre\stanza_resources\resources.json
2024-09-08 16:30:36 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

2024-09-08 16:30:36 INFO: Using device: cpu
2024-09-08 16:30:36 INFO: Loading: tokenize
2024-09-08 16:30:36 INFO: Loading: mwt
2024-09-08 16:30:36 INFO: Loading: pos
2024-09-08 16:30:36 INFO: Loading: lemma
2024-09-08 16:30:36 INFO: Loading: depparse
2024-09-08 16:30:37 INFO: Done loading processors!


In [44]:
# Process a text
doc = nlp("Your example sentence goes here.")
for sentence in doc.sentences:
    print("Tokens:", [word.text for word in sentence.words])
    print("POS Tags:", [word.pos for word in sentence.words])
    print("Dependencies:", [(word.text, word.deprel) for word in sentence.words])

Tokens: ['Your', 'example', 'sentence', 'goes', 'here', '.']
POS Tags: ['PRON', 'NOUN', 'NOUN', 'VERB', 'ADV', 'PUNCT']
Dependencies: [('Your', 'nmod:poss'), ('example', 'compound'), ('sentence', 'nsubj'), ('goes', 'root'), ('here', 'advmod'), ('.', 'punct')]


In [46]:
# Create the sample text for demo. 
sample_doc = nlp(""" Although she was very busy at office work, Mary felt she had had enough of it. 
                     She and her spouse decided they needed to go on a holiday. 
                     They travelled by train to France because they had enough friends in the country.""")

# Process the text
doc = nlp(sample_doc)

# Print out tokens, POS tags, and dependency relations.
for sentence in doc.sentences:
    print("Sentence:", " ".join([word.text for word in sentence.words]))
    print("Tokens:", [word.text for word in sentence.words])
    print("POS Tags:", [word.pos for word in sentence.words])
    print("Dependencies:", [(word.text, word.deprel) for word in sentence.words])
    print("-----")

Sentence: Although she was very busy at office work , Mary felt she had had enough of it .
Tokens: ['Although', 'she', 'was', 'very', 'busy', 'at', 'office', 'work', ',', 'Mary', 'felt', 'she', 'had', 'had', 'enough', 'of', 'it', '.']
POS Tags: ['SCONJ', 'PRON', 'AUX', 'ADV', 'ADJ', 'ADP', 'NOUN', 'NOUN', 'PUNCT', 'PROPN', 'VERB', 'PRON', 'AUX', 'VERB', 'ADJ', 'ADP', 'PRON', 'PUNCT']
Dependencies: [('Although', 'mark'), ('she', 'nsubj'), ('was', 'cop'), ('very', 'advmod'), ('busy', 'advcl'), ('at', 'case'), ('office', 'compound'), ('work', 'obl'), (',', 'punct'), ('Mary', 'nsubj'), ('felt', 'root'), ('she', 'nsubj'), ('had', 'aux'), ('had', 'ccomp'), ('enough', 'obj'), ('of', 'case'), ('it', 'obl'), ('.', 'punct')]
-----
Sentence: She and her spouse decided they needed to go on a holiday .
Tokens: ['She', 'and', 'her', 'spouse', 'decided', 'they', 'needed', 'to', 'go', 'on', 'a', 'holiday', '.']
POS Tags: ['PRON', 'CCONJ', 'PRON', 'NOUN', 'VERB', 'PRON', 'VERB', 'PART', 'VERB', 'ADP', 

We will take the middle sentence for elaborations on the output.

Sentence"She and her spouse decided they needed to go on a holiday."

Tokens and POS Tags:
Tokens: ['She', 'and', 'her', 'spouse', 'decided', 'they', 'needed', 'to', 'go', 'on', 'a', 'holiday', '.']
POS Tags: ['PRON', 'CCONJ', 'PRON', 'NOUN', 'VERB', 'PRON', 'VERB', 'PART', 'VERB', 'ADP', 'DET', 'NOUN', 'PUNCT']
PRON (Pronoun): 'She', 'her', 'they'
CCONJ (Coordinating Conjunction): 'and'
NOUN (Noun): 'spouse', 'holiday'
VERB (Verb): 'decided', 'needed', 'go'
PART (Particle): 'to'
ADP (Adposition): 'on'
DET (Determiner): 'a'
PUNCT (Punctuation): '.'

Dependencies:
('She', 'nsubj'): 'She' is the nominal subject of the main verb 'decided'.
('and', 'cc'): 'and' is a coordinating conjunction linking 'She' and 'her spouse'.
('her', 'nmod:poss'): 'her' is a possessive modifier for 'spouse'.
('spouse', 'conj'): 'spouse' is a conjunct connected to 'She' by 'and'.
('decided', 'root'): 'decided' is the main verb (root) of the sentence.
('they', 'nsubj'): 'they' is the subject of the embedded verb 'needed'.
('needed', 'ccomp'): 'needed' is a clausal complement of 'decided'.
('to', 'mark'): 'to' is a marker for the infinitive verb 'go'.
('go', 'xcomp'): 'go' is an open clausal complement of 'needed'.
('on', 'case'): 'on' is a preposition marking the case of 'holiday'.
('a', 'det'): 'a' is a determiner for 'holiday'.
('holiday', 'obl'): 'holiday' is the oblique object of the preposition 'on'.
('.', 'punct'): '.' is punctuation marking the end of the sentence.

Having this information in your folds, you can now apply RST principles and manually complete the the process of RST. More advanced tools are available to complete the RST process for you. Stanza provides valuable syntactic information. However, an end-to-end RST analysis would need additional steps or tools to explicitly categorize and recognise the rhetorical relationships between different text parts. To this date, a Python library for RST is unavailable to our knowledge. 

#### Discourse parsing (code demo)

A direct discourse parsing library is currently unavailable in Python. Professionals who write code often use advanced NLP libraries like spaCy or AllenNLP, along with deep learning techniques. We will stick to spaCy for a simplified demo. We will base our code on the following logic.

1. Load SpaCy Model
   - Initialize SpaCy's English language model

2. Define Sample Text
   - Set a string variable with the sample text

3. Process Text
   - Use SpaCy to analyze the sample text
   - Split the text into sentences

4. Define Function to Extract Discourse Information
   - For each sentence in the processed text:
     a. Print the sentence
     b. Extract and print named entities (if any)
     c. Extract and print syntactic dependencies for each token
     d. Infer basic discourse relations based on keywords:
        - If the sentence contains the word "because":
          - Record that this sentence provides a reason for the previous sentence
        - If the sentence contains the word "although":
          - Record that this sentence contrasts with the previous sentence

5. Call Function to Extract and Display Discourse Information
   - Print inferred discourse relations

In [None]:
!python -m spacy download en_core_web_sm

In [53]:
import spacy

# Load SpaCy's English model.
nlp = spacy.load('en_core_web_sm')

# Sample text.
text = """
Although she was very busy with office work, Mary felt she had had enough of it. 
She and her spouse decided they needed to go on a holiday. 
They travelled by train to France because they had enough friends in the country.
"""

# Process the text with SpaCy.
doc = nlp(text)

# Function to extract and display discourse-like information
def extract_discourse_info(doc):
    sentences = list(doc.sents)
    relations = []

    for i, sent in enumerate(sentences):
        print(f"Sentence {i+1}: {sent.text}")

        # Extract named entities.
        entities = [(ent.text, ent.label_) for ent in sent.ents]
        if entities:
            print("  Named Entities:", entities)

        # Extract syntactic dependencies
        dependencies = [(token.text, token.dep_, token.head.text) for token in sent]
        print("  Dependencies:", dependencies)

        # Basic inference of discourse relations.
        if i > 0:
            previous_sent = sentences[i-1]
            if "because" in sent.text.lower():
                relations.append(f"Sentence {i+1} provides a reason for Sentence {i}")
            elif "although" in sent.text.lower():
                relations.append(f"Sentence {i+1} contrasts with Sentence {i}")

    return relations

# Extract and print discourse information.
relations = extract_discourse_info(doc)
print("\nInferred Discourse Relations:")
for relation in relations:
    print(relation)

Sentence 1: 

  Dependencies: [('\n', 'dep', '\n')]
Sentence 2: Although she was very busy with office work, Mary felt she had had enough of it. 

  Named Entities: [('Mary', 'PERSON')]
  Dependencies: [('Although', 'mark', 'was'), ('she', 'nsubj', 'was'), ('was', 'advcl', 'felt'), ('very', 'advmod', 'busy'), ('busy', 'acomp', 'was'), ('with', 'prep', 'busy'), ('office', 'compound', 'work'), ('work', 'pobj', 'with'), (',', 'punct', 'felt'), ('Mary', 'nsubj', 'felt'), ('felt', 'ROOT', 'felt'), ('she', 'nsubj', 'had'), ('had', 'aux', 'had'), ('had', 'ccomp', 'felt'), ('enough', 'dobj', 'had'), ('of', 'prep', 'enough'), ('it', 'pobj', 'of'), ('.', 'punct', 'felt'), ('\n', 'dep', '.')]
Sentence 3: She and her spouse decided they needed to go on a holiday. 

  Named Entities: [('a holiday', 'DATE')]
  Dependencies: [('She', 'nsubj', 'decided'), ('and', 'cc', 'She'), ('her', 'poss', 'spouse'), ('spouse', 'conj', 'She'), ('decided', 'ROOT', 'decided'), ('they', 'nsubj', 'needed'), ('needed', 

Discourse parsing deals with how different parts of a text are related to one another. It can be through logical and communicative connections.The main part of the output is the "discourse information." In the following input text, 

Below is the input text, labelled by sentence numbers: 
(Sentence 1) Although she was very busy with office work, (Sentence 2) Mary felt she had had enough of it. 
(Sentence 3) She and her spouse decided they needed to go on a holiday. 
(Sentence 4) They travelled by train to France because they had enough friends in the country.

The following is the summary of Discourse Relations
- Contrasting Relation: Sentence 2 seems to contrast with Sentence 1. The word "Although" in Sentence 1 talks about a contrast with the decision pronounced in Sentence 2.
- Reason Relation: Sentence 4 seems to give a reason for the decision made in Sentence 3. The word "because" specifies that Sentence 4 explains why they travelled to France.

#### Entity tracking (code demo)

Entity tracking deals with keeping track of the entities (like people, places, or objects) mentioned throughout a text. It's like keeping track of characters throughout a novel. Here's a simple code demo using spaCy for Named Entity Recognition (NER) and tracking those entities.

In [55]:
import spacy

# Load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Sample text.
text = """
Although she was very busy with office work, Mary felt she had had enough of it. 
She and her spouse decided they needed to go on a holiday. 
They travelled by train to France because they had enough friends in the country.
"""

# Process the text with spaCy
doc = nlp(text)

# Dictionary to track entities
entity_tracking = {}

# Iterate through the sentences in the doc
for sent in doc.sents:
    print(f"Sentence: {sent}")
    # Iterate through named entities in the sentence
    for ent in sent.ents:
        print(f"Entity: {ent.text}, Label: {ent.label_}")
        # Track entities and update if seen again
        if ent.text in entity_tracking:
            entity_tracking[ent.text] += 1
        else:
            entity_tracking[ent.text] = 1

# Output tracked entities
print("\nEntity Tracking:")
for entity, count in entity_tracking.items():
    print(f"{entity}: mentioned {count} times")

Sentence: 

Sentence: Although she was very busy with office work, Mary felt she had had enough of it. 

Entity: Mary, Label: PERSON
Sentence: She and her spouse decided they needed to go on a holiday. 

Entity: a holiday, Label: DATE
Sentence: They travelled by train to France because they had enough friends in the country.

Entity: France, Label: GPE

Entity Tracking:
Mary: mentioned 1 times
a holiday: mentioned 1 times
France: mentioned 1 times


More professional code for entity tracking is written using packages like spaCy, AllenNLP, or Hugging Face Transformers. Entity code functions first detect, classify, and link entities across texts. These systems make use of NER and Coreference Resolution to track entities even when they are mentioned by pronouns or synonyms. Professional code is frequently a part of larger NLP pipelines that integrate with databases or knowledge graphs to accomplish entity relationships and maintain the required accuracy over long input documents or conversations.

#### Lexical chains (code demo)

A lexical chain is a sequence of related words that are connected either through direct synonyms or semantically related terms. These related words share a common meaning or topic. We discussed lexical chains with an example in the theory section. In this book, we will provide a simple code demo of lexical chains using WordNet from the nltk library. 

For professional NLP applications, lexical chain code involves advanced algorithms that may be based on WordNet, distributional semantics, or word embeddings like Word2Vec or BERT to pick up semantic relationships between words. Such systems make use of synonyms, hypernyms, and context-based similarities to arrive at accurate lexical chains. This code is often united with text segmentation, word sense disambiguation, and coherence modelling for tasks involving summarization or topic detection.

In [None]:
# !pip install nltk
# !python -m nltk.downloader wordnet

In [6]:
import nltk
from nltk.corpus import wordnet as wn

# Sample text
text = "Anil is a blood student in my class. He runs very fast."

# Tokenize the text
words = nltk.word_tokenize(text.lower())

# Function to find synonyms from WordNet
def get_synonyms(word):
    synonyms = set()
    for syn in wn.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return synonyms

# Build lexical chains
lexical_chains = []

for word in words:
    found_chain = False
    word_synonyms = get_synonyms(word)
    
    # Check if word fits into any existing chain
    for chain in lexical_chains:
        if chain.intersection(word_synonyms):
            chain.update(word_synonyms)
            found_chain = True
            break
    
    # If not, start a new chain
    if not found_chain and word_synonyms:
        lexical_chains.append(set(word_synonyms))

# Output lexical chains
for i, chain in enumerate(lexical_chains):
    print(f"Chain {i + 1}: {chain}")

Chain 1: {'indigo', 'indigotin', 'Indigofera_suffruticosa', 'anil', 'Indigofera_anil'}
Chain 2: {'personify', 'represent', 'be', 'equal', 'follow', 'live', 'cost', 'constitute', 'embody', 'comprise', 'make_up', 'exist'}
Chain 3: {'type_A', 'adenine', 'ampere', 'a', 'angstrom_unit', 'antiophthalmic_factor', 'vitamin_A', 'angstrom', 'axerophthol', 'amp', 'group_A', 'deoxyadenosine_monophosphate', 'A'}
Chain 4: {'debauched', 'lineage', 'line', 'riotous', 'blood', 'blood_line', 'fast', 'degenerate', 'quick', 'bloodline', 'dissolute', 'flying', 'libertine', 'tight', 'fasting', 'firm', 'parentage', 'rake', 'ancestry', 'degraded', 'loyal', 'profligate', 'rip', 'rakehell', 'stock', 'dissipated', 'immobile', 'stemma', 'pedigree', 'descent', 'truehearted', 'line_of_descent', 'roue', 'origin'}
Chain 5: {'scholarly_person', 'bookman', 'pupil', 'educatee', 'student', 'scholar'}
Chain 6: {'Indiana', 'atomic_number_49', 'In', 'IN', 'inward', 'Hoosier_State', 'inch', 'inwards', 'indium', 'in'}
Chain 7

The output characterizes lexical chains as  groups of semantically related words from the input text. Sometimes you see a couple of unrelated terms (not given in the input text) in these lexical chains. This is an example of how lexical chaining can bring in terms that aren't explicitly present but are linked conceptually in the word database. The algorithm sometimes needs fine-tuning to limit the chain generation to the terms that more closely match your input text. 

#### Topic modelling (code demo)

Topic modelling techniques use word patterns and groupings to identify the key themes or topics in a large collection of texts. In this chapter, we will present a simple demo of topic modelling using Tf-Idf technique. We talked about this technique in detail in the earlier chapters. Professional topic modelling code is often written using advanced techniques like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF). It leverages Python packages like Gensim or scikit-learn. We will take up LDA in detail in the upcoming chaptes. 

We will follow these five simple steps to write our code. 
- Prepare Documents: Start with a list of text documents you want to analyse.
- Initialize Vectorizer: Create a TfidfVectorizer object. A TfIdf score will convert the input text documents into numerical data. This process ignores stop words.
- Transform Text: Use the vectorizer to convert the input text documents into a Tf-Idf matrix. 
- Get Feature Names: Extract the list of words that were considered in the TF-IDF analysis.
- Display Scores: For each document, print the words and their Tf-Idf scores.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data (documents)

documents =   ["Although she was very busy with office work, Mary felt she had had enough of it.", 
                "She and her spouse decided they needed to go on a holiday.",
                "They travelled by train to France because they had enough friends in the country."]


# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the documents into a TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get the feature names (terms) from the TF-IDF model
feature_names = tfidf_vectorizer.get_feature_names_out()

# Display the TF-IDF scores for each document
for doc_idx, doc in enumerate(documents):
    print(f"\nDocument {doc_idx + 1}: {doc}")
    # Get the TF-IDF scores for each word in the document
    for word_idx in tfidf_matrix[doc_idx].nonzero()[1]:
        print(f"{feature_names[word_idx]}: {tfidf_matrix[doc_idx, word_idx]:.4f}")


Document 1: Although she was very busy with office work, Mary felt she had had enough of it.
felt: 0.4472
mary: 0.4472
work: 0.4472
office: 0.4472
busy: 0.4472

Document 2: She and her spouse decided they needed to go on a holiday.
holiday: 0.5000
needed: 0.5000
decided: 0.5000
spouse: 0.5000

Document 3: They travelled by train to France because they had enough friends in the country.
country: 0.4472
friends: 0.4472
france: 0.4472
train: 0.4472
travelled: 0.4472


The output talk about the TF-IDF scores for the words in Document 1, 2, and 3. The higher the score, the more relevant the word is to the content of the document. In the case of document 2, each of these words has a high score of 0.5000. It means all the words are equally important in this document. If a word has a high TF-IDF score, it means that the word is exclusive to that document and it could be a strong pointer of its topic or content. Note that this was a simplified analysis. better results can be obtained using advanced techniques like LDA. 

#### Cohesion analysis (code demo)

Cohesion analysis has two objectives. First, it looks at how well different parts of a text fit together. And second, how they connect to make the text flow smoothly and make sense.
In our below demo, we are bringing in a new concept of cosine similarity, which measures the similarity between two vectors. An interpretation of the similarity scores is given below. We will cover this concept in detail in the later pages of this chapter. 

We have used a basic TF-Idf technique in our code demo. More advanced vectorization techniques used by professionals for cohesion insight include word embeddings like Word2Vec and GloVe to capture more nuanced insights into text similarity and cohesion. We will take up these techniques later in this chapter.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Revised sentences
sentences = [
    "Mary is feeling overwhelmed with her busy office job but finds some relief in her evening walks.",
    "Mary feels overwhelmed with her hectic work schedule, yet she finds relaxation in her daily evening walks.",
    "Despite a busy workday, Mary enjoys unwinding with a walk in the evening."
]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the sentences into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(sentences)

# Compute the cosine similarity matrix
cosine_sim_matrix = cosine_similarity(tfidf_matrix)

# Display the cosine similarity matrix
print("Cosine Similarity Matrix:")
for i in range(len(sentences)):
    for j in range(len(sentences)):
        print(f"Similarity between Sentence {i + 1} and Sentence {j + 1}: {cosine_sim_matrix[i, j]:.2f}")


Cosine Similarity Matrix:
Similarity between Sentence 1 and Sentence 1: 1.00
Similarity between Sentence 1 and Sentence 2: 0.32
Similarity between Sentence 1 and Sentence 3: 0.19
Similarity between Sentence 2 and Sentence 1: 0.32
Similarity between Sentence 2 and Sentence 2: 1.00
Similarity between Sentence 2 and Sentence 3: 0.10
Similarity between Sentence 3 and Sentence 1: 0.19
Similarity between Sentence 3 and Sentence 2: 0.10
Similarity between Sentence 3 and Sentence 3: 1.00


Interpretation of the output:
- Interpreting the scores: Higher scores specify more cohesive sentences with shared vocabulary and similar content.
- Low Inter-Sentence Similarity: The highest is 0.32. It shows limited shared content or vocabulary.
- Minimal Overlap: Low similarity scores of 0.10 and 0.19 indicate minimal thematic or lexical overlap between the sentences.
- A similarity score of 1.00 indicates perfect similarity with itself.
Similarity scores of 0 indicate no shared vocabulary and similarity of content between sentences.

#### Temporal relation identification (code demo)

Temporal relation identification is the process of finding time-based relationships between events in an input text. For instance, take the sentence "I want to complete this chapter before lunch;" the word " before" indicates the time-based connection between two events. It talks about finishing work earlier than lunch.

Our code demo for temporal relation identification would first perform dependency parsing for extracting events and detect temporal signals, which are swords like "before," "after," "during," "until," and "while.". After this, the classification of events will be done through rule-based methods or machine-learning models. This approach aims to find the sequence and timing of actions in a given input text. We will utilise spaCy to extract events. 

Our program below follows these five steps
- Import Libraries and Load Model
- Feature Extraction from Sentences
- Prepare Training Data
- Train Random Forest Classifier
- Predict Temporal Relations in New Sentences

In [19]:
# Required Libraries
import spacy
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Load the spaCy model for dependency parsing
nlp = spacy.load("en_core_web_sm")

# Sample sentences with temporal relations
sentences = [
    "She finished her work before going to lunch.",
    "He went to the gym after work.",
    "They waited until the show started.",
    "The meeting was delayed during the storm."
]

# Temporal signals
temporal_signals = ["before", "after", "during", "until", "while"]

# Feature extraction - identifying events and temporal signals
def extract_features(sent):
    doc = nlp(sent)
    events = []
    temporal_relation = ""
    
    for token in doc:
        if token.dep_ == "ROOT":  # Event (verb) extraction
            events.append(token.lemma_)
        if token.text in temporal_signals:  # Temporal signal detection
            temporal_relation = token.text
    return events, temporal_relation

# Prepare training data - sentences, events, and labels (1 for temporal relation present, 0 otherwise)
X = []
y = []

for sentence in sentences:
    events, signal = extract_features(sentence)
    if signal:  # If temporal signal is present, we classify as 1
        X.append([len(events)])  # Simple feature: number of events
        y.append(1)
    else:
        X.append([0])
        y.append(0)

# Convert to numpy arrays
X = np.array(X)
y = np.array(y)

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=10, random_state=42)
clf.fit(X, y)

# Predict on new sentences
new_sentences = ["John started his project before the deadline.",
                 "She arrived after the party had begun."]

for new_sent in new_sentences:
    events, signal = extract_features(new_sent)
    prediction = clf.predict([[len(events)]])
    print(f"Sentence: {new_sent}")
    print(f"Detected Temporal Signal: {signal}, Prediction: {'Temporal relation' if prediction == 1 else 'No temporal relation'}\n")

Sentence: John started his project before the deadline.
Detected Temporal Signal: before, Prediction: Temporal relation

Sentence: She arrived after the party had begun.
Detected Temporal Signal: after, Prediction: Temporal relation



We will throw some line on the function def extract_features(sent): for better understanding.
- extract_features(sent): Processes each sentence using spaCy. The function returns the list of events (verbs) and the temporal relation signal (if present).
- doc = nlp(sent): Prepared the sentence and gets it into a structured format for analysis.
- for token in doc:: Loops through each word in the sentence.
- token.dep_ == "ROOT": Checks if the word is the main verb (the root action) and stores it.
- if token.text in temporal_signals:: If a word is found in the list of temporal signals, it is stored that as a temporal signal.

Converting words into ML model format: Without vectorizing the entire sentence, the logic uses a simple logic for converting words into numbers. It counts the number of verbs (events) as features in X, and puts a binary label (1 or 0) for y based on the presence of temporal signals. This keeps the dataset simple and in numerical format without using  a vectorizer.

The output confirms that the code is effectively detecting temporal signals present in the input sentences. The model is also correctly predicting the existence of temporal relations based on these signals.

In [None]:
>>> Code Snippet 7.1