## libraries

In [None]:
!pip install --quiet transformers==4.5.0
!pip install --quiet sentencepiece==0.1.95
!pip install --quiet textwrap3==0.9.2
!pip install --quiet nltk==3.2.5

In [None]:
!pip install --quiet ipython-autotime
%load_ext autotime

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25htime: 375 µs (started: 2024-04-30 08:03:53 +00:00)


## Text wrapping

In [None]:
text = """Orcas, also known as killer whales, are one of the most awe-inspiring and iconic aquatic animals found in the world's oceans. Belonging to the family of dolphins, they are characterized by their distinctive black and white markings, impressive size, and powerful presence. Orcas are highly social creatures, living in tight-knit pods led by a matriarchal hierarchy. These pods often consist of multiple generations, with individuals forming strong bonds through complex communication and cooperative hunting strategies.

Remarkably intelligent and adaptable, orcas have earned a reputation as apex predators, capable of hunting a wide variety of prey, including fish, seals, and even other marine mammals like dolphins and whales. They employ sophisticated hunting techniques, such as coordinated teamwork and strategic maneuvers, to outsmart their prey. Despite their formidable hunting prowess, orcas are also known for their playful behavior, often engaging in acrobatic displays and social interactions, including breaching, tail-slapping, and vocalizations.

Orcas occupy a diverse range of habitats, from polar regions to tropical seas, demonstrating their remarkable ability to thrive in various environments. They have been observed in all the world's oceans, from the Arctic to the Antarctic, showcasing their adaptability to different climates and ecosystems. However, like many marine species, orcas face numerous threats, including habitat degradation, pollution, climate change, and overfishing of their prey species.

Conservation efforts are crucial to safeguard the future of orcas and their marine habitats. By protecting their habitats, regulating human activities, and promoting sustainable fishing practices, we can help ensure the survival of these magnificent creatures. Orcas serve as ambassadors for the health of our oceans, highlighting the interconnectedness of all marine life and the importance of preserving these ecosystems for future generations to enjoy and appreciate. Through research, education, and collaborative conservation initiatives, we can work together to ensure a brighter future for orcas and all aquatic animals. """
for wrp in wrap(text, 150):
  print (wrp)
print ("\n")

Orcas, also known as killer whales, are one of the most awe-inspiring and iconic aquatic animals found in the world's oceans. Belonging to the family
of dolphins, they are characterized by their distinctive black and white markings, impressive size, and powerful presence. Orcas are highly social
creatures, living in tight-knit pods led by a matriarchal hierarchy. These pods often consist of multiple generations, with individuals forming strong
bonds through complex communication and cooperative hunting strategies.  Remarkably intelligent and adaptable, orcas have earned a reputation as apex
predators, capable of hunting a wide variety of prey, including fish, seals, and even other marine mammals like dolphins and whales. They employ
sophisticated hunting techniques, such as coordinated teamwork and strategic maneuvers, to outsmart their prey. Despite their formidable hunting
prowess, orcas are also known for their playful behavior, often engaging in acrobatic displays and social intera

# **Summarization with T5 Transformer**

In [None]:
# Import the necessary libraries for torch and transformers
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the pre-trained T5 model and tokenizer for text summarization
summary_model = T5ForConditionalGeneration.from_pretrained('t5-base')
summary_tokenizer = T5Tokenizer.from_pretrained('t5-base')

# Set the device to use for computations (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the selected device
summary_model = summary_model.to(device)

In [None]:
# Import the necessary libraries for random number generation and NLTK
import random
import numpy as np

# Define a function to set the random seed for reproducibility
def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

# Set the random seed for reproducibility
set_seed(42)

In [None]:
# Import the necessary NLTK libraries for text processing
import nltk
nltk.download('punkt')
nltk.download('brown')
nltk.download('wordnet')

# Import the necessary NLTK modules for word sense disambiguation and text tokenization
from nltk.corpus import wordnet as wn
from nltk.tokenize import sent_tokenize

# Define a function to post-process the text by capitalizing each sentence
def postprocesstext (content):
  final=""
  for sent in sent_tokenize(content):
    sent = sent.capitalize()
    final = final +" "+sent
  return final

# Define a function to generate the summary of the text using the pre-trained T5 model
def summarizer(text, model, tokenizer):

  # Pre-process the text by adding a "summarize" prefix and removing newline characters
  text = text.strip().replace("\n", " ")
  text = "summarize: " + text

  # Encode the text using the tokenizer and set the max length to 512
  max_len = 512
  encoding = tokenizer.encode_plus(text, max_length=max_len, pad_to_max_length=False, truncation=True, return_tensors="pt").to(device)

  # Extract the input_ids and attention_mask from the encoding
  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  # Generate the summary using the T5 model with beam search and length constraints
  outs = model.generate(input_ids=input_ids,
                        attention_mask=attention_mask,
                        early_stopping=True,
                        num_beams=3,
                        num_return_sequences=1,
                        no_repeat_ngram_size=2,
                        min_length = 75,
                        max_length=300)

  # Decode the summary from the output ids and post-process it by capitalizing each sentence
  dec = [tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]
  summary = dec[0]
  summary = postprocesstext(summary)
  summary = summary.strip()

  # Return the summary
  return summary

# Call the summarizer function with the input text and generate the summary
summarized_text = summarizer(text, summary_model, summary_tokenizer)

# Function to split text into lines and indent them for better readability
def pretty_print(text, width=80):
    for line in text.split('\n'):
        print(line.ljust(width))

# Print the original text
print("\nOriginal Text:")
pretty_print(text)

# Print the summarized text
print("\nSummarized Text:")
pretty_print(summarized_text)

# **Keywords and Noun Phrases Span Extraction**

In [None]:
!pip install --quiet git+https://github.com/boudinfl/pke.git
!pip install --quiet flashtext==2.7

In [None]:
import nltk

# Download stopwords from the NLTK corpus
nltk.download('stopwords')

# Import the stopwords from the NLTK corpus
from nltk.corpus import stopwords

# Import the string module for punctuation characters
import string

# Import the pke package for keyphrase extraction
import pke

# Import the traceback module for debugging
import traceback

def get_nouns_multipartite(content):
    # Initialize an empty list to store the extracted nouns
    out = []

    # Check if the content is empty
    if not content:
        # Print an error message and return the empty list
        print("Error: content is empty")
        return out

    # Try to execute the following code block
    try:
        # Initialize the MultipartiteRank extractor
        extractor = pke.unsupervised.MultipartiteRank()

        # Create a stoplist that includes punctuation characters,
        # left and right brackets, and English stopwords
        stoplist = list(string.punctuation)
        stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
        stoplist += stopwords.words('english')

        # Load the content and stoplist into the extractor
        extractor.load_document(input=content, stoplist=stoplist)

        # Specify the part-of-speech (POS) tags of the nouns and proper nouns
        pos = {'PROPN', 'NOUN'}

        # Select the candidate phrases based on the POS tags
        extractor.candidate_selection(pos=pos)

        # Assign weights to the candidate phrases based on their relevance
        extractor.candidate_weighting(alpha=1.1, threshold=0.75, method='average')

        # Extract the top-ranked nouns as key phrases
        keyphrases = extractor.get_n_best(n=15)

        # Add the key phrases to the output list
        for val in keyphrases:
            out.append(val[0])

    # If any exception occurs, print the error message and the traceback
    except Exception as e:
        print(f"Error: {e}")
        traceback.print_exc()

    # Return the output list
    return out

In [None]:
!pip install flashtext

In [None]:
from flashtext import KeywordProcessor

def get_keywords(originaltext, summarytext):
    # Extract keywords from the original text
    keywords = get_nouns_multipartite(originaltext)
    print("keywords unsummarized: ", keywords)

    # Create a KeywordProcessor object to extract keywords from the summary text
    keyword_processor = KeywordProcessor()

    # Add the extracted keywords to the KeywordProcessor object
    for keyword in keywords:
        keyword_processor.add_keyword(keyword)

    # Extract keywords from the summary text
    keywords_found = keyword_processor.extract_keywords(summarytext)

    # Convert the extracted keywords to a set and then back to a list to remove duplicates
    keywords_found = list(set(keywords_found))
    print("keywords_found in summarized: ", keywords_found)

    # Identify the keywords that are present in both the original text and the summary text
    important_keywords = []
    for keyword in keywords:
        if keyword in keywords_found:
            important_keywords.append(keyword)

    # Return the first four important keywords
    return important_keywords[:4]


imp_keywords = get_keywords(text, summarized_text)
print("important keywords: ", imp_keywords)

# **Question generation with T5**

In [None]:
question_model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_squad_v1')
question_tokenizer = T5Tokenizer.from_pretrained('ramsrigouthamg/t5_squad_v1')
question_model = question_model.to(device)

In [None]:
def get_question(context, answer, model, tokenizer):
    """
    Generates a question based on the given context and answer using a pre-trained model.

    Args:
        context (str): The context or passage from which the question is to be generated.
        answer (str): The answer present in the context for which the question is to be generated.
        model: The pre-trained model used for question generation.
        tokenizer: The tokenizer used for encoding the text.

    Returns:
        str: The generated question based on the context and answer.
    """
    # Combine context and answer into a single text string
    text = "context: {} answer: {}".format(context, answer)

    # Tokenize and encode the text using the provided tokenizer
    encoding = tokenizer.encode_plus(text, max_length=384, pad_to_max_length=False, truncation=True, return_tensors="pt").to(device)
    input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

    # Generate questions using the model
    outs = model.generate(input_ids=input_ids,
                          attention_mask=attention_mask,
                          early_stopping=True,
                          num_beams=5,
                          num_return_sequences=1,
                          no_repeat_ngram_size=2,
                          max_length=72)

    # Decode the generated question from token IDs and remove special tokens
    dec = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outs]
    Question = dec[0].replace("question:", "")
    Question = Question.strip()
    return Question


# Print the summarized text, wrapping lines at 150 characters
for wrp in wrap(summarized_text, 150):
    print(wrp)
print("\n")

# Generate questions for important keywords and print them along with the keywords
for answer in imp_keywords:
    ques = get_question(summarized_text, answer, question_model, question_tokenizer)
    print(ques)
    print(answer.capitalize())
    print("\n")


# **Filter keywords with Maximal Marginal Relevance (MMR) algorithm to select a diverse set of keywords based on cosine similarity**

In [None]:
!pip install --quiet keybert==0.2.0
!pip install --quiet strsim==0.0.3

In [None]:
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
!tar -xvf  s2v_reddit_2015_md.tar.gz

In [None]:
!pip install -q sense2vec

In [None]:
import numpy as np
from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("s2v_old")

In [None]:
from sentence_transformers import SentenceTransformer
sentence_transformer_model = SentenceTransformer('msmarco-distilbert-base-v3')

In [None]:
!pip install strsim

In [None]:
# Importing the Normalized Levenshtein distance calculation library
from similarity.normalized_levenshtein import NormalizedLevenshtein

# Initializing the Normalized Levenshtein object
normalized_levenshtein = NormalizedLevenshtein()

# Function to filter words based on their semantic sense
def filter_same_sense_words(original, wordlist):
    """
    Filters words from a given list that share the same semantic sense as the original word.
    
    Parameters:
    - original: The original word whose sense we want to match against.
    - wordlist: A list of words to filter through.
    
    Returns:
    - A list of words from the input list that share the same sense as the original word.
    """
    filtered_words = []  # List to store words of the same sense
    base_sense = original.split('|')[1]  # Extracting the base sense from the original word
    
    # Iterating over each word in the wordlist
    for eachword in wordlist:
        # Checking if the sense of the current word matches the base sense
        if eachword[0].split('|')[1] == base_sense:
            # Adding the word to the filtered list after processing
            filtered_words.append(eachword[0].split('|')[0].replace("_", " ").title().strip())
    
    return filtered_words

# Function to calculate the highest similarity score between a word and a list of words
def get_highest_similarity_score(wordlist, wrd):
    """
    Calculates the highest similarity score between a given word and a list of words using the Normalized Levenshtein distance.
    
    Parameters:
    - wordlist: A list of words to compare against.
    - wrd: The word to find the highest similarity score for.
    
    Returns:
    - The maximum similarity score found among the words in the list.
    """
    score = []  # List to store similarity scores
    
    # Calculating similarity scores for each word in the list
    for each in wordlist:
        score.append(normalized_levenshtein.similarity(each.lower(), wrd.lower()))
    
    return max(score)

# Function to retrieve words associated with a given sense using Sense2Vec
def sense2vec_get_words(word, s2v, topn, question):
    """
    Retrieves words associated with a given sense using Sense2Vec, filters them based on semantic similarity, and excludes words already present in the question.
    
    Parameters:
    - word: The word to find similar words for.
    - s2v: The Sense2Vec model instance.
    - topn: The number of top similar words to retrieve.
    - question: The question to exclude certain words from the result.
    
    Returns:
    - A list of words that are semantically similar to the input word and not part of the question.
    """
    output = []  # List to store the final set of words
    
    # Attempting to retrieve the best sense for the input word
    try:
        sense = s2v.get_best_sense(word, senses=["NOUN", "PERSON", "PRODUCT", "LOC", "ORG", "EVENT", "NORP", "WORK OF ART", "FAC", "GPE", "NUM", "FACILITY"])
        most_similar = s2v.most_similar(sense, n=topn)
        
        # Filtering words of the same sense from the most similar words
        output = filter_same_sense_words(sense, most_similar)
    except:
        output = []
    
    threshold = 0.6  # Threshold for filtering words based on similarity score
    final = [word]  # Initial list containing the input word
    
    # Excluding words already present in the question
    checklist = question.split()
    
    # Filtering words based on similarity score and exclusion criteria
    for x in output:
        if get_highest_similarity_score(final, x) < threshold and x not in final and x not in checklist:
            final.append(x)
    
    return final[1:]  # Returning the list excluding the initial word

# Function to perform Maximal Marginal Relevance (MMR) keyword extraction
def mmr(doc_embedding, word_embeddings, words, top_n, lambda_param):
    """
    Performs Maximal Marginal Relevance (MMR) keyword extraction from a document embedding and a list of word embeddings.
    
    Parameters:
    - doc_embedding: The document embedding vector.
    - word_embeddings: The list of word embedding vectors.
    - words: The list of words corresponding to the word embeddings.
    - top_n: The number of top keywords to extract.
    - lambda_param: The parameter balancing the trade-off between diversity and relevance in MMR.
    
    Returns:
    - A list of top keywords extracted using MMR.
    """
    word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
    word_similarity = cosine_similarity(word_embeddings)

    keywords_idx = [np.argmax(word_doc_similarity)]  # Initial keyword index
    candidates_idx = [i for i in range(len(words)) if i!= keywords_idx[0]]  # Candidate indices excluding the initial keyword

    for _ in range(top_n - 1):  # Iterating until the desired number of keywords is reached
        candidate_similarities = word_doc_similarity[candidates_idx, :]  # Similarity scores of candidates with the document
        target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)  # Maximum similarity scores of candidates with the current keywords
        
        # Calculating MMR scores
        mmr = (lambda_param) * candidate_similarities - (1-lambda_param) * target_similarities.reshape(-1, 1)
        
        # Finding the index of the candidate with the highest MMR score
        mmr_idx = candidates_idx[np.argmax(mmr)]
        
        # Updating the list of keywords and removing the selected candidate from the candidates list
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)
    
    # Returning the list of keywords
    return [words[idx] for idx in keywords_idx]


In [None]:
import nltk
nltk.download('wordnet')

In [None]:
# Importing necessary libraries
from collections import OrderedDict
from sklearn.metrics.pairwise import cosine_similarity

# Function to generate distractors using WordNet
def get_distractors_wordnet(word):
    """
    Generates distractors for a given word using WordNet, focusing on synonyms within the same sense.
    
    Parameters:
    - word: The word for which distractors are to be generated.
    
    Returns:
    - A list of distractors for the input word.
    """
    distractors = []  # List to store distractors
    
    try:
        # Retrieving the first noun synset for the word
        syn = wn.synsets(word, 'n')[0]
        
        # Preparing the word for comparison
        word = word.lower()
        orig_word = word
        if len(word.split()) > 0:
            word = word.replace(" ", "_")
        
        # Getting hypernyms of the synset
        hypernym = syn.hypernyms()
        
        # If no hypernyms exist, returning an empty list
        if len(hypernym) == 0:
            return distractors
        
        # Iterating through hyponyms of the first hypernym
        for item in hypernym[0].hyponyms():
            name = item.lemmas()[0].name()
            
            # Skipping if the name is the same as the original word
            if name == orig_word:
                continue
            
            # Formatting the name
            name = name.replace("_", " ")
            name = " ".join(w.capitalize() for w in name.split())
            
            # Adding unique names to the distractors list
            if name is not None and name not in distractors:
                distractors.append(name)
    except:
        print("Wordnet distractors not found")
    
    return distractors

# Function to generate distractors using Sense2Vec and Sentence Transformer models
def get_distractors(word, origsentence, sense2vecmodel, sentencemodel, top_n, lambdaval):
    """
    Generates distractors for a given word using both Sense2Vec and Sentence Transformer models, aiming to diversify the selection.
    
    Parameters:
    - word: The word for which distractors are to be generated.
    - origsentence: The original sentence containing the word.
    - sense2vecmodel: The Sense2Vec model instance.
    - sentencemodel: The Sentence Transformer model instance.
    - top_n: The number of top similar words to consider.
    - lambdaval: The lambda value for MMR keyword extraction.
    
    Returns:
    - A list of diversified distractors for the input word.
    """
    distractors = sense2vec_get_words(word, sense2vecmodel, top_n, origsentence)
    print("distractors ", distractors)
    
    # If no distractors are found, returning an empty list
    if len(distractors) == 0:
        return distractors
    
    distractors_new = [word.capitalize()]  # Starting with the capitalized word itself
    distractors_new.extend(distractors)  # Extending with the distractors found
    
    # Encoding the original sentence and the new distractors
    embedding_sentence = origsentence + " " + word.capitalize()
    keyword_embedding = sentencemodel.encode([embedding_sentence])
    distractor_embeddings = sentencemodel.encode(distractors_new)
    
    # Determining the maximum number of keywords to keep
    max_keywords = min(len(distractors_new), 5)
    
    # Applying MMR to filter keywords
    filtered_keywords = mmr(keyword_embedding, distractor_embeddings, distractors_new, max_keywords, lambdaval)
    
    # Finalizing the list of keywords, excluding duplicates and the original word
    final = [word.capitalize()]
    for wrd in filtered_keywords:
        if wrd.lower()!= word.lower():
            final.append(wrd.capitalize())
    final = final[1:]
    
    return final

# Example usage
sent = "Which animal is called killer whale"
keyword = "orca"

# Assuming s2v and sentence_transformer_model are defined elsewhere
print(get_distractors(keyword, sent, s2v, sentence_transformer_model, 40, 0.2))


# **JSON generation - MCQs**

In [None]:
import random

# Function to generate a question along with its distractors and correct answers
def generate_question(context, chooseOption):
    """
    Generates a question based on a given context, including distractors and correct answers.
    
    Parameters:
    - context: The text context from which to generate the question.
    - chooseOption: Specifies whether to use WordNet or Sense2Vec for generating distractors.
    
    Returns:
    - A tuple containing a list of questions and a summary of the context.
    """
    # Summarize the context and extract keywords
    summary_text = summarizer(context, summary_model, summary_tokenizer)
    np = get_keywords(context, summary_text)

    questions = []
    for answer in np:
        # Generate a question based on the answer
        ques = get_question(summary_text, answer, question_model, question_tokenizer)
        
        # Determine distractors based on the chosen method
        if chooseOption == "Wordnet":
            distractors = get_distractors_wordnet(answer)
        else:
            distractors = get_distractors(answer.capitalize(), ques, s2v, sentence_transformer_model, 40, 0.2)

        # Prepare correct and all answers
        correct_answers = [answer.capitalize()]
        all_answers = correct_answers + distractors[:4]
        random.shuffle(all_answers)
        answers = all_answers
        correct_answer_indices = [i for i, answer in enumerate(answers) if answer in correct_answers]

        # Construct the question dictionary
        question = {
            "question": ques,
            "answers": answers,
            "correct_answer_indices": correct_answer_indices,
            "correct_answers": correct_answers
        }
        questions.append(question)

    # Replace placeholders in the summary with actual answers
    summary = f"Summary: \n{summary_text}"
    for answer in np:
        summary = summary.replace(answer, f"{answer}")
        summary = summary.replace(answer.capitalize(), f"{answer.capitalize()}")

    return questions, summary

# Context and radio button choice
context = """
Your context goes here...
"""
chooseOption = "Sense2Vec"

# Assuming other functions like summarizer, get_keywords, get_question, etc., are defined elsewhere
import json
questions, summary = generate_question(context, chooseOption)
print(json.dumps(questions, indent=4))