### Importing necessary libraries

In [None]:
!pip install --quiet flashtext==2.7  # Install flashtext library version 2.7
!pip install git+https://github.com/boudinfl/pke.git  # Install pke library from GitHub repository
!pip install --quiet transformers==4.8.1  # Install transformers library version 4.8.1
!pip install --quiet sentencepiece==0.1.95  # Install sentencepiece library version 0.1.95
!pip install --quiet textwrap3==0.9.2  # Install textwrap3 library version 0.9.2
!pip install --quiet strsim==0.0.3  # Install strsim library version 0.0.3
!pip install --quiet sense2vec==2.0.0  # Install sense2vec library version 2.0.0
!pip install --quiet sentence-transformers==2.2.2  # Install sentence-transformers library version 2.2.2
!pip install intel-extension-for-pytorch==2.0.100  # Install Intel extension for PyTorch version 2.0.100


### Import the wrap function from the textwrap3 library and Wrap the text into lines of maximum 150 characters

In [None]:
from textwrap3 import wrap
text = """Elon Musk has shown again he can influence the digital currency market with just his tweets. After saying that his electric vehicle-making company
Tesla will not accept payments in Bitcoin because of environmental concerns, he tweeted that he was working with developers of Dogecoin to improve
system transaction efficiency. Following the two distinct statements from him, the world's largest cryptocurrency hit a two-month low, while Dogecoin
rallied by about 20 percent. The SpaceX CEO has in recent months often tweeted in support of Dogecoin, but rarely for Bitcoin.  In a recent tweet,
Musk put out a statement from Tesla that it was “concerned” about the rapidly increasing use of fossil fuels for Bitcoin (price in India) mining and
transaction, and hence was suspending vehicle purchases using the cryptocurrency.  A day later he again tweeted saying, “To be clear, I strongly
believe in crypto, but it can't drive a massive increase in fossil fuel use, especially coal”.  It triggered a downward spiral for Bitcoin value but
the cryptocurrency has stabilised since.   A number of Twitter users welcomed Musk's statement. One of them said it's time people started realising
that Dogecoin “is here to stay” and another referred to Musk's previous assertion that crypto could become the world's future currency."""

for wrp in wrap(text, 150):  # Wrap the text into lines of maximum 150 characters
  print(wrp)  # Print each wrapped line
print("\n")  # Print a new line


# **Summarization of text with T5**
Importing libraries, loading the T5 model and tokenizer, setting the device to CPU, and optimizing the model using the Intel Extension for PyTorch.

### Intel optimizations
`import intel_extension_for_pytorch as ipex` Importing intel extension for pytorch  
`summary_model = ipex.optimize(summary_model)` Optimizing the model with intel extension for pytorch

In [None]:
import torch  # Import the torch library
from transformers import T5ForConditionalGeneration, T5Tokenizer  # Import T5 model and tokenizer from the transformers library
import os  # Import the os module
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # Set the CUDA_VISIBLE_DEVICES environment variable to '-1' to disable GPU usage

summary_model = T5ForConditionalGeneration.from_pretrained('t5-base')  # Load the T5 model for conditional generation
summary_tokenizer = T5Tokenizer.from_pretrained('t5-base')  # Load the T5 tokenizer
device = torch.device("cpu")  # Set the device to CPU
summary_model = summary_model.to(device)  # Move the summary model to the CPU device

import intel_extension_for_pytorch as ipex  # Import the Intel Extension for PyTorch module
summary_model = ipex.optimize(summary_model)  # Optimize the summary model using Intel Extension for PyTorch


#### The purpose of the code below is to set the random seed for reproducibility. It sets the seed for the random module, numpy module, torch module, and all CUDA devices to ensure consistent random behavior across runs. The set_seed function takes an integer seed value as input and applies it to the various modules. Finally, the function is called with a seed value of 42.


In [None]:
import random  # Import the random module
import numpy as np  # Import the numpy module

def set_seed(seed: int):
    random.seed(seed)  # Set the seed for the random module
    np.random.seed(seed)  # Set the seed for the numpy module
    torch.manual_seed(seed)  # Set the seed for the torch module
    torch.cuda.manual_seed_all(seed)  # Set the seed for all CUDA devices

set_seed(42)  # Call the set_seed function with seed value 42



### Importing libraries, downloading necessary resources from NLTK, defining functions for text post-processing and summarization, generating a summary, and printing the original and summarized text.

In [None]:
import nltk  # Import the nltk library
nltk.download('punkt')  # Download the Punkt tokenizer data
nltk.download('brown')  # Download the Brown corpus data
nltk.download('wordnet')  # Download the WordNet corpus data
from nltk.corpus import wordnet as wn  # Import the WordNet module from nltk.corpus
from nltk.tokenize import sent_tokenize  # Import the sent_tokenize function from nltk.tokenize

def postprocesstext(content):
    final = ""  # Initialize an empty string for the final processed text
    for sent in sent_tokenize(content):  # Iterate over each sentence in the content
        sent = sent.capitalize()  # Capitalize the sentence
        final = final + " " + sent  # Append the capitalized sentence to the final text
    return final  # Return the final processed text


def summarizer(text, model, tokenizer):
    text = text.strip().replace("\n", " ")  # Remove leading/trailing spaces and replace newline characters with spaces
    text = "summarize: " + text  # Prepend "summarize: " to the text
    max_len = 512  # Set the maximum length for encoding
    encoding = tokenizer.encode_plus(text, max_length=max_len, pad_to_max_length=False, truncation=True,
                                     return_tensors="pt").to(device)  # Encode the text using the tokenizer

    input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]  # Retrieve the input IDs and attention mask

    outs = model.generate(input_ids=input_ids,
                          attention_mask=attention_mask,
                          early_stopping=True,
                          num_beams=3,
                          num_return_sequences=1,
                          no_repeat_ngram_size=2,
                          min_length=75,
                          max_length=300)  # Generate the summary using the model

    dec = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outs]  # Decode the generated summary
    summary = dec[0]  # Retrieve the first generated summary
    summary = postprocesstext(summary)  # Apply post-processing to the summary
    summary = summary.strip()  # Remove leading/trailing spaces

    return summary  # Return the generated summary


summarized_text = summarizer(text, summary_model, summary_tokenizer)  # Generate the summary using the provided text and models

print("\noriginal Text >>")
for wrp in wrap(text, 150):  # Wrap the original text into lines of maximum 150 characters
    print(wrp)  # Print each wrapped line
print("\n")
print("Summarized Text >>")
for wrp in wrap(summarized_text, 150):  # Wrap the summarized text into lines of maximum 150 characters
    print(wrp)  # Print each wrapped line
print("\n")



### Answer Span Extraction (Keywords and Noun Phrases)
Importing libraries, downloading necessary resources from NLTK, defining the get_nouns_multipartite function for keyphrase extraction using the MultipartiteRank algorithm,
and providing an example usage of the function.

In [None]:
import nltk  # Import the nltk library
nltk.download('stopwords')  # Download the stopwords data from NLTK
from nltk.corpus import stopwords  # Import the stopwords corpus from nltk.corpus
import string  # Import the string module for handling punctuation
import pke  # Import the pke library for keyphrase extraction
import traceback  # Import the traceback module for error handling

def get_nouns_multipartite(content):
    out = []  # Initialize an empty list for storing keyphrases
    try:
        extractor = pke.unsupervised.MultipartiteRank()  # Create an instance of MultipartiteRank for keyphrase extraction
        extractor.load_document(input=content, language='en')  # Load the content into the extractor with English language

        pos = {'PROPN', 'NOUN'}  # Define the part-of-speech tags to consider as candidate keyphrases
        stoplist = list(string.punctuation)  # Create a list of punctuation marks
        stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']  # Add additional brackets to the stoplist
        stoplist += stopwords.words('english')  # Add the English stopwords to the stoplist

        extractor.candidate_selection(pos=pos)  # Select candidate keyphrases using specified POS tags and stoplist

        # Build the Multipartite graph and rank candidates using random walk
        # Alpha controls the weight adjustment mechanism, and the threshold determines the keyphrase extraction method
        extractor.candidate_weighting(alpha=1.1, threshold=0.75, method='average')

        keyphrases = extractor.get_n_best(n=15)  # Retrieve the top 15 keyphrases

        for val in keyphrases:
            out.append(val[0])  # Append each keyphrase to the output list
    except:
        out = []  # In case of an error, reset the output list to empty
        traceback.print_exc()  # Print the traceback information for debugging purposes

    return out  # Return the list of extracted keyphrases


# Example usage:
content = "This is an example sentence. Keyphrases are important for text analysis."
keyphrases = get_nouns_multipartite(content)
print(keyphrases)


### Importing the KeywordProcessor class from flashtext, defining the get_keywords function for extracting important keywords from the original and summarized text, and providing an example usage of the function.

In [None]:
from flashtext import KeywordProcessor  # Import the KeywordProcessor class from the flashtext library

def get_keywords(originaltext, summarytext):
    keywords = get_nouns_multipartite(originaltext)  # Extract keywords from the original text using the get_nouns_multipartite function
    print("keywords unsummarized: ", keywords)  # Print the extracted keywords

    keyword_processor = KeywordProcessor()  # Create an instance of the KeywordProcessor
    for keyword in keywords:
        keyword_processor.add_keyword(keyword)  # Add each keyword to the keyword processor

    keywords_found = keyword_processor.extract_keywords(summarytext)  # Extract keywords from the summary text using the keyword processor
    keywords_found = list(set(keywords_found))  # Remove duplicate keywords from the extracted keywords
    print("keywords_found in summarized: ", keywords_found)  # Print the extracted keywords found in the summary

    important_keywords = []
    for keyword in keywords:
        if keyword in keywords_found:  # Check if a keyword from the original text is present in the extracted keywords from the summary
            important_keywords.append(keyword)  # Append the important keyword to the list

    return important_keywords[:4]  # Return the top 4 important keywords


imp_keywords = get_keywords(text, summarized_text)  # Call the get_keywords function with the provided text and summarized text
print(imp_keywords)  # Print the important keywords


### Question generation with T5
Loading the T5 model and tokenizer for question generation, moving the model to the specified device, and optimizing the model using Intel Extension for PyTorch.

### Intel optimizations
`question_model = ipex.optimize(question_model)` Optimizing the model with intel extension for pytorch

In [None]:
question_model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_squad_v1')  # Load the T5 model for question generation
question_tokenizer = T5Tokenizer.from_pretrained('ramsrigouthamg/t5_squad_v1')  # Load the tokenizer for the T5 model
question_model = question_model.to(device)  # Move the question model to the specified device (CPU or GPU)
question_model = ipex.optimize(question_model)  # Optimize the question model using Intel Extension for PyTorch

### The get_question function for generating a question given a context and an answer, the loop to print the summarized text, and the loop to generate questions for each important keyword and print them along with the capitalized form of the answer.

In [None]:
def get_question(context, answer, model, tokenizer):
    text = "context: {} answer: {}".format(context, answer)  # Format the context and answer into a text string
    encoding = tokenizer.encode_plus(text, max_length=384, pad_to_max_length=False, truncation=True, return_tensors="pt").to(device)
    input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

    outs = model.generate(input_ids=input_ids,
                          attention_mask=attention_mask,
                          early_stopping=True,
                          num_beams=5,
                          num_return_sequences=1,
                          no_repeat_ngram_size=2,
                          max_length=72)

    dec = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outs]

    Question = dec[0].replace("question:", "")  # Extract the generated question from the output
    Question = Question.strip()  # Remove leading and trailing whitespace
    return Question


# Print the summarized text
for wrp in wrap(summarized_text, 150):
    print(wrp)
print("\n")

# Generate questions for each important keyword
for answer in imp_keywords:
    ques = get_question(summarized_text, answer, question_model, question_tokenizer)  # Generate a question for the answer using the question model and tokenizer
    print(ques)  # Print the generated question
    print(answer.capitalize())  # Print the capitalized form of the answer
    print("\n")

### Filter keywords with Maximum marginal Relevance
Downloading and extracting the sense2vec model, loading the sentence transformer model, functions for filtering same sense words, calculating similarity scores,generating sense2vec words, performing MMR (Maximal Marginal Relevance) keyword selection, getting WordNet distractors, and the main code to get distractors for a given keyword and sentence using the loaded models and functions.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

# Download and extract sense2vec model
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
!tar -xvf s2v_reddit_2015_md.tar.gz

import numpy as np
from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk('s2v_old')

from sentence_transformers import SentenceTransformer
# Load the sentence transformer model
sentence_transformer_model = SentenceTransformer('msmarco-distilbert-base-v3')

from similarity.normalized_levenshtein import NormalizedLevenshtein
normalized_levenshtein = NormalizedLevenshtein()

def filter_same_sense_words(original, wordlist):
    filtered_words = []
    base_sense = original.split('|')[1]

    for eachword in wordlist:
        if eachword[0].split('|')[1] == base_sense:
            filtered_words.append(eachword[0].split('|')[0].replace("_", " ").title().strip())

    return filtered_words

def get_highest_similarity_score(wordlist, wrd):
    score = []
    for each in wordlist:
        score.append(normalized_levenshtein.similarity(each.lower(), wrd.lower()))
    return max(score)

def sense2vec_get_words(word, s2v, topn, question):
    output = []
    try:
        sense = s2v.get_best_sense(word, senses=["NOUN", "PERSON", "PRODUCT", "LOC", "ORG", "EVENT", "NORP", "WORK OF ART", "FAC", "GPE", "NUM", "FACILITY"])
        most_similar = s2v.most_similar(sense, n=topn)
        output = filter_same_sense_words(sense, most_similar)
    except:
        output = []

    threshold = 0.6
    final = [word]
    checklist = question.split()
    for x in output:
        if get_highest_similarity_score(final, x) < threshold and x not in final and x not in checklist:
            final.append(x)

    return final[1:]

def mmr(doc_embedding, word_embeddings, words, top_n, lambda_param):
    word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
    word_similarity = cosine_similarity(word_embeddings)

    keywords_idx = [np.argmax(word_doc_similarity)]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]

    for _ in range(top_n - 1):
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)

        mmr = (lambda_param) * candidate_similarities - (1-lambda_param) * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]

        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    return [words[idx] for idx in keywords_idx]

!pip install scikit-learn-intelex
from sklearnex import patch_sklearn
patch_sklearn()
from collections import OrderedDict
from sklearn.metrics.pairwise import cosine_similarity

def get_distractors_wordnet(word):
    distractors = []
    try:
        syn = wn.synsets(word, 'n')[0]
        word = word.lower()
        orig_word = word
        if len(word.split()) > 0:
            word = word.replace(" ", "_")
        hypernym = syn.hypernyms()
        if len(hypernym) == 0:
            return distractors
        for item in hypernym[0].hyponyms():
            name = item.lemmas()[0].name()
            if name == orig_word:
                continue
            name = name.replace("_", " ")
            name = " ".join(w.capitalize() for w in name.split())
            if name is not None and name not in distractors:
                distractors.append(name)
    except:
        print("Wordnet distractors not found")

    return distractors

def get_distractors(word, origsentence, sense2vecmodel, sentencemodel, top_n, lambdaval):
    distractors = sense2vec_get_words(word, sense2vecmodel, top_n, origsentence)
    if len(distractors) == 0:
        return distractors
    distractors_new = [word.capitalize()]
    distractors_new.extend(distractors)

    embedding_sentence = origsentence + " " + word.capitalize()
    keyword_embedding = sentencemodel.encode([embedding_sentence])
    distractor_embeddings = sentencemodel.encode(distractors_new)

    max_keywords = min(len(distractors_new), 5)
    filtered_keywords = mmr(keyword_embedding, distractor_embeddings, distractors_new, max_keywords, lambdaval)

    final = [word.capitalize()]
    for wrd in filtered_keywords:
        if wrd.lower() != word.lower():
            final.append(wrd.capitalize())
    final = final[1:]

    return final

sent = "What cryptocurrency did Musk rarely tweet about?"
keyword = "Bitcoin"

print(get_distractors(keyword, sent, s2v, sentence_transformer_model, 40, 0.2))

### Finally generating MCQS

Overall process of generating questions from the given context:

* Summarize the context using the summarizer model.

* Extract the keywords from the context and summary.

* Iterate over each keyword and generate a question using the question model and
 tokenizer.

* Retrieve distractors for the answer using the get_distractors function.
 Generate the output string with

In [None]:
context = '''Once upon a time, there lived a rabbit and tortoise. The rabbit could run fast. He was very proud of his speed. While the turtle was slow and consistent.
One day that tortoise came to meet him. The tortoise was walking very slow as usual. The rabbit looked and laughed at him.
The tortoise asked “what happened?”
The rabbit replied, “You walk so slowly! How can you survive like this?”.
The turtle listened to everything and felt humiliated by the rabbit’s words.
The tortoise replied, “Hey friend! You are very proud of your speed. Let’s have a race and see who is faster”.
The rabbit was surprised by the challenge of the tortoise. But he accepted the challenge as he thought it would be a cakewalk for him.
So, the tortoise and rabbit started the race. The rabbit was as usual very fast and went far away. While the tortoise was left behind.
After a while, the rabbit looked behind.
He said to himself, “The slow turtle will take ages to come near me. I should rest a bit”.
The rabbit was tired from running fast. The sun was high too. He ate some grass and decided to take a nap.
He said to himself, “I am confident; I can win even if the tortoise passes me. I should rest a bit”. With that thought, he slept and lost the track of time.
Meanwhile, the slow and steady turtle kept on moving. Although he was tired, he didn’t rest.
Sometime later, he passed the rabbit when the rabbit was still sleeping.
The rabbit suddenly woke up after sleeping for a long time. He saw that the tortoise was about to cross the finishing line.
He started running very fast with his full energy. But it was too late.
The slow turtle had already touched the finishing line. He has already won the race.
The rabbit was very disappointed with himself while the tortoise was very happy to win the race with his slow speed. He could not believe his eyes. He was shocked by the end results.
At last, the tortoise asked the rabbit “Now who is faster”. The rabbit had learned his lesson. He could not utter a word. The tortoise said bye to the rabbit and left that place calmly and happily.'''

def generate_question(context):
    summary_text = summarizer(context, summary_model, summary_tokenizer)
    np = get_keywords(context, summary_text)

    output = ""
    for answer in np:
        ques = get_question(summary_text, answer, question_model, question_tokenizer)
        distractors = get_distractors(answer.capitalize(), ques, s2v, sentence_transformer_model, 40, 0.2)

        output += f"\nQuestion: {ques}\n"
        output += f"Answer: {answer.capitalize()}\n"

        if len(distractors) > 0:
            output += "Distractors:\n"
            for distractor in distractors[:4]:
                output += f"- {distractor}\n"

    summary = f"\nSummary: {summary_text}"
    output += summary
    return output

print(generate_question(context))