## Title: "IntelliGen: AI-Powered QuestionGenerator for Personalized Learning Based on Bloom's Taxonomy"



Name = Manas Ajay Rathi

Email = 23manasrathi@gmail.com

Ph no = 9579933691

College = PICT, Pune


Importing Libraries and Setting Device:


The code imports the necessary libraries, including torch for PyTorch, and T5ForConditionalGeneration and T5Tokenizer from the transformers library for accessing the T5 model.
nltk is imported for natural language processing tasks such as tokenization and wordnet.
The code checks if a GPU is available and sets the device accordingly.

Initializing Models and Tokenizers:

The T5 model for text summarization is loaded using T5ForConditionalGeneration.from_pretrained('t5-base'), and the corresponding tokenizer is loaded using T5Tokenizer.from_pretrained('t5-base').
The model is moved to the GPU if available.
Text Preprocessing and Summarization:

The postprocesstext function capitalizes the sentences in a given text.
The summarizer function takes a text as input, preprocesses it, and uses the T5 model and tokenizer to generate a summary of the text.
The generated summary is then post-processed to capitalize the sentences using the postprocesstext function.

Noun Phrase Extraction:

The get_nouns_multipartite function extracts the top noun phrases from a given text using the pke library's MultipartiteRank algorithm.
The algorithm considers candidate phrases with parts-of-speech tags as proper nouns (PROPN) and common nouns (NOUN).
The extracted noun phrases are returned as a list.

Question Generation:

The get_question function takes a context, answer, taxonomy level, model, and tokenizer as input.
It combines the context, answer, and taxonomy level into a single text and encodes it using the tokenizer.
The encoded input is passed to the T5 model to generate a question.
The generated question is extracted from the model's output and post-processed to remove the "question:" prefix and any leading/trailing whitespaces.
The processed question is returned.

Generating Questions and Distractors:

The generate_question function takes a context, radiobutton choice (either "Wordnet" or "Sense2Vec"), and taxonomy level as input.
It generates a summary of the context using the summarizer function.
Noun phrases are extracted from the summary using the get_nouns_multipartite function.
For each noun phrase, a question is generated using the get_question function and the specified taxonomy level.
If the radiobutton choice is "Wordnet," distractors are generated using the get_distractors_wordnet function. Otherwise, distractors are obtained using the get_distractors function with Sense2Vec word embeddings.
The questions, correct answers, and distractors are formatted and appended to the output string.
The summary is post-processed to highlight the noun phrases using HTML tags.
The final output string, including questions, answers, distractors, and the highlighted summary, is returned.

Gradio Interface:

The Gradio interface is created using gr.Interface.
It takes the generate_question function as the main function and sets the input and output components.
The input components include a textbox for the context, radio buttons for the distractor generation method, and radio buttons for the taxonomy level.
The output component is a textbox to display the generated questions, answers, and distractors.
The interface is launched using iface.launch().

## Installation of libraries

In [None]:
!pip install --quiet flashtext==2.7
!pip install git+https://github.com/boudinfl/pke.git

Collecting git+https://github.com/boudinfl/pke.git
  Cloning https://github.com/boudinfl/pke.git to /tmp/pip-req-build-rher63jh
  Running command git clone --filter=blob:none --quiet https://github.com/boudinfl/pke.git /tmp/pip-req-build-rher63jh
  Resolved https://github.com/boudinfl/pke.git to commit 69871ffdb720b83df23684fea53ec8776fd87e63
  Preparing metadata (setup.py) ... [?25l[?25hdone
time: 12.7 s (started: 2023-10-15 12:03:51 +00:00)


In [None]:
!pip install --quiet transformers==4.8.1
!pip install --quiet sentencepiece==0.1.95
!pip install --quiet textwrap3==0.9.2


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for tokenizers (pyproject.toml) ... [?25l[?25herror
[31m  ERROR: Failed building wheel for tokenizers[0m[31m
[0m[31mERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects[0m[31m
[0m  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sentencepiece (setup.py) ... [?25l[?25hdone
time: 2min 9s (started: 2023-10-15 12:04:03 +00:00)


In [None]:
!pip install --quiet strsim==0.0.3
!pip install --quiet sense2vec==2.0.0

time: 11.9 s (started: 2023-10-15 12:06:13 +00:00)


In [None]:
!pip install --quiet ipython-autotime
%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 7.34 s (started: 2023-10-15 12:06:24 +00:00)


In [None]:
!pip install --quiet sentence-transformers==2.2.2

time: 4.62 s (started: 2023-10-15 12:06:32 +00:00)


## Example 1

In [None]:
from textwrap3 import wrap

text = """Deep Learning (DL) is a subset of machine learning that focuses on training artificial neural networks to learn and make predictions from large amounts of data. DL has gained significant popularity and has become a powerful tool across various domains due to its ability to automatically extract meaningful features from raw data.

DL finds its application in numerous fields, ranging from computer vision and natural language processing to healthcare and finance. One of the notable applications of DL is image recognition. DL models can be trained to accurately classify and identify objects within images, enabling tasks such as facial recognition, object detection, and autonomous driving. DL has also revolutionized the field of natural language processing, enabling machines to understand and generate human-like text. This has led to advancements in machine translation, sentiment analysis, chatbots, and more.

DL models have proven to be invaluable in the healthcare industry. They can analyze medical images, such as X-rays and MRIs, to assist in diagnosing diseases and detecting abnormalities. DL models are also used for predicting patient outcomes, drug discovery, and personalized medicine. In finance, DL models can analyze large amounts of financial data to make predictions about stock prices, fraud detection, credit scoring, and algorithmic trading.

DL has also found applications in recommendation systems, where it powers personalized suggestions for products, movies, music, and more. By analyzing user preferences and behavior, DL models can provide tailored recommendations that enhance user experience and engagement. Additionally, DL has made significant contributions to the field of robotics, enabling robots to perceive and interact with the environment autonomously.

Furthermore, DL has facilitated advancements in the field of autonomous vehicles. DL models can process sensor data from cameras, lidars, and radars to perceive the environment, detect objects, and make decisions in real-time, enabling self-driving cars."""

for wrp in wrap(text, 150):
  print (wrp)
print ("\n")

Deep Learning (DL) is a subset of machine learning that focuses on training artificial neural networks to learn and make predictions from large
amounts of data. DL has gained significant popularity and has become a powerful tool across various domains due to its ability to automatically
extract meaningful features from raw data.  DL finds its application in numerous fields, ranging from computer vision and natural language processing
to healthcare and finance. One of the notable applications of DL is image recognition. DL models can be trained to accurately classify and identify
objects within images, enabling tasks such as facial recognition, object detection, and autonomous driving. DL has also revolutionized the field of
natural language processing, enabling machines to understand and generate human-like text. This has led to advancements in machine translation,
sentiment analysis, chatbots, and more.  DL models have proven to be invaluable in the healthcare industry. They can analy

# **Summarization with T5**

In [None]:


!pip install transformers
!pip install torch

!pip install sentencepiece
import sentencepiece as spm


import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer
summary_model = T5ForConditionalGeneration.from_pretrained('t5-base')
summary_tokenizer = T5Tokenizer.from_pretrained('t5-base')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
summary_model = summary_model.to(device)




For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


time: 33.1 s (started: 2023-10-15 12:06:36 +00:00)


In [None]:
import random
import numpy as np

def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(42)

time: 629 µs (started: 2023-10-15 12:07:10 +00:00)


In [None]:
import nltk
nltk.download('punkt')
nltk.download('brown')
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.tokenize import sent_tokenize

def postprocesstext (content):
  final=""
  for sent in sent_tokenize(content):
    sent = sent.capitalize()
    final = final +" "+sent
  return final


def summarizer(text,model,tokenizer):
  text = text.strip().replace("\n"," ")
  text = "summarize: "+text
  # print (text)
  max_len = 512
  encoding = tokenizer.encode_plus(text,max_length=max_len, pad_to_max_length=False,truncation=True, return_tensors="pt").to(device)

  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  outs = model.generate(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  early_stopping=True,
                                  num_beams=3,
                                  num_return_sequences=1,
                                  no_repeat_ngram_size=2,
                                  min_length = 75,
                                  max_length=300)


  dec = [tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]
  summary = dec[0]
  summary = postprocesstext(summary)
  summary= summary.strip()

  return summary


summarized_text = summarizer(text,summary_model,summary_tokenizer)


print ("\noriginal Text >>")
for wrp in wrap(text, 150):
  print (wrp)
print ("\n")
print ("Summarized Text >>")
for wrp in wrap(summarized_text, 150):
  print (wrp)
print ("\n")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



original Text >>
Deep Learning (DL) is a subset of machine learning that focuses on training artificial neural networks to learn and make predictions from large
amounts of data. DL has gained significant popularity and has become a powerful tool across various domains due to its ability to automatically
extract meaningful features from raw data.  DL finds its application in numerous fields, ranging from computer vision and natural language processing
to healthcare and finance. One of the notable applications of DL is image recognition. DL models can be trained to accurately classify and identify
objects within images, enabling tasks such as facial recognition, object detection, and autonomous driving. DL has also revolutionized the field of
natural language processing, enabling machines to understand and generate human-like text. This has led to advancements in machine translation,
sentiment analysis, chatbots, and more.  DL models have proven to be invaluable in the healthcare indust

# **Answer Span Extraction (Keywords and Noun Phrases)**

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
import pke
import traceback

def get_nouns_multipartite(content):
    out=[]
    try:
        extractor = pke.unsupervised.MultipartiteRank()
        extractor.load_document(input=content,language='en')
        #    not contain punctuation marks or stopwords as candidates.
        pos = {'PROPN','NOUN'}
        #pos = {'PROPN','NOUN'}
        stoplist = list(string.punctuation)
        stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
        stoplist += stopwords.words('english')
        # extractor.candidate_selection(pos=pos, stoplist=stoplist)
        extractor.candidate_selection(pos=pos)
        # 4. build the Multipartite graph and rank candidates using random walk,
        #    alpha controls the weight adjustment mechanism, see TopicRank for
        #    threshold/method parameters.
        extractor.candidate_weighting(alpha=1.1,
                                      threshold=0.75,
                                      method='average')
        keyphrases = extractor.get_n_best(n=15)


        for val in keyphrases:
            out.append(val[0])
    except:
        out = []
        traceback.print_exc()

    return out

time: 5.11 ms (started: 2023-10-15 12:07:12 +00:00)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from flashtext import KeywordProcessor


def get_keywords(originaltext,summarytext):
  keywords = get_nouns_multipartite(originaltext)
  print ("keywords unsummarized: ",keywords)
  keyword_processor = KeywordProcessor()
  for keyword in keywords:
    keyword_processor.add_keyword(keyword)

  keywords_found = keyword_processor.extract_keywords(summarytext)
  keywords_found = list(set(keywords_found))
  print ("keywords_found in summarized: ",keywords_found)

  important_keywords =[]
  for keyword in keywords:
    if keyword in keywords_found:
      important_keywords.append(keyword)

  return important_keywords[:4]


imp_keywords = get_keywords(text,summarized_text)
print (imp_keywords)


keywords unsummarized:  ['dl models', 'data', 'image recognition', 'fields', 'application', 'machine learning', 'objects', 'finance', 'predictions', 'amounts', 'language processing', 'recommendation systems', 'user preferences', 'healthcare', 'images']
keywords_found in summarized:  ['data', 'amounts', 'machine learning', 'language processing', 'dl models', 'objects', 'images']
['dl models', 'data', 'machine learning', 'objects']
time: 1.31 s (started: 2023-10-15 12:07:12 +00:00)


# **Question generation with T5**

In [None]:
question_model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_squad_v1')
question_tokenizer = T5Tokenizer.from_pretrained('ramsrigouthamg/t5_squad_v1')
question_model = question_model.to(device)

time: 4.16 s (started: 2023-10-15 12:07:14 +00:00)


In [None]:
def get_question(context,answer,model,tokenizer):
  text = "context: {} answer: {}".format(context,answer)
  encoding = tokenizer.encode_plus(text,max_length=384, pad_to_max_length=False,truncation=True, return_tensors="pt").to(device)
  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  outs = model.generate(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  early_stopping=True,
                                  num_beams=5,
                                  num_return_sequences=1,
                                  no_repeat_ngram_size=2,
                                  max_length=72)


  dec = [tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]


  Question = dec[0].replace("question:","")
  Question= Question.strip()
  return Question



for wrp in wrap(summarized_text, 150):
  print (wrp)
print ("\n")

for answer in imp_keywords:
  ques = get_question(summarized_text,answer,question_model,question_tokenizer)
  print (ques)
  print (answer.capitalize())
  print ("\n")


Deep learning (dl) is a subset of machine learning that focuses on training artificial neural networks to learn from large amounts of data. Dl models
can be trained to accurately classify and identify objects within images, enabling tasks such as facial recognition, object detection, and autonomous
driving. It has also revolutionized the field of natural language processing, leading to advancements in machine translation, sentiment analysis,
chatbots and more.


What can be trained to classify and identify objects within images?
Dl models


Deep learning trains neural networks to learn from large amounts of what?
Data


Deep learning is a subset of what?
Machine learning


What can Dl models be trained to classify within images?
Objects


time: 1.25 s (started: 2023-10-15 12:07:18 +00:00)


# **Gradio UI Visualization**

In [None]:
!pip install gradio==3.14.0





import gradio as gr

context = gr.inputs.Textbox(lines=10, placeholder="Enter paragraph/content here...")
output = gr.outputs.HTML(  label="Question and Answers")


def generate_question(context):
  summary_text = summarizer(context,summary_model,summary_tokenizer)
  for wrp in wrap(summary_text, 150):
    print (wrp)
  np =  get_keywords(context,summary_text)
  print ("\n\nNoun phrases",np)
  output=""
  for answer in np:
    ques = get_question(summary_text,answer,question_model,question_tokenizer)
    # output= output + ques + "\n" + "Ans: "+answer.capitalize() + "\n\n"
    output = output + "<b style='color:blue;'>" + ques + "</b>"
    # output = output + "<br>"
    output = output + "<b style='color:green;'>" + "Ans: " +answer.capitalize()+  "</b>"
    output = output + "<br>"

  summary ="Summary: "+ summary_text
  for answer in np:
    summary = summary.replace(answer,"<b>"+answer+"</b>")
    summary = summary.replace(answer.capitalize(),"<b>"+answer.capitalize()+"</b>")
  output = output + "<p>"+summary+"</p>"

  return output

iface = gr.Interface(
  fn=generate_question,
  inputs=context,
  outputs=output)
iface.launch(share=True)




Collecting gradio==3.14.0
  Downloading gradio-3.14.0-py3-none-any.whl (13.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi (from gradio==3.14.0)
  Using cached fastapi-0.103.2-py3-none-any.whl (66 kB)
Collecting ffmpy (from gradio==3.14.0)
  Using cached ffmpy-0.3.1-py3-none-any.whl
Collecting httpx (from gradio==3.14.0)
  Downloading httpx-0.25.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.7/75.7 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collecting orjson (from gradio==3.14.0)
  Using cached orjson-3.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (138 kB)
Collecting pycryptodome (from gradio==3.14.0)
  Using cached pycryptodome-3.19.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
Collecting pydub (from gradio==3.14.0)
  Using cached pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting python-multipart



Colab notebook detected. To show errors in colab notebook, set debug=True in launch()

Setting up a public link... we have recently upgraded the way public links are generated. If you encounter any problems, please report the issue and downgrade to gradio version 3.13.0
.
Running on public URL: https://1e26499c-95c8-4044.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces




time: 22 s (started: 2023-10-15 12:08:19 +00:00)


# **Filter keywords with Maximum marginal Relevance**

In [None]:
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
!tar -xvf  s2v_reddit_2015_md.tar.gz

--2023-10-15 12:12:16--  https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/50261113/52126080-0993-11ea-8190-8f0e295df22a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231015%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231015T121216Z&X-Amz-Expires=300&X-Amz-Signature=24fdcf2a042c2002bccd66e285a5c1da1b25b14402969e6332f67065448f026a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=50261113&response-content-disposition=attachment%3B%20filename%3Ds2v_reddit_2015_md.tar.gz&response-content-type=application%2Foctet-stream [following]
--2023-10-15 12:12:16--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/50261113/52126080-0993-11ea-8190-8

In [None]:
import numpy as np
from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk('s2v_old')

time: 6.12 s (started: 2023-10-15 12:12:38 +00:00)


In [None]:
from sentence_transformers import SentenceTransformer
# paraphrase-distilroberta-base-v1
sentence_transformer_model = SentenceTransformer('msmarco-distilbert-base-v3')

Downloading (…)da7dc/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)3fc4bda7dc/README.md:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading (…)c4bda7dc/config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)da7dc/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/499 [00:00<?, ?B/s]

Downloading (…)3fc4bda7dc/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)4bda7dc/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

time: 5.24 s (started: 2023-10-15 12:12:11 +00:00)


In [None]:
from similarity.normalized_levenshtein import NormalizedLevenshtein
normalized_levenshtein = NormalizedLevenshtein()

def filter_same_sense_words(original,wordlist):
  filtered_words=[]
  base_sense =original.split('|')[1]
  print (base_sense)
  for eachword in wordlist:
    if eachword[0].split('|')[1] == base_sense:
      filtered_words.append(eachword[0].split('|')[0].replace("_", " ").title().strip())
  return filtered_words

def get_highest_similarity_score(wordlist,wrd):
  score=[]
  for each in wordlist:
    score.append(normalized_levenshtein.similarity(each.lower(),wrd.lower()))
  return max(score)

def sense2vec_get_words(word,s2v,topn,question):
    output = []
    print ("word ",word)
    try:
      sense = s2v.get_best_sense(word, senses= ["NOUN", "PERSON","PRODUCT","LOC","ORG","EVENT","NORP","WORK OF ART","FAC","GPE","NUM","FACILITY"])
      most_similar = s2v.most_similar(sense, n=topn)
      # print (most_similar)
      output = filter_same_sense_words(sense,most_similar)
      print ("Similar ",output)
    except:
      output =[]

    threshold = 0.6
    final=[word]
    checklist =question.split()
    for x in output:
      if get_highest_similarity_score(final,x)<threshold and x not in final and x not in checklist:
        final.append(x)

    return final[1:]

def mmr(doc_embedding, word_embeddings, words, top_n, lambda_param):

    # Extract similarity within words, and between words and the document
    word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
    word_similarity = cosine_similarity(word_embeddings)

    # Initialize candidates and already choose best keyword/keyphrase
    keywords_idx = [np.argmax(word_doc_similarity)]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]

    for _ in range(top_n - 1):
        # Extract similarities within candidates and
        # between candidates and selected keywords/phrases
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)

        # Calculate MMR
        mmr = (lambda_param) * candidate_similarities - (1-lambda_param) * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]

        # Update keywords & candidates
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    return [words[idx] for idx in keywords_idx]

time: 9.8 ms (started: 2023-10-15 12:13:23 +00:00)


In [None]:
from collections import OrderedDict
from sklearn.metrics.pairwise import cosine_similarity

def get_distractors_wordnet(word):
    distractors=[]
    try:
      syn = wn.synsets(word,'n')[0]

      word= word.lower()
      orig_word = word
      if len(word.split())>0:
          word = word.replace(" ","_")
      hypernym = syn.hypernyms()
      if len(hypernym) == 0:
          return distractors
      for item in hypernym[0].hyponyms():
          name = item.lemmas()[0].name()
          #print ("name ",name, " word",orig_word)
          if name == orig_word:
              continue
          name = name.replace("_"," ")
          name = " ".join(w.capitalize() for w in name.split())
          if name is not None and name not in distractors:
              distractors.append(name)
    except:
      print ("Wordnet distractors not found")
    return distractors

def get_distractors (word,origsentence,sense2vecmodel,sentencemodel,top_n,lambdaval):
  distractors = sense2vec_get_words(word,sense2vecmodel,top_n,origsentence)
  print ("distractors ",distractors)
  if len(distractors) ==0:
    return distractors
  distractors_new = [word.capitalize()]
  distractors_new.extend(distractors)
  # print ("distractors_new .. ",distractors_new)

  embedding_sentence = origsentence+ " "+word.capitalize()
  # embedding_sentence = word
  keyword_embedding = sentencemodel.encode([embedding_sentence])
  distractor_embeddings = sentencemodel.encode(distractors_new)

  # filtered_keywords = mmr(keyword_embedding, distractor_embeddings,distractors,4,0.7)
  max_keywords = min(len(distractors_new),5)
  filtered_keywords = mmr(keyword_embedding, distractor_embeddings,distractors_new,max_keywords,lambdaval)
  # filtered_keywords = filtered_keywords[1:]
  final = [word.capitalize()]
  for wrd in filtered_keywords:
    if wrd.lower() !=word.lower():
      final.append(wrd.capitalize())
  final = final[1:]
  return final

sent = "What is application of DL"
keyword = "neural networks"



print (get_distractors(keyword,sent,s2v,sentence_transformer_model,40,0.2))


word  neural networks
NOUN
Similar  ['Neural Nets', 'Neural Network', 'Genetic Algorithms', 'Artificial Neural Networks', 'Deep Learning', 'Machine Learning', 'Neural Networks', 'Pattern Recognition', 'Evolutionary Algorithms', 'Human Brain', 'Anns', 'Machine Learning', 'Computation', 'Algorithms', 'Complex Systems', 'Biological Systems', 'Quantum Computation', 'Computer Vision', 'Natural Language Processing', 'Complexity Theory', 'Neural Net', 'Computational Models', 'Biological Brains', 'Computer Programs', 'Human Brains', 'Heuristics', 'Artificial Intelligence', 'Information Processing', 'Training Data', 'Computational Complexity', 'Quantum Computing', 'Unsupervised Learning', 'Human Cognition', 'Mathematical Models', 'Turing Machines', 'Computations', 'Cognitive Psychology', 'Connectome', 'Computer Simulations', 'Machine Vision']
distractors  ['Genetic Algorithms', 'Artificial Neural Networks', 'Deep Learning', 'Machine Learning', 'Pattern Recognition', 'Evolutionary Algorithms', '

In [None]:
import torch
from transformers import T5ForConditionalGeneration,T5Tokenizer
summary_model = T5ForConditionalGeneration.from_pretrained('t5-base')
summary_tokenizer = T5Tokenizer.from_pretrained('t5-base')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
summary_model = summary_model.to(device)


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


time: 5.49 s (started: 2023-10-15 12:13:51 +00:00)


In [None]:
import nltk
nltk.download('punkt')
nltk.download('brown')
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.tokenize import sent_tokenize

def postprocesstext (content):
  final=""
  for sent in sent_tokenize(content):
    sent = sent.capitalize()
    final = final +" "+sent
  return final


def summarizer(text,model,tokenizer):
  text = text.strip().replace("\n"," ")
  text = "summarize: "+text
  # print (text)
  max_len = 512
  encoding = tokenizer.encode_plus(text,max_length=max_len, pad_to_max_length=False,truncation=True, return_tensors="pt").to(device)

  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  outs = model.generate(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  early_stopping=True,
                                  num_beams=3,
                                  num_return_sequences=1,
                                  no_repeat_ngram_size=2,
                                  min_length = 75,
                                  max_length=300)


  dec = [tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]
  summary = dec[0]
  summary = postprocesstext(summary)
  summary= summary.strip()

  return summary


summarized_text = summarizer(text,summary_model,summary_tokenizer)


print ("\noriginal Text >>")
for wrp in wrap(text, 150):
  print (wrp)
print ("\n")
print ("Summarized Text >>")
for wrp in wrap(summarized_text, 150):
  print (wrp)
print ("\n")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



original Text >>
Deep Learning (DL) is a subset of machine learning that focuses on training artificial neural networks to learn and make predictions from large
amounts of data. DL has gained significant popularity and has become a powerful tool across various domains due to its ability to automatically
extract meaningful features from raw data.  DL finds its application in numerous fields, ranging from computer vision and natural language processing
to healthcare and finance. One of the notable applications of DL is image recognition. DL models can be trained to accurately classify and identify
objects within images, enabling tasks such as facial recognition, object detection, and autonomous driving. DL has also revolutionized the field of
natural language processing, enabling machines to understand and generate human-like text. This has led to advancements in machine translation,
sentiment analysis, chatbots, and more.  DL models have proven to be invaluable in the healthcare indust


# **SAMPLE TEXT**


Deep Learning (DL) is a subset of machine learning that focuses on training artificial neural networks to learn and make predictions from large amounts of data. DL has gained significant popularity and has become a powerful tool across various domains due to its ability to automatically extract meaningful features from raw data.

DL finds its application in numerous fields, ranging from computer vision and natural language processing to healthcare and finance. One of the notable applications of DL is image recognition. DL models can be trained to accurately classify and identify objects within images, enabling tasks such as facial recognition, object detection, and autonomous driving. DL has also revolutionized the field of natural language processing, enabling machines to understand and generate human-like text. This has led to advancements in machine translation, sentiment analysis, chatbots, and more.

DL models have proven to be invaluable in the healthcare industry. They can analyze medical images, such as X-rays and MRIs, to assist in diagnosing diseases and detecting abnormalities. DL models are also used for predicting patient outcomes, drug discovery, and personalized medicine. In finance, DL models can analyze large amounts of financial data to make predictions about stock prices, fraud detection, credit scoring, and algorithmic trading.

DL has also found applications in recommendation systems, where it powers personalized suggestions for products, movies, music, and more. By analyzing user preferences and behavior, DL models can provide tailored recommendations that enhance user experience and engagement. Additionally, DL has made significant contributions to the field of robotics, enabling robots to perceive and interact with the environment autonomously.

In [None]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import sent_tokenize
import random

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

summary_model = T5ForConditionalGeneration.from_pretrained('t5-base')
summary_tokenizer = T5Tokenizer.from_pretrained('t5-base')
summary_model = summary_model.to(device)

nltk.download('punkt')
nltk.download('brown')
nltk.download('wordnet')


def postprocesstext(content):
    final = ""
    for sent in sent_tokenize(content):
        sent = sent.capitalize()
        final = final + " " + sent
    return final


def summarizer(text, model, tokenizer):
    text = text.strip().replace("\n", " ")
    text = "summarize: " + text
    max_len = 512
    encoding = tokenizer.encode_plus(text, max_length=max_len, pad_to_max_length=False, truncation=True,
                                     return_tensors="pt").to(device)

    input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

    outs = model.generate(input_ids=input_ids,
                          attention_mask=attention_mask,
                          early_stopping=True,
                          num_beams=3,
                          num_return_sequences=1,
                          no_repeat_ngram_size=2,
                          min_length=75,
                          max_length=300)

    dec = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outs]
    summary = dec[0]
    summary = postprocesstext(summary)
    summary = summary.strip()

    return summary


def get_nouns_multipartite(content):
    out = []
    try:
        extractor = pke.unsupervised.MultipartiteRank()
        extractor.load_document(input=content, language='en')
        pos = {'PROPN', 'NOUN'}
        stoplist = list(string.punctuation)
        stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
        stoplist += stopwords.words('english')
        extractor.candidate_selection(pos=pos)
        extractor.candidate_weighting(alpha=1.1,
                                      threshold=0.75,
                                      method='average')
        keyphrases = extractor.get_n_best(n=15)

        for val in keyphrases:
            out.append(val[0])
    except:
        out = []

    return out


def get_question(context, answer, taxonomy_level, model, tokenizer):
    text = "context: {} answer: {} taxonomy_level: {}".format(context, answer, taxonomy_level)
    encoding = tokenizer.encode_plus(
        text, max_length=384, pad_to_max_length=False, truncation=True, return_tensors="pt"
    ).to(device)
    input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

    outs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        early_stopping=True,
        num_beams=5,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        max_length=72,
    )

    dec = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outs]

    Question = dec[0].replace("question:", "")
    Question = Question.strip()
    return Question


def generate_question(context, radiobutton, taxonomy_level):
    summary_text = summarizer(context, summary_model, summary_tokenizer)
    np = get_nouns_multipartite(summary_text)

    output = ""
    for answer in np:
        ques = get_question(summary_text, answer, taxonomy_level, question_model, question_tokenizer)
        if radiobutton == "Wordnet":
            distractors = get_distractors_wordnet(answer)
        else:
            distractors = get_distractors(answer.capitalize(), ques, s2v, sentence_transformer_model, 40, 0.2)

        output = output + ques + "\n" + "Ans: " + answer.capitalize() + "\n"
        if len(distractors) > 0:
            output = output + "Distractors: " + ", ".join(distractors[:4]) + "\n"
        output = output + "\n"

    summary = "Summary: " + summary_text
    for answer in np:
        summary = summary.replace(answer, "<b>" + answer + "</b>")
        summary = summary.replace(answer.capitalize(), "<b>" + answer.capitalize() + "</b>")
    output = output + summary
    return output


context = gr.inputs.Textbox(lines=10, placeholder="Enter paragraph/content here...")
radiobutton = gr.inputs.Radio(["Wordnet", "Sense2Vec"])
taxonomy_level = gr.inputs.Radio(["knowledge", "comprehension", "application", "analysis", "synthesis", "evaluation"])
output = gr.outputs.Textbox()

iface = gr.Interface(
    fn=generate_question,
    inputs=[context, radiobutton, taxonomy_level],
    outputs=output
)

iface.launch(debug=True)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Wordnet distractors not found
Wordnet distractors not found
Wordnet distractors not found
Wordnet distractors not found
word  Objects
NOUN
Similar  ['Objects', 'Other Objects', 'Object', 'Most Objects', 'Only Objects', 'Vectors', 'Particles', 'Single Object', 'Many Objects', 'Other Object', 'Planets', 'Real Objects', 'Different Objects', 'Entities', 'First Object', 'Massive Objects', 'Smaller Objects', 'Certain Objects', 'Object(S', 'Massive Object', 'Separate Objects', '3-Dimensional Space', 'Whole Object', 'Target Object', 'Earth-Moon System', 'Three-Dimensional Space', 'Second Object', 'Particles', 'Same Object', 'Multiple Objects', 'Celestial Bodies', 'Large Objects', 'Original Object', 'Nearby Objects', 'Individual Objects', 'Celestial Objects', 'Gravitational Fields', 'Local Space', 'Planetary Bodies']
distractors  ['Other Objects', 'Vectors', 'Particles', 'Single Object', 'Planets', 'Different Objects', 'Entities', 'Massive Objects', 'Separate Objects', '3-Dimensional Space', 'E



time: 7min 37s (started: 2023-10-15 12:15:11 +00:00)
