<a href="https://colab.research.google.com/github/HimathR/Zentern-Public/blob/main/Question_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚧 **Installations**
Welcome to the question generator. To start, please follow the instructions as listed here. 
Run the following code block before each run-time to acquire all necessary libaries and resources needed for executing the program!
Please note that full installation can take on average 5-10 minutes, please be patient!

In [None]:
!pip install git+https://github.com/ramsrigouthamg/Questgen.ai &> /dev/null
!pip install --quiet git+https://github.com/boudinfl/pke.git &> /dev/null
!python -m nltk.downloader universal_tagset &> /dev/null
!python -m spacy download en &> /dev/null
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz &> /dev/null
!tar -xvf  s2v_reddit_2015_md.tar.gz &> /dev/null
!ls s2v_old &> /dev/null
import nltk
nltk.download('stopwords')
print("Installations Complete")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Installations Complete


# 📚 **Text File Formatting**
When it comes to reading in files, you must ensure that the input files are of the correct format. 
The correct format for the best possible content generation should include:
*   A .txt file with no unknown or broken characters
*   No extraneous text (**anything not related to the main story**). Removing this stops any chance of these details being used in question generation. **Examples of extraneous details that can interfere with question generation include:**
 *   Page Numbers
 *   Publishing Details (including Author Name, ISBN)
 *   Standalone 1-2 Word Image Captions 
 *   Question/activity sections that may potentially appear throughout, or at the end of the book 


Due to all books having different patterns when it comes to when and where these extraneous details may potentially appear, this process has to be done **manually**








# 📑 **Adding Text Files To Be Read In**
All story book files to be processed must be uploaded as a zip folder known as "stories.zip".

The stories.zip file must be in such a way that it can be read by a program. To convert an archive file with stories into a stories.zip, follow the steps within this program found [here](https://colab.research.google.com/drive/1gDq0Mh2Eqjt2sIaF8rqrDtFGg_HTqgYj?usp=sharing )

To upload the zip file, navigate to the "files" section to the left of this browser, and click on "upload to session storage".
[Click me](https://gyazo.com/726677814c4f254f3baf36ce65b5a676) for a visual guide. 

Then, simply click on the zip stories.zip file and it will be added to this run-time. 
After this, execute the cell below

❓ **Troubleshooting:**
* Ensure that stories.zip has no capital letters
* If the currrent run-time runs out, the zip file must be reuploaded (the run-time only runs out if the computer is turned off, internet connection is lost, or the browser is exited)
* Don't forget to execute the code block below


In [None]:
#!unzip /stories.zip &> /dev/null 
!unzip /content/stories.zip &> /dev/null 

# ⏭ **Running The Program**
Once this has been completed, run the code block below. Once again, please allow 2-3 minutes for relevant packages to download. 
You will know it worked as intended if it prints out the names of all story book titles successfully. It will display those names for 10 seconds, then conclude, meaning it is now ready for action. 


In [None]:
# relevant libraries
from IPython.display import clear_output
from Questgen import main
from pprint import pprint
import nltk.data
import random 
import time
import csv 
import os
import re
# for punctuation checking
regex = re.compile('[,\.!?]')

# used for question generation
qe = main.BoolQGen()
answer = main.AnswerPredictor()
qg = main.QGen()

# test reading in of extracted files
filenum = 0
for filename in os.listdir("/content/stories"):
    if filename.endswith("txt"): 
      filenum += 1
      print(filename)

time.sleep(10)
clear_output()

### **Next, run the below code block to generate questions!**
As a general rule of thumb, it takes approximate 10 minutes of processing time per 100 stories, though of course this can vary depending on things like story length. During this processing time you can just have the program running in the background and do something else. The message "PROCESSING COMPLETE" will appear in the output box once the program is finished.


In [None]:
def draw_progress(percent, barLen = 25): # this will print out current completion
    # it is defined as the percentage of current files processed divided by total files available
    progress = ""
    for i in range(barLen):
        if i < int(barLen * percent):
            progress += "█"
        else:
            progress += " "
    print("[%s] %.2f%%" % (progress, percent * 100))
              
def scramble_words(text):
  sorted_list = sorted(text.split(), key = len, reverse=True)
  if len(sorted_list) <= 3:
    longest_word = sorted_list[0]
  else:
    longest_word = random.choice(sorted_list[:3]) # retrieve one of the top 5 longest words in the text
  count = 0
  while "-" in longest_word or len(longest_word) <= 1:
    longest_word = random.choice(sorted_list[:3])
    count += 1
    if count > 3:
      return ""   
  answer = regex.sub('', longest_word) # remove unnecessary punctuation
  answer = re.sub(u'\u2019', u"\u0027", answer)
  longest_word = answer
  longest = list(longest_word)
  # remove any punctuation marks
  shuffled_word = ''.join([str(w) for w in random.sample(longest, len(longest))])
  # turn it back to a string and convert to answer format
  answer = (shuffled_word + " | " + answer + " [CORRECT]")
  return answer

def scramble_sentences(text):
  sentences = nltk.tokenize.sent_tokenize(text) # tokenize the sentences
  choice = random.choice(sentences) # choose a random sentence 
  random_sentence = (choice).split()
  count = 0
  while len(random_sentence) <= 2:
    choice = random.choice(sentences) # choose a random sentence 
    random_sentence = (choice).split()
    count += 1
    if count >= 5:
      return ""
  random.shuffle(random_sentence) # rearrange random sentence order
  shuffled_sentence = ' '.join(random_sentence) 
  answer = (shuffled_sentence + " | " + choice + " [CORRECT]") # convert to answer format
  return answer

def clean_word(random_word): # used to remove any trailing punctuation marks at the end (commas mainly)
  original = random_word
  random_word = list(random_word)
  end = 0 
  if not random_word[-1].isalpha():
    end = str(random_word[-1])
    random_word.remove(random_word[-1])
  return ''.join(random_word), original, end

def fill_blanks(text):
  sentences = nltk.tokenize.sent_tokenize(text) # tokenize the sentences
  random_sentence = (random.choice(sentences)).split() # choose a random sentence
  copy = random_sentence[1:] # skip the first word
  if len(copy) <= 1: 
    return ""
  random_word = (random.choice(copy)) # choose a random word 
  random_word, original, end = clean_word((random_word)) # remove punctuation marks
  count = 0
  while len(random_word) < 3 and count <= 5: # make sure the word is long enough
    random_word = (random.choice(random_sentence))
    random_word,  original, end = clean_word((random_word))
    count += 1
    if count == 5: # if no words long enough found, skip the fill in the blanks qs
      return ""
  for item in range(len(random_sentence)): 
    new_word = random_sentence[item]
    new_word, original, end = clean_word(new_word) 
    if new_word == random_word: 
      if end != 0: # replace word in original sentence with blanks (_)
        input = "_"*(len(random_word))+str(end)
        random_sentence[item] = input
      else:
        random_sentence[item] = "_"*(len(random_word))
  blanked_sentence = ' '.join(random_sentence)
  answer = (blanked_sentence + " | " + random_word + " [CORRECT]") # convert to answer format
  return answer

def create_mcq(payload):
  output = qg.predict_mcq(payload)
  question_storage = []
  contexts = []
  for key,value in output.items():
    if key == 'questions':
      for outputs in value:
        for key2, value2 in outputs.items():
            if key2 == 'question_statement': # this is the actual question
                    question = value2
            if key2 == 'answer': # this is the answer to the question
                    answer = value2
            if key2 == 'options': # these are the other (wrong) multiple choice answers
                    options = ",".join(value2)
            if key2 == 'context': 
                    # these are contexts for each question: 
                    # the section of text from which the question was derived
                    contexts.append(value2)
        finalstring = question + "," + answer + "[CORRECT]," + options
        question_storage.append(finalstring)
  return question_storage, contexts

# create output.csv document
with open("/output.csv", 'w') as f:
  # these are the headings: can change as needed from here 
  header = ['Book Title', 'Sentence Scramble', 'Word Scramble', 'Fill Blanks', 'MCQ1', 'MCQ2', 'MCQ3'] #, 'Context1', 'Context2', 'Context3'
  writer = csv.writer(f)
  writer.writerow(header)
  writer.writerow("")

testmode = True
# When this is set to 'True', it will make a test document 
# This will only run a few documents at a time, and you can manually 
# prompt to exit - this can be used for just testing formatting and 
# making sure the program set up is working (if needed). 
# By default, this value is False for normal operation. 
count = 0
iterations = 5 # change to how many stories you want to test out in a single run
for filename in os.listdir("/content/stories"):
    count += 1
    if count >= iterations and testmode:
      break
    if count % 5 == 0:
      print(count)
      clear_output()
    if filename.endswith("txt"): 
      with open("/content/stories/" + filename, 'r') as file:
          data = file.read().replace('\n', ' ') # read in all story data as single string
          data = re.sub(u'\u2019',u"\u0027", data)
          print("Creating Questions For " + filename)
          payload = {"input_text": data}
          all_qs = [] # used to store all question types
          mcq_qs, contexts = create_mcq(payload) # for multiple choice qs
          blanks = fill_blanks(data) # for fill in the blank qs
          scrambled_sentences = scramble_sentences(str(data)) # for unscrambling full sentence qs
          scrambled_words = scramble_words(str(data)) # for unscrambling word qs
          print("Writing New Output To CSV")
          with open("/output.csv", 'a', encoding='utf-16') as f: 
              all_qs.insert(0, filename[:-4]) # Adds the book name
              # Adds the 3-5 generated questions
              all_qs.append(scrambled_sentences) 
              all_qs.append(scrambled_words)
              all_qs.append(blanks)
              for item in mcq_qs:
                all_qs.append(item)
              # adds context for each mcq question on same row 
              fixrow = ['', '', '', '']
              for item in contexts:
                fixrow.append(item)
              writer = csv.writer(f) 
              writer.writerow(all_qs)
              writer.writerow(fixrow)
          print("MODEL PROGRESS: ", end=' ')
          if not testmode:
            iterations = filenum
          progress = float(count/iterations)
          draw_progress(progress)

# automatically downloads the csv to browser once complete
print("PROCESSING COMPLETE, DOWNLOADING FILES")
from google.colab import files
files.download('/output.csv')

# ✅ **Finalisation**
All questions generated will automatically be written to the output.csv folder, which will then automatically be downloaded.

### ❗ **Some Important Final Notes**
This program will not always generate the ideal questions, and may sometimes lack coherency due to current deficiencies in its language processing. **Therefore, some degree of manual checking will almost always be needed.** That being said, it can still serve as a solid foundation, as most questions only need a minimal form of human editing to be up to standard. 

### 📭 **Further Contact Details** 
This program was made by Himath Ratnayake for MELearning. For anybody who may use/improve upon this program in the future,  please feel free to reach out if you have any further queries and questions. 
* 📧 **E-Mail 1:** himath4510@gmail.com
* 📧 **E-Mail 2:** himath.ratnayake@griffithuni.edu.au
* 📧 **Discord ID:** <@239156955988492298> (post this on any discord server to access profile)

### ⭐ **Other Useful Resources**
* **QuestGen Open-Source Library:** 
 The primary tool used to generate the multiple choice questions, using a module that has already been carefully developed and trained. 
 * Available at: https://github.com/ramsrigouthamg/Questgen.ai 


* **Natural Language Toolkit and Corpora:** 
For relevant lexical resources (word banks), text processing features (tokenization) and other NLP tasks. 
 * Available at: https://www.nltk.org/ 

* **File Processing Program Link:**
 * Available at: https://colab.research.google.com/drive/1gDq0Mh2Eqjt2sIaF8rqrDtFGg_HTqgYj?usp=sharing 

