#Resources

* https://www.youtube.com/watch?v=ZzgUqFtxgXI
* https://www.youtube.com/watch?v=7aBRk_JP-qY
* https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/3/document-splitting

#Introduction

Title: Exam MCQ Generator

Game Plan:

*   Divide the document into multiple chuncks and then select chunks at random, pass them to Langchain and ask to make an MCQ out of it

#Installs

In [112]:
!pip install -q python-dotenv

!pip install -q langchain_experimental
!pip install -q langchain

!pip install -q openai

In [113]:
!pip install -q pydantic

#Imports

In [114]:
import os

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain

#Data Source

In [115]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [116]:
wizard_in_oz_text_path = '/content/drive/MyDrive/datasets/llm-from-scratch/wizard_in_oz.txt'

In [139]:
def read_text_file(path):
  with open(path, 'r', encoding='utf-8') as f:
    text = f.read()

  return text

In [140]:
text = read_text_file(wizard_in_oz_text_path)

In [141]:
print(text[:100])

﻿Title: Dorothy and the Wizard in Oz


Author: L. Frank Baum

Illustrator: John R. Neill

Release da


#Env

In [119]:
env_path = '/content/drive/MyDrive/credentials/data-analytics-demo/.env'

In [120]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(env_path)

True

In [121]:
GOOGLE_PALM_API_KEY = os.environ['GOOGLE_PALM_API_KEY']
HUGGINGFACE_API_KEY = os.environ['HUGGINGFACE_API_KEY']
OPEN_AI_API_KEY = os.environ['OPEN_AI_API_KEY']

# GOOGLE_PALM_API_KEY, HUGGINGFACE_API_KEY, OPEN_AI_API_KEY

#Model

In [122]:
from pydantic import BaseModel, Field

class MCQModel(BaseModel):
  question: str = Field(description="This is the question text")
  options: list[str] = Field(description="This is a list of multiple choices or options avalible")
  correct_option: str = Field(description="This is the correct choice or option")

In [123]:
from langchain.output_parsers import PydanticOutputParser

pydantic_parser = PydanticOutputParser(pydantic_object=MCQModel)

format_instructions = pydantic_parser.get_format_instructions()

print(format_instructions)

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"question": {"title": "Question", "description": "This is the question text", "type": "string"}, "options": {"title": "Options", "description": "This is a list of multiple choices or options avalible", "type": "array", "items": {"type": "string"}}, "correct_option": {"title": "Correct Option", "description": "This is the correct choice or option", "type": "string"}}, "required": ["question", "options", "correct_option"]}
```


In [124]:
langchain_llm = ChatOpenAI(openai_api_key=OPEN_AI_API_KEY, model_name="gpt-3.5-turbo-0613", verbose=False,)

In [125]:
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

In [126]:
texts = text_splitter.create_documents([text])

In [127]:
print(f'Total Documents Created: {len(texts)}')

Total Documents Created: 287


In [128]:
prompt = (
    "Text:"
    "\n{text}"
    "\nGenerate an Multiple Choice Question from this text. Also return the correct option."
    "\nDo not use any outside information"
    "\n{format_instructions}"
)

In [129]:
prompt_template = ChatPromptTemplate.from_template(prompt)

prompt_template

ChatPromptTemplate(input_variables=['format_instructions', 'text'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['format_instructions', 'text'], template='Text:\n{text}\nGenerate an Multiple Choice Question from this text. Also return the correct option.\nDo not use any outside information\n{format_instructions}'))])

In [130]:
llm_chain = LLMChain(llm=langchain_llm, prompt=prompt_template, output_key="result")

In [131]:
from langchain.callbacks import get_openai_callback as get_langchain_openai_callback
import random

def get_question():
  doc_index = random.randint(0, len(texts)-1)

  with get_langchain_openai_callback() as cb_langchain:
    response = llm_chain.invoke({"text": texts[doc_index], "format_instructions": format_instructions})

  return {'doc_index': doc_index, 'mcq': response['result'], 'cb_langchain': cb_langchain}

#Testing

In [132]:
response_raw = get_question()
response_parsed = MCQModel.parse_raw(response_raw['mcq'])

In [133]:
print(response_parsed)

print('\n')
print(f"doc_index = {response_raw['doc_index']}")
print('\ncosting:\n')
print(f"{response_raw['cb_langchain']}")

question='Who ruled the Land of Oz?' options=['The Princess', 'The Wizard', 'The Witches', 'The Munchkins'] correct_option='The Wizard'


doc_index = 218

costing:

Tokens Used: 538
	Prompt Tokens: 497
	Completion Tokens: 41
Successful Requests: 1
Total Cost (USD): $0.0008275


In [134]:
response_parsed

MCQModel(question='Who ruled the Land of Oz?', options=['The Princess', 'The Wizard', 'The Witches', 'The Munchkins'], correct_option='The Wizard')

#Pydantic Demo

In [135]:
from pydantic import BaseModel, EmailStr, validator

class User(BaseModel):
  name: str
  # email: EmailStr
  email: str
  account_id: int

  @validator("account_id")
  def validate_account_id(cls, value):
    if value<=0:
      raise ValueError(f'Account ID cannot be negative: {value}')

    return value

user = User(name='jack', email='a@b.com', account_id=1234)

print(user)
print(user.json())
print(user.dict())
print(user.parse_raw(user.json()))

name='jack' email='a@b.com' account_id=1234
{"name": "jack", "email": "a@b.com", "account_id": 1234}
{'name': 'jack', 'email': 'a@b.com', 'account_id': 1234}
name='jack' email='a@b.com' account_id=1234


#Working Demo

In [142]:
texts_path = '/content/drive/MyDrive/datasets/exam_mcq_generator/texts'

In [147]:
text_path_01 = os.path.join(texts_path, '01.txt')
text_path_02 = os.path.join(texts_path, '02.txt')
text_path_03 = os.path.join(texts_path, '03.txt')
text_path_04 = os.path.join(texts_path, '04.txt')
text_path_05 = os.path.join(texts_path, '05.txt')

In [158]:
text_01 = read_text_file(text_path_01)

display(text_01)

"In the small coastal village of Eldridge Haven, where the briny scent of the ocean mingled with the sweet fragrance of blooming flowers, lived a young woman named Seraphina. Her days were filled with the rhythmic ebb and flow of tide, the gentle whispers of the wind through ancient willows, and the comforting routine of life in a close-knit community. Yet, beneath the surface of tranquil routine lurked an undercurrent of restlessness that stirred within Seraphina's soul, beckoning her toward the unknown.\n\nOne fateful evening, as the sun dipped below the horizon, casting a warm palette of pinks and purples across the sky, Seraphina took a solitary stroll along the shore. The rhythmic crashing of the waves seemed to sync with the beating of her heart, echoing the unspoken desire for adventure that tugged at the edges of her consciousness. Little did she know that this serene walk would unravel the fabric of her familiar existence and set her on a path of unforeseen destiny.\n\nIn the 

In [159]:
text_02 = read_text_file(text_path_02)

display(text_02)

'In the not-so-distant future, a revolutionary technological marvel known as SynthEra emerged from the laboratories of Quantum Dynamics Innovations, forever altering the landscape of human experience. SynthEra, a groundbreaking neural interface, represented the pinnacle of advancements in neurotechnology, seamlessly blending the digital and biological realms.\n\nAt its core, SynthEra was designed to establish a direct and symbiotic connection between the human brain and the vast expanse of digital information. Consisting of a sleek neuro-interface headset adorned with an intricate web of sensors, SynthEra employed advanced neurostimulation algorithms to interpret neural signals, transforming thoughts into actionable commands. The device seamlessly integrated with the user\'s cognitive functions, creating an immersive and intuitive interface for navigating the digital world.\n\nThe possibilities unlocked by SynthEra were nothing short of awe-inspiring. Users could interact with augmente

In [160]:
text_03 = read_text_file(text_path_03)

display(text_03)

"Deep within the esoteric realm of abstract mathematics, a conundrum known as the Collatz conjecture captivated the minds of scholars and amateurs alike. Born in the mid-20th century from the intellect of German mathematician Lothar Collatz, this seemingly straightforward problem birthed a puzzle that has defied resolution for decades, its elusiveness challenging the very core of mathematical understanding.\n\nAt its heart, the Collatz conjecture revolves around the interplay of natural numbers. The rules are deceptively simple: take any positive integer 'n,' if it is even, divide it by 2; if it is odd, triple it and add 1. Repeat this process iteratively, and the conjecture posits that, regardless of the starting value, the sequence will eventually reach the elusive cycle of 4, 2, 1, and from thereon endlessly loop in this enigmatic dance.\n\nMathematicians, drawn to the allure of simplicity masking complexity, embarked on a journey to unlock the secrets embedded in the Collatz conjec

In [161]:
text_04 = read_text_file(text_path_04)

display(text_04)

"In the vast expanse of human linguistic articulation, poetry emerges as an enigmatic manifestation, a convolution of lexical cadence that transcends the conventional confines of mundane discourse. Within the nebulous realms of metaphorical articulation, poets, as lexical alchemists, wield a contrivance of language that begets an ethereal symphony, wherein the elusive shades of human sentiment meander.\n\nThe nuanced choreography of syllabic configurations and rhythmic resonances imbues poets with the capacity to metamorphose prosaic verbiage into an intricate mosaic of expressive profundity. In its essence, a meticulously wrought poem becomes a tapestry of allegorical impressions, each epithet and turn of phrase akin to a brushstroke upon the canvas of cognition.\n\nThe potency of poetry resides not solely within its articulated lexicon but rather within the interstices of verbal modulations, the intervals of pregnant reticence that implore the discerning reader to tarry and ruminate.

In [162]:
text_05 = read_text_file(text_path_05)

display(text_05)

"The history of Large Language Models (LLMs) is a fascinating journey that spans the intersection of artificial intelligence, natural language processing, and machine learning. The roots of LLMs can be traced back to the early endeavors in computational linguistics, where researchers sought to develop systems capable of understanding and generating human-like language.\n\nThe seeds of LLMs were sown in the mid-20th century, with early attempts at computer-based language translation. The Georgetown-IBM experiment in 1954 marked a pioneering effort, utilizing an early computer to translate Russian sentences into English. However, these early endeavors were constrained by the limited computational power of the time, and the results fell short of achieving true linguistic nuance.\n\nAs computational capabilities burgeoned in the latter half of the 20th century, researchers delved into more sophisticated approaches to natural language processing. Rule-based systems emerged, relying on prede