# Interview Questions Creator

Inside this folder (`Interview_Questions_Creator/`) we will be creating a model that creates interview questions given some useful documents. 

Not only we will be getting the questions from the model, but we will also want the model to provide the answers.

First we are going to implement it in this Jupyter notebook, then we are going to make the code modular inside the `Interview_Questions_Creator/`. 

## Setup

First of all, we will install all the required packages. We will also be creating a conda virtual environment explicitly for this project (see `README.md`).

```bash
conda create -n interview python=3.10 -y
```

then:
```bash 
conda activate interview
pip install -r requirements.txt
```

In [1]:
# get OpenAI key

from dotenv import load_dotenv
import os

load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")

## Get Data

Next thing we need is to import our data to generate questions and answer from. We will save it in the `data/` folder.

### Extract Documents

In [2]:
from langchain.document_loaders import PyPDFLoader

In [3]:
file_path = "data/SDG.pdf"
loader = PyPDFLoader(file_path=file_path)
data = loader.load()
data

[Document(metadata={'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign CC 13.0 (Macintosh)', 'creationdate': '2017-12-26T16:03:25-05:00', 'moddate': '2017-12-27T15:29:10-05:00', 'trapped': '/False', 'source': 'data/SDG.pdf', 'total_pages': 24, 'page': 0, 'page_label': '1'}, page_content=''),
 Document(metadata={'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign CC 13.0 (Macintosh)', 'creationdate': '2017-12-26T16:03:25-05:00', 'moddate': '2017-12-27T15:29:10-05:00', 'trapped': '/False', 'source': 'data/SDG.pdf', 'total_pages': 24, 'page': 1, 'page_label': '2'}, page_content=''),
 Document(metadata={'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign CC 13.0 (Macintosh)', 'creationdate': '2017-12-26T16:03:25-05:00', 'moddate': '2017-12-27T15:29:10-05:00', 'trapped': '/False', 'source': 'data/SDG.pdf', 'total_pages': 24, 'page': 2, 'page_label': '3'}, page_content='IN THE YEAR 2015, LEADERS FROM 193 COUNTRIES OF THE WORLD \nCAME TOGETHER TO FACE T

In [4]:
question_gen = ""
for page in data:
    question_gen += page.page_content

In [5]:
question_gen

'IN THE YEAR 2015, LEADERS FROM 193 COUNTRIES OF THE WORLD \nCAME TOGETHER TO FACE THE FUTURE.\nAnd what they saw was daunting. Famines. Drought. Wars. Plagues. Poverty. \nNot just in some faraway place, but in their own cities and towns and villages.\nThey knew things didn’t have to be this way. They knew we had enough \nfood to feed the world, but that it wasn’t getting shared. They knew there \nwere medicines for HIV and other diseases, but they cost a lot. They knew \nthat earthquakes and floods were inevitable, but that the high death \ntolls were not. \nThey also knew that billions of people worldwide shared their hope for a \nbetter future.\nSo leaders from these countries created a plan called the Sustainable \nDevelopment Goals (SDGs). This set of 17 goals imagines a future just 15 years \noff that would be rid of poverty and hunger, and safe from the worst effects of \nclimate change. It’s an ambitious plan. \nBut there’s ample evidence that we can succeed. In the past 15 yea

Ok this is our full data, we successfully extracted from file. Now we want to split it into chunks:

### Process Data Into Chunks

We can create chunks from documents in [different ways](https://python.langchain.com/docs/how_to/split_by_token/). When working with OpenAI models, it's better to use `TikToken`, a fast BPE tokenizer created by OpenAI.

In order to do so, we will be using LangChain's [`TokenTextSplitter`](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html#tokentextsplitter) which works with tiktoken directly and will ensure each split is smaller than chunk size.

In [6]:
from langchain.text_splitter import TokenTextSplitter

splitter_ques_gen = TokenTextSplitter(
    model_name= "gpt-3.5-turbo",
    chunk_size= 10000,  # why so long? see below
    chunk_overlap = 200
)


In [7]:
chunk_ques_gen = splitter_ques_gen.split_text(question_gen)

In [8]:
chunk_ques_gen   # since our document is smaller than 10000 chunks, now we wil have only one chunk 

['IN THE YEAR 2015, LEADERS FROM 193 COUNTRIES OF THE WORLD \nCAME TOGETHER TO FACE THE FUTURE.\nAnd what they saw was daunting. Famines. Drought. Wars. Plagues. Poverty. \nNot just in some faraway place, but in their own cities and towns and villages.\nThey knew things didn’t have to be this way. They knew we had enough \nfood to feed the world, but that it wasn’t getting shared. They knew there \nwere medicines for HIV and other diseases, but they cost a lot. They knew \nthat earthquakes and floods were inevitable, but that the high death \ntolls were not. \nThey also knew that billions of people worldwide shared their hope for a \nbetter future.\nSo leaders from these countries created a plan called the Sustainable \nDevelopment Goals (SDGs). This set of 17 goals imagines a future just 15 years \noff that would be rid of poverty and hunger, and safe from the worst effects of \nclimate change. It’s an ambitious plan. \nBut there’s ample evidence that we can succeed. In the past 15 ye

In [9]:
len(chunk_ques_gen)

1

But we need one more step: to get an actual chunking we can use in LangChain, we need to **convert our data do `Document` format**

In [10]:
type(chunk_ques_gen)

list

In [11]:
from langchain.docstore.document import Document

document_ques_gen = [Document(page_content = t) for t in chunk_ques_gen]

In [12]:
print(f"{document_ques_gen} \n\n{type(document_ques_gen[0])}")

[Document(metadata={}, page_content='IN THE YEAR 2015, LEADERS FROM 193 COUNTRIES OF THE WORLD \nCAME TOGETHER TO FACE THE FUTURE.\nAnd what they saw was daunting. Famines. Drought. Wars. Plagues. Poverty. \nNot just in some faraway place, but in their own cities and towns and villages.\nThey knew things didn’t have to be this way. They knew we had enough \nfood to feed the world, but that it wasn’t getting shared. They knew there \nwere medicines for HIV and other diseases, but they cost a lot. They knew \nthat earthquakes and floods were inevitable, but that the high death \ntolls were not. \nThey also knew that billions of people worldwide shared their hope for a \nbetter future.\nSo leaders from these countries created a plan called the Sustainable \nDevelopment Goals (SDGs). This set of 17 goals imagines a future just 15 years \noff that would be rid of poverty and hunger, and safe from the worst effects of \nclimate change. It’s an ambitious plan. \nBut there’s ample evidence tha

In [13]:
# now we perform the actual chunking:

splitter_ans_gen = TokenTextSplitter(
    model_name = 'gpt-3.5-turbo',
    chunk_size = 1000,
    chunk_overlap = 100
)

In [14]:
document_answer_gen = splitter_ans_gen.split_documents(
    document_ques_gen
)

In [15]:
document_answer_gen

[Document(metadata={}, page_content='IN THE YEAR 2015, LEADERS FROM 193 COUNTRIES OF THE WORLD \nCAME TOGETHER TO FACE THE FUTURE.\nAnd what they saw was daunting. Famines. Drought. Wars. Plagues. Poverty. \nNot just in some faraway place, but in their own cities and towns and villages.\nThey knew things didn’t have to be this way. They knew we had enough \nfood to feed the world, but that it wasn’t getting shared. They knew there \nwere medicines for HIV and other diseases, but they cost a lot. They knew \nthat earthquakes and floods were inevitable, but that the high death \ntolls were not. \nThey also knew that billions of people worldwide shared their hope for a \nbetter future.\nSo leaders from these countries created a plan called the Sustainable \nDevelopment Goals (SDGs). This set of 17 goals imagines a future just 15 years \noff that would be rid of poverty and hunger, and safe from the worst effects of \nclimate change. It’s an ambitious plan. \nBut there’s ample evidence tha

In [16]:
len(document_answer_gen)

4

## Create Questions through the LLM

Nwxt step is to create the questions using an LLM. 

Right now we will not be **embedding** our document data: later on, when we will be doing asnwer generation, we will actually convert document into an embedding representation.

### Choose model

In [17]:
from langchain.chat_models import ChatOpenAI    

In [18]:
llm_ques_gen_pipeline = ChatOpenAI(
    model='gpt-3.5-turbo',
    temperature=0.3
)

  llm_ques_gen_pipeline = ChatOpenAI(


### Define Prompts

A super important step is to define a good prompt for our model:

In [19]:
# PROMPT TEMPLATE

prompt_template = """
You are an expert at creating questions based on coding materials and documentation.
Your goal is to prepare a coder or programmer for their exam and coding tests.
You do this by asking questions about the text below:

------------
{text}
------------

Create questions that will prepare the coders or programmers for their tests.
Make sure not to lose any important information.

QUESTIONS:
"""

Then to render it into a suitable prompt template for LangChain we will use [`PromptTemplate`](https://python.langchain.com/api_reference/core/prompts/langchain_core.prompts.prompt.PromptTemplate.html#prompttemplate)

In [20]:
from langchain.prompts import PromptTemplate

PROMPT_QUESTIONS = PromptTemplate(template=prompt_template, input_variables=['text'])

And now we can implement a very interesting refining step: we want to pass the generated answers again to the model, instructing it to refine the given answer. 

This is a very powerful method in advanced applications.

In [21]:
refine_template = ("""
You are an expert at creating practice questions based on coding material and documentation.
Your goal is to help a coder or programmer prepare for a coding test.
We have received some practice questions to a certain extent: {existing_answer}.
We have the option to refine the existing questions or add new ones.
(only if necessary) with some more context below.
------------
{text}
------------

Given the new context, refine the original questions in English.
If the context is not helpful, please provide the original questions.
QUESTIONS:
"""
)

In [22]:
REFINE_PROMPT_QUESTIONS = PromptTemplate(
    input_variables=["existing_answer", "text"],
    template=refine_template,
)

### Create Chain

For this task we can use a specific type of chain in LangChain, the [`load_summarize_chain`](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.summarize.chain.load_summarize_chain.html#load-summarize-chain), which literally returns "A chain to use for summarizing.".  

As a matter of fact, we are trying to summarize questions from a bigger document.

In [23]:
from langchain.chains.summarize import load_summarize_chain

ques_gen_chain = load_summarize_chain(llm = llm_ques_gen_pipeline, 
                                          chain_type = "refine", 
                                          verbose = True, 
                                          question_prompt=PROMPT_QUESTIONS, 
                                          refine_prompt=REFINE_PROMPT_QUESTIONS)

In [24]:
ques = ques_gen_chain.run(document_ques_gen)

print(ques)

  ques = ques_gen_chain.run(document_ques_gen)




[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are an expert at creating questions based on coding materials and documentation.
Your goal is to prepare a coder or programmer for their exam and coding tests.
You do this by asking questions about the text below:

------------
IN THE YEAR 2015, LEADERS FROM 193 COUNTRIES OF THE WORLD 
CAME TOGETHER TO FACE THE FUTURE.
And what they saw was daunting. Famines. Drought. Wars. Plagues. Poverty. 
Not just in some faraway place, but in their own cities and towns and villages.
They knew things didn’t have to be this way. They knew we had enough 
food to feed the world, but that it wasn’t getting shared. They knew there 
were medicines for HIV and other diseases, but they cost a lot. They knew 
that earthquakes and floods were inevitable, but that the high death 
tolls were not. 
They also knew that billions of people worldwide shared their hope for a 
b

## Create Answers

### Vector Embedding

Now, as we were saying before, we want to embed text in order for the model to retrieve the generated answers efficiently.

We can use [OpenAi embeddings](https://python.langchain.com/docs/integrations/text_embedding/openai/) directly from `langchain.embeddings.openai`:

In [25]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

  embeddings = OpenAIEmbeddings()


Then, we'll need a vector store to fetch the embeddings. 

This time we will use [`FAISS`](https://ai.meta.com/tools/faiss/) from Meta. We can use [directly through LangChain](https://python.langchain.com/docs/integrations/vectorstores/faiss/): 

In [26]:
from langchain.vectorstores import FAISS
vector_store = FAISS.from_documents(document_answer_gen, embeddings)

### Model and Chain for asnwers retrieval 

In [28]:
# initialize the answering model:
llm_answer_gen = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo")

In [29]:
ques

'1. What is the main focus of the Sustainable Development Goals (SDGs) mentioned in the text?\n2. How many countries are involved in the Sustainable Development Goals initiative?\n3. What is the goal related to poverty mentioned in the text and what is the target year for achieving it?\n4. How has hunger improved in the past 20 years and what actions are suggested to further address this issue?\n5. What are some of the key aspects of the goal related to ensuring healthy lives and promoting well-being for all at all ages?\n6. What progress has been made in achieving gender equality according to the text?\n7. Why is it important to ensure sustainable consumption and production patterns?\n8. How can individuals contribute to helping achieve the Sustainable Development Goals by 2030 according to the text?\n9. What are some of the challenges related to climate change mentioned in the text, and how can they be addressed?\n10. How can the global partnership for sustainable development be revi

In [31]:
ques_list = ques.split("\n")
ques_list

['1. What is the main focus of the Sustainable Development Goals (SDGs) mentioned in the text?',
 '2. How many countries are involved in the Sustainable Development Goals initiative?',
 '3. What is the goal related to poverty mentioned in the text and what is the target year for achieving it?',
 '4. How has hunger improved in the past 20 years and what actions are suggested to further address this issue?',
 '5. What are some of the key aspects of the goal related to ensuring healthy lives and promoting well-being for all at all ages?',
 '6. What progress has been made in achieving gender equality according to the text?',
 '7. Why is it important to ensure sustainable consumption and production patterns?',
 '8. How can individuals contribute to helping achieve the Sustainable Development Goals by 2030 according to the text?',
 '9. What are some of the challenges related to climate change mentioned in the text, and how can they be addressed?',
 '10. How can the global partnership for sus

For the efficient retrieval of answers we can use LangChain's `RetrievalQA` chain:

In [36]:
from langchain.chains import RetrievalQA

answer_generation_chain = RetrievalQA.from_chain_type(llm=llm_answer_gen, 
                                               chain_type="stuff", 
                                               retriever=vector_store.as_retriever())



### Generate answers for each question

In [37]:
# Answer each question and save to a file
for question in ques_list:
    print("Question: ", question)
    answer = answer_generation_chain.run(question)
    print("Answer: ", answer)
    print("--------------------------------------------------\\n\\n")
    # Save answer to file
    with open("answers.txt", "a") as f:
        f.write("Question: " + question + "\\n")
        f.write("Answer: " + answer + "\\n")
        f.write("--------------------------------------------------\\n\\n")




Question:  1. What is the main focus of the Sustainable Development Goals (SDGs) mentioned in the text?
Answer:  The main focus of the Sustainable Development Goals (SDGs) mentioned in the text is to address and work towards ending extreme poverty, hunger, ensuring healthy lives, promoting well-being, providing quality education, combating climate change, conserving ecosystems, promoting peace and justice, reducing inequality, making cities sustainable, ensuring sustainable consumption and production patterns, and achieving gender equality, among other goals.
--------------------------------------------------\n\n
Question:  2. How many countries are involved in the Sustainable Development Goals initiative?
Answer:  193 countries are involved in the Sustainable Development Goals initiative.
--------------------------------------------------\n\n
Question:  3. What is the goal related to poverty mentioned in the text and what is the target year for achieving it?
Answer:  The goal related 