In [1]:
from src.data_handling import WikiHandler
from src.config import AppResources
from pathlib import Path
from src.summarizer import OpenAISummarizer
from src.question_answering_system import *

## Summarizer task

To run this task we need three classes:
<ol>
<li><strong>DataHandler:</strong> This class wraps the concept of handling the input json and converting the various segments of conversation in unified text. It also internally combines consecutive conversations of each speaker which might be splitted.</li>
<li><strong>AppResources:</strong> This class contains all the resources that will be needed by the summarizer. All the resources are mentioned into a config.json file. This class needs the config.json for instantiation </li>
<li><strong>OpenAISummarizer:</strong> This class contains all the logic for summarization, including three techniques for summarization namely:
    <ol>
        <li><strong>StuffSummarizer</strong></li>
        <li><strong>MapReduceSummarizer</strong></li>
        <li><strong>RefineSummarizer</strong></li>
    </ol>
    </li>

</ol>


In [2]:
# create a datahandler, it 
dataset = WikiHandler.get_data("Mughals")
# Create Resources object, this class abstracts away everything that the summarizer will need
app_res = AppResources.from_config_file(config_path=Path('config_files/config.json'))





  lis = BeautifulSoup(html).find_all('li')


In [4]:
len(dataset)

4

#### Stuff Summarizer
Idea is simple if the model permits just using all the text into a single call for summarization. Unfortunately gpt3.5 has a maximum token length of 16k tokens for a single call and our demo text has closer to 25k tokens. So in the code if somebody provides text longer than 16k tokens and specifies <strong>stuff</strong>, it will fall back to <strong>mapreducesummarizer</strong> 

In [None]:
summarizer = OpenAISummarizer(app_res,summarizer_type='stuff')
summary =  summarizer.summarize(dataset.data_as_str)
print(summary)

#### MapReduce Summarizer

Idea here is split text into multiple smaller chunks, call summarization into each of them. Then combine the intermediate summarizations to a single final summarization.

In [None]:
summarizer = OpenAISummarizer(app_res,summarizer_type='mapreduce')
summary =  summarizer.summarize(dataset.data_as_str)
print(summary)


#### Refine Summarizer
Idea here is to split text into multiple smaller chunks, call summarization into the first chunk and then append the summarization of the previous chunks into the new one till we reach the final summary. This process is the most expensive of all the three summarizers, as we have a linear stack of calls.



In [None]:
summarizer = OpenAISummarizer(app_res,summarizer_type='refine')
summary =  summarizer.summarize(dataset.data_as_str)
print(summary)




## Question Answering task

To run this task we need three classes:
<ol>
<li><strong>DataHandler:</strong> This class wraps the concept of handling the input json and converting the various segments of conversation in unified text. It also internally combines consecutive conversations of each speaker which might be splitted.</li>
<li><strong>AppResources:</strong> This class contains all the resources that will be needed by the QA system. All the resources are mentioned into a config.json file. This class needs the config.json for instantiation </li>
<li><strong>QuestionAnsweringSystem:</strong> This class contains all the logic for QA system to process. This class also employs some verification frameworks. There is logic that can verify whether answers are hallucinated by the model or not, if the generated text has any relevance to the question or not. 
<li><strong>VectorDatabase:</strong> This class is used to transform our documents into embeddings, so we can  semantically find related answers to the questions. All of the data is not persisted and only stored in memory. Default database I used here is <strong>FAISS</strong>  </li> 
    

</ol>

In [None]:
# create a datahandler, it 
dataset = DataHandler("datasets/demo-segments.json")

# Create Resources object, this class abstracts away everything that the summarizer will need
app_res = AppResources.from_config_file(config_path=Path('config_files/config.json'))

# create and store the embeddings of chunked text into the database.
db = VectorDatabase(app_res)
db.add_documents(dataset.data_as_str)

# create the retriever 
retriever = db.get_retriever()

# Instantiate the QAsystem
qa_system = QuestionAnsweringSystem(app_res, retriever)


In [None]:
# Provide the question
question = "Who is Lancelot? "

In [None]:
generated_answer = qa_system.run(question)
print(f"Answer: {generated_answer}")