# Wikipedia Chatbot
## Try to create a chatbot using Wikipedia as a knowledge base

##### Creating/Importing the knowledge base (IR system)
We will use two methods to create the knowledge base:
- Importing a pre-made knowledge base
- Using the wikipedia API

### Importing a pre-made knowledge base
We now import the knowledge base from a pre-downloaded file, taken from [parl.ai](https://parl.ai/docs/zoo.html#wikipedia-retriever-used-for-wizard-of-wikipedia)

This is an agent that retrieves the 5 most relevant documents from Wikipedia given a query. It is trained on the [Wizard of Wikipedia](https://parl.ai/projects/wizard_of_wikipedia/) dataset.

In [1]:
import yaml
from typing import Dict

from parlai.core.agents import create_agent
from parlai.agents.image_seq2seq.image_seq2seq import ImageSeq2seqAgent
from parlai.scripts.interactive import setup_args

In [2]:
# Using my local file path
local_file_path = "/Users/lorenzobenzoni/miniconda3/envs/tensorflow/lib/python3.10/site-packages/data/models/wikipedia_full/tfidf_retriever/model/model"
parlai_agent_kwargs = {"model_file": local_file_path}

parser = setup_args()
opt = parser.parse_kwargs(**parlai_agent_kwargs)
parlai_agent = create_agent(opt, requireModelExists=True)

11:58:03 | [33mOverriding opt["model_file"] to /Users/lorenzobenzoni/miniconda3/envs/tensorflow/lib/python3.10/site-packages/data/models/wikipedia_full/tfidf_retriever/model/model (previously: wiki_full_notitle)[0m
11:58:03 | Loading /Users/lorenzobenzoni/miniconda3/envs/tensorflow/lib/python3.10/site-packages/data/models/wikipedia_full/tfidf_retriever/model/model.tfidf


In [8]:
# Try to use the agent to generate a response
parlai_agent.observe({"text": "White House"})
white_house = parlai_agent.act()

In [9]:
# These are the candidates
white_house["text_candidates"]

['\nThe White House is the official residence and workplace of the President of the United States. It is located at 1600 Pennsylvania Avenue NW in Washington, D.C., and has been the residence of every U.S. president since John Adams in 1800. The term "White House" is often used as a metonym for the president and his advisers, as in "The White House announced that...".\n\nThe residence was designed by Irish-born architect James Hoban in the neoclassical style. Construction took place between 1792 and 1800 using Aquia Creek sandstone painted white. When Thomas Jefferson moved into the house in 1801, he (with architect Benjamin Henry Latrobe) added low colonnades on each wing that concealed stables and storage. In 1814, during the War of 1812, the mansion was set ablaze by the British Army in the Burning of Washington, destroying the interior and charring much of the exterior. Reconstruction began almost immediately, and President James Monroe moved into the partially reconstructed Execut

### Using the Wikipedia API
We now use the wikipedia API to create a knowledge base. We will use the [wikipedia](https://pypi.org/project/wikipedia/) python library. The main problem of the wikipedia is that we cannot search for a specific sentence, but only for a page. So, we have to select the **context** at the start of the conversation, like in the parlai paper.

In [12]:
%pip install wikipediaapi

[31mERROR: Could not find a version that satisfies the requirement wikipediaapi (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for wikipediaapi[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [5]:
import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

white_house = wiki_wiki.page('White House')
print(f"Page - Exists: {white_house.exists()}")

white_house.summary

Page - Exists: True


'The White House is the official residence and workplace of the president of the United States. It is located at 1600 Pennsylvania Avenue NW in Washington, D.C., and has been the residence of every U.S. president since John Adams in 1800 when the national capital was moved from Philadelphia to Washington, D.C. The term "White House" is often used as metonymy for the president and his advisers.\nThe residence was designed by Irish-born architect James Hoban in the neoclassical style. Hoban modelled the building on Leinster House in Dublin, a building which today houses the Oireachtas, the Irish legislature. Construction took place between 1792 and 1800, using Aquia Creek sandstone painted white. When Thomas Jefferson moved into the house in 1801, he and architect Benjamin Henry Latrobe added low colonnades on each wing to conceal what then were stables and storage. In 1814, during the War of 1812, the mansion was set ablaze by British forces in the Burning of Washington, destroying the 

## The model
We have tried a lot of different models: from question answering, conversational, to text2text. The best results were obtained with the text2text model.

### T5 Model - Text2Text

**Overview**
The T5 model was presented in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pretraining objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

In [44]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# This model is a zero-shot / few-shot model, general purpose, trained on a large corpus of text
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float32)

Without providing a Knlowledge Base, the model is able to answer to some questions, but it is not able to answer to questions that require a specific knowledge. For example, if we ask "What is the capital of Italy?", the model is not able to answer "Rome"

In [22]:
input_text = "Which is the capital of Italy?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=1000)
output_text = tokenizer.decode(output_ids[0][1:-1])

# Wrong output, without a fine-tuning on a specific task or a knlowledge base retrieval
print(output_text)

turin


If we give to the transformer a context, taken from Wikipedia, it can use the knowledge base to answer to the questions. For example, if we ask "What is the capital of Italy?", the model is now able to answer "Rome"

In [19]:
italy = wiki_wiki.page('Italy').summary

italy[:150]

'Italy (Italian: Italia [iˈtaːlja] (listen)), officially the Italian Republic or the Republic of Italy, is a country in Southern and Western Europe. Lo'

In [20]:
input_text = f'In a funny way, try to answer o thins question, knowing that <Knowledge-base> {italy} </knowledge-base>: Question: Which is the capital of Italy?'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=10000)
output_text = tokenizer.decode(output_ids[0][1:-1])

print(output_text)

Rome


**Another Example**
- **Question**: How many students are there at Politecnico di Milano?

In [21]:
politecnico = wiki_wiki.page('Politecnico di Milano').summary

politecnico

"The Polytechnic University of Milan (Politecnico di Milano) is the largest technical university in Italy, with about 42,000 students. \nThe university offers undergraduate, graduate and higher education courses in engineering, architecture and design. \nFounded in 1863, it is the oldest university in Milan.\nThe Polytechnic University of Milan has two main campuses in the city of Milan, Italy, where the majority of the research and teaching activities are located, as well as other satellite campuses in five other cities across the Lombardy and Emilia-Romagna regions. \nThe central offices and headquarters are located in the historical campus of Città Studi in Milan, which is also the largest, active since 1927.\nAccording to the QS World University Rankings for the subject area 'Engineering & Technology', it ranked in 2022 as the 13th best in the world. It ranked 6th worldwide for Design, 9th for Civil and Structural Engineering, 9th for Mechanical, Aerospace Engineering and 7th for A

In [22]:
input_text = f'In a funny way, try to answer o thins question, knowing that <Knowledge-base> {politecnico} </knowledge-base>: Question: How many students are there at Politecnico di Milano?'

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=10000)
output_text = tokenizer.decode(output_ids[0][1:-1])

print(output_text)

42,000


**Another Example**
- **Question**: Who is the most famous Architect that studied at Politecnico di Milano?

In [23]:
input_text = f'In a funny way, try to answer o thins question, knowing that <Knowledge-base> {politecnico} </knowledge-base> Question: Who is the most famous Architect that studied at Politecnico di Milano?'

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=10000)
output_text = tokenizer.decode(output_ids[0][1:-1])

print(output_text)

Renzo Piano


## Question Answering model
[Roberta Model - SQuAD 2.0](https://huggingface.co/deepset/roberta-base-squad2) is the roberta-base model, fine-tuned using the SQuAD2.0 [dataset](https://huggingface.co/datasets/squad_v2). It's been trained on question-answer pairs, including unanswerable questions, for the task of Question Answering.

In [21]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions with a pipeline
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [30]:
QA_input = {
    'question': 'how many students study at Politecnico di Milano?',
    'context': politecnico
}
res = nlp(QA_input)
res['answer']

'42,000'

In [27]:
QA_input = {
    'question': 'Who is the most famous Architect that studied at Politecnico di Milano?',
    'context': politecnico
}
res = nlp(QA_input)
res['answer']

'Renzo Piano and Aldo Rossi'

### Try to use this model on the dataset of the Wikipedia Parlai Paper
We now use T5 because is the most general purpose one.

In [1]:
import json

data = json.load(open('./wizard_of_wikipedia/data.json'))

In [2]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# this model is a zero-shot / few-shot model, general purpose, trained on a large corpus of text
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float32)

Now we test the model without the IR system, to see if it can answer to some questions without the knowledge base.

In [3]:
# Test on the first example
chosen_topic = data[0]['chosen_topic']

statement = data[0]['dialog'][0]['text']

input_text = f'Statement : {statement}'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=10000)
output_text = tokenizer.decode(output_ids[0][1:-1])

print(f'Statement: {statement}\nAnswer: {output_text}\nChosen topic: {chosen_topic}')

Statement: I think science fiction is an amazing genre for anything. Future science, technology, time travel, FTL travel, they're all such interesting concepts.
Answer: I love science fiction
Chosen topic: Science fiction


Now we test the model with the IR system, to see if the answer can be improved.

In [29]:
# test on a random example
chosen_topic = data[1355]['chosen_topic']

statement = data[1355]['dialog'][0]['text']

context = wiki_wiki.page(chosen_topic).summary

input_text = f'In a funny way, try to answer o thins question, knowing that <Knowledge-base> {context} </knowledge-base> Statement : {statement}'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=10000)
output_text = tokenizer.decode(output_ids[0][1:-1])

print(f'Statement: {statement}\nAnswer: {output_text}\nChosen topic: {chosen_topic}')

from parlai.core.metrics import BleuMetric

bleu = BleuMetric.compute(output_text, [data[1355]["dialog"][1]["text"]])
print(f"BLEU: {bleu}")

Statement: I need a new winter hobby and picked Sled Dogs. I'm thinking I want a husky. Have  you ever had one?
Answer: Huskies are also kept as pets, and groups work to find new pet homes for retired racing and adventure-trekking dogs.
Chosen topic: Husky
BLEU: 5.049e-11


Since the answers are more concise than the ones given in the dataset, the BLEU score is always lower.
We think that with this generative task, is much better to use Human evaluation, because the BLEU score is not a good metric for this task.

![alt text](./bot/Screenshot%202023-05-27%20alle%2013.52.15.png "Fishing")