# WizardOfWikipedia: Knowledge-powered Conversational Agents
## Authors: 
 - Jacopo Di Ventura - jacopo.di.ventura@usi.ch
 - Jury Andrea D'Onofrio - jury.donofrio@usi.ch
 - Matteo Martinoli - matteo.martinoli@usi.ch
 - Roberto Neglia - roberto.neglia@usi.ch
 
#### Installation and upgrade of libraries

In [None]:
%%time 
!pip install hnswlib -q
!apt-get install espeak-ng -y -q
!pip install -U torch torchaudio -q
!pip install TTS -q
!pip uninstall -y transformers -q
!pip install transformers==4.22.2 -q
!pip install --upgrade accelerate -q
!pip install -U sentence-transformers -q

#### Python library imports

In [None]:
import json
import numpy as np
import re
import torch
import pickle
import hnswlib
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer, CrossEncoder
from IPython.display import Audio
from IPython.utils import io
from TTS.api import TTS

In [3]:
# Data paths
MODEL_PATH = "/kaggle/input/weights/file/test1"
EMBEDDINGS_PATH = "/kaggle/input/weights/embeddings_cache.pickle"
INDEX_PATH = "/kaggle/input/weights/embeddings_index_100.index"
PASSAGES_PATH = "/kaggle/input/weights/passages_list.json"

In [4]:
# Device definition
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Pretrained models:
### Define tokenizer and custom model

In [5]:
print('Loading pretrained model...', end="\r")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH).to(device)
print('Loading pretrained model. Done')

Loading pretrained model. Done


### Load SentenceTransformer and CrossEncoder models
In this code block, two pretrained models are loaded for different natural language processing (NLP) tasks:

1. **Sentence Transformer Model**
   - `semb_model` is initialized using the 'multi-qa-MiniLM-L6-cos-v1' model. This model is designed for generating embeddings for sentences, which can be useful for various NLP tasks such as similarity analysis and clustering.

2. **Cross-Encoder Model**
   - `xenc_model` is initialized using the 'cross-encoder/ms-marco-MiniLM-L-6-v2' model. This model is intended for cross-encoder tasks and can be used for tasks that involve ranking or comparing pairs of text.

These pretrained models provide powerful tools for various NLP applications and tasks.


In [6]:
print('Loading SentenceTransformer...', end="\r")
with io.capture_output() as captured:
    semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
print('Loading SentenceTransformer. Done')
    
print('Loading CrossEncoder...', end="\r")
with io.capture_output() as captured:    
    xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print('Loading CrossEncoder. Done')

Loading SentenceTransformer. Done
Loading CrossEncoder. Done


### Load embeddings and HNSW index 


In [7]:
print('Loading embeddings from cache...', end="\r")
with open(EMBEDDINGS_PATH, 'rb') as f:
    passage_embeddings = pickle.load(f)
print('Loading embeddings from cache. Done')

print('Loading index...', end="\r")
# Initialize an HNSW index for efficient similarity search using cosine distance.
index = hnswlib.Index(space='cosine', dim=passage_embeddings.size(1))
index.load_index(INDEX_PATH)
print('Loading index. Done')

Loading embeddings from cache. Done
Loading index. Done


In [8]:
# Loading pre-computed passages list
with open(PASSAGES_PATH) as f:
    passages = json.load(f)

### Load text-to-speech model

In [9]:
print('Loading Text-to-Speech Model...', end="\r")
with io.capture_output() as captured:
    tts = TTS('tts_models/en/jenny/jenny', progress_bar=False).to(device)
print('Loading Text-to-Speech Model. Done')

Loading Text-to-Speech Model. Done


#### `sample_to_string` Function

- The `sample_to_string` function is designed to format the sample information into a string. It includes the context, input text, and the reply using information in the context.

- Args:
  - `sample` (dict): A dictionary with fields 'context', 'input', and 'target'.
  - `eos_token` (str): An end-of-sequence token used for separating different parts of the formatted string.

- Returns:
  - `str`: A formatted string that combines the context, input text, and reply.


In [10]:
def sample_to_string(sample, eos_token):
    return f"context: {sample['context']} {eos_token} input text: {sample['input']} {eos_token} reply using information in the context: {sample['target']} {eos_token}"

#### `get_response` Function

This function is defined to generate a response based on a given input message using embeddings and a cross-encoder model.

The function takes three main inputs:
  - `message` (str): The input message for which a response is generated.
  - `index`: An index structure (HNSW index) for efficient similarity search.
  - `passages` (list): A list of passages to search for responses.

- The function performs the following steps:
  1. Encodes the input message into an embedding using the Sentence Transformer model (`semb_model`).
  2. Performs nearest neighbor search using the embedding to retrieve candidate passages (`corpus_ids`).
  3. Prepares model inputs for cross-encoder predictions using the message and candidate passages.
  4. Predicts cross-encoder scores for the candidate passages.
  5. Sorts candidate passages by cross-encoder scores and normalizes the scores.
  6. Selects passages with scores above a specified threshold (e.g., 0.9).
  7. Returns the selected passages as the best responses.


In [11]:
def get_response(message, index, passages):
    # Encode the input message into an embedding using the Sentence Transformer model
    message_embedding = semb_model.encode(message, convert_to_tensor=True).cpu()
    
    # Perform nearest neighbor search to retrieve candidate passages
    corpus_ids, _ = index.knn_query(message_embedding, k=64)
    
    # Prepare model inputs for cross-encoder predictions
    model_inputs = [(message, passages[idx]) for idx in corpus_ids[0]]
    
    # Predict cross-encoder scores for candidate passages
    cross_scores = xenc_model.predict(model_inputs, show_progress_bar=False)
    
    # Sort candidate passages by cross-encoder scores
    best_idxs = np.argsort(cross_scores)
    best_scores = np.sort(cross_scores)
    
    # Normalize scores
    best_scores = (best_scores-np.min(best_scores))/(np.max(best_scores)-np.min(best_scores))
    
    # Select passages with scores above > 0.9
    idxs = []
    for i in range(len(best_scores)):
        if best_scores[i] > 0.9:
            idxs.append(best_idxs[i])
    
    # Return the selected passages as the best responses
    return [passages[corpus_ids[0][idx]] for idx in reversed(idxs)]

#### `text_to_speech` Function

This function coverts text to speech using a Text-to-Speech (TTS) system.

The function takes two inputs:
  - `text` (str): The input text that we want to convert to speech.
  - `tts`: A TTS object responsible for text-to-speech conversion.


In [12]:
def text_to_speech(text, tts):
    # Generate speech from the input text
    wav = tts.tts(text=text)
    
    # Create an audio widget to play the generated speech
    return Audio(wav, rate=45000, autoplay=True)    

## Chatbot dialog initialization and interaction:

In this code block, the bot interacts with the user in a conversation-like manner. Here are the main components of the code:

### Dialog initialization

- `dialog` is initialized using a dictionary. It includes fields for `input` (user's message), `context` (contextual information), and `target` (target response).

- `max_len` determines the length of the conversation.

### User Interaction

In each turn:
  - The user's message is read and added to the dialog.
  - The top passages are retrieved using the `get_response` function based on the user's message.
  - The returned passages are added to the dialog context.
  - The dialog is converted to a string and encoded for the model.
  - The response is generated by the model based on the dialog dictionary.
  - The generated response is processed and displayed as the chatbot's reply.
  - The chatbot's response is also converted to speech and played back using text-to-speech conversion.


In [None]:
# Initialize dialog
dialog = {
    "input": "",
    "context": "",
    "target": "",
}

# Maximum dialog length (in turn pairs)
max_len = 3

for i in range(max_len):
    # Read user message
    user_message = input("> APPRENTICE:")
    # Add user message to dialog
    dialog["input"] = user_message

    # get top passages
    top_passages = get_response(dialog["input"], index, passages)
    # add top passages to dialog
    dialog["context"] = " ".join(top_passages)
    
    # Convert dialog to string
    input_string = sample_to_string(dialog, tokenizer.eos_token)
    
    # Encode input
    input_encoding = tokenizer(input_string, return_tensors="pt").to(device)
    
    # Generate response
    output_ids = model.generate(
        input_encoding.input_ids,
        max_new_tokens=50,
        do_sample=True,
        temperature=1.3,
        num_beams=5,
        top_k=5,
        repetition_penalty = 1.2,
        pad_token_id=tokenizer.eos_token_id,
    )
    
    # Decode response
    chatbot_response = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1) :], skip_special_tokens=True)
    
    # Fix response string
    chatbot_response = chatbot_response[41:].rstrip()
    chatbot_response = re.sub(r'\s+', ' ', chatbot_response)
    print(f"> WIZARD: {chatbot_response}\n")
    print(f"> CONTEXT USED: {dialog['context']}")
    
    # Text-to-Speech Conversion
    with io.capture_output() as captured:
        audio_track = text_to_speech(chatbot_response, tts)
    display(audio_track)  

## Conclusions:
In summary, we have successfully achieved our goal of creating an knowledge-powered conversational agents able to generate an abstract answer based on context retrieved from the user’s input. Additionally, we have effectively augmented our bot with a Text-to-Speech (TTS) system. 

## Future works:
In order to further improve our system, multiple ideas are possible. For example:
- Experimenting with alternative pre-trained models.
- Refining the `sample_to_string` function for improved performance.
- Exploring different sets of hyper-parameters to optimize system behavior.
- Investigating the development and integration of a Speech-to-Text system for comprehensive functionality.