In [1]:
import pandas as pd
import numpy as np

## Steps in Creating the AI ChatBOT

1) Choosing a pre-fine tuned model
2) Loading a pre trained model
3) Deploying it with Wikipedia API
4) Integrate it with Gradio UI to make it look interactive and prettier
5) Test the bot


### Choosing a pre-fine tuned model

I have choosen a model that is pre trained on the Stanford Question Answering Dataset (SQuAD) dataset as the training dataset from the hugging face repository. I chose BERT trained model for this purpose.

#### Loading a pre trained model
> Link: https://huggingface.co/deepset/bert-base-cased-squad2 

In [48]:
!pip install wikipedia
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

#### Playing with our model

> * Tokenize the Input
> * Obtain model scores
> * Get the result

Documentation Link: https://huggingface.co/docs/transformers/notebooks

In [60]:
context = ''' Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan. 
            It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael 
            Caine. Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts 
            who travel through a wormhole near Saturn in search of a new home for mankind. Interstellar premiered on October 26,
            2014, in Los Angeles. In the United States, it was first released on film stock, expanding to venues using digital 
            projectors. The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making 
            it the tenth-highest grossing film of 2014. It received acclaim for its performances, direction, screenplay, musical
            score, visual effects, ambition, themes, and emotional weight. It has also received praise from many astronomers for
            its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult 
            following, and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
            Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received 
            numerous other accolades. '''

question = "Who directed Interstellar"

# 1. Tokenize the input
# Here return_tensors = "pt" returns a torch.Tensor objects.
inputs = tokenizer.encode_plus(question, context, return_tensors = "pt") 

# 2. Obtain model scores
answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)
# Get the most likely beginning of the answer
answer_start = torch.argmax(answer_start_scores) 
# Get the most likely end of the answer
answer_end = torch.argmax(answer_end_scores) + 1

# 3. Get the result
# Converting the start and end tokens back to words
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

'Christopher Nolan'

#### Deploying it with Wikipedia API


##### Searching keywords on Wikipedia

In [69]:
import wikipedia as wiki

# Define the question
question = "Who directed Interstellar?"

# Use the Wikipedia API to search for the most relevant page based on the question
search_results = wiki.search(question)
if not search_results:
    answer = "Sorry, I could not find any relevant information on Wikipedia."
else:
    # Use the first search result as the page title
    page_title = search_results[0]
    # Get the Wikipedia page for the page title
    page = wiki.page(page_title)
    # Use the page content as the context for the Q&A model
    context = page.content

    # Set a maximum sequence length for the input
    max_seq_length = 512

    # Tokenize the input
    inputs = tokenizer.encode_plus(question, context, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors="pt")

    # Obtain model scores
    answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)

    # Get the most likely beginning of the answer
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of the answer
    answer_end = torch.argmax(answer_end_scores) + 1

    # Convert the start and end tokens back to words
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

print(answer)

Christopher Nolan


In [72]:
# Define the question
question = input("Ask a question ?")

# Use the Wikipedia API to search for the most relevant page based on the question
search_results = wiki.search(question)
if not search_results:
    answer = "Sorry, I could not find any relevant information on Wikipedia."
else:
    # Use the first search result as the page title
    page_title = search_results[0]
    # Get the Wikipedia page for the page title
    page = wiki.page(page_title)
    # Use the page content as the context for the Q&A model
    context = page.content

    # Set a maximum sequence length for the input
    max_seq_length = 512

    # Tokenize the input
    inputs = tokenizer.encode_plus(question, context, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors="pt")

    # Obtain model scores
    answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)

    # Get the most likely beginning of the answer
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of the answer
    answer_end = torch.argmax(answer_end_scores) + 1

    # Convert the start and end tokens back to words
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

print(answer)

Ask a question ?Who directed Interstellar
Christopher Nolan


In [73]:
# Define the question
question = input("Ask a question ?")

# Use the Wikipedia API to search for the most relevant page based on the question
search_results = wiki.search(question)
if not search_results:
    answer = "Sorry, I could not find any relevant information on Wikipedia."
else:
    # Use the first search result as the page title
    page_title = search_results[0]
    # Get the Wikipedia page for the page title
    page = wiki.page(page_title)
    # Use the page content as the context for the Q&A model
    context = page.content

    # Set a maximum sequence length for the input
    max_seq_length = 512

    # Tokenize the input
    inputs = tokenizer.encode_plus(question, context, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors="pt")

    # Obtain model scores
    answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)

    # Get the most likely beginning of the answer
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of the answer
    answer_end = torch.argmax(answer_end_scores) + 1

    # Convert the start and end tokens back to words
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

print(answer)

Ask a question ?Who was the founder of Mughal Empire in the India subcontinent?
Babur


In [76]:
# Define the question
question = input("Ask a question ?  ")

# Use the Wikipedia API to search for the most relevant page based on the question
search_results = wiki.search(question)
if not search_results:
    answer = "Sorry, I could not find any relevant information on Wikipedia."
else:
    # Use the first search result as the page title
    page_title = search_results[0]
    # Get the Wikipedia page for the page title
    page = wiki.page(page_title)
    # Use the page content as the context for the Q&A model
    context = page.content

    # Set a maximum sequence length for the input
    max_seq_length = 512

    # Tokenize the input
    inputs = tokenizer.encode_plus(question, context, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors="pt")

    # Obtain model scores
    answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)

    # Get the most likely beginning of the answer
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of the answer
    answer_end = torch.argmax(answer_end_scores) + 1

    # Convert the start and end tokens back to words
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))

print(answer)

Ask a question ?  Who is Elon Musk?
[CLS]


**Error** - The Expected output should be a summary of Elon musk But It returns a [CLS], that is because the bot fails to find an answer.

### Putting it all togeather

#### Gradio UI

In [81]:
#pip install gradio 

Collecting gradioNote: you may need to restart the kernel to use updated packages.

  Downloading gradio-3.20.1-py3-none-any.whl (14.3 MB)
     ---------------------------------------- 0.0/14.3 MB ? eta -:--:--
     ---------------------------------------- 0.0/14.3 MB ? eta -:--:--
     ---------------------------------------- 0.0/14.3 MB ? eta -:--:--
     ---------------------------------------- 0.0/14.3 MB ? eta -:--:--
     ---------------------------------------- 0.0/14.3 MB ? eta -:--:--
     --------------------------------------- 0.0/14.3 MB 164.3 kB/s eta 0:01:27
     --------------------------------------- 0.1/14.3 MB 306.3 kB/s eta 0:00:47
     --------------------------------------- 0.1/14.3 MB 306.3 kB/s eta 0:00:47
     --------------------------------------- 0.1/14.3 MB 306.3 kB/s eta 0:00:47
     --------------------------------------- 0.1/14.3 MB 273.1 kB/s eta 0:00:52
     --------------------------------------- 0.2/14.3 MB 364.8 kB/s eta 0:00:39
     ----------------

In [105]:
import gradio as gr

# Load the pre-trained Q&A model and tokenizer
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

# Define the function to get the answer
def get_answer(question):
    
    # Use the Wikipedia API to search for the most relevant page based on the question
    search_results = wiki.search(question)
    if not search_results:
        answer = "Sorry, I could not find any relevant information on Wikipedia."
    else:
        
        # Use the first search result as the page title
        page_title = search_results[0]
        
        # Get the Wikipedia page for the page title
        page = wiki.page(page_title)
        
        # Use the page content as the context for the Q&A model
        context = page.content

        # Set a maximum sequence length for the input
        max_seq_length = 512

        # Tokenize the input
        inputs = tokenizer.encode_plus(question, context, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors="pt")

        # Obtain model scores
        answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)

        # Get the most likely beginning of the answer
        answer_start = torch.argmax(answer_start_scores)
        # Get the most likely end of the answer
        answer_end = torch.argmax(answer_end_scores) + 1

        # Convert the start and end tokens back to words
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))
    
    return answer

# Create the Gradio interface
gr.Interface(
    fn = get_answer,
    inputs = gr.Textbox(label = "Question"),
    outputs = "text",
    title = "Wikipedia Q&A ChatBOT",
    description = "Type your question, I'm happy to get the answer from Wikipedia.",
    examples = [
        ["Who directed Interstellar?"],
        ["Who is Robert James Moroso?"],
        ["Who was the founder of Mughal Empire in the India subcontinent?"],
        ["Where was Elijah Mudenda born?"],
        ["What is Cassiopeia?"],
        ["Which dynasty was established by Pushyavarman?"],
        ["Featherstone Prison was constructed on property previously owned by?"],
        ["Which team event did India won in the olympics?"]
    ]).launch();

Running on local URL:  http://127.0.0.1:7877

To create a public link, set `share=True` in `launch()`.
