Retrieval and Question Answering Exercise

In this exercise, your goal is to utilize a vector database to attempt to retrieve relevant context to answer questions about Best Picture winners since 2000. Each question can be answered from the Wikipedia page of each movie.

You have been provided a list of movies and links to their Wikipedia pages in the file best_picture_2000.csv.

Build a vector database off of these Wikipedia pages which, given a query, can find potentially relevant context to answer the question.

Then use a question-answering model from HugingFace to anwser the question.

A list of question and answer pairs is given in QAs.csv, but feel free to add to it yourself.

In [1]:
import chromadb
from nltk.tokenize import sent_tokenize
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np

In [2]:
import requests
from bs4 import BeautifulSoup
import re

In [5]:
# Import package
import wikipedia


In [3]:
client = chromadb.PersistentClient(path="./qa_vdb")
collection = client.get_collection("best_pictures")

In [4]:
movies = pd.read_csv('../data/best_picture_2000.csv')
movies = movies.reset_index()
movies

Unnamed: 0,index,title,link,year
0,0,Gladiator,https://en.wikipedia.org/wiki/Gladiator_(2000_...,2000
1,1,A Beautiful Mind,https://en.wikipedia.org/wiki/A_Beautiful_Mind...,2001
2,2,Chicago,https://en.wikipedia.org/wiki/Chicago_(2002_film),2002
3,3,The Lord of the Rings: The Return of the King,https://en.wikipedia.org/wiki/The_Lord_of_the_...,2003
4,4,Million Dollar Baby,https://en.wikipedia.org/wiki/Million_Dollar_Baby,2004
5,5,Crash,https://en.wikipedia.org/wiki/Crash_(2004_film),2005
6,6,The Departed,https://en.wikipedia.org/wiki/The_Departed,2006
7,7,No Country for Old Men,https://en.wikipedia.org/wiki/No_Country_for_O...,2007
8,8,Slumdog Millionaire,https://en.wikipedia.org/wiki/Slumdog_Millionaire,2008
9,9,The Hurt Locker,https://en.wikipedia.org/wiki/The_Hurt_Locker,2009


In [32]:
movies['page'] = movies['link'].str.replace('https://en.wikipedia.org/wiki/', '')
movies['page'] = movies['page'].str.replace('%27', "'")
movies

Unnamed: 0,index,title,link,year,page
0,0,Gladiator,https://en.wikipedia.org/wiki/Gladiator_(2000_...,2000,Gladiator_(2000_film)
1,1,A Beautiful Mind,https://en.wikipedia.org/wiki/A_Beautiful_Mind...,2001,A_Beautiful_Mind_(film)
2,2,Chicago,https://en.wikipedia.org/wiki/Chicago_(2002_film),2002,Chicago_(2002_film)
3,3,The Lord of the Rings: The Return of the King,https://en.wikipedia.org/wiki/The_Lord_of_the_...,2003,The_Lord_of_the_Rings:_The_Return_of_the_King
4,4,Million Dollar Baby,https://en.wikipedia.org/wiki/Million_Dollar_Baby,2004,Million_Dollar_Baby
5,5,Crash,https://en.wikipedia.org/wiki/Crash_(2004_film),2005,Crash_(2004_film)
6,6,The Departed,https://en.wikipedia.org/wiki/The_Departed,2006,The_Departed
7,7,No Country for Old Men,https://en.wikipedia.org/wiki/No_Country_for_O...,2007,No_Country_for_Old_Men
8,8,Slumdog Millionaire,https://en.wikipedia.org/wiki/Slumdog_Millionaire,2008,Slumdog_Millionaire
9,9,The Hurt Locker,https://en.wikipedia.org/wiki/The_Hurt_Locker,2009,The_Hurt_Locker


In [33]:
movies['page']

0                             Gladiator_(2000_film)
1                           A_Beautiful_Mind_(film)
2                               Chicago_(2002_film)
3     The_Lord_of_the_Rings:_The_Return_of_the_King
4                               Million_Dollar_Baby
5                                 Crash_(2004_film)
6                                      The_Departed
7                            No_Country_for_Old_Men
8                               Slumdog_Millionaire
9                                   The_Hurt_Locker
10                                The_King's_Speech
11                                The_Artist_(film)
12                                 Argo_(2012_film)
13                          12_Years_a_Slave_(film)
14                                   Birdman_(film)
15                                 Spotlight_(film)
16                            Moonlight_(2016_film)
17                               The_Shape_of_Water
18                                Green_Book_(film)
19          

# Specify the title of the Wikipedia page
wiki = wikipedia.page('Chicago_(2002_film)')
# Extract the plain text content of the page
text = wiki.content
print(text)

# Import package
import re# Clean text
text = re.sub(r'==.*?==+', '', text)
text = text.replace("\n", '')
print(text)

# Using requests instead of wikipedia package: Specify url of the web page
movie_wiki = requests.get('https://en.wikipedia.org/wiki/Chicago_(2002_film)')

movie_wiki

# Make a soup 
soup = BeautifulSoup(movie_wiki.text, 'html')

# Extract the plain text content from paragraphs
paras = []
for paragraph in soup.find_all('p'):
    paras.append(str(paragraph.text))

In [None]:
# Extract text from paragraph headers
# heads = []
# for head in soup.find_all('span', attrs={'mw-headline'}):
#     heads.append(str(head.text))

In [None]:
# Interleave paragraphs & headers
# text = [val for pair in zip(paras, heads) for val in pair]
# text = ' '.join(text)

# Drop footnote superscripts in brackets
paras = re.sub(r"\[.*?\]+", '', str(paras))

# Replace '\n' (a new line) with ''
paras = paras.replace('\\n', '')
print(paras)

In [34]:
def wiki_page(page):
    wiki = wikipedia.page(page, auto_suggest=False)
    return wiki

def wiki_content(wiki):    
    content=wiki.content
    return content

def wiki_text(content):        
    text = re.sub(r'==.*?==+', '', content)
    text = text.replace("\n", '')
    return text

def wiki_process(page):
    return wiki_text(wiki_content(wiki_page(page)))

In [35]:
wiki_process("The_King's_Speech")

'The King\'s Speech is a 2010 historical drama film directed by Tom Hooper and written by David Seidler. Colin Firth plays the future King George VI who, to cope with a stammer, sees Lionel Logue, an Australian speech and language therapist played by Geoffrey Rush. The men become friends as they work together, and after his brother abdicates the throne, the new king relies on Logue to help him make his first wartime radio broadcast upon Britain\'s declaration of war on Germany in 1939.Seidler read about George VI\'s life after learning to manage a stuttering condition he developed during his own youth. He started writing about the relationship between the therapist and his royal patient as early as the 1980s, but at the request of the King\'s widow, Queen Elizabeth The Queen Mother, postponed work until her death in 2002. He later rewrote his screenplay for the stage to focus on the essential relationship between the two protagonists. Nine weeks before filming began, the filmmakers lea

In [37]:
movies['text'] = movies['page'].apply(wiki_process)

In [38]:
movies

Unnamed: 0,index,title,link,year,page,text
0,0,Gladiator,https://en.wikipedia.org/wiki/Gladiator_(2000_...,2000,Gladiator_(2000_film),Gladiator is a 2000 historical epic film direc...
1,1,A Beautiful Mind,https://en.wikipedia.org/wiki/A_Beautiful_Mind...,2001,A_Beautiful_Mind_(film),A Beautiful Mind is a 2001 American biographic...
2,2,Chicago,https://en.wikipedia.org/wiki/Chicago_(2002_film),2002,Chicago_(2002_film),Chicago is a 2002 American musical black comed...
3,3,The Lord of the Rings: The Return of the King,https://en.wikipedia.org/wiki/The_Lord_of_the_...,2003,The_Lord_of_the_Rings:_The_Return_of_the_King,The Lord of the Rings: The Return of the King ...
4,4,Million Dollar Baby,https://en.wikipedia.org/wiki/Million_Dollar_Baby,2004,Million_Dollar_Baby,Million Dollar Baby is a 2004 American sports ...
5,5,Crash,https://en.wikipedia.org/wiki/Crash_(2004_film),2005,Crash_(2004_film),Crash is a 2004 American crime drama film prod...
6,6,The Departed,https://en.wikipedia.org/wiki/The_Departed,2006,The_Departed,The Departed is a 2006 American epic crime thr...
7,7,No Country for Old Men,https://en.wikipedia.org/wiki/No_Country_for_O...,2007,No_Country_for_Old_Men,No Country for Old Men is a 2007 American neo-...
8,8,Slumdog Millionaire,https://en.wikipedia.org/wiki/Slumdog_Millionaire,2008,Slumdog_Millionaire,Slumdog Millionaire is a 2008 British drama fi...
9,9,The Hurt Locker,https://en.wikipedia.org/wiki/The_Hurt_Locker,2009,The_Hurt_Locker,The Hurt Locker is a 2008 American war thrille...


In [39]:
collection.get(ids = ['0_1'])

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

In [40]:
def add_movie(movie):
    sentences = sent_tokenize(movie['text'])
    collection.add(
        documents = sentences,
        ids = [f'{movie["index"]}_{i}' for i in range(len(sentences))],
        metadatas = [{'movie': movie['title']}] * len(sentences)
    )

In [41]:
for _, row in tqdm(movies.iterrows()):
    add_movie(row)

0it [00:00, ?it/s]

In [42]:
collection.get(ids = ['0_1'])

{'ids': ['0_1'],
 'embeddings': None,
 'metadatas': [{'movie': 'Gladiator'}],
 'documents': ['It stars Russell Crowe, Joaquin Phoenix, Connie Nielsen, Tomas Arana, Ralf Möller, Oliver Reed (in his final role), Djimon Hounsou, Derek Jacobi, John Shrapnel, Richard Harris, and Tommy Flanagan.Crowe portrays Roman general Maximus Decimus Meridius, who is betrayed when Commodus, the ambitious son of Emperor Marcus Aurelius, murders his father and seizes the throne.'],
 'uris': None,
 'data': None}

In [68]:
query_text = "shot in"

results = collection.query(
    query_texts = [query_text],
    where={"movie": "The Hurt Locker"},
    n_results = 10
)

print(query_text)
print('\n -------- \n')
print('\n'.join(results['documents'][0]))
print('\n --------- \n')

shot in

 -------- 

Temperatures averaged 120 °F (49 °C) over the 44 days of shooting.
"There were two-by-fours with nails being dropped from two-story buildings that hit me in the helmet, and they were throwing rocks.... We got shot at a few times while we were filming", Renner said.
And in this way, the full impact of the Iraq war—at least as it was fought in 2004—becomes clear: American soldiers shot at Iraqi civilians even when, for example, they just happened to be holding a cell phone and standing near an IED."
James and Sanborn rescue him, although Eldridge is shot in the leg.
Describing the experience of filming in Jordan in the summer, he said, "It was so desperately hot, and we were so easily agitated.
According to Renner, shooting the film in the Middle East contributed to this.
Bigelow had wanted to film in Iraq, but the production security team could not guarantee their safety from snipers.Principal photography began in July 2007 in Jordan and Kuwait.
While fixing the tir