# Semantic Search - Wikipedia Question-Answer-Retrieval

This examples demonstrates the setup for Question-Answer-Retrieval.

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

As model, we use: nq-distilbert-base-v1

It was trained on the Natural Questions dataset, a dataset with real questions from Google Search together with annotated data from Wikipedia providing the answer. For the passages, we encode the Wikipedia article tile together with the individual text passages.

In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 8.8MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 26.5MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 56.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp3

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch

if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")


# We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
model_name = 'nq-distilbert-base-v1'
bi_encoder = SentenceTransformer(model_name)
top_k = 5  # Number of passages we want to retrieve with the bi-encoder

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append([data['title'], paragraph])

# If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))

# To speed things up, pre-computed embeddings are downloaded.
# The provided file encoded the passages with the model 'nq-distilbert-base-v1'
if model_name == 'nq-distilbert-base-v1':
    embeddings_filepath = 'simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
    if not os.path.exists(embeddings_filepath):
        util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01-nq-distilbert-base-v1.pt', embeddings_filepath)

    corpus_embeddings = torch.load(embeddings_filepath)
    corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
    if torch.cuda.is_available():
        corpus_embeddings = corpus_embeddings.to('cuda')
else:  # Here, we compute the corpus_embeddings from scratch (which can take a while depending on the GPU)
    corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

100%|██████████| 245M/245M [00:10<00:00, 24.1MB/s]
100%|██████████| 50.2M/50.2M [00:04<00:00, 11.4MB/s]


Passages: 509663


100%|██████████| 783M/783M [00:26<00:00, 29.8MB/s]


In [None]:
def search(query):
    # Encode the query using the bi-encoder and find potentially relevant passages
    start_time = time.time()
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query
    end_time = time.time()

    # Output of top-k hits
    print("Input question:", query)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for hit in hits:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']]))

In [None]:
search(query = "What is the capital of the France?")

Input question: What is the capital of the France?
Results (after 1.493 seconds):
	0.826	['Capital of France', 'The capital of France is Paris. In the course of history, the national capital has been in many locations other than Paris.']
	0.753	['Arrondissement of Sarlat-la-Canéda', 'The arrondissement of Sarlat-la-Canéda is an arrondissement of France. It is part of the Dordogne "département" in the Nouvelle-Aquitaine region. Its capital is the city of Sarlat-la-Canéda.']
	0.752	['Arrondissement of Figeac', 'The arrondissement of Figeac is an arrondissement of France. It is part of the Lot "département" in the Occitanie region. Its capital is the city of Figeac.']
	0.746	["Arrondissement of Saint-Jean-d'Angély", "The arrondissement of Saint-Jean-d'Angély is an arrondissement of France, in the Charente-Maritime department, Nouvelle-Aquitaine region. Its capital is the city of Saint-Jean-d'Angély."]
	0.745	['Arrondissement of Confolens', 'The arrondissement of Confolens is an arrondisse

In [None]:
search(query = "What is the best orchestra in the world?")

Input question: What is the best orchestra in the world?
Results (after 0.645 seconds):
	0.642	['Orchestra', 'Some of the greatest orchestras today include: the New York Philharmonic Orchestra, the Boston Symphony Orchestra, the Chicago Symphony Orchestra, the Cleveland Orchestra, the Los Angeles Philharmonic Orchestra, the London Symphony Orchestra, the London Philharmonic Orchestra, the BBC Symphony Orchestra, the Royal Concertgebouw Orchestra, the Vienna Philharmonic Orchestra, the Berlin Philharmonic Orchestra, the Leipzig Gewandhaus Orchestra, the , the St Petersburg Philharmonic Orchestra, the Israel Philharmonic Orchestra, and the NHK Symphony Orchestra (Tokyo). Opera houses usually have their own orchestra, e.g. the orchestras of the Metropolitan Opera House, La Scala, or the Royal Opera House.']
	0.557	['Vienna Philharmonic', 'The Vienna Philharmonic (in German: die Wiener Philharmoniker) is an orchestra based in Vienna, Austria. It is thought of as one of the greatest orchest

In [None]:
search(query = "Number countries Europe")

Input question: Number countries Europe
Results (after 0.643 seconds):
	0.641	['Europe', 'There are at least 43 countries in Europe (the European identities of 5 transcontinental countries:Cyprus, Georgia, Kazakhstan, Russia and Turkey are disputed). Most of these countries are members of the European Union.']
	0.639	['Eurozone', 'There are 19 members in the Eurozone']
	0.638	['Northern Europe', "Europe, the planet's 6th largest continent, includes 47 countries and assorted dependencies, islands and territories."]
	0.628	['Europe', 'Within these regions, there are up to 48 independent European countries (with the identities of 5 transcontinental countries being disputed). The largest is the Russian Federation, which covers 39% of Europe.']
	0.616	['Schengen Area', 'Twenty-six countries belong to the Schengen Area. All these countries are members of the European Union, except Iceland, Liechtenstein, Norway and Switzerland.']


In [None]:
search(query = "When did the cold war end?")

Input question: When did the cold war end?
Results (after 0.646 seconds):
	0.715	['Cold War', 'Not all historians agree on when the Cold War ended. Some think it ended when the Berlin Wall fell. Others think it ended when the Soviet Union collapsed in 1991.']
	0.668	['Cold War', 'After the fall of the Berlin Wall in 1989 and without Communist rule holding together the countries that comprised the Soviet Union, the USSR broke into smaller countries, like Russia, Ukraine, Lithuania and Georgia. The nations of Eastern Europe returned to capitalism, and the period of the Cold War was over. The Soviet Union ended in December 1991.']
	0.643	['Ronald Reagan', 'When Reagan visited Moscow for the fourth summit in 1988, he was seen as a celebrity by the Soviets. A journalist asked the president if he still considered the Soviet Union the evil empire. "No", he replied, "I was talking about another time, another era". In November 1989, ten months after Reagan left office, the Berlin Wall was torn 

In [None]:
search(query = "How long do cats live?")

Input question: How long do cats live?
Results (after 0.656 seconds):
	0.805	['Aging in cats', 'Reliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.']
	0.744	['Aging in cats', 'The life expectancy of an indoor cat is around 17 years, but the life expectancy of outdoor cats is 5.6 years.']
	0.628	['Mouse-like hamsters', 'Mouse-like hamsters live the longest of all the mouse species. They have been recorded as living 9 years, 3 months and 18 days in captivity. They regularly live over 4 years in captivity. The species which lives the second longest in the muroids lives 7 years, 8 months, which is the Canyon Mouse, "Peromyscus crinitus".']
	0.608	['American Shorthair', 'The American Shorthair is a working cat. It has a large

In [None]:
search(query = "How many people live in Toronto?")

Input question: How many people live in Toronto?
Results (after 0.657 seconds):
	0.753	['Toronto', 'The City of Toronto has a population of over 3 million people. Even more people live in the regions around it. All together, the Greater Toronto Area is home to over 6 million people. That makes it the biggest metropolitan area in Canada.']
	0.606	['Markham, Ontario', 'Markham, Ontario is a city in Regional Municipality of York, in the Greater Toronto Area of Southern Ontario, Canada. There are twice as many people there as in 1990. 261,573 people live in Markham. It is the 4th largest town in the Greater Toronto Area, after Toronto, Mississauga, and Brampton.']
	0.577	['Toronto', 'Toronto is the capital city of the province of Ontario in Canada. It is also the largest city in both Ontario and Canada. Found It is on the north-west side of Lake Ontario.']
	0.561	['Toronto', 'Toronto is a very multicultural city. Different people from around the world have moved to Toronto to live since th

In [None]:
search(query = "Oldest US president")

Input question: Oldest US president
Results (after 0.650 seconds):
	0.643	['William Henry Harrison', 'He was elected president in 1840, and took the oath of office on March 4, 1841. His inauguration speech lasted an hour and forty minutes. William Henry Harrison caught a serious case of pneumonia, and on April 4 that same year he died. He was the first President to die in office. Harrison was the oldest president to take office at , until 1981 when Ronald Reagan was a year older than Harrison. He was the last president to be born before the United States Declaration of Independence.']
	0.615	['Martin Van Buren', 'Martin Van Buren (December 5, 1782 – July 24, 1862) was the eighth President of the United States. He was the first president born after the United States Declaration of Independence, making him the first president who was born as a U.S. citizen.']
	0.571	['George Washington', 'George Washington (February 22, 1732 – December 14, 1799) was the first President of the United Stat

In [None]:
search(query = "Coldest place earth")

Input question: Coldest place earth
Results (after 0.662 seconds):
	0.648	['East Antarctica', 'East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.']
	0.619	['Ulan Bator', 'Ulaanbaatar is at about 1350 meters (4430\xa0ft) above sea level. For this high elevation, and for the high latitude, and location hundreds of kilometres from any coast, Ulaanbaatar is the coldest national capital in the world, with a subarctic climate.']
	0.615	['Cold', 'The coldest possible temperature is −273.15 °C, which can be expressed as -459.67 °F on the Fahrenheit scale. This is called absolute zero. Absolute zero is also 0 K on the Kelvin scale and 0 °R on the Rankine scale']
	0.608	['Antarctica', 'Antarctica is the coldest, driest and windiest continent. It is also, on 

In [None]:
search(query = "When was Barack Obama born?")

Input question: When was Barack Obama born?
Results (after 0.655 seconds):
	0.717	['Barack Obama', 'Obama was born on August 4, 1961 in Kapiʻolani Medical Center for Women and Children (called Kapiʻolani Maternity & Gynecological Hospital in 1961) in Honolulu, Hawaii and is the first President to have been born in Hawaii. His father was a black exchange student from Kenya named Barack Obama Sr. He died in a motorcycle accident in Kenya in 1982. His mother was a white woman from Kansas named Ann Dunham, who was an anthropologist and died in 1995. He spent most of his childhood in Hawaii and Chicago, Illinois, although he lived in Jakarta, Indonesia with his mother and stepfather from age 6 to age 10. He later moved back to Hawaii to live with his grandparents.']
	0.699	['Stanley Armour Dunham', 'Stanley Armour Dunham (March 23, 1918 – February 8, 1992) was the maternal grandfather of U.S. President Barack Obama. He and his wife Madelyn Payne Dunham raised Obama from the age of 10 in Hon

In [None]:
search(query = "Paris eiffel tower")

Input question: Paris eiffel tower
Results (after 0.660 seconds):
	0.667	['Eiffel Tower', 'The Eiffel Tower (French: La Tour Eiffel, ], IPA pronunciation: "EYE-full" English; "eh-FEHL" French) is a landmark in Paris. It was built between 1887 and 1889 for the Exposition Universelle (World Fair). The Tower was the Exposition\'s main attraction.']
	0.645	['France', 'Some of the most famous attractions in Paris, are the Eiffel Tower and the Arc de Triomphe. Another one is Mont Saint Michel, in Normandy.']
	0.564	['Arc de Triomphe', 'The Arc de Triomphe (meaning "arch of victory)", at the centre of the place de l\'Étoile and the western end of the Champs-Elysées, is a very famous monument in Paris.']
	0.560	['Multiple choice', '2. Where is the Eiffel Tower?<br>A) London<br>B) Paris<br>C) Singapore<br>D) New York']
	0.508	['Louise Michel (Paris Metro)', 'Louise Michel is a station of the Paris Métro, serving Line 3. It is located in the commune of Levallois-Perret northwest of the French ca

In [None]:
search(query = "Which US president was killed?")

Input question: Which US president was killed?
Results (after 0.648 seconds):
	0.665	['Assassination', 'In the United States, four presidents were assassinated within 100 years. They were Presidents Abraham Lincoln (1865), James Garfield (1881), William McKinley (1901), and John F. Kennedy (1963).']
	0.640	['Abraham Lincoln', 'Abraham Lincoln (February 12, 1809 \xa0– April 15, 1865) was an American politician. He was the 16th President of the United States. He was president from 1861 to 1865, during the American Civil War. Just five days after most of the Confederate forces had surrendered and the war was ending, John Wilkes Booth assassinated Lincoln. Lincoln was the first president of the United States to be assassinated. Lincoln has been remembered as the "Great Emancipator" because he worked to end slavery in the United States.']
	0.630	['James A. Garfield', 'James Abram Garfield (November 19, 1831 - September 19, 1881) was the 20th (1881) President of the United States and the 2nd

In [None]:
search(query="When is Chinese New Year")

Input question: When is Chinese New Year
Results (after 0.652 seconds):
	0.761	['Chinese New Year', 'Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.']
	0.745	["People's Republic of China", "Chinese New Year lasts fifteen days, including one week as a national holiday. It starts with the first day of the Chinese lunar year and ends with the full moon fifteen days later. It is always in the middle of winter, but is called the Spring Festival in Chinese because Chinese seasons are a little different from English ones. On the first day of the Chinese New Year, people call on friends a

In [None]:
search(query="what is the name of manchester united stadium")

Input question: what is the name of manchester united stadium
Results (after 0.657 seconds):
	0.747	['Manchester United F.C.', 'Manchester United F.C. is a football club that plays in the Premier League. They play their home games at Old Trafford which is in Greater Manchester.']
	0.706	['Old Trafford', 'Old Trafford is a football stadium in Manchester in North West England. Its nickname is "The Theatre of Dreams". It is home to the club Manchester United F.C.. It is the biggest club stadium in Great Britain and second biggest stadium in Great Britain, with Wembley Stadium being the biggest. Old Trafford hosted most of England\'s home matches while Wembley was being built. It was built in 1910. It cost about £60,000,000 to build.']
	0.701	['Manchester Regional Arena', 'The Manchester Regional Arena is a stadium in Manchester, England used mostly for athletics and association football. It was originally made as the warm-up track for the 2002 Commonwealth Games held at the City of Manche

In [None]:
search(query="who wrote cant get you out of my head lyrics")

Input question: who wrote cant get you out of my head lyrics
Results (after 0.652 seconds):
	0.602	["Can't Get You Out of My Head", '"Can\'t Get You Out of My Head" is a song recorded by Australian singer Kylie Minogue. It is the first single from her eighth studio album, "Fever". It is credited with reigniting her career in the 2000s. It is a dance-pop song. The song was reported to reach number one in her home country of Australia, along with New Zealand and every country in Europe except Finland, where it reached the top five. It reached number seven on the "Billboard" Hot 100, becoming her second US top-ten hit after her version of "The Locomotion" in 1988. It has become Minogue\'s signature song.']
	0.445	["Just Can't Get Enough", '"Just Can\'t Get Enough" is a song by American hip hop group The Black Eyed Peas. William Adams, Allan Pineda, Jaime Gomez and Stacy Ferguson of the Black Eyed Peas wrote the song. Joshua Alvarez, Stephen Shadowen, Rodney "Darkchild" Jerkins, LaShawn Da

In [None]:
search(query="where does the story the great gatsby take place")

Input question: where does the story the great gatsby take place
Results (after 0.669 seconds):
	0.769	['The Great Gatsby', 'The Great Gatsby is a novel by F. Scott Fitzgerald. It was first sold in 1925. The novel takes place in New York City and Long Island in New York.']
	0.768	['The Great Gatsby', 'The story is told by Nick Carraway, a man who moves to Long Island, New York, from the Midwest. Nick is not rich, but he lives in a rich area that has two towns called East Egg and West Egg. The "old rich" live in East Egg while the "new rich" live in West Egg. Nick lives in a small house in West Egg. Nick\'s next-door neighbor is Jay Gatsby. Jay is in love with Nick\'s cousin Daisy. However, Daisy is married to a man named Tom. The novel is about Jay and his hope that he can steal Daisy from Tom.']
	0.762	['The Great Gatsby', "The events of the novel happen in the summer of 1922. Nick Carraway, a man who grew up in the American Midwest, is the narrator. Nick is a World War I veteran and 

In [None]:
search(query="who turned out to be the mother on how i met your mother")

Input question: who turned out to be the mother on how i met your mother
Results (after 0.652 seconds):
	0.730	['Cristin Milioti', 'Cristin Milioti (born August 16, 1985) is an American actress and singer. She is best known for playing the Mother on the sitcom "How I Met Your Mother" from 2013 to 2014. She has also played Teresa Petrillo Belfort in the 2013 movie "The Wolf of Wall Street", and Betsy Solverson in the second season of "Fargo" (2015).']
	0.674	['How I Met Your Mother', "A retelling of Ted Mosby's (Josh Radnor) life before he met his wife (Cristin Milioti), whom is telling his kids in the year 2030, hence the title, 'How I Met Your Mother'. The story focuses on Ted's everyday life and relationships between his four best friends, Marshall Eriksen (Jason Segel), Lily Auldren (Alyson Hannigan), Barney Stinson (Neil Patrick Harris) and Robin Scherbatsky (Cobie Smulders). The show is mainly a comedy, along with heart-warming scenes and tearful moments. The show was loosely base

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('nq-distilbert-base-v1')

query_embedding = model.encode('How many people live in London?')

#The passages are encoded as [title, text]
passage_embedding = model.encode([['London', 'London has 9,787,426 inhabitants at the 2011 census.']])

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

Similarity: tensor([[0.6503]])


In [None]:
query_embedding = model.encode('who turned out to be the mother on how i met your mother')

#The passages are encoded as [title, text]
passage_embedding = model.encode([['The Mother (How I Met Your Mother)', 'The Mother (How I Met Your Mother) Tracy McConnell (colloquial: "The Mother") is the title character from the CBS television sitcom "How I Met Your Mother". The show, narrated by Future Ted (Bob Saget), tells the story of how Ted Mosby (Josh Radnor) met The Mother. Tracy McConnell appears in eight episodes, from "Lucky Penny" to "The Time Travelers", as an unseen character; she was first seen fully in "Something New" and was promoted to a main character in season 9. The Mother is played by Cristin Milioti. The story of how Ted met The Mother is the framing device'],
                                  ['Make It Easy on Me', 'and Pete Waterman on her 1993 album "Good \'N\' Ready", on which a remixed version of the song is included. "Make It Easy On Me", a mid-tempo R&B jam, received good reviews (especially for signalling a different, more soulful and mature sound atypical of the producers\' Europop fare), but failed to make an impact on the charts, barely making the UK top 100 peaking at #99, and peaking at #52 on the "Billboard" R&B charts. The pop group Steps covered the song on their 1999 album "Steptacular". It was sung as a solo by Lisa Scott-Lee. Make It Easy on']])

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

Similarity: tensor([[ 0.7562, -0.0835]])


In [None]:
query_embedding = model.encode('where does the story the great gatsby take place')
passage_embedding = model.encode([['The Great Gatsby',
 'The Great Gatsby The Great Gatsby is a 1925 novel written by American author F. Scott Fitzgerald that follows a cast of characters living in the fictional towns of West Egg and East Egg on prosperous Long Island in the summer of 1922. The story primarily concerns the young and mysterious millionaire Jay Gatsby and his quixotic passion and obsession with the beautiful former debutante Daisy Buchanan. Considered to be Fitzgerald\'s magnum opus, "The Great Gatsby" explores themes of decadence, idealism, resistance to change, social upheaval, and excess, creating a portrait of the Roaring Twenties that has been described as'],
 ['The Producers (1967 film)', '2005 (to coincide with the remake released that year). In 2011, MGM licensed the title to Shout! Factory to release a DVD and Blu-ray combo pack with new HD transfers and bonus materials. StudioCanal (worldwide rights holder to all of the Embassy Pictures library) released several R2 DVD editions and Blu-ray B releases using a transfer slightly different from the North Ameri can DVD and BDs. The Producers (1967 film) The Producers is a 1967 American satirical comedy film written and directed by Mel Brooks and starring Zero Mostel, Gene Wilder, Dick Shawn, and Kenneth Mars. The film was Brooks\'s directorial']
])

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))


Similarity: tensor([[ 0.8294, -0.2055]])
