# Demo QA

Demo of questions and answers for system

In [1]:
# plotly standard imports
import plotly.graph_objs as go
import chart_studio.plotly as py

# Cufflinks wrapper on plotly
import cufflinks

# Data science imports
import pandas as pd
import numpy as np

# Options for pandas
pd.options.display.max_columns = 30

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from plotly.offline import iplot, init_notebook_mode
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

# Set global theme
cufflinks.set_config_file(world_readable=True, theme='pearl')

## Evaluate Search Engine

Check how search engine found context for system

In [2]:
from datasets import load_dataset

eli5 = load_dataset('eli5', cache_dir='./datasets')
wiki40b_snippets = load_dataset('wiki_snippets', name='wiki40b_en_100_0', cache_dir='./datasets')['train']

Reusing dataset eli5 (./datasets/eli5/LFQA_reddit/1.0.0/339112ecaedfbceb5b50e2b05935a382d504f72b4fdb27ce0f697102d4eb0535)
Reusing dataset wiki_snippets (./datasets/wiki_snippets/wiki40b_en_100_0/1.0.0/d152a0e6a420c02b9b26e7f75f45fb54c818cae1d83e8f164f0b1a13ac7998ae)


In [3]:
from lfqa_utils import *

In [4]:
qar_tokenizer = AutoTokenizer.from_pretrained('yjernite/retribert-base-uncased', cache_dir='./tokenaizers')
qar_model = AutoModel.from_pretrained('yjernite/retribert-base-uncased', cache_dir='./models').to('cuda:0')
_ = qar_model.eval()

Some weights of RetriBertModel were not initialized from the model checkpoint at yjernite/retribert-base-uncased and are newly initialized: ['bert_query.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# faiss_res = faiss.StandardGpuResources()
wiki40b_passage_reps = np.memmap(
            'wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat',
            dtype='float32', mode='r',
            shape=(wiki40b_snippets.num_rows, 128)
)

# wiki40b_index_flat = faiss.IndexFlatIP(128)
# wiki40b_gpu_index = faiss.index_cpu_to_gpu(faiss_res, 0, wiki40b_index_flat)
# wiki40b_gpu_index.add(wiki40b_passage_reps)

wiki40b_index_flat = faiss.IndexFlatIP(128)

wiki40b_index_flat.add(wiki40b_passage_reps)

In [6]:
question = eli5['test_eli5'][12342]['title']
doc, res_list = query_qa_dense_index(question, qar_model, qar_tokenizer, wiki40b_snippets, wiki40b_index_flat, device='cuda:0')

df = pd.DataFrame({
    'Article': ['---'] + [res['article_title'] for res in res_list],
    'Sections': ['---'] + [res['section_title'] if res['section_title'].strip() != '' else res['article_title']
                 for res in res_list],
    'Text': ['--- ' + question] + [res['passage_text'] for res in res_list],
})
df.style.set_properties(**{'text-align': 'left'})

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).



Unnamed: 0,Article,Sections,Text
0,---,---,--- What is stopping us from covering large sections of the desert with solar panels?
1,Renewable energy in the United States,Concentrated solar power,"disturb an average of 2.7 to 2.9 acres per gigawatt-hour/year, and use from 3.5 to 3.8 acres per gW-hr/year for the entire sites. According to a 2009 study, this intensity of land use is less than that of the country's average power plant using surface-mined coal. Some of the land in the eastern portion of the Mojave Desert is to be preserved, but the solar industry is more interested in areas of the western desert, ""where the sun burns hotter and there is easier access to transmission lines"". Some of the largest solar thermal power plants in the United States are"
2,Solar power plants in the Mojave Desert,Land use issues & Water use issues,"acres of offshore exploration in the Gulf of Mexico are under lease for oil and gas development, exploration and production. Some of the land in the eastern Mojave Desert will be preserved, but the solar industry is mainly interested in areas of the western desert, ""where the sun burns hotter and there is easier access to transmission lines"", said Kenn J. Arnecke of FPL Energy, a view shared by many industry executives. Water use issues Concentrating solar plants in the Mojave Desert have brought up issues of water use, because concentrating solar power plants with wet-cooling systems have high water-consumption intensities"
3,Masdar City,Renewable resources,"scale. Then you realise it's much more efficient to build your solar field on the ground in the middle of the desert. You can send a man to brush them off every day, rather than having to access everyone's buildings individually, and you can make sure that they are running at their absolute peak. It's much better than putting them on every building in the city."" Blowing sand has been a problem for its solar panels, so Masdar has been working with other companies to engineer surfaces with pores smaller than sand particles to stop them from sticking on the panels."
4,Renewable energy debate,Solar power,"were built, they would total 7,387 megawatts. The requirement for so much land has spurred efforts to encourage solar facilities to be built on already-disturbed lands, and the Department of Interior identified Solar Energy Zones that it judges to contain lower value habitat where solar development would have less of an impact on ecosystems. Sensitive wildlife impacted by large solar facility plans include the desert tortoise, Mohave Ground Squirrel, Mojave fringe-toed lizard, and desert bighorn sheep. In the United States, some of the land in the eastern portion of the Mojave Desert is to be preserved, but the solar industry"
5,Energy in Jordan,Solar,"per kWh tendered in early 2015 for the second phase of the Mohammed bin Rashid Al Maktoum Solar Park in the United Arab Emirates. A plan to put solar panels at all 6000 mosques in the country was announced in February 2015. Jordan inaugurated its first solar-powered charging station for electric cars in February 2012. Located at El Hassan Science City (EHSC), the station is considered the first step towards promoting solar-powered vehicles and building more solar-charging facilities on the streets of Jordan. The Sahara Forest Project, a Norwegian endeavour to create oases in hot, arid and uninhabited lands, is currently being"
6,Solar power in the United Arab Emirates,Abu Dhabi & Dubai,"all rooftop panels, it was found easier to clean the sand off ground mounted panels at a single location. Dubai The Dubai Clean Energy Strategy aims to provide 7 per cent of Dubai’s energy from clean energy sources by 2020. It will increase this target to 25 per cent by 2030 and 75 per cent by 2050. Due to a variety of factors, a Saudi-backed consortium had a low bid to build the solar farm in Dubai for only 3¢/kWh. The first phase of the proposed 1,000 MW Mohammed bin Rashid Al Maktoum Solar Park, in Seih Al-Dahal, about 50 kilometers"
7,Desert,Solar energy capture,"for solar energy, partly due to low amounts of cloud cover. Many solar power plants have been built in the Mojave Desert such as the Solar Energy Generating Systems and Ivanpah Solar Power Facility. Large swaths of this desert are covered in mirrors. The potential for generating solar energy from the Sahara Desert is huge, the highest found on the globe. Professor David Faiman of Ben-Gurion University has stated that the technology now exists to supply all of the world's electricity needs from 10% of the Sahara Desert. Desertec Industrial Initiative was a consortium seeking $560 billion to invest in North"
8,Renewable energy in Pakistan,Solar power,"district) (now that system is unserviceable) and Dittal Khan Laghari, Digri (Mirpurkhas district).The Punjab government announced the establishment of Quaid-e-Azam Solar Park over an area of 5,000 acres in the Cholistan Development Authority in Bahawalpur. A practical example of the use of solar energy can be seen in some rural villages of Pakistan where houses have been provided with solar panels that run electric fans and energy-saving bulbs. One notable and successfully implemented case was the village of Narian Khorian (about 50 kilometers from Islamabad), which employs the use of 100 solar panels installed by a local firm, free of cost;"
9,Concentrated solar power,Very large scale solar power plants & Suitable sites,"solar power plants using 1% of each of the world's deserts. Total consumption worldwide was 15,223 TWh/year (in 2003). The gigawatt size projects would have been arrays of standard-sized single plants. In 2012, the BLM made available 97,921,069 acres (39,627,251 hectares) of land in the southwestern United States for solar projects, enough for between 10,000 and 20,000 GW. The largest single plant in operation is the 510 MW Noor Solar Power Station. Suitable sites The locations with highest direct irradiance are dry, at high altitude, and located in the tropics. These locations have a higher potential for CSP than areas with less sun. Abandoned"


In [7]:
q_rep = embed_questions_for_retrieval([question], qar_tokenizer, qar_model)
D, I = wiki40b_index_flat.search(q_rep, 10)
res_passages_lst = [[wiki40b_snippets[int(i)] for i in i_lst] for i_lst in I]

res_passages_lst

[[{'_id': '{"datasets_id": 5770, "wiki_id": "Q3246573", "sp": 14, "sc": 1444, "ep": 14, "ec": 2032}',
   'datasets_id': 5770,
   'wiki_id': 'Q3246573',
   'start_paragraph': 14,
   'start_character': 1444,
   'end_paragraph': 14,
   'end_character': 2032,
   'article_title': 'Renewable energy in the United States',
   'section_title': 'Concentrated solar power',
   'passage_text': 'disturb an average of 2.7 to 2.9 acres per gigawatt-hour/year, and use from 3.5 to 3.8 acres per gW-hr/year for the entire sites. \nAccording to a 2009 study, this intensity of land use is less than that of the country\'s average power plant using surface-mined coal. Some of the land in the eastern portion of the Mojave Desert is to be preserved, but the solar industry is more interested in areas of the western desert, "where the sun burns hotter and there is easier access to transmission lines".\nSome of the largest solar thermal power plants in the United States are'},
  {'_id': '{"datasets_id": 88543, "wi

In [8]:
res_passages_lst[0][0]

{'_id': '{"datasets_id": 5770, "wiki_id": "Q3246573", "sp": 14, "sc": 1444, "ep": 14, "ec": 2032}',
 'datasets_id': 5770,
 'wiki_id': 'Q3246573',
 'start_paragraph': 14,
 'start_character': 1444,
 'end_paragraph': 14,
 'end_character': 2032,
 'article_title': 'Renewable energy in the United States',
 'section_title': 'Concentrated solar power',
 'passage_text': 'disturb an average of 2.7 to 2.9 acres per gigawatt-hour/year, and use from 3.5 to 3.8 acres per gW-hr/year for the entire sites. \nAccording to a 2009 study, this intensity of land use is less than that of the country\'s average power plant using surface-mined coal. Some of the land in the eastern portion of the Mojave Desert is to be preserved, but the solar industry is more interested in areas of the western desert, "where the sun burns hotter and there is easier access to transmission lines".\nSome of the largest solar thermal power plants in the United States are'}

In [10]:


passages = res_passages_lst[0]
scores = D[0]


df = pd.DataFrame({
    'Article': [p['article_title'] for p in passages],
    'Sections': [p['section_title'] if p['section_title'].strip() != '' else p['article_title'] for p in passages],
    'Text': [p['passage_text'] for p in passages],
    'Score': scores
})
df.style.set_properties(**{'text-align': 'left'})

Unnamed: 0,Article,Sections,Text,Score
0,Renewable energy in the United States,Concentrated solar power,"disturb an average of 2.7 to 2.9 acres per gigawatt-hour/year, and use from 3.5 to 3.8 acres per gW-hr/year for the entire sites. According to a 2009 study, this intensity of land use is less than that of the country's average power plant using surface-mined coal. Some of the land in the eastern portion of the Mojave Desert is to be preserved, but the solar industry is more interested in areas of the western desert, ""where the sun burns hotter and there is easier access to transmission lines"". Some of the largest solar thermal power plants in the United States are",24.65955
1,Solar power plants in the Mojave Desert,Land use issues & Water use issues,"acres of offshore exploration in the Gulf of Mexico are under lease for oil and gas development, exploration and production. Some of the land in the eastern Mojave Desert will be preserved, but the solar industry is mainly interested in areas of the western desert, ""where the sun burns hotter and there is easier access to transmission lines"", said Kenn J. Arnecke of FPL Energy, a view shared by many industry executives. Water use issues Concentrating solar plants in the Mojave Desert have brought up issues of water use, because concentrating solar power plants with wet-cooling systems have high water-consumption intensities",24.249338
2,Masdar City,Renewable resources,"scale. Then you realise it's much more efficient to build your solar field on the ground in the middle of the desert. You can send a man to brush them off every day, rather than having to access everyone's buildings individually, and you can make sure that they are running at their absolute peak. It's much better than putting them on every building in the city."" Blowing sand has been a problem for its solar panels, so Masdar has been working with other companies to engineer surfaces with pores smaller than sand particles to stop them from sticking on the panels.",24.224705
3,Renewable energy debate,Solar power,"were built, they would total 7,387 megawatts. The requirement for so much land has spurred efforts to encourage solar facilities to be built on already-disturbed lands, and the Department of Interior identified Solar Energy Zones that it judges to contain lower value habitat where solar development would have less of an impact on ecosystems. Sensitive wildlife impacted by large solar facility plans include the desert tortoise, Mohave Ground Squirrel, Mojave fringe-toed lizard, and desert bighorn sheep. In the United States, some of the land in the eastern portion of the Mojave Desert is to be preserved, but the solar industry",24.21981
4,Energy in Jordan,Solar,"per kWh tendered in early 2015 for the second phase of the Mohammed bin Rashid Al Maktoum Solar Park in the United Arab Emirates. A plan to put solar panels at all 6000 mosques in the country was announced in February 2015. Jordan inaugurated its first solar-powered charging station for electric cars in February 2012. Located at El Hassan Science City (EHSC), the station is considered the first step towards promoting solar-powered vehicles and building more solar-charging facilities on the streets of Jordan. The Sahara Forest Project, a Norwegian endeavour to create oases in hot, arid and uninhabited lands, is currently being",23.429373
5,Solar power in the United Arab Emirates,Abu Dhabi & Dubai,"all rooftop panels, it was found easier to clean the sand off ground mounted panels at a single location. Dubai The Dubai Clean Energy Strategy aims to provide 7 per cent of Dubai’s energy from clean energy sources by 2020. It will increase this target to 25 per cent by 2030 and 75 per cent by 2050. Due to a variety of factors, a Saudi-backed consortium had a low bid to build the solar farm in Dubai for only 3¢/kWh. The first phase of the proposed 1,000 MW Mohammed bin Rashid Al Maktoum Solar Park, in Seih Al-Dahal, about 50 kilometers",23.358679
6,Desert,Solar energy capture,"for solar energy, partly due to low amounts of cloud cover. Many solar power plants have been built in the Mojave Desert such as the Solar Energy Generating Systems and Ivanpah Solar Power Facility. Large swaths of this desert are covered in mirrors. The potential for generating solar energy from the Sahara Desert is huge, the highest found on the globe. Professor David Faiman of Ben-Gurion University has stated that the technology now exists to supply all of the world's electricity needs from 10% of the Sahara Desert. Desertec Industrial Initiative was a consortium seeking $560 billion to invest in North",23.26635
7,Renewable energy in Pakistan,Solar power,"district) (now that system is unserviceable) and Dittal Khan Laghari, Digri (Mirpurkhas district).The Punjab government announced the establishment of Quaid-e-Azam Solar Park over an area of 5,000 acres in the Cholistan Development Authority in Bahawalpur. A practical example of the use of solar energy can be seen in some rural villages of Pakistan where houses have been provided with solar panels that run electric fans and energy-saving bulbs. One notable and successfully implemented case was the village of Narian Khorian (about 50 kilometers from Islamabad), which employs the use of 100 solar panels installed by a local firm, free of cost;",23.240244
8,Concentrated solar power,Very large scale solar power plants & Suitable sites,"solar power plants using 1% of each of the world's deserts. Total consumption worldwide was 15,223 TWh/year (in 2003). The gigawatt size projects would have been arrays of standard-sized single plants. In 2012, the BLM made available 97,921,069 acres (39,627,251 hectares) of land in the southwestern United States for solar projects, enough for between 10,000 and 20,000 GW. The largest single plant in operation is the 510 MW Noor Solar Power Station. Suitable sites The locations with highest direct irradiance are dry, at high altitude, and located in the tropics. These locations have a higher potential for CSP than areas with less sun. Abandoned",23.155958
9,Net metering,Related technology & Solar Guerrilla,safe. Solar Guerrilla Solar Guerrilla (or the guerrilla solar movement) is a term originated by Home Power Magazine and is applied to someone who connects solar panels without permission or notification and uses monthly net metering without regard for law.,23.089102


## Answer generation

In [11]:
qa_s2s_tokenizer = AutoTokenizer.from_pretrained('yjernite/bart_eli5', cache_dir='./tokenaizers')
qa_s2s_model = AutoModelForSeq2SeqLM.from_pretrained('yjernite/bart_eli5', cache_dir='./models').to('cuda:0')
_ = qa_s2s_model.eval()

In [21]:
def answer_on(questions):
    question_docs = []
    for question in questions:
        doc, res_list = query_qa_dense_index(
            question, qar_model, qar_tokenizer,
            wiki40b_snippets, wiki40b_index_flat, device='cuda:0'
        )
        # concatenate question and support document into BART input
        question_docs.append("question: {} context: {}".format(question, doc))
    
    # generate an answer with beam search

    num_answers=10
    num_beams=8
    min_len=64
    max_len=256
    max_input_length=1024
    do_sample=False
    temp=1.0
    top_p=None
    top_k=None
    device="cuda:0"
    
    model_inputs = make_qa_s2s_batch([(doc, "A") for doc in question_docs], qa_s2s_tokenizer, max_input_length, device=device,)
    
    n_beams = num_answers if num_beams is None else max(num_beams, num_answers)
    generated_ids = qa_s2s_model.generate(
        input_ids=model_inputs["input_ids"],
        attention_mask=model_inputs["attention_mask"],
        min_length=min_len,
        max_length=max_len,
        do_sample=do_sample,
        early_stopping=True,
        num_beams=1 if do_sample else n_beams,
        temperature=temp,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=qa_s2s_tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        num_return_sequences=num_answers,
        decoder_start_token_id=qa_s2s_tokenizer.bos_token_id,
    )
    return [ans_ids.strip() for ans_ids in qa_s2s_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)]


In [25]:
eli5['test_eli5'][12342:12345]['title']

['What is stopping us from covering large sections of the desert with solar panels?',
 'Why is tuna never a sustainable dinner option? What makes even farmed tuna never sustainable?',
 'How do elevator logistics work? That is: What happens to an elevator car after it deposits somebody at their floor? Does it stay at Floor N until called to another, or immediately go back down to Floor 1?']

In [26]:
questions = eli5['test_eli5'][12342:12345]['title']
answer = answer_on(questions)
print(question)
print(len(answer), answer[0])

RuntimeError: CUDA out of memory. Tried to allocate 120.00 MiB (GPU 0; 5.93 GiB total capacity; 3.46 GiB already allocated; 215.12 MiB free; 3.63 GiB reserved in total by PyTorch)

In [11]:
questions = []
answers = []

for i in [12342] + [j for j in range(4)]:
    # create support document with the dense index
    question = eli5['test_eli5'][i]['title']
    answer = answer_on(question)[0]
    questions += [question]
    answers += [answer]

df = pd.DataFrame({
    'Question': questions,
    'Answer': answers,
})
df.style.set_properties(**{'text-align': 'left'})

Unnamed: 0,Question,Answer
0,What is stopping us from covering large sections of the desert with solar panels?,"Nothing is stopping us from covering large sections of the desert with solar panels. The problem is that it takes a lot of money to do so, and it's very expensive to do it on a large scale. There's also the issue of how much water it takes to power the panels, and how much of that water is actually used."
1,Why do you get chills/goosebumps from hearing large crowds sing along to songs?,"It's called frisson, and it's caused by the release of a chemical called dopamine, which makes you feel good. URL_0 > Frisson is a pleasure experience that causes changes in your heart rate and goose bumps. Frisson may be associated with music as a prerequisite. It has been shown that some people experiencing musical frisson are more likely to be high on drugs, alcohol, or money. The pleasure experience is driven by the chemical dopamine."
2,"How did studded leather and heavy eye makeup come to be the Hollywood dress code for dystopian, post-apocalyptic societies?","Studded leather and heavy eye makeup have been around for a long time. It's not a new thing. URL_0 It's just Hollywood decided it was a good idea to use it in a post-apocalyptic setting, so it became the standard. The same thing happened in the 80s, 90s, and 00s."
3,"What's the difference between a bush, a shrub, and a tree?","A tree is a living thing. A shrub is a type of plant. A bush is a kind of plant that grows in the ground. A tree has a trunk, a shrub has a root system, and a bush has a stem. It's a bit of a misnomer to say that a ""tree"" is a ""branch"" of a ""bush""."
4,Why is it hard to breathe with a strong air gust blowing straight at your face?,"It's not hard to breathe with a strong air gust blowing straight at your face. It's hard to breath when the wind is blowing in the opposite direction. The wind is pushing air away from your face, and the air you're trying to inhale is trying to go the other way. So you're not getting enough air into your lungs."


In [33]:
def print_answer(question):
    answer = answer_on(question)[0]
    print('Question:', question, '\nAnswer:', answer)

In [34]:
print_answer('Why sky is blue?')

Question: Why sky is blue? 
Answer: The sky is blue because the air is mostly made up of water droplets suspended in the air. Water droplets absorb blue light, which makes the sky look blue. It's the same reason the sky looks blue when you look at it through a telescope. The blue light from the sun is scattered by the atmosphere, so the sky appears blue.


In [35]:
print_answer('Why so hard to generate ideas?')

Question: Why so hard to generate ideas? 
Answer: It's not hard to generate ideas, it's just hard to come up with the *idea* for the idea. It's like trying to think of an idea when you have no idea what you're going to do with it. You can think of it, but you don't know how to do it.


In [36]:
print_answer('Why we feels bad, when long time not sleep?')

Question: Why we feels bad, when long time not sleep? 
Answer: When you don't sleep, your body releases a chemical called melatonin, which makes you feel sleepy. When you do sleep, it takes a while for melatonin to build up in your brain, so you feel groggy when you wake up and don't feel rested when you go back to sleep. This is why it's important to get a good night's sleep.
