# AIR - Exercise in Google Colab

## Colab Preparation

Open via google drive -> right click: open with Colab

**Get a GPU**

Toolbar -> Runtime -> Change Runtime Type -> GPU

**Mount Google Drive**

* Download data and clone your github repo to your Google Drive folder
* Use Google Drive as connection between Github and Colab (Could also use direct github access, but re-submitting credentials might be annoying)
* Commit to Github locally from the synced drive

**Keep Alive**

When training google colab tends to kick you out, This might help: https://medium.com/@shivamrawat_756/how-to-prevent-google-colab-from-disconnecting-717b88a128c0

**Get Started**

Run the following script to mount google drive and install needed python packages. Pytorch comes pre-installed.

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')


Mounted at /content/gdrive


In [None]:
## implement part 2 here

# PLAN
1. Select a pre-trained extractive QA (Question-Answer) model from the model hub to use
2. Load the model
3. Tokenize query passage pairs
4. Run inference
5. Store results with HuggingFace library: Provide >=1 text-spans that answers a given (query, passage) pair
6. Evaluate >= top1 MSMARCO passage results from best re-ranking model

Sources:
- Tutorial for Pipeline: https://towardsdatascience.com/question-and-answering-with-bert-6ef89a78dac
- Explanation of BERT: http://jalammar.github.io/illustrated-transformer/


# Import packages 

In [None]:
!pip install transformers
!pip install datasets
!apt install git-lfs #to push model in hub 

import pandas as pd
import numpy as np
import torch

from transformers import BertTokenizer

import datasets
from datasets import Dataset
from datasets import load_metric
from datasets import load_dataset

In [None]:
### 0. Test Data
# Test Sets: cannot read in data (TODO: Check core.metrics)
# 1. msmarco-fira-21.qrels.qa-answers.tsv:
# queryid
# documentid
# relevance-grade text-selection (multiple answers possible, split with tab)

# 2. msmarco-fira-21.qrels.qa-tuples.tsv
# queryid
# documentid
# relevance-grade
# query-text
# document-text
# text-selection (multiple answers possible, split with tab)

In [None]:
# Training data
# Input: (query,passage) pairs --> Use msmarco qa tuples?
# Output of model: >=1 text spans answering the pair --> msmarco qa answers

# 1. Read in Data

In [7]:
qa_tuples = pd.read_csv("/content/gdrive/MyDrive/air-20222-group-8-data/msmarco-fira-21.qrels.qa-tuples.tsv", sep = "\t", names=["query_id", "document_id", "relevance_grade", "query_text", "document_text", "NA","text_selection"])

In [9]:
#Drop NA Column
qa_tuples = qa_tuples.drop('NA', 1)

  


In [10]:
qa_tuples

Unnamed: 0,query_id,document_id,relevance_grade,query_text,document_text,text_selection
0,135386,100163,3,definition of imagination,imagination - the formation of a mental image ...,the formation of a mental image of something t...
1,290779,101026,3,how many oscars has clint eastwood won?pdrijgh...,Clint Eastwood -- five-time Oscar winner and e...,five
2,21741,1021598,3,are cold sores and fever blisters the same,"Cold sores, sometimes called fever blisters, a...","Cold sores, sometimes called fever blisters"
3,810210,1029662,3,what is the cause of blood in the stool,Having blood in the stool can be the result of...,"wide variety of conditions, such as hemorrhoid..."
4,1097448,103635,3,how many calories in slim fast shakes,"The chocolate-flavored shake contains 190, whi...",chocolate-flavored shake contains 190
...,...,...,...,...,...,...
52601,525779,4877404,2,twin tower adress,The twin towers were built in the borough of M...,The twin towers were built in the borough of M...
52602,210442,4877731,3,how can i get more energy while pregnant,To compensate for this your body will require ...,To compensate for this your body will require ...
52603,1088928,4878423,3,"vitamins a, d, e, and k are dependent upon","Four important fat-soluble vitamins are A, D, ...","Vitamins A, D, and K cooperate synergistically..."
52604,550565,4881591,3,what age can you wear baby on back in a carrier?,When can I carry my baby in a front pack facin...,As soon as your baby can hold his head up stea...


## Test Dataset: QA Answers

In [11]:
qa_answers = pd.read_csv("/content/gdrive/MyDrive/air-20222-group-8-data/msmarco-fira-21.qrels.qa-answers.tsv", sep = "\t", names=["query_id","document_id", "relevance_grade", "NA", "text_selection"])

In [12]:
qa_answers.head()

Unnamed: 0,query_id,document_id,relevance_grade,NA,text_selection
0,135386,100163,3,,the formation of a mental image of something t...
1,290779,101026,3,,five
2,21741,1021598,3,,"Cold sores, sometimes called fever blisters"
3,810210,1029662,3,,"wide variety of conditions, such as hemorrhoid..."
4,1097448,103635,3,,chocolate-flavored shake contains 190


TODO: How to read in full answers dataset here? 

Explanation: https://huggingface.co/course/chapter7/7?fw=pt

Format we need:
- context: document text (in qa tuples)
- question: query (query_text in qa tuples)
- answer_start: passage (text selection in qa tuples)
- answer_end:

In [None]:
raw_datasets = load_dataset("squad")


Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to C:\Users\norap\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to C:\Users\norap\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [None]:
def prepare_data(df):
    answer = df["text_selection"][0]
    df["answer_start"] = df["text_selection"]["answer_start"][0]
    df["answer_end"] = df["answer_start"] + len(answer)
    return df

In [None]:
#qa_tuples.rdd.map(prepare_data)

# 3. Choose pre-trained extractive QA (Question-Answer) model from the model hub to us