# Datasets

Main objective:
- (1) finetune gpt2 model to generate text based on keywords
- (2) text should be as close to question answering as possible
- (3) Ideally generated text should include a question inside some tags

Example:
- Real
    - Q: How do muscles grow? --> A: "I hope this answer qualifies as technical, yet simple enough (as I very rarely post here), but the basic idea that I understand is that your muscles rip and tear on a microscopic level when you are working out \[...\] by larger muscle fibers."

- Generation
    - <|startoftext|>~^muscles^growth~@How do muscles grow? --> "I hope this answer \[...\] by larger muscle fibers.".

So given the two keywords "muscle" and "growth", it generates a question and the corresponding answer. This is because for the intended newsletter format, question-answer-pairs are needed.

---

# 1. Model tuning

Reference:
- https://hyunjoonlee70.github.io/Blog_Post_3/
- https://github.com/mallorbc/GPT_Neo_quotes_dataset/blob/main/quotes_dataset.py

Potential Datasets:
- https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate
- https://github.com/huggingface/datasets/tree/master/datasets/eli5

In [3]:
pip install datasets

Collecting datasets
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 4.9 MB/s eta 0:00:01
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.12.2-py38-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 8.0 MB/s eta 0:00:01
[?25hCollecting dill
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.3 MB/s eta 0:00:01
[?25hCollecting pyarrow!=4.0.0,>=3.0.0
  Downloading pyarrow-6.0.1-cp38-cp38-macosx_10_13_x86_64.whl (19.1 MB)
[K     |████████████████████████████████| 19.1 MB 681 kB/s eta 0:00:01
[?25hCollecting tqdm>=4.62.1
  Using cached tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp38-cp38-macosx_10_9_x86_64.whl (574 kB)
[K     |████████████████████████████████| 574 kB 6.9 MB/s eta 0:00:01
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp38-cp38-macosx_10_9_x86_64.whl (31 kB)
Collecting fsspec[ht

In [45]:
from datasets import list_datasets, load_dataset
import pandas as pd
import numpy as np

In [5]:
eli5_dataset = load_dataset('eli5')

Downloading:   0%|          | 0.00/5.63k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading and preparing dataset eli5/LFQA_reddit (download: 6.03 MiB, generated: 1.26 GiB, post-processed: Unknown size, total: 1.26 GiB) to /Users/ayman/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa...


Downloading:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/576M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/21.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/286M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.65M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/330M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/18.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/36.2M [00:00<?, ?B/s]

Dataset eli5 downloaded and prepared to /Users/ayman/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa. Subsequent calls will reuse this data.


  0%|          | 0/9 [00:00<?, ?it/s]

In [32]:
# Provides multiple answers and their scoring
# Stored in descending order, only first (best) answer will be considered
eli5_dataset["train_eli5"][0]

{'q_id': '1oy5tc',
 'title': 'in football whats the point of wasting the first two plays with a rush - up the middle - not regular rush plays i get those',
 'selftext': '',
 'document': '',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['ccwtgnz', 'ccwtmho', 'ccwt946', 'ccwvj0u'],
  'text': ["Keep the defense honest, get a feel for the pass rush, open up the passing game. An offense that's too one dimensional will fail. And those rushes up the middle can be busted wide open sometimes for big yardage.",
   "If you throw the ball all the time, then the defense will adapt to always cover for a pass.  By doing a simple running play every now and then, you force the defense to stay close and guard against the run.  Sometimes, the offense can catch the defense off guard by faking a run and freeing up their receivers.\n\nAlso, you don't have to gain massive yards on every single play.  Sometimes, it works best to gain a few yards at a time.  As long as you get the first down, you ar

In [36]:
eli5_dataset.shape

{'train_eli5': (272634, 9),
 'validation_eli5': (9812, 9),
 'test_eli5': (24512, 9),
 'train_asks': (131778, 9),
 'validation_asks': (2281, 9),
 'test_asks': (4462, 9),
 'train_askh': (98525, 9),
 'validation_askh': (4901, 9),
 'test_askh': (9764, 9)}

In [47]:
dataset = eli5_dataset["test_eli5"] #has only 25000 entries, therefor faster model training
dataframe = pd.DataFrame(data=dataset)

In [49]:
dataframe = dataframe.drop(columns=["q_id", "selftext", "document", "subreddit", "title_urls", "selftext_urls", "answers_urls"])

In [74]:
# filtering the answers
for i in dataframe.index:
    dataframe.answers[i] = dataframe.answers[i]["text"][0]

In [101]:
# merging question and answers to one
dataframe["full_text"] = dataframe["title"] + " " + dataframe["answers"]

In [102]:
dataframe

Unnamed: 0,title,answers,full_text
0,Why do you get chills/goosebumps from hearing ...,"I think it's because, at that moment, it's bas...",Why do you get chills/goosebumps from hearing ...
1,How did studded leather and heavy eye makeup c...,I like to think that leather clothing is rathe...,How did studded leather and heavy eye makeup c...
2,"What's the difference between a bush, a shrub,...",Shrubs and trees are both specifically *woody*...,"What's the difference between a bush, a shrub,..."
3,Why is it hard to breathe with a strong air gu...,Moving air = lower pressure. The greater the d...,Why is it hard to breathe with a strong air gu...
4,how having hereditary cancer genes doesn’t nec...,"It's kind of like a ""3 strikes and you're out""...",how having hereditary cancer genes doesn’t nec...
...,...,...,...
24507,Why game companies doesn't release the source ...,Because it's probably not just their code.\n\n...,Why game companies doesn't release the source ...
24508,Why do dogs and cats lift their paws really hi...,They feel as though they can lift their feet o...,Why do dogs and cats lift their paws really hi...
24509,"How are bugs, in software, ""fixed""?",The first step in fixing a software bug is try...,"How are bugs, in software, ""fixed""? The first ..."
24510,What's the point of a passing lane on the high...,"The speed limit tells you how fast you can go,...",What's the point of a passing lane on the high...


## Keyword extraction

In [107]:
pip install yake

Collecting yake
  Downloading yake-0.4.8-py2.py3-none-any.whl (60 kB)
[K     |████████████████████████████████| 60 kB 4.4 MB/s eta 0:00:011
[?25hCollecting jellyfish
  Downloading jellyfish-0.8.9-cp38-cp38-macosx_10_14_x86_64.whl (24 kB)
Collecting segtok
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting tabulate
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Installing collected packages: jellyfish, segtok, tabulate, yake
Successfully installed jellyfish-0.8.9 segtok-1.5.11 tabulate-0.8.9 yake-0.4.8
Note: you may need to restart the kernel to use updated packages.


In [178]:
import yake
import sys

In [179]:
def keywords_yake(text, language = "en", max_ngram_size = 2, numOfKeywords = 1):

    custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, top=numOfKeywords)
    keywords = custom_kw_extractor.extract_keywords(text)
    return keywords

In [180]:
for i in dataframe.index:
    dataframe["keyword"][i] = keywords_yake(dataframe.full_text[i])[0][0]
    sys.stdout.write("\rExtracting keyword: %i" % i)
    sys.stdout.flush()

Doing thing 24511

In [202]:
sample_df = dataframe[:100]

# Model

In [191]:
from transformers import pipeline, set_seed, GPT2LMHeadModel,  GPT2Tokenizer, GPT2Config, GPT2LMHeadModel

In [197]:
special_tokens  = { "bos_token": "<|BOS|>",
                    "eos_token": "<|EOS|>",
                    "unk_token": "<|UNK|>",                    
                    "pad_token": "<|PAD|>",
                    "sep_token": "<|SEP|>"}

In [185]:
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30)

In [198]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2').add_special_tokens(special_tokens)

In [203]:
sample_df

Unnamed: 0,title,answers,full_text,keyword
0,Why do you get chills/goosebumps from hearing ...,"I think it's because, at that moment, it's bas...",Why do you get chills/goosebumps from hearing ...,crowds sing
1,How did studded leather and heavy eye makeup c...,I like to think that leather clothing is rathe...,How did studded leather and heavy eye makeup c...,Hollywood dress
2,"What's the difference between a bush, a shrub,...",Shrubs and trees are both specifically *woody*...,"What's the difference between a bush, a shrub,...",stems
3,Why is it hard to breathe with a strong air gu...,Moving air = lower pressure. The greater the d...,Why is it hard to breathe with a strong air gu...,gust blowing
4,how having hereditary cancer genes doesn’t nec...,"It's kind of like a ""3 strikes and you're out""...",how having hereditary cancer genes doesn’t nec...,n’t necessarily
...,...,...,...,...
95,How do normal 3D glasses differ from IMAX 3D g...,3d glasses are meant to make your eyes see sli...,How do normal 3D glasses differ from IMAX 3D g...,glasses
96,What would happen if the umbilical cord was ne...,It shrivels up and falls off on it's own event...,What would happen if the umbilical cord was ne...,umbilical cord
97,Why do so many people consider older instrumen...,"First of all, there is very little about the ""...",Why do so many people consider older instrumen...,newly made
98,We've been on the verge of running out of IP a...,[This guy says it well.](_URL_0_)\n\n > Does...,We've been on the verge of running out of IP a...,URL


In [207]:
input_ids = []
attn_masks = []
max_length = 768

for i in sample_df.index:

  encodings_dict = tokenizer('<|BOS|>'+ sample_df.keyword[i] + '<|SEP|>' + sample_df.full_text[i] + '<|EOS|>', truncation=True, max_length=max_length, padding="max_length")

  input_ids.append(torch.tensor(encodings_dict['input_ids']))
  attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

TypeError: 'int' object is not callable