# Paper Summary

The paper demonstrated that SOTA results could be obtained for downstream tasks such as textual entailment, question answering, semantic similarity assessment, and
document classification by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning (with task-specific input transformations) on each specific task.

Stage 1 - Unsupervised pre-training of a language model (transformer decoder) on the bookcorpus dataset

Stage 2 - Supervised fine-tuning of model from stage 1 (with an added linear layer at the end) on a target task e.g. entailment using a labeled dataset

In this setup, I train the exact `124M` param model on the bookcorpus dataset, evaluating the performance using perplexity where the original paper achieved perplexity of 18.4. 

Then I finetune on the Rocstories dataset (5 sentence short stories for evaluating Reasoning on Commonsense i.e causal and temporal commonsense reasoning) and evaluated on Story clozetest dataset [the task here is `Commonsense reasoning`]

# Imports

In [None]:
import os
import pandas as pd

# Get Dataset

In [6]:

bookcorpus_dataset_path = "datasets/bookcorpus"
rocstories_dataset_path = "datasets/finetune/rocstories"
clozetest_dataset_path = "datasets/finetune/clozetest"
model_path = "models/v0"
os.makedirs(bookcorpus_dataset_path, exist_ok=True)
os.makedirs(model_path, exist_ok=True)

In [None]:
# !wget https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2 -O datasets/bookcorpus/bookcorpus.tar.bz2
# !tar -xf datasets/bookcorpus/bookcorpus.tar.bz2 -C datasets/bookcorpus/

In [None]:
with open(f"{bookcorpus_dataset_path}/books_large_p1.txt") as f:
    lines1 = f.readlines()
with open(f"{bookcorpus_dataset_path}/books_large_p2.txt") as f:
        lines2 = f.readlines()
bookcorpus = lines1 + lines2
print(f"{len(bookcorpus):,} lines")
bookcorpus[:10]

74,004,228 lines


['the half-ling book one in the fall of igneeria series kaylee soderburg copyright 2013 kaylee soderburg all rights reserved .\n',
 'isbn : 1492913731 isbn-13 : 978-1492913733 for my family , who encouraged me to never stop fighting for my dreams chapter 1 summer vacations supposed to be fun , right ?\n',
 'i wish i had a better answer to that question .\n',
 'starlings , new york is not the place youd expect much to happen .\n',
 'its a small quiet town , the kind where everyone knows your name .\n',
 'its a place where your parents wouldnt even care if you stayed out late biking with your friends .\n',
 'only because everyone felt so safe , so comfy .\n',
 'they dont know the half of it .\n',
 'but i do .\n',
 'i know it all and starlings is not the place where you want to be after dark .\n']

## Finetuning data

The dataset was obtained via a request to [rochester.edu](https://cs.rochester.edu/nlp/rocstories/)

In [22]:
rocstories_spring_2016 = pd.read_csv(f"{rocstories_dataset_path}/ROCStories__spring2016 - ROCStories_spring2016.csv")
clozetest_spring_2016_test = pd.read_csv(f"{clozetest_dataset_path}/cloze_test_test__spring2016 - cloze_test_ALL_test.csv")
clozetest_spring_2016_val = pd.read_csv(f"{clozetest_dataset_path}/cloze_test_val__spring2016 - cloze_test_ALL_val.csv")

rocstories_spring_2016.head()

Unnamed: 0,storyid,storytitle,sentence1,sentence2,sentence3,sentence4,sentence5
0,9a51198e-96f1-42c3-b09d-a3e1e067d803,Overweight Kid,Dan's parents were overweight.,Dan was overweight as well.,The doctors told his parents it was unhealthy.,His parents understood and decided to make a c...,They got themselves and Dan on a diet.
1,617e7ada-3878-488d-bd56-40695b91f053,The Bike Accident,Carrie had just learned how to ride a bike.,She didn't have a bike of her own.,Carrie would sneak rides on her sister's bike.,She got nervous on a hill and crashed into a w...,The bike frame bent and Carrie got a deep gash...
2,79b0da1f-e460-4173-ba58-8c9e2553c53a,Beach,Morgan enjoyed long walks on the beach.,She and her boyfriend decided to go for a long...,"After walking for over a mile, something happe...",Morgan decided to propose to her boyfriend.,Her boyfriend was upset he didn't propose to h...
3,d173b7de-4611-4cdf-934c-912834755e41,The bad customer.,Jane was working at a diner.,"Suddenly, a customer barged up to the counter.",He began yelling about how long his food was t...,Jane didn't know how to react.,"Luckily, her coworker intervened and calmed th..."
4,af0fd5a4-de36-47ba-8aa2-e99d10986d7a,Being Patient,I was talking to my crush today.,She continued to complain about guys flirting ...,I decided to agree with what she says and list...,"After I got home, I got a text from her.",She asked if we can hang out tomorrow.


In [23]:
clozetest_spring_2016_val.head()

Unnamed: 0,InputStoryid,InputSentence1,InputSentence2,InputSentence3,InputSentence4,RandomFifthSentenceQuiz1,RandomFifthSentenceQuiz2,AnswerRightEnding
0,138d5bfb-05cc-41e3-bf2c-fa85ebad14e2,Rick grew up in a troubled household.,"He never found good support in family, and tur...",It wasn't long before Rick got shot in a robbery.,The incident caused him to turn a new leaf.,He is happy now.,He joined a gang.,1
1,bff9f820-9605-4875-b9af-fe6f14d04256,Laverne needs to prepare something for her fri...,She decides to bake a batch of brownies.,She chooses a recipe and follows it closely.,Laverne tests one of the brownies to make sure...,The brownies are so delicious Laverne eats two...,Laverne doesn't go to her friend's party.,1
2,e8f628d5-9f97-40ed-8611-fc0e774673c4,Sarah had been dreaming of visiting Europe for...,She had finally saved enough for the trip.,She landed in Spain and traveled east across t...,She didn't like how different everything was.,Sarah then decided to move to Europe.,Sarah decided that she preferred her home over...,2
3,f5226bfe-9f26-4377-b05f-3d9568dbdec1,Gina was worried the cookie dough in the tube ...,She was very happy to find she was wrong.,The cookies from the tube were as good as from...,Gina intended to only eat 2 cookies and save t...,Gina liked the cookies so much she ate them al...,Gina gave the cookies away at her church.,1
4,69ac9b05-b956-402f-9fff-1f926ef9176b,It was my final performance in marching band.,I was playing the snare drum in the band.,We played Thriller and Radar Love.,The performance was flawless.,I was very proud of my performance.,I was very ashamed of my performance.,1
