In [1]:
from fastai.text import *
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

**TA Note:** This is the revised version of this assignment made to work with `fastai` v2.

# Language models

A language model is an algorithm that takes a sequence of words, and outputs the likely next word in the sequence. Most language models output a list of words, each with its probability of occurance. For example, if we had a sentence that started `I would like to eat a hot`, then ideally the algorithm would predict that  the word `dog` had a much higher chance of being the next word than the word `meeting`. 

Language models are a very powerful building block in natural language processing. They are used for classifying text (e.g. is this review positive or negative?), for answering questions based on text (e.g. "what is the capital of Finland?" based on the Wikipedia page on Finland), and language translation (e.g. English to Japanese).

## The intuition behind why language models are so broadly useful
How can this simple sounding algorithm be that broadly useful? Intuitively, this is because predicting the next word in a sentence requires a lot of information, not just about grammar and syntax, but also about semantics: what things mean in the real-world. For instance, we know that `I would like to eat a hot dog` is semantically reasonable, but `I would like to eat a hot cat` is nonsensical. 

# Step 1: Load all the data 
In this example, we are going to use a dataset of tweets from [the Onion](https://www.theonion.com), as well as some non-sarcastic news sources. I found this data set on [Kaggle](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection). 

The JSON file for this dataset is in the directory `./resources`.


In [2]:
from pathlib import Path
data_path = Path('./resources')

The data is in a JSON file, so I am using the `read_json` method. If your data is CSV, use the `read_csv` method instead. 

We use the `lines=True` argument here because the author formatted each line as a separate JSON object. I think at least half of your time as a data scientist/AI researcher is spent dealing with other people's data formats!


In [3]:
data = pd.read_json(data_path/'Sarcasm_Headlines_Dataset_v2.json', lines=True)

In [4]:
data.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


As you can see, some of this dataset is sarcastic (drawn from The Onion), while the rest is drawn from places like the Huffington Post that publish real news, not satire. 

## Step 1a: Examine the data set (5 points)

Before we go off adventuring, let's first see what this dataset looks like. 

### Q: How large is this dataset? Is it balanced? (1 points)

In [5]:
# Insert code here to check size of dataset, and how many are positive (is_sarcastic = 1) and how many negative?

### Q: How long on average is each headline? (4 points)
Longer text = more information. We want to see what the length of the headline is in order to see how much information it may have. 

In [6]:
# Insert code here to find the average length of headline (in words)
## Hint: see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html 
# the '\s' regex looks for spaces.

# Step 2: Build a language model that knows how to write news headlines

This is the first step of our project that will be using a machine learning model. 

We are going to use the [fast.ai](https://fast.ai/) library to create this model. If you need help with understanding this section, look at the fast.ai documentation.

**TA Note**: If the import fails, ensure you install `fastai` with either `pip`, `conda`, etc.

**TA Note**: If the assertion is triggered, you don't have `fastai` v2!

In [7]:
import fastai
from fastai.text.all import * 

print(fastai.__version__)
assert fastai.__version__.split('.')[0] == '2'

2.3.0


In [8]:
dls_lm = TextDataLoaders.from_df(
    data, 
    path=data_path,
    text_col='headline',
    valid_pct=0,
    is_lm=True
)

## So here is what happened above. 

First, we tell `fastai` that we want to work with text data (headlines in our case), that are stored in the dataframe `data` (that's the `TextDataLoaders.from_df` part). We also pass in our data path, so after we process our data, we can store it at that location. Finally, we tell it where to look for the headline in the dataframe (which column to use, `text_col=`). 

Then there are two other important parts. We'll take it from the end. A `DataLoader` is a `fastai` v2 convenience. It keeps all your training, validation and test data together. 

But what kind of validation data do we need for a language model? Remember that a language model predicts the next word in an input sequence of words. So, we don't want to just take some of the headlines and set them aside as a validation set. Instead, we want to use _all_ the sentences and validate whether we can guess the right next word some fraction of the time. 


So, in the keyword arguments to `from_df`: 

- We pass `valid_pct=0` so that 0% of the data is split out into a validation set. 
- We pass `is_lm=True` so that the loader labels the "next word" as the label for each sequence of words.


**TA Note:** If you're coming to this version of the assignment from the original, `DataLoader` replaces `DataBunch`. However, we don't need to save it like we did there as we will not be loading it again later.

## Step 2a: Learn the model

Now that we have the data, it's time to train the model.

Now, we *could* learn a language model from scratch. But we're instead going to "cheat" and use _transfer learning_. We're going to use a pretrained language model, and finetune it for our purpose. Specifically, we're going to use a model trained on the `Wikitext-103` corpus. 

One way to understand it is to think of our pre-trained model is as a model that can predict the next word in a Wikipedia article. We want to train it to write headlines instead. Since headlines still have to sound like English, ie. follow grammar, syntax, be generally plausible etc, being able to predict the next word in Wikipedia is super useful. It allows us to start with a model that already knows some English, and then train it just for writing headlines.


In [9]:
learn = language_model_learner(
    dls_lm, 
    arch=AWD_LSTM, 
    drop_mult=0.5,
    path=data_path,
).to_fp16()

Let's train it.

**TA Note**: I have a GPU-accelerated setup, which I've used to train the provided model for several extra (4) epochs. If you're training your own model, you should probably reduce this to 1 epoch, which will take about 20 minutes without magic NVidia CUDA sauce (which reduces this to ~15 seconds). Training for 4 epochs might take your computer over an hour on your own machine. Google Colab does provide GPU acceleration though.

In [61]:
learn.fine_tune(4, base_lr=1e-2)

epoch,train_loss,valid_loss,time
0,4.478631,,00:16


epoch,train_loss,valid_loss,time
0,4.301938,,00:16
1,4.265944,,00:16
2,4.038555,,00:16
3,3.813424,,00:17


**TA Note**: The `fine_tune` method provided by `fastai` v2 is equivalent to running `fit_one_cycle`, `unfreeze` and then `fit_once_cycle` (which is what this original version of the assignment did), but with some sensible defaults. If this were a machine learning course we'd go into more detail about what's happening under the hood, but it isn't and we won't.

Once trained, it's time to write some headlines! We give it a starting sequence `Students protest ` and see what it comes up with. 

In [15]:
learn.predict("Students protest ", n_words=5, no_unk=True)

'Students protest democratic debates on kim kardashian'

Pretty good, huh? 

In [73]:
learn.predict('The Fed is expected to', n_words=5, no_unk=True)

'The Fed is expected to make example of national debt'

Well, maybe not perfect but...

OK, now let's save our hard work. We'll use this later. (Pssst: why is it called an encoder? Look at the Fastai docs to find out!)

In [11]:
learn.save_encoder('headlines-awd.pkl')

Note that we also want to save the whole model, so we can reuse it in our twitter bot. 


In [12]:
learn.export('headlines-lm.pkl')

# Step 2b: See how well the language model works (15 points)

Try generating a few more headlines. Then, answer the following questions. Wherever possible, show what code you ran, or what predictions you asked it for. *Suggestion: Try using punctuations, numbers, texts of different lengths etc.*

### Q: What is the effect of starting with longer strings? (5 points)

We could start our headline generation with just one word, e.g. `learn.predict('White', n_words=9)` or with many: `learn.predict('White House Says Whistleblower Did', n_words=5)`. What is the difference you see in the kinds of headlines generated?


In [59]:
## Your answer here. Insert more cells if you want to insert code etc.
learn.predict('White', n_words=9)
#learn.predict('White House Says Whistleblower Did', n_words=5)

'White house unable to afford best colored glass drawer'

### Q: What aspects of the task of generating headlines does our language model do well? (5 points)
For example, does it get grammar right? Does it know genders of people or objects? etc.

In [60]:
#Your answer here. Insert more cells if you want to insert code etc.




### Q: What aspects of the task of generating headlines does our model do poorly? (5 points)
What does it frequently get wrong? Why might it make these mistakes?



In [61]:
## Your answer here





# Step 3: Learn a classifier to see which headlines are satire

Remember, our dataset has some stories that are satire (from the Onion) and others that are real. Now, we're going to train a classifier to distinguish one from the other. 

In [15]:
dls_class = TextDataLoaders.from_df(
    data, 
    path=data_path,
    text_col='headline', 
    text_vocab=dls_lm.vocab,
    label_col='is_sarcastic',
    valid_pct=0.2
)

Due to IPython and Windows limitation, python multiprocessing isn't available now.
So `number_workers` is changed to 0 to avoid getting stuck


We're using a similar databunch method as we did for our language model above. Here, we are using `valid_pct=0.2` so we keep some fraction (20%) of our dataset as a validation set. There is one other trick: `text_vocab=...` ensures that our classifier only uses words that we have in our language model -- so it never deals with words it hasn't encountered before. (Consider: why is this important?)

See if you can work out what the other arguments are. 

In [16]:
dls_class.show_batch()

Unnamed: 0,text,category
0,"xxbos ' 12 years a slave , ' ' captain xxunk , ' ' american xxunk , ' ' wolf of wall street , ' ' blue xxunk , ' ' dallas buyers club , ' ' her , ' ' xxunk , ' ' before midnight , ' and ' xxunk ' all written during same continuing education screenwriting class",1
1,"xxbos xxunk xxunk : ' it 's possible my son was xxunk by frank sinatra , mario xxunk , george xxunk , robert altman , anthony xxunk , xxunk xxunk , robert redford , michael xxunk , danny xxunk , or bruce xxunk '",1
2,xxbos roy moore on pedophilia accusers : ' these women are only xxunk me now because xxunk xxunk norms have created an environment in which assault allegations are taken seriously ',1
3,xxbos giuliani says kim jong - un xxunk like a has - been - politician - turned - hack - attorney trying to get a job at the white house,1
4,xxbos ' new york times ' moves all content you wo n't give a shit about unless you make at least $ xxunk a year into one convenient section,1
5,"xxbos ' it 's just a costume , it 's just a costume , ' man nervously assures himself as giant hot dog starts walking toward him",1
6,"xxbos debut author virginia franken talks about flawed characters , her ( new ) addiction to coffee , what dance taught her about writing , and more",0
7,"xxbos ' is it too late to audition ? ' asks perfect actor for role , xxunk head into room just as producers were giving up hope",1
8,"xxbos ' how could harvey weinstein get away with this ? ' asks man currently ignoring sexual misconduct of 17 separate coworkers , friends , acquaintances",1


Above: what our data looks like after we apply the vocabulary restriction. `xxunk` is an unknown word. 

Below: we're creating a classifier. 

In [18]:
classify = text_classifier_learner(
    dls_class, 
    arch=AWD_LSTM, 
    drop_mult=0.5,
    path=data_path,
    metrics=accuracy
).to_fp16()

In [21]:
classify = text_classifier_learner(dls_class, arch=AWD_LSTM, drop_mult=0.5, metrics=accuracy)

Remember that language model we saved earlier? It's time load it back!

In [22]:
classify.load_encoder('headlines-awd.pkl')

<fastai.text.learner.TextLearner at 0x1fc14d679d0>

What's happening here? 

Here's the trick: a language model predicts the next word in a sequence using all the information it has so far (all the previous words). When we train a classifier, we ask it to predict the label (satire or not) instead of the next word. 

The intuition here is that if you can tell what the next word in a sentence is, you can tell if it is satirical. (Similarly, if you can can tell what the next word in an email is, you can tell if it is spam, etc.)

In [23]:
classify.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.415637,0.349997,0.850428,00:28


In [24]:
classify.freeze_to(-2)

Above: this is similar to the `unfreeze()` that `finetune()` above uses. Except, here we only allow a few layers of your model to change. 

**TA Note:** This is called "gradual unfreezing". It's a technique used in NLP related ML tasks where we gradually train from the back of the neural network to the front. Why does this produce better results? It's not really clear, but empirical studies show it helps. 

¯\\_(ツ)_/¯

In [25]:
classify.fit_one_cycle(1, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.360557,0.304304,0.86895,00:30


Wow! An accuracy of (around) 85%! That sounds great, and for not that much work. 

Now, let's try it on some headlines, to see how well it does. 

# Step 4: try out the classifier (20 points)

In [41]:
classify.predict("Despair for Many and Silver Linings for Some in California Wildfires")

('0', tensor(0), tensor([0.9042, 0.0958]))

Here in the output, the first part of this tuple is the chosen category (`0`, i.e. not satire), and the last part is an array of probabilities. The classifier suggests that the headline (which I got from the [New York Times](https://www.nytimes.com/2019/10/29/us/california-fires-homes.html?action=click&module=Top%20Stories&pgtype=Homepage)) is not satire, with about a 90% confidence. 

## Step 4a: Try out this classifier (10 points)

Below, try the classifier with some headlines, real or made up (including made up by the language model above). 


In [65]:
## Two headlines that the classifier correctly classifies (1 point)

In [67]:
## Two headlines that the classifier classifies incorrectly (1 point)

Now, we want to find two headlines that the classifier is really confident about, but classifies incorrectly. We want the confidence of the prediction to be at least 70%.

One headline is anything you want to write. Another must be a real headline (not satire) that you could trick the classifier into misclassifying changing only one word. For instance, taking `"Despair for Many and Silver Linings for Some in California Wildfires"`, a real NYTimes headline, you can change it to `"Despair for Many and Silver Linings for Some in Oregon Wildfires"` (note that this particular change does not cause the classifier to misclassify).

In [68]:
## Insert one headline that the classifier classifies incorrectly, with false high confidence. (4 points)


In [71]:
## Insert one headline that the classifier classifies incorrectly, with false high confidence. (4 points)

# Also, insert link to the original headline/article.


## Step 4b: What kinds of headlines are misclassified? (10 points)

Write your hypothesis below on what kinds of headlines are misclassified. Show your work.

In [None]:
## Show work here

(Add your interpretation here)

# Step 5: Save your classifier
Now that we've trained the classifier, you're ready for Part 2. You'll use this saved file in your bot later.

In [29]:
classify.export('satire_awd.pkl')

Later, you'll use it like so.

In [33]:
serve_classifier = load_learner(data_path/'satire_awd.pkl')
serve_lm = load_learner(data_path/'headlines-lm.pkl')

In [34]:
serve_classifier.predict('How the New Syria Took Shape')

('0', tensor(0), tensor([0.9733, 0.0267]))

In [60]:
serve_lm.predict('US President', n_words=10, no_unk=True)

'US President : how to get money from home , hip hop'

**TA Note:** President Biden's mixtape is not expected to drop until he delivers his State of the Union address.

# Step 6: add the bot code. 

See the assignment document for what the bot code should look like. You can add it just below here, but you are also welcome to create a new notebook where you put that code. 