In [1]:
from fastai.text import *
import pandas as pd

# Language models

Maybe here: https://raw.githubusercontent.com/mkearney/trumptweets/master/data/trumptweets-1515775693.tweets.csv

A language model is an algorithm that takes a sequence of words, and outputs the likely next word in the sequence. Most language models output a list of words, each with its probability of occurance. For example, if we had a sentence that started `I would like to eat a hot`, then ideally the algorithm would predict that  the word `dog` had a much higher chance of being the next word than the word `meeting`. 

Language models are a very powerful building block in natural language processing. They are used for classifying text (e.g. is this review positive or negative?), for answering questions based on text (e.g. "what is the capital of Finland?" based on the Wikipedia page on Finland), and language translation (e.g. English to Japanese).

## The intuition behind why language models are so broadly useful
How can this simple sounding algorithm be that broadly useful? Intuitively, this is because predicting the next word in a sentence requires a lot of information, not just about grammar and syntax, but also about semantics: what things mean in the real-world. For instance, we know that `I would like to eat a hot dog` is semantically reasonable, but `I would like to eat a hot cat` is nonsensical. 

I trained a simple language model, and asked it to predict the word following `I would like to eat a `. 

We get:
    

# Step 1: Load all the data 
In this example, we are going to use a dataset of tweets from [the Onion](https://www.theonion.com), as well as some non-sarcastic news sources. I found this data set on [Kaggle](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection). 

Before I started creating this notebook, I downloaded the JSON file to a folder `haii-assignment4'

In [2]:
from pathlib import Path
data_path = Path('./haii-assignment4')

The data is in a JSON file, so I am using the `read_json` method. If your data is CSV, use the `read_csv` method instead. 

We use the `lines=True` argument here because the author formatted each line as a separate JSON object. I think at least half of your time as a data scientist/AI researcher is spent dealing with other people's data formats!


In [3]:
headlines = pd.read_json(data_path/'Sarcasm_Headlines_Dataset_v2.json', lines=True)

In [4]:
headlines

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...
...,...,...,...
28614,1,jews to celebrate rosh hashasha or something,https://www.theonion.com/jews-to-celebrate-ros...
28615,1,internal affairs investigator disappointed con...,https://local.theonion.com/internal-affairs-in...
28616,0,the most beautiful acceptance speech this week...,https://www.huffingtonpost.com/entry/andrew-ah...
28617,1,mars probe destroyed by orbiting spielberg-gat...,https://www.theonion.com/mars-probe-destroyed-...


As you can see, some of this dataset is drawn from the onion, the rest is drawn from places like the Huffington Post which publish real news, not satire. 

## Step 1a: Examine the data set (5 points)

Before we go off adventuring, let's first see what this dataset looks like. 

### Q: How large is this dataset? Is it balanced? (1 points)

In [5]:
# Insert code here to check size of dataset, and how many are positive (is_sarcastic = 1) and how many negative?
# Hint: Your output will look like this.
is_sarcastic = headlines['is_sarcastic'] #create dataframe 
is_sarcastic.value_counts()

0    14985
1    13634
Name: is_sarcastic, dtype: int64

Answer: The dataset is balanced because the number of `is_sarcastic = 0` and `is_sarcastic = 1` are almost equal in number. 

### Q: How long on average is each headline? (4 points)
Longer text = more information. We want to see what the length of the headline is in order to see how much information it may have. 

In [6]:
# Insert code here to find the average length of headline (in words)
## Hint: see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html 
# the '\s' regex looks for spaces.
headline = headlines['headline']
headline.str.count('\s')

0         7
1        12
2         6
3         7
4         8
         ..
28614     6
28615    12
28616    11
28617     7
28618     6
Name: headline, Length: 28619, dtype: int64

In [7]:
# Adding + 1 to 
headline.str.count('\s').mean() + 1

10.052552500087355

Answer: Each headline has 10 words on an average. 

# Step 2: Build a language model that knows how to write news headlines

This is the first step of our project that will be using a machine learning model. 

We are going to use the [fast.ai](https://fast.ai/) library to create this model. If you need help with understanding this section, look at the fast.ai documentation -- it is fantastic! The steps below are modified from the [online tutorial](https://docs.fast.ai/text.html#Quick-Start:-Training-an-IMDb-sentiment-model-with-ULMFiT)

In [8]:
import fastai
from fastai.text import * 

*Note: if this import fails for you, make sure you've installed fastai first. Do that by creating a new cell, and typing `!pip install fastai`*

*Note to self: I had to use `conda install -c pytorch -c fastai fastai` in the env in which I am running this notebook instead of `!pip install fastai`. Pip install did not install bottleneck package and some others when run in command line. Code reference for the same can be found [here.](https://github.com/fastai/fastai)*

In [9]:
data_lm = (TextList.from_df(headlines, path=data_path, cols='headline').split_none().label_for_lm().databunch())

## So here is what happened above. 

First, we tell fastai that we want to work on a list of texts (headlines in our case), that are stored in a dataframe (that's the `TextList.from_df` part.) We also pass in our data path, so after we process our data, we can store it at that location. Finally, we tell it where to look for the headline in the dataframe (which column to use, `cols=`). 

Then there are two other important parts. We'll take it from the end. A `databunch` is a fastai convenience. It keeps all your training, validation and test data together. But what kind of validation data do we need for a language model? Remember that a language model predicts the next word in an input sequence of words. So, we can't just take some of the headlines and set them aside as validation. Instead, we want to use all the sentences and validate whether we can guess the right next word some fraction of the time. So, we first say `split_none` so you use all your data. Then we say `label_for_lm` so it labels the "next word" as the label for each sequence of words. It's a clever method -- see the source if you're curious!


In [10]:
data_lm.save('data_lm_export.pkl')

Let's save this databunch. We'll use this saved copy later. 

## Step 2a: Learn the model

Now that we have the data, it's time to train the model.

Now, we *could* learn a language model from scratch. But we're instead going to cheat. We're going to use a pretrained language model, and finetune it for our purpose. Specifically, we're going to use a model trained on the `Wikitext-103` corpus. 

One way to understand it is to think of our pre-trained model is as a model that can predict the next word in a Wikipedia article. We want to train it to write headlines instead. Since headlines still have to sound like English, ie. follow grammar, syntax, be generally plausible etc, being able to predict the next word in Wikipedia is super useful. It allows us to start with a model that already knows some English, and then just train it for writing headlines.



In [11]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

This `AWD_LSTM` is the pretrained Wikipedia model.

Let's train it.

In [12]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,6.052151,#na#,04:03,


Once trained, it's time to write some headlines! We give it a starting sequence `Students protest ` and see what it comes up with. 

In [13]:
learn.predict("Students protest ", n_words=5, no_unk=True)

'Students protest  the set of restaurants in'

Pretty good, huh? 

In [14]:
learn.predict('The Fed is expected to', n_words=3, no_unk=True)

'The Fed is expected to lose 3 separate'

OK, it's not perfect! Let's make it a little better. 

The `unfreeze` below is telling fastai to allow us to change the weights throughout the model. We do this when we want to make the model generate text that's more similar to our headlines (than to Wikipedia). 

*Note to self: `unfreeze` will unfreeze all layers of your model, so you will be training the early and later layers, although you still may be training the different layer groups at different learning rates. This is called ‘discriminative learning rates’ or ‘discriminative layer training’. Referenced from a [fast.ai forum](https://forums.fast.ai/t/can-anyone-explain-me-what-does-freeze-and-unfreeze-do/42025).*

In [15]:
learn.unfreeze()

In [16]:
learn.fit_one_cycle(cyc_len=1, max_lr=1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,5.181231,#na#,06:56,


In [17]:
learn.predict('New Study', n_words=5)

'New Study finds how much eagles you'

In [18]:
learn.predict('16 Problems', n_words=5)

"16 Problems that help save bale 's"

OK, now let's save our hard work. We'll use this later. (Pssst: why is it called an encoder? Look at the Fastai docs to find out!)

*Note to self: The `encoder` is essentially tasked with creating a mathematical representation of the language based on the task for predicting the next word. A `decoder` is responsible for taking that representation and applying it to some problem (e.g., predicting the next word, understanding sentiment, etc.). Referenced from a [fast.ai forum](https://forums.fast.ai/t/what-is-an-encoder-and-what-is-save-load-encoder-actually-doing/8281/3).*

In [19]:
learn.save_encoder('headlines-awd.pkl')

Note that we also want to save the whole model, so we can reuse it in our twitter bot. 


In [20]:
learn.export('headlines-lm.pkl')

# Step 2b: See how well the language model works (15 points)

Try generating a few more headlines. Then, answer the following questions. Wherever possible, show what code you ran, or what predictions you asked it for. *Suggestion: Try using punctuations, numbers, texts of different lengths etc.*

### Q: What is the effect of starting with longer strings? (5 points)

We could start our headline generation with just one word, e.g. `learn.predict('White', n_words=9)` or with many: `learn.predict('White House Says Whistleblower Did', n_words=5)`. What is the difference you see in the kinds of headlines generated?


In [21]:
## Your answer here. Insert more cells if you want to insert code etc.
learn.predict('White', n_words=9)

"White house creators : ' emotional debt ' xxbos how"

In [22]:
learn.predict('White', n_words=4)

'White house audience wondering if'

In [23]:
learn.predict('White House Says Whistleblower Did', n_words=5)

'White House Says Whistleblower Did not arrive after the uprising'

In [24]:
learn.predict('White House is located in', n_words=2)

'White House is located in south africa'

In [25]:
learn.predict('White House is located in', n_words=10)

'White House is located in wide sector of central capitol building , having a completely'

In [26]:
learn.predict('White House is located in', n_words=10)

"White House is located in asia it 's never like to see movie despite everyone"

Answer: There are 2 parameters here: The number of words given to the model to generate the next words of the headline (which we can refer to as `string_words`), and the number of additional words to be added to complete the headline (which we can refer to as `n_words`). After experimenting a bit, I think that headlines generated with more `string_words` are more likely to not have grammatical errors. However, the headlines sometimes do not make factual sense, and aren't specific to what we may be looking for. In case of `n_words`, the more `n_words` we specify for the model to predict, the more likely it is for the model to give out unidentifiable words in the sentence such as xxbos etc or leave the sentence incomplete. Thus the optimum combination for a good headline is more `string_words` and lesser `n_words`which can be quantified after running some more experiments.

## Q: What aspects of the task of generating headlines does our language model do well? (5 points)
For example, does it get grammar right? Does it know genders of people or objects? etc.

In [27]:
#Your answer here. Insert more cells if you want to insert code etc.
learn.predict('Donald Trump is the president', n_words=5)

'Donald Trump is the president of american radio stations ,'

In [28]:
learn.predict('Inclement weather prevents liars from', n_words=4)

'Inclement weather prevents liars from leaving conviction xxbos voting'

Answer: Here is the list of what all the model does well, according to me: 
* The grammar output by the model is acceptable, although sentence formation and punctuation could be better.
* The model uses advanced words such as gubernatorial, semitic etc. (these words might vanish from the above lines in case the notebook is run again.)
* The model can generate an output with numbers and words together. 
* The model is able to generate the names of famous celebrities, and uses them with their name and surname.
* The model is able to use present tense, past tense, present participle etc. pretty well.

## Q: What aspects of the task of generating headlines does our model do poorly? (5 points)
What does it frequently get wrong? Why might it make these mistakes?



In [29]:
## Your answer here
learn.predict('twinkle twinkle little', n_words=2)

'twinkle twinkle little found through'

Answer: Here is the list of what all the model does poorly, according to me: 
* Sentence formation and punctuation is haphazard. This could be because even after learning from Wikipedia, the model still needs to be taught the rules of sentence formation and punctuation.
* The model does not know context or associations such as, Michael Jordan is a basketball player, or that Donald Trump is the president of USA. This is perhaps because famous personalities and what they are associated with needs to be taught to the model separately. 
* The model is unable to complete poems known by everyone, such as 'twinkle twinkle little star'. This is perhaps because the Wikipedia pages might not contain this information.
* A headline is usually a compelling sentence. The model is unable to make a headline statement. Sometimes it starts a new sentence and drops it in the middle, leaving the rest to the user's imagination. This mistake could be because the training data is more paragraph like. 
* When the number of words specified are more than 3 or 4, the model starts adding xxbos at unpredicted places. 
* Sometimes the model gives out lesser words than what the user has asked for. 

# Step 3: Learn a classifier to see which headlines are satire

Remember, our dataset has some stories that are satire (from the Onion) and others that are real. Now, we're going to train a classifier to distinguish one from the other. 

In [30]:
data_clas = (TextList.from_df(df=headlines, path=data_path, vocab= data_lm.train_ds.vocab, cols='headline').split_by_rand_pct(valid_pct=0.2).label_from_df(cols='is_sarcastic').databunch())


We're using a similar databunch method as we did for our language model above. Here, we are using `split_by_rand_pct` so we keep some fraction of our dataset as a validation set. There is one other trick: `vocab= data_lm.train_ds.vocab` ensures that our classifier only uses words that we have in our language model -- so it never deals with words it hasn't encountered before. (Consider: why is this important?)

*Note to self: `data_lm.train_ds.vocab` this could be important so that the context and intent pertain to the dataset. But this could also be a limiting factor for the type of responses that are given out.*

See if you can work out what the other arguments are. 

In [31]:
data_clas.show_batch()

text,target
"xxbos hot wheels ranked number one toy for rolling down ramp , knocking over xxunk that send xxunk down a funnel , dropping onto teeter - xxunk that yanks on string , causing xxunk system to raise wooden block , xxunk series of twine xxunk that unwind spring , launching tennis ball across room , xxunk tire down slope until it hits power switch , xxunk table fan that blows",1
"xxbos tv showdown expected as ' sleepy hollow ' debuts tonight against hbo 's ' xxunk , ' xxunk 's ' xxunk xxunk , ' xxunk 's ' xxunk rider '",1
"xxbos past xxunk and on to xxunk , one of israel 's premier xxunk sites : spring break 2016 , breaking bad on the looney front - part 1",0
"xxbos greece 's rock portrait gallery , from xxunk xxunk to de xxunk 's nose : suspended in mid - air on the looney front , part ii",0
"xxbos the xxunk & xxunk years : a conversation with laurie anderson on lou reed , plus roger xxunk presents xxunk xxunk 's teen cancer xxunk road xxunk",0


Above: what our data looks like after we apply the vocabulary restriction. `xxunk` is an unknown word. 

Below: we're creating a classifier. 

In [32]:
classify = text_classifier_learner(data=data_clas, arch=AWD_LSTM, drop_mult=0.5)

Remember that language model we saved earlier? It's time load it back!

In [33]:
classify.load_encoder('headlines-awd.pkl')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (22896 items)
x: TextList
xxbos xxunk scientists unveil doomsday clock of hair loss,xxbos dem rep . totally nails why congress is falling short on gender , racial equality,xxbos eat your xxunk : 9 xxunk different recipes,xxbos xxunk weather prevents liar from getting to work,xxbos my white inheritance
y: CategoryList
1,0,0,1,0
Path: haii-assignment4;

Valid: LabelList (5723 items)
x: TextList
xxbos have we already solved the student debt crisis ?,xxbos chance the rapper teams up with naacp for # xxunk campaign,xxbos cousin of nba star xxunk wade killed in chicago shooting,xxbos how xxunk immigration policy affects the xxunk,xxbos dad busts his daughter for drinking in the most epic way possible
y: CategoryList
0,0,0,0,0
Path: haii-assignment4;

Test: None, model=SequentialRNN(
  (0): MultiBatchEncoder(
    (module): AWD_LSTM(
      (encoder): Embedding(11152, 400, padding_idx=1)
      (encoder_dp): EmbeddingDropout(
        (emb): Em

What's happening here? 

Here's the trick: a language model predicts the next word in a sequence using all the information it has so far (all the previous words). When we train a classifier, we ask it to predict the label (satire or not) instead of the next word. 

The intuition here is that if you can tell what the next word in a sentence is, you can tell if it is satirical. (Similarly, if you can can tell what the next word in an email is, you can tell if it is spam, etc.)

In [34]:
classify.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.459102,0.377744,0.82981,02:17


In [35]:
classify.freeze_to(-2)

Above: this is similar to `unfreeze()` that we used before. Except, you only allow a few layers of your model to change. Then we can train again, similar to using `unfreeze()`

In [36]:
classify.fit_one_cycle(1, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.388193,0.325575,0.85934,02:43


Wow! An accuracy of 85%! That sounds great, and for not that much work. 

Now, let's try it on some headlines, to see how well it does. 

# Step 4: try out the classifier (20 points)

In [37]:
classify.predict("Despair for Many and Silver Linings for Some in California Wildfires")

(Category 0, tensor(0), tensor([0.9879, 0.0121]))

Here in the output, the first part of this tuple is the chosen category (`0`, i.e. not satire), and the last part is an array of probabilities. The classifier suggests that the headline (which I got from the [New York Times](https://www.nytimes.com/2019/10/29/us/california-fires-homes.html?action=click&module=Top%20Stories&pgtype=Homepage)) is not satire, with about an 86% confidence. 

## Step 4a: Try out this classifier (10 points)

Below, try the classifier with some headlines, real or made up (including made up by the language model above). 


### Two headlines that the classifier correctly classifies (1 point)

In [39]:
classify.predict("Joker' smashes October box office record with $93.5M debut")

(Category 0, tensor(0), tensor([0.6847, 0.3153]))

In [40]:
classify.predict("Woman wakes up to find a cat staring in her face")

(Category 1, tensor(1), tensor([0.1533, 0.8467]))

### Two headlines that the classifier classifies incorrectly (1 point)

In [42]:
classify.predict("You may be at risk of throat cancer if you have a throat or mouth")

(Category 0, tensor(0), tensor([0.9147, 0.0853]))

In [43]:
classify.predict("Man takes a Harley Davidson bike for a test ride and steals it")

(Category 0, tensor(0), tensor([0.8689, 0.1311]))

### Now, we want to find two headlines that the classifier is really confident about, but classifies incorrectly. We want the confidence of the prediction to be at least 85%.

One headline is anything you want to write. Another must be a real headline (not satire) that you could trick the classifier into misclassifying changing only one word. For instance, taking `"Despair for Many and Silver Linings for Some in California Wildfires"`, a real NYTimes headline, you can change it to `"Despair for Many and Silver Linings for Some in Oregon Wildfires"` (note that this particular change does not cause the classifier to misclassify).

In [44]:
## Insert one headline that the classifier classifies incorrectly, with false high confidence. (4 points)
classify.predict("Man takes a Harley Davidson bike for a test ride and steals it")

(Category 0, tensor(0), tensor([0.8689, 0.1311]))

In [45]:
## Insert one headline that the classifier classifies incorrectly, with false high confidence. (4 points)
classify.predict("Mississippi judge accused of selling fake funeral plans to senior citizens")
# Also, insert link to the original headline/article.

(Category 0, tensor(0), tensor([0.6769, 0.3231]))

link to original headline: https://www.wmcactionnews5.com/2019/05/21/mississippi-coroner-accused-selling-fake-funeral-plans-senior-citizens/

## Step 4b: What kinds of headlines are misclassified? (10 points)

Write your hypothesis below on what kinds of headlines are misclassified. If it helps you, use the [TextClassificationInterpretation](https://docs.fast.ai/text.learner.html#TextClassificationInterpretation) utility. Show your work, especially if you use this utility.

### Trying two strings for validating

In [46]:
## Show work here
import matplotlib.cm as cm

In [47]:
txt_ci = TextClassificationInterpretation.from_learner(classify)
test_text = "Mississippi coroner accused of selling fake funeral plans to senior citizens"
txt_ci.show_intrinsic_attention(test_text,cmap=cm.Purples)



In [48]:
txt_ci.intrinsic_attention(test_text)[1]



tensor([0.2181, 0.1548, 0.6053, 1.0000, 0.5298, 0.1507, 0.4832, 0.6368, 0.5153,
        0.2190, 0.1319, 0.6186, 0.2511])

In [49]:
txt_ci = TextClassificationInterpretation.from_learner(classify)
test_text = "Man takes a Harley Davidson bike for a test ride and steals it"
txt_ci.show_intrinsic_attention(test_text,cmap=cm.Purples)



In [50]:
txt_ci.intrinsic_attention(test_text)[1]



tensor([0.3626, 0.3066, 0.7198, 0.6549, 0.3873, 0.2485, 0.3887, 0.1311, 0.5101,
        0.4689, 0.1839, 0.1910, 0.2853, 0.2878, 0.1991, 1.0000, 0.1556])

Answer: My interpretation: From what I read in blogs and articles, I found that a text classifier breaks apart the text into separate words and measures the effect of those words. Text classification assigns categories to text. By using TextClassificationInterpretation above, the darker the highlight for a word, the more it contributes to the classification and in our case, classification of satire. In the line above where a man takes a bike for a test ride and steals it, the parts of the sentence that make it funny include bike, test ride and steals. However, the words with the highest impact are man, takes, davidson, bike and steals. The word `test ride` which is making the sentence funny is not picked as an important keyword by the classifier, which leads to the misclassification. This can also be said for the fake funeral plan example.

To summarise: 
WHEN THE WORD(S) THAT MAKE THE SENTENCE FUNNY ARE NOT MARKED AS IMPORTANT BY THE CLASSIFIER, THEN THE SENTENCE IS MISCLASSIFIED. 

*Useful links*
1. https://medium.com/@ageitgey/text-classification-is-your-new-secret-weapon-7ca4fad15788
2. https://monkeylearn.com/text-classification/

# Step 5: Save your classifier
Now that we've trained the classifier, you're ready for Part 2. You'll use this saved file in your bot later.

In [51]:
classify.export(file='satire_awd.pkl')

Later, you'll use it like so.

In [52]:
serve_classifier = load_learner(path=data_path, file='satire_awd.pkl')
serve_lm = load_learner(path=data_path, file='headlines-lm.pkl')

In [53]:
serve_classifier.predict('How the New Syria Took Shape')

(Category 0, tensor(0), tensor([0.9987, 0.0013]))

In [54]:
serve_lm.predict('Rising Seas', n_words=7)

'Rising Seas actually reaches global speed join in the'

# Step 6: add the bot code. 

See the assignment document for what the bot code should look like. You can add it just below here, but you are also welcome to create a new notebook where you put that code. 

#### P.S. Credentials are in JSON format
*Process of building the bot:*
* tweepy installed in the env for this notebook using `conda install -c conda-forge tweepy` in command line
* JSON file parsed to extract the credentials

In [55]:
#Import the necessary methods from tweepy library
import json
import tweepy
import credentials

credentials = open('credentials.json', "r")
credentials_json = credentials.read()
credentials_dict = json.loads(credentials_json)

consumer_key = credentials_dict['consumer_key']
consumer_secret = credentials_dict['consumer_secret']
access_token = credentials_dict['access_token']
access_token_secret = credentials_dict['access_token_secret']

#coolest way to import from a .py file: 
#https://stackoverflow.com/questions/25501403/storing-the-secrets-passwords-in-a-separate-file

In [56]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

In [57]:
user = api.me()
print (user.name)

Harshika Jain


In [67]:
# Define the search
query = '@harshikerfuffle'
# changed max_tweets from 100 to 10 since my bot broke once.
max_tweets = 10

# Do the search
searched_tweets = []
last_id = -1
while len(searched_tweets) < max_tweets:
    count = max_tweets - len(searched_tweets)
    try:
        new_tweets = api.search(q=query, count=count, max_id=str(last_id - 1))
        if not new_tweets:
            break
        searched_tweets.extend(new_tweets)
        last_id = new_tweets[-1].id
    except tweepy.TweepError as e:
        # depending on TweepError.code, one may want to retry or wait to keep things simple, 
        #we will give up on an error                                                                                                                          
        break

## Step 6a: Integrating the bot with the satire classifier
Now that you can do basic replies with your bot, it’s time to make it do something useful! Specifically, our bot should do two things:<br>
1. When someone tweets a headline @ the bot, it replies with whether the headline is satire.
2. It also makes up a headline that plays off the original headline, and tweets it back. 

### Explanation of Code
1. The status text was assigned a variable. When testing the bot on Twitter, it was found that if the bot handle i.e. query keyword in this scenario is mentioned in the first 4 words of the sentence, then the bot handle is also used in the reply as output, thus it had to be removed. 
2. To do this, tweet_text is redefined by using the .replace() method to replace the query keyword i.e. the bot handle name in this case with a blank.
3. first_n_words variable is defined to split the first 4 words from the sentence queried. 
4. Since splitting outputs an array, it has to be joined to be read as string, thus join is performed. 
5. Using the previous code where we predicted a headline based starting with 2 words, we can now replace that with first_n_words since they are picked up from the queried sentence. 
6. The classifier output is a tuple, but to reference Category 0 or 1, we need to use `output[0].obj` or `output[1].obj` and then reply with a predicted headline accordingly. <br>

P.S. This code was refined by testing with users. 

In [62]:
for status in searched_tweets:
    tweet_text = status.text
    tweet_text = tweet_text.replace(query, '')
    output = serve_classifier.predict(tweet_text)
    first_n_words = tweet_text.split()[:4]
    first_n_words = ' '.join(first_n_words)
    headline = serve_lm.predict(first_n_words, n_words=6)
    twitter_handle = '@' + status.author.screen_name
    
    if output[0].obj == 0:
        api.update_status(
            "Yeah, that seems real, not satirical. But here’s what my bot says:" + twitter_handle + " " + str(headline),
            status.id_str)

    if output[0].obj == 1:
        api.update_status(
            "Haha, that's so funny. To reiterate what you just said, my bot has to say:" +  twitter_handle + " " + str(headline),
            status.id_str)

### Trials that were done before writing the bot integration code<br>a.k.a Mini Appendix

In [59]:
# Iterate over the search
# for status in searched_tweets:
  # do something with all these tweets                                                                                                                                                
#   print	(status)
    

# for (let i = 0; i < searched_tweets.length; i++) {
#     const tweet = searched_tweets[i];
#     print tweet
# }

In [60]:
# When you want to update your status
# api.update_status('I’m on a ferris wheel!')

In [66]:
print(status.text)

Icy Road Conditions Lead To Multi-Deer Pileup On Highway. Did you hear about this? 
@harshikerfuffle


In [63]:
# checking the type of status.text, it is used later in the code.
type(status.text)

str

In [64]:
# checking output type for writing the if statement
output[0].obj #https://docs.fast.ai/core.html#Category

0

In [65]:
#testing
tweet_text = "the world is ending but I want twinkies"
output = serve_classifier.predict(tweet_text)
print(output)

(Category 0, tensor(0), tensor([0.9986, 0.0014]))


## Step 6b. Reflection
Would you recommend using our satire-classifier as a good starting point to build a fake-news classifier? (10 points)<br>
If so, what changes would we need to make to make it useful for this purpose? If not, why not? 

Answer:  Yes, I think the satire-classifier can be a starting point to build a fake news classifier but it definitely needs to be more advanced than what it right now to achieve that. Detecting fake news is challenging to define, and can be based on multiple intrinsic problems such as 
* Are the facts correct?
* Is the source of the news authentic?
* Is the news itself biased?
* Is the headline misleading?
* Is the news article clickbait? etc. 

Our model is just looking at headlines, but for fake news we would have to account for some of these problems.

To use the satire classifier as a fake news classifier, we would also need to train the model with sentences or keywords of actual reported fake news so that the model learns to detect it, be able to verify the credibility of the source etc. A basic fake news classifier would probably highlight words such as please, share, posted, like etc. as the ones that contribute to adding fakeness, when we run the TextClassificationInterpretation utility on it. However, as we saw with the satire classifier, there are times that the model may misclassify. 

*Relevant articles:*
1. https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
2. https://miguelmalvarez.com/2017/03/23/how-can-machine-learning-and-ai-help-solving-the-fake-news-problem/

# Step 7: Test with Users and Iterate
In this part, you’ll ask three participants to interact with your bot. You’ll give the user high-level information about what the domain of the bot is, and then see how they interact with it. Ask each of the participants to ask your chatbot at least three different things. Record how they interact with your bot. After this participant input, update your bot to attempt to address how that participant interacted with your chatbot. 

Add to bot notebook: How did what your participants input compare to the ones you tested so far? How did participants react when the chatbot didn’t respond correctly, or responded with nonsense? (2.5 points)

Answer: The participant's tweets and the subsequent reponses of my bot can be seen here: https://twitter.com/harshikerfuffle

INPUT COMPARISON
* The inputs that I had tested with comprised of sentences with about 10 words. Some participants sent tweets with less than 10 words. With a one word tweet, when my bot predicted 6 more, there was an `xxbos` in the response.
* A participant tweeted a question. My bot didn't reiterate the participant's question with another question but rather with a sentence. 

PARTICIPANTS' REACTION
* A friend said the bot answer resembles porn spam since one of the initial replies while testing came out something like `waitress.. number xxbos`
* During one test, my bot sent the same reply to all participants, which disappointed some, but this was quickly rectified by checking the for loop which has `api.status_update`

Add to bot notebook: what change could you make in response to this feedback? (2.5 points)
* One change I made on the fly was to ask the participants to tweet something longer than 4 words, since the classifier is using the first four words to predict the rest of the sentence. 
* I do not know how I can modify the responses that are given (sentences dropping off in the middle etc.), except by changing the number of words of the sentence that are picked up and reused in the new headline. 
* Use more data to train the model so that it can give better responses.

# Appendix

In [68]:
classify.predict("Why poor people tend to be more generous than the rich")

(Category 0, tensor(0), tensor([0.9907, 0.0093]))

In [69]:
classify.predict("World bank says poor need more money")

(Category 1, tensor(1), tensor([0.4031, 0.5969]))

In [70]:
classify.predict("Police: Fake Money Making The Rounds In Pittsburgh Area")

(Category 0, tensor(0), tensor([0.9542, 0.0458]))

In [71]:
classify.predict("Police: Fake Money making The bloopers In Pittsburgh Area")

(Category 0, tensor(0), tensor([0.8960, 0.1040]))

In [72]:
classify.predict("Man throws 20ft up National Christmas Tree near White House")

(Category 1, tensor(1), tensor([0.1957, 0.8043]))

In [73]:
classify.predict("I Accidentally Uncovered a Nationwide Scam on 9Gag")

(Category 0, tensor(0), tensor([0.9391, 0.0609]))

In [74]:
classify.predict("Parents Scale school Building to Help Students cheat on Exams")

(Category 1, tensor(1), tensor([0.4809, 0.5191]))

In [75]:
classify.predict("Man Buys Solid Gold Shirt to score with the Ladies")

(Category 0, tensor(0), tensor([0.5963, 0.4037]))

In [76]:
classify.predict("Parents Scale office building to Help apple cheat off Exams")

(Category 1, tensor(1), tensor([0.4368, 0.5632]))

In [77]:
classify.predict("Bar Fight De-Escalates After Both Parties Unable To Tear Off T-Shirts")

(Category 1, tensor(1), tensor([0.1455, 0.8545]))