In [1]:
from fastai.text import *
import pandas as pd

# Language models

Maybe here: https://raw.githubusercontent.com/mkearney/trumptweets/master/data/trumptweets-1515775693.tweets.csv

A language model is an algorithm that takes a sequence of words, and outputs the likely next word in the sequence. Most language models output a list of words, each with its probability of occurance. For example, if we had a sentence that started `I would like to eat a hot`, then ideally the algorithm would predict that  the word `dog` had a much higher chance of being the next word than the word `meeting`. 

Language models are a very powerful building block in natural language processing. They are used for classifying text (e.g. is this review positive or negative?), for answering questions based on text (e.g. "what is the capital of Finland?" based on the Wikipedia page on Finland), and language translation (e.g. English to Japanese).

## The intuition behind why language models are so broadly useful
How can this simple sounding algorithm be that broadly useful? Intuitively, this is because predicting the next word in a sentence requires a lot of information, not just about grammar and syntax, but also about semantics: what things mean in the real-world. For instance, we know that `I would like to eat a hot dog` is semantically reasonable, but `I would like to eat a hot cat` is nonsensical. 

I trained a simple language model, and asked it to predict the word following `I would like to eat a `. 

We get:
    

# Step 1: Load all the data 
In this example, we are going to use a dataset of tweets from [the Onion](https://www.theonion.com), as well as some non-sarcastic news sources. I found this data set on [Kaggle](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection). 

Before I started creating this notebook, I downloaded the JSON file to a folder `haii-assignment4'

In [2]:
from pathlib import Path
data_path = Path('./haii-assignment4')

The data is in a JSON file, so I am using the `read_json` method. If your data is CSV, use the `read_csv` method instead. 

We use the `lines=True` argument here because the author formatted each line as a separate JSON object. I think at least half of your time as a data scientist/AI researcher is spent dealing with other people's data formats!


In [3]:
headlines = pd.read_json(data_path/'Sarcasm_Headlines_Dataset_v2.json', lines=True)

In [4]:
headlines

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...
...,...,...,...
28614,1,jews to celebrate rosh hashasha or something,https://www.theonion.com/jews-to-celebrate-ros...
28615,1,internal affairs investigator disappointed con...,https://local.theonion.com/internal-affairs-in...
28616,0,the most beautiful acceptance speech this week...,https://www.huffingtonpost.com/entry/andrew-ah...
28617,1,mars probe destroyed by orbiting spielberg-gat...,https://www.theonion.com/mars-probe-destroyed-...


As you can see, some of this dataset is drawn from the onion, the rest is drawn from places like the Huffington Post which publish real news, not satire. 

## Step 1a: Examine the data set (5 points)

Before we go off adventuring, let's first see what this dataset looks like. 

### Q: How large is this dataset? Is it balanced? (1 points)

In [10]:
# Insert code here to check size of dataset, and how many are positive (is_sarcastic = 1) and how many negative?
# Hint: Your output will look like this.
is_sarcastic = headlines['is_sarcastic'] #create dataframe 
is_sarcastic.value_counts()

0    14985
1    13634
Name: is_sarcastic, dtype: int64

Answer: The dataset is balanced because the number of `is_sarcastic = 0` and `is_sarcastic = 1` are almost equal in number. 

### Q: How long on average is each headline? (4 points)
Longer text = more information. We want to see what the length of the headline is in order to see how much information it may have. 

In [15]:
# Insert code here to find the average length of headline (in words)
## Hint: see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html 
# the '\s' regex looks for spaces.
headline = headlines['headline']
headline.str.len()

0        61
1        79
2        49
3        52
4        61
         ..
28614    44
28615    87
28616    71
28617    61
28618    34
Name: headline, Length: 28619, dtype: int64

In [16]:
headline.str.len().mean()

62.30857122890387

Answer: Each headline has 62 characters on an average. 

# Step 2: Build a language model that knows how to write news headlines

This is the first step of our project that will be using a machine learning model. 

We are going to use the [fast.ai](https://fast.ai/) library to create this model. If you need help with understanding this section, look at the fast.ai documentation -- it is fantastic! The steps below are modified from the [online tutorial](https://docs.fast.ai/text.html#Quick-Start:-Training-an-IMDb-sentiment-model-with-ULMFiT)

In [17]:
import fastai
from fastai.text import * 

*Note: if this import fails for you, make sure you've installed fastai first. Do that by creating a new cell, and typing `!pip install fastai`*

*Note to self: I had to use `conda install -c pytorch -c fastai fastai` in the env in which I am running this notebook instead of `!pip install fastai`. Pip install did not install bottleneck package and some others when run in command line. Code reference for the same can be found [here.](https://github.com/fastai/fastai)*

In [18]:
data_lm = (TextList.from_df(headlines, path=data_path, cols='headline').split_none().label_for_lm().databunch())

## So here is what happened above. 

First, we tell fastai that we want to work on a list of texts (headlines in our case), that are stored in a dataframe (that's the `TextList.from_df` part.) We also pass in our data path, so after we process our data, we can store it at that location. Finally, we tell it where to look for the headline in the dataframe (which column to use, `cols=`). 

Then there are two other important parts. We'll take it from the end. A `databunch` is a fastai convenience. It keeps all your training, validation and test data together. But what kind of validation data do we need for a language model? Remember that a language model predicts the next word in an input sequence of words. So, we can't just take some of the headlines and set them aside as validation. Instead, we want to use all the sentences and validate whether we can guess the right next word some fraction of the time. So, we first say `split_none` so you use all your data. Then we say `label_for_lm` so it labels the "next word" as the label for each sequence of words. It's a clever method -- see the source if you're curious!


In [19]:
data_lm.save('data_lm_export.pkl')

Let's save this databunch. We'll use this saved copy later. 

## Step 2a: Learn the model

Now that we have the data, it's time to train the model.

Now, we *could* learn a language model from scratch. But we're instead going to cheat. We're going to use a pretrained language model, and finetune it for our purpose. Specifically, we're going to use a model trained on the `Wikitext-103` corpus. 

One way to understand it is to think of our pre-trained model is as a model that can predict the next word in a Wikipedia article. We want to train it to write headlines instead. Since headlines still have to sound like English, ie. follow grammar, syntax, be generally plausible etc, being able to predict the next word in Wikipedia is super useful. It allows us to start with a model that already knows some English, and then just train it for writing headlines.



In [20]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

This `AWD_LSTM` is the pretrained Wikipedia model.

Let's train it.

In [21]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,6.062365,#na#,04:02,


Once trained, it's time to write some headlines! We give it a starting sequence `Students protest ` and see what it comes up with. 

In [22]:
learn.predict("Students protest ", n_words=5, no_unk=True)

'Students protest  from stop in raised neighborhood'

Pretty good, huh? 

In [23]:
learn.predict('The Fed is expected to', n_words=3, no_unk=True)

'The Fed is expected to hold 14 %'

OK, it's not perfect! Let's make it a little better. 

The `unfreeze` below is telling fastai to allow us to change the weights throughout the model. We do this when we want to make the model generate text that's more similar to our headlines (than to Wikipedia). 

*Note to self: `unfreeze` will unfreeze all layers of your model, so you will be training the early and later layers, although you still may be training the different layer groups at different learning rates. This is called ‘discriminative learning rates’ or ‘discriminative layer training’. Referenced from a [fast.ai forum](https://forums.fast.ai/t/can-anyone-explain-me-what-does-freeze-and-unfreeze-do/42025).*

In [24]:
learn.unfreeze()

In [25]:
learn.fit_one_cycle(cyc_len=1, max_lr=1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,5.178275,#na#,06:26,


In [26]:
learn.predict('New Study', n_words=5)

'New Study finds scientific economics anti -'

In [27]:
learn.predict('16 Problems', n_words=5)

'16 Problems with an empire xxbos signature'

OK, now let's save our hard work. We'll use this later. (Pssst: why is it called an encoder? Look at the Fastai docs to find out!)

*Note to self: The `encoder` is essentially tasked with creating a mathematical representation of the language based on the task for predicting the next word. A `decoder` is responsible for taking that representation and applying it to some problem (e.g., predicting the next word, understanding sentiment, etc.). Referenced from a [fast.ai forum](https://forums.fast.ai/t/what-is-an-encoder-and-what-is-save-load-encoder-actually-doing/8281/3).*

In [28]:
learn.save_encoder('headlines-awd.pkl')

Note that we also want to save the whole model, so we can reuse it in our twitter bot. 


In [29]:
learn.export('headlines-lm.pkl')

# Step 2b: See how well the language model works (15 points)

Try generating a few more headlines. Then, answer the following questions. Wherever possible, show what code you ran, or what predictions you asked it for. *Suggestion: Try using punctuations, numbers, texts of different lengths etc.*

### Q: What is the effect of starting with longer strings? (5 points)

We could start our headline generation with just one word, e.g. `learn.predict('White', n_words=9)` or with many: `learn.predict('White House Says Whistleblower Did', n_words=5)`. What is the difference you see in the kinds of headlines generated?


In [35]:
## Your answer here. Insert more cells if you want to insert code etc.
learn.predict('White', n_words=9)

'White house officials agree to issue ethics enforcement semitic payments'

In [51]:
learn.predict('White', n_words=4)

'White house recipients choose to'

In [49]:
learn.predict('White House Says Whistleblower Did', n_words=5)

"White House Says Whistleblower Did ' die he got it"

In [43]:
learn.predict('White House is located in', n_words=2)

'White House is located in the theater'

In [46]:
learn.predict('White House is located in', n_words=10)

'White House is located in elusive flying carrier , looking famous gubernatorial vermont xxbos population'

In [55]:
learn.predict('White House is located in', n_words=10)

'White House is located in Chicago where it runs past 12 location dog not'

Answer: After experimenting a bit, I think that headlines that start with more words are more specific to what we may be looking for. However, the higher the number of words, the more likely it is for the model to give out unidentifiable words in the sentence such as those that start with xxbos etc. 

## Q: What aspects of the task of generating headlines does our language model do well? (5 points)
For example, does it get grammar right? Does it know genders of people or objects? etc.

In [66]:
#Your answer here. Insert more cells if you want to insert code etc.
learn.predict('Donald Trump is the president', n_words=5)

'Donald Trump is the president of the new york times'

In [75]:
learn.predict('Inclement weather prevents liars from', n_words=4)

'Inclement weather prevents liars from arriving end of year'

Answer: Here is the list of what all the model does well, according to me: 
* The model uses advanced words such as gubernatorial, semitic etc. (these words might vanish from the above lines in case the notebook is run again.)
* The model uses numbers and words together. 
* The model is able to generate the names of famous celebrities, and uses them with their name and surname.

## Q: What aspects of the task of generating headlines does our model do poorly? (5 points)
What does it frequently get wrong? Why might it make these mistakes?



In [68]:
## Your answer here
learn.predict('twinkle twinkle little', n_words=2)

'twinkle twinkle little bit for'

Answer: Here is the list of what all the model does poorly, according to me: 
* Grammar is fequently wrong, I think that this is because even after learning from Wikipedia, the model still needs to be taught the rules of sentence formation and grammar usage.
* When the number of words specified are more than 3 or 4, the model starts adding xxbos at unpredicted places. 
* Sometimes the model gives out lesser words than what the user has asked for. 
* The model does not know context or associations such as, Michael Jordan is a basketball player, or that Donald Trump is the president of USA. This is because the sentence generation is random.
* The model is unable to complete poems known by everyone, such as 'twinkle twinkle little star'. This is perhaps because the Wikipedia pages might not contain this information.
* The model is unable to start a new sentence. Rather, it adds commas. This is because even punctuation is something that needs to be taught to the model. 

# Step 3: Learn a classifier to see which headlines are satire

Remember, our dataset has some stories that are satire (from the Onion) and others that are real. Now, we're going to train a classifier to distinguish one from the other. 

In [76]:
data_clas = (TextList.from_df(df=headlines, path=data_path, vocab= data_lm.train_ds.vocab, cols='headline').split_by_rand_pct(valid_pct=0.2).label_from_df(cols='is_sarcastic').databunch())


We're using a similar databunch method as we did for our language model above. Here, we are using `split_by_rand_pct` so we keep some fraction of our dataset as a validation set. There is one other trick: `vocab= data_lm.train_ds.vocab` ensures that our classifier only uses words that we have in our language model -- so it never deals with words it hasn't encountered before. (Consider: why is this important?)

*Note to self: `data_lm.train_ds.vocab` this could be important so that the context and intent pertain to the dataset. But this could also be a limiting factor for the type of responses that are given out.*

See if you can work out what the other arguments are. 

In [77]:
data_clas.show_batch()

text,target
"xxbos hot wheels ranked number one toy for rolling down ramp , knocking over xxunk that send xxunk down a funnel , dropping onto teeter - xxunk that yanks on string , causing xxunk system to raise wooden block , xxunk series of twine xxunk that unwind spring , launching tennis ball across room , xxunk tire down slope until it hits power switch , xxunk table fan that blows",1
"xxbos ' how do we treat the little people , joan ? ' i asked . and she said , ' why , we treat them better . we only s -- t on people at our level or higher . '",0
"xxbos jared kushner claims that russian interference less damaging to u.s . democracy than saudi arabia , nepotism , israel , cambridge analytica , uae , illicit donations , erik prince , bill barr , and financial xxunk",1
"xxbos ' men are not xxunk , ' says woman who has no idea what it like to take two whole xxunk to get to your clothing section at zara",1
"xxbos annoying guy in movie theater constantly screaming ' get out of there , you idiot ' at bradley cooper 's character in ' a star is born '",1


Above: what our data looks like after we apply the vocabulary restriction. `xxunk` is an unknown word. 

Below: we're creating a classifier. 

In [78]:
classify = text_classifier_learner(data=data_clas, arch=AWD_LSTM, drop_mult=0.5)

Remember that language model we saved earlier? It's time load it back!

In [79]:
classify.load_encoder('headlines-awd.pkl')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (22896 items)
x: TextList
xxbos xxunk scientists unveil doomsday clock of hair loss,xxbos dem rep . totally nails why congress is falling short on gender , racial equality,xxbos mother comes pretty close to using word ' streaming ' correctly,xxbos my white inheritance,xxbos 5 ways to file your taxes with less stress
y: CategoryList
1,0,1,0,0
Path: haii-assignment4;

Valid: LabelList (5723 items)
x: TextList
xxbos someone xxunk the ' elf ' trailer as a thriller , and it 's terrifying,xxbos gun lobbyist warns gun owners could resort to ' bullet box ' if they do n't like election results,xxbos man stays up most of night rocking cat back to sleep,xxbos ariana grande xxunk saves tidal with musical xxunk on ' snl ',xxbos fired u.s . attorney xxunk xxunk said to have been investigating hhs secretary tom price
y: CategoryList
0,0,1,0,0
Path: haii-assignment4;

Test: None, model=SequentialRNN(
  (0): MultiBatchEncoder(
    (module): AWD_LSTM(

What's happening here? 

Here's the trick: a language model predicts the next word in a sequence using all the information it has so far (all the previous words). When we train a classifier, we ask it to predict the label (satire or not) instead of the next word. 

The intuition here is that if you can tell what the next word in a sentence is, you can tell if it is satirical. (Similarly, if you can can tell what the next word in an email is, you can tell if it is spam, etc.)

In [80]:
classify.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.445014,0.373388,0.835226,02:04


In [81]:
classify.freeze_to(-2)

Above: this is similar to `unfreeze()` that we used before. Except, you only allow a few layers of your model to change. Then we can train again, similar to using `unfreeze()`

In [82]:
classify.fit_one_cycle(1, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.392299,0.329438,0.857417,02:31


Wow! An accuracy of 85%! That sounds great, and for not that much work. 

Now, let's try it on some headlines, to see how well it does. 

# Step 4: try out the classifier (20 points)

In [87]:
classify.predict("Despair for Many and Silver Linings for Some in California Wildfires")

(Category 0, tensor(0), tensor([0.8974, 0.1026]))

Here in the output, the first part of this tuple is the chosen category (`0`, i.e. not satire), and the last part is an array of probabilities. The classifier suggests that the headline (which I got from the [New York Times](https://www.nytimes.com/2019/10/29/us/california-fires-homes.html?action=click&module=Top%20Stories&pgtype=Homepage)) is not satire, with about an 86% confidence. 

## Step 4a: Try out this classifier (10 points)

Below, try the classifier with some headlines, real or made up (including made up by the language model above). 


In [84]:
## Two headlines that the classifier correctly classifies (1 point)

In [97]:
classify.predict("Joker' smashes October box office record with $93.5M debut")

(Category 0, tensor(0), tensor([0.6481, 0.3519]))

In [110]:
classify.predict("Woman wakes up to find a cat staring in her face")

(Category 1, tensor(1), tensor([0.0679, 0.9321]))

In [67]:
## Two headlines that the classifier classifies incorrectly (1 point)

In [108]:
classify.predict("You may be at risk of throat cancer if you have a throat or mouth")

(Category 0, tensor(0), tensor([0.9494, 0.0506]))

In [109]:
classify.predict("Man takes a Harley Davidson bike for a test ride and steals it")

(Category 0, tensor(0), tensor([0.8899, 0.1101]))

Now, we want to find two headlines that the classifier is really confident about, but classifies incorrectly. We want the confidence of the prediction to be at least 85%.

One headline is anything you want to write. Another must be a real headline (not satire) that you could trick the classifier into misclassifying changing only one word. For instance, taking `"Despair for Many and Silver Linings for Some in California Wildfires"`, a real NYTimes headline, you can change it to `"Despair for Many and Silver Linings for Some in Oregon Wildfires"` (note that this particular change does not cause the classifier to misclassify).

In [122]:
## Insert one headline that the classifier classifies incorrectly, with false high confidence. (4 points)
classify.predict("Man takes a Harley Davidson bike for a test ride and steals it")

(Category 0, tensor(0), tensor([0.8899, 0.1101]))

In [276]:
## Insert one headline that the classifier classifies incorrectly, with false high confidence. (4 points)
classify.predict("Mississippi judge accused of selling fake party plans to senior citizens")
# Also, insert link to the original headline/article.

(Category 0, tensor(0), tensor([0.6315, 0.3685]))

link to original headline: https://www.wmcactionnews5.com/2019/05/21/mississippi-coroner-accused-selling-fake-funeral-plans-senior-citizens/

In [230]:
classify.predict("Mississippi coroner accused of selling fake funeral plans to senior citizens")

(Category 1, tensor(1), tensor([0.4767, 0.5233]))

## Step 4b: What kinds of headlines are misclassified? (10 points)

Write your hypothesis below on what kinds of headlines are misclassified. If it helps you, use the [TextClassificationInterpretation](https://docs.fast.ai/text.learner.html#TextClassificationInterpretation) utility. Show your work, especially if you use this utility.

In [254]:
## Show work here
import matplotlib.cm as cm

In [258]:
txt_ci = TextClassificationInterpretation.from_learner(classify)
test_text = "Mississippi coroner accused of selling fake party plans to senior citizens"
txt_ci.show_intrinsic_attention(test_text,cmap=cm.Purples)



In [277]:
txt_ci.intrinsic_attention(test_text)[1]



tensor([0.2414, 0.1866, 0.6415, 1.0000, 0.4387, 0.1313, 0.4741, 0.5150, 0.1970,
        0.2122, 0.1205, 0.5394, 0.2369])

In [278]:
txt_ci = TextClassificationInterpretation.from_learner(classify)
test_text = "You may be at risk of throat cancer if you have a throat or mouth"
txt_ci.show_intrinsic_attention(test_text,cmap=cm.Purples)



(Add your interpretation here)

# Step 5: Save your classifier
Now that we've trained the classifier, you're ready for Part 2. You'll use this saved file in your bot later.

In [247]:
classify.export(file='satire_awd.pkl')

Later, you'll use it like so.

In [248]:
serve_classifier = load_learner(path=data_path, file='satire_awd.pkl')
serve_lm = load_learner(path=data_path, file='headlines-lm.pkl')

In [249]:
serve_classifier.predict('How the New Syria Took Shape')

(Category 0, tensor(0), tensor([0.9793, 0.0207]))

In [252]:
serve_lm.predict('Rising Seas', n_words=7)

'Rising Seas : the 6 gates going which have'

# Step 6: add the bot code. 

See the assignment document for what the bot code should look like. You can add it just below here, but you are also welcome to create a new notebook where you put that code. 

*Process of building the bot:*
* tweepy installed in new env for this notebook using `conda install -c conda-forge tweepy` in command line

In [356]:
#Import the necessary methods from tweepy library
import tweepy
import credentials

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

In [357]:
user = api.me()
print (user.name)

Harshika Jain


In [378]:
# Define the search
query = '@CMUInnovation'
max_tweets = 100

# Do the search
searched_tweets = []
last_id = -1
while len(searched_tweets) < max_tweets:
    count = max_tweets - len(searched_tweets)
    try:
        new_tweets = api.search(q=query, count=count, max_id=str(last_id - 1))
        if not new_tweets:
            break
        searched_tweets.extend(new_tweets)
        last_id = new_tweets[-1].id
    except tweepy.TweepError as e:
        # depending on TweepError.code, one may want to retry or wait to keep things simple, 
        #we will give up on an error                                                                                                                          
        break

# Iterate over the search
for status in searched_tweets:
  # do something with all these tweets                                                                                                                                                
  print	(status)

Status(_api=<tweepy.api.API object at 0x1a2f29e210>, _json={'created_at': 'Mon Nov 04 14:27:07 +0000 2019', 'id': 1191361234035236865, 'id_str': '1191361234035236865', 'text': 'Team Presentation Photos from @IBM #BlueHack are now up on our FB page! Go check them out &amp; tag your team members!… https://t.co/0ZkSVws57k', 'truncated': True, 'entities': {'hashtags': [{'text': 'BlueHack', 'indices': [35, 44]}], 'symbols': [], 'user_mentions': [{'screen_name': 'IBM', 'name': 'IBM', 'id': 18994444, 'id_str': '18994444', 'indices': [30, 34]}], 'urls': [{'url': 'https://t.co/0ZkSVws57k', 'expanded_url': 'https://twitter.com/i/web/status/1191361234035236865', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [120, 143]}]}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_t

In [379]:
api.update_status('I’m on a plane!')

Status(_api=<tweepy.api.API object at 0x1a2f29e210>, _json={'created_at': 'Mon Nov 04 20:33:16 +0000 2019', 'id': 1191453378766868480, 'id_str': '1191453378766868480', 'text': 'I’m on a plane!', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="https://www.github.com/harshikerfuffle" rel="nofollow">harapug</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 704772464769765376, 'id_str': '704772464769765376', 'name': 'Harshika Jain', 'screen_name': 'harshikerfuffle', 'location': 'Pittsburgh, PA', 'description': 'User Experience Designer | NIDian | Tartan', 'url': 'https://t.co/ZZ9imZvwFE', 'entities': {'url': {'urls': [{'url': 'https://t.co/ZZ9imZvwFE', 'expanded_url': 'http://cryptdecrypt.blogspot.in', 'display_url': 'cryptdecrypt.blogspot.in', 'indices': [0, 23]}]}, 'description': {'urls': 

In [380]:
api.update_status(
  'this is a reply! @@CMUInnovation', status.author.screen_name,
  status.id_str
)

Status(_api=<tweepy.api.API object at 0x1a2f29e210>, _json={'created_at': 'Mon Nov 04 20:33:27 +0000 2019', 'id': 1191453424820326402, 'id_str': '1191453424820326402', 'text': 'this is a reply! @@CMUInnovation', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'CMUInnovation', 'name': 'CMU Innovation', 'id': 2337236814, 'id_str': '2337236814', 'indices': [18, 32]}], 'urls': []}, 'source': '<a href="https://www.github.com/harshikerfuffle" rel="nofollow">harapug</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 704772464769765376, 'id_str': '704772464769765376', 'name': 'Harshika Jain', 'screen_name': 'harshikerfuffle', 'location': 'Pittsburgh, PA', 'description': 'User Experience Designer | NIDian | Tartan', 'url': 'https://t.co/ZZ9imZvwFE', 'entities': {'url': {'urls': [{'url': 'https://t.co/ZZ9imZvwFE', 'ex

# Appendix

In [173]:
classify.predict("Why poor people tend to be more generous than the rich")

(Category 0, tensor(0), tensor([0.9350, 0.0650]))

In [187]:
classify.predict("World bank says poor need more money")

(Category 1, tensor(1), tensor([0.2322, 0.7678]))

In [188]:
classify.predict("Police: Fake Money Making The Rounds In Pittsburgh Area")

(Category 0, tensor(0), tensor([0.8743, 0.1257]))

In [213]:
classify.predict("Police: Fake Money making The bloopers In Pittsburgh Area")

(Category 0, tensor(0), tensor([0.7909, 0.2091]))

In [215]:
classify.predict("Man throws 20ft up National Christmas Tree near White House")

(Category 1, tensor(1), tensor([0.1094, 0.8906]))

In [129]:
classify.predict("I Accidentally Uncovered a Nationwide Scam on 9Gag")

(Category 0, tensor(0), tensor([0.8012, 0.1988]))

In [182]:
classify.predict("Parents Scale school Building to Help Students cheat on Exams")

(Category 1, tensor(1), tensor([0.1536, 0.8464]))

In [217]:
classify.predict("Man Buys Solid Gold Shirt to score with the Ladies")

(Category 1, tensor(1), tensor([0.2849, 0.7151]))

In [186]:
classify.predict("Parents Scale office building to Help apple cheat off Exams")

(Category 1, tensor(1), tensor([0.1436, 0.8564]))