## Analyzing Tweets

We can gather a sample of Twitter data using the Twitter API (https://dev.twitter.com).  To do so, we'll need to create a Twitter application and get credentials for it.  You can do this manually at https://app.twitter.com.  Once you have an app, go to the "Key and Access Tokens" tab to find your credentials.

In [1]:
import twitter

In [125]:
# Fill these in!

CONSUMER_KEY = ''
CONSUMER_SECRET = ''
ACCESS_TOKEN = ''
ACCESS_SECRET = ''

In [3]:
api = twitter.Twitter(
    auth=twitter.OAuth(
        consumer_key=CONSUMER_KEY,
        consumer_secret=CONSUMER_SECRET,
        token=ACCESS_TOKEN,
        token_secret=ACCESS_SECRET,
    ))

stream = twitter.TwitterStream(
    auth=twitter.OAuth(
        consumer_key=CONSUMER_KEY,
        consumer_secret=CONSUMER_SECRET,
        token=ACCESS_TOKEN,
        token_secret=ACCESS_SECRET,
    ))

In [4]:
import itertools
%time st = list(itertools.islice(stream.statuses.sample(), 0, 1000))

CPU times: user 1.33 s, sys: 256 ms, total: 1.59 s
Wall time: 21.5 s


In [12]:
[t.get('text', None) for t in st][:10]

[None,
 'Your reading Into it ....',
 'RT @WTFFacts: Short term memory is the key to academic achievement.',
 '7/30の日曜日は札幌ゲイ文館で私と握手',
 'RT @people: Driver Allegedly Claims He Didn’t Know He Was Smuggling People in Incident that Left 10 Dead https://t.co/p3zOyot4ai',
 'Akira, cielo. —queesunaniñaunmomentoqueledaalgo—',
 'With another White House delay, rule to bolster safety data on generic labels may be dead https://t.co/SsyEMlNKvG via @statnews',
 "RT @rainnwilson: America is 5% of the worlds population but has 50% of the worlds guns. Cmon! Let's get that number up, people!",
 'Ben çocuklarıma babanızı instagramdan buldum diyemem mesela',
 'RT @fengzilin0312: #송중기 肤浅的我今天就是一条颜🐶 https://t.co/gxxVeFeIn7']

In [1]:
import pandas as pd
import nb_utils

emotion_csv = nb_utils.download('https://www.crowdflower.com/wp-content/uploads/2016/07/text_emotion.csv')
emotion_df = pd.read_csv(emotion_csv)

In [2]:
emotion_df.head()

Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...


In [3]:
emotion_df['sentiment'].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

## Trying out a simple learner

Before we try to build our deep learning models, let's make sure we can learn something using a simple linear model.

In [4]:
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import cross_val_score

VOCAB_SIZE = 50000

tfidf_vec = TfidfVectorizer(max_features=VOCAB_SIZE)
label_encoder = LabelEncoder()

linear_x = tfidf_vec.fit_transform(emotion_df['content'])
linear_y = label_encoder.fit_transform(emotion_df['sentiment'])

sgd = SGDClassifier(loss='hinge')
bayes = MultinomialNB()



In [5]:
{ 
    'sgd': cross_val_score(sgd, linear_x, linear_y),
    'bayes': cross_val_score(bayes, linear_x, linear_y),
}

{'bayes': array([ 0.24615731,  0.28706412,  0.28954082]),
 'sgd': array([ 0.29196971,  0.32943382,  0.320003  ])}

## Checking what our model learned

Our linear models appear to be learning something more powerful than "pick the most popular category".  We can take a quick look at which words they find the most correlated with each category before moving on to our neural network.

In [6]:
fitted_sgd = sgd.fit(linear_x, linear_y)
fitted_sgd

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

In [7]:
import numpy as np
inverse_vocab = { v:k for (k,v) in tfidf_vec.vocabulary_.items() }

for i, klass in enumerate(label_encoder.classes_):
    sorted_coef = np.argsort(fitted_sgd.coef_[i])
    print(klass, [inverse_vocab[j] for j in sorted_coef[-5:]])

anger ['rigging', 'transtelecom', 'aaaaaaaaaaa', 'fridaaaayyyyy', 'confuzzled']
boredom ['cleanin', 'interminable', 'squeaking', 'meanmillies', 'documentation']
empty ['bethsybsb', 'makinitrite', '_cheshire_cat_', 'conversating', 'kimbermuffin']
enthusiasm ['krisswouldhowse', 'sotongs', 'foolproofdiva', 'lena_distractia', 'npyskater']
fun ['yaaaaay', 'threee', 'yeaahh', 'tunes', 'bamboozle']
happiness ['werewolfseth', 'wars', 'juddday', 'excellent', 'woohoo']
hate ['cricinfo', 'bastard', 'zomberellamcfox', 'grrrr', 'hate']
love ['loved', 'mommies', 'loving', 'mothers', 'love']
neutral ['surfin', 'itchy', 'gut', 'frenchieb', 'mcraddictal']
relief ['mastered', 'surviving', 'relaxed', 'relief', 'chiacy']
sadness ['sadly', 'disappointed', 'depressing', 'cried', 'sad']
surprise ['suprisingly', 'titanite', 'surprised', 'himym', 'surprise']
worry ['scared', 'nervous', 'poor', 'throat', 'worried']


## Training a deep model

Now that we've seen how well a simple linear model can do, let's see if we can do any better with a deep learning model.  In this case, we don't have an excessive amount of training data: this constrains the models we can train effectively: use too big of a model, and we'll end up overfitting our data.

We'll use pre-trained word embeddings again to bootstrap our model

## Featurizing and preparing our data

Just like we did when computing word embeddings, we want to featurize our data so we can classify it effectively.

In [8]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot

VOCAB_SIZE = 50000
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(emotion_df['content'])

Using TensorFlow backend.


In [9]:
# This may take a while to load

w2v, idf = nb_utils.load_w2v(tokenizer)

In [30]:
tokens = tokenizer.texts_to_sequences(emotion_df['content'])
tokens = pad_sequences(tokens)
labels = label_encoder.transform(emotion_df['sentiment'])

training_count = int(0.9 * len(tokens))
training_tokens, training_labels = tokens[:training_count], labels[:training_count]
test_tokens, test_labels = tokens[training_count:], labels[training_count:]

In [38]:
import tensorflow as tf
from keras import layers, models
import keras.backend as K


def make_embedding(name, vocab_size, embedding_size, weights=None, mask_zero=True):
    if weights is not None:
        return layers.Embedding(mask_zero=mask_zero, input_dim=vocab_size, 
                                output_dim=weights.shape[1], 
                                weights=[weights], trainable=False, 
                                name='%s/embedding' % name)
    else:
        return layers.Embedding(mask_zero=mask_zero, input_dim=vocab_size, 
                                output_dim=embedding_size,
                                name='%s/embedding' % name)

def create_unigram_model(vocab_size, embedding_size=None, embedding_weights=None, idf_weights=None):
    assert not (embedding_size is None and embedding_weights is None)
    message = layers.Input(shape=(None,), dtype='int32', name='message')
    
    embedding = make_embedding('message_vec', vocab_size, embedding_size, embedding_weights)
    idf = make_embedding('message_idf', vocab_size, embedding_size, idf_weights)

    mask = layers.Masking(mask_value=0)
    def _combine_and_sum(args):
        [embedding, idf] = args
        return K.sum(embedding * K.abs(idf), axis=1)

    sum_layer = layers.Lambda(_combine_and_sum, name='combine_and_sum')
    sum_msg = sum_layer([mask(embedding(message)), idf(message)])
    fc1 = layers.Dense(units=128, activation='relu')(sum_msg)
    categories = layers.Dense(units=len(label_encoder.classes_), activation='softmax')(fc1)
    
    model = models.Model(
        inputs=[message],
        outputs=categories,
    )
    
    model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    model.summary()
    return model

unigram_model = create_unigram_model(vocab_size=VOCAB_SIZE,
                                     embedding_weights=w2v,
                                     idf_weights=idf)

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
message (InputLayer)             (None, None)          0                                            
____________________________________________________________________________________________________
message_vec/embedding (Embedding (None, None, 300)     15000000    message[0][0]                    
____________________________________________________________________________________________________
masking_7 (Masking)              (None, None, 300)     0           message_vec/embedding[0][0]      
____________________________________________________________________________________________________
message_idf/embedding (Embedding (None, None, 1)       50000       message[0][0]                    
___________________________________________________________________________________________

In [41]:
unigram_model.fit(training_tokens, training_labels, epochs=10)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fc599833400>

In [42]:
unigram_model.evaluate(test_tokens, test_labels, verbose=2)

[2.8783021488189697, 0.29899999999999999]

## Learning Embeddings

It looks like our model with pre-trained embeddings isn't doing much better than the linear models.

We can also try training a model "from scratch", and learn the word embeddings from our training data.  Note that we use a small embedding size here to speed up training and to try to avoid overfitting.

Only training for 10 epochs stops the model while it is still improving on the training set, but prevents it
from overfitting.  We can formalize this by using a validation set and early stopping.

In [43]:
learned_embeddings_model = create_unigram_model(vocab_size=VOCAB_SIZE, embedding_size=25)

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
message (InputLayer)             (None, None)          0                                            
____________________________________________________________________________________________________
message_vec/embedding (Embedding (None, None, 25)      1250000     message[0][0]                    
____________________________________________________________________________________________________
masking_8 (Masking)              (None, None, 25)      0           message_vec/embedding[0][0]      
____________________________________________________________________________________________________
message_idf/embedding (Embedding (None, None, 25)      1250000     message[0][0]                    
___________________________________________________________________________________________

In [45]:
learned_embeddings_model.fit(training_tokens, training_labels, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc56477d160>

In [46]:
# Note the test set accuracy is lower than that on the training set.

learned_embeddings_model.evaluate(test_tokens, test_labels, verbose=2)

[2.0585128955841063, 0.35225000000000001]

# More Complex Models

As with our previous task, we can try using more powerful models to classify our text.  In this case, the limited training data and text size limit their effectiveness.

In [105]:
def create_cnn_model(vocab_size, embedding_size=None, embedding_weights=None):
    message = layers.Input(shape=(None,), dtype='int32', name='title')
    
    # The convolution layer in keras does not support masking, so we just allow
    # the embedding layer to learn an explicit value.
    embedding = make_embedding('message_vec', vocab_size, embedding_size, embedding_weights,
                              mask_zero=False)

    def _combine_sum(v):
        return K.sum(v, axis=1)

    cnn_1 = layers.Convolution1D(128, 3)
    cnn_2 = layers.Convolution1D(128, 3)
    cnn_3 = layers.Convolution1D(128, 3)
    
    global_pool = layers.GlobalMaxPooling1D()
    local_pool = layers.MaxPooling1D(strides=1, pool_size=3)

    cnn_encoding = global_pool(cnn_3(local_pool(cnn_2(local_pool(cnn_1(embedding(message)))))))
    fc1 = layers.Dense(units=128, activation='elu')(cnn_encoding)
    categories = layers.Dense(units=len(label_encoder.classes_), activation='softmax')(fc1)
    model = models.Model(
        inputs=[message],
        outputs=[categories],
    )
    model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return model

In [106]:
cnn_model = create_cnn_model(VOCAB_SIZE, embedding_weights=w2v)
cnn_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
title (InputLayer)           (None, None)              0         
_________________________________________________________________
message_vec/embedding (Embed (None, None, 300)         15000000  
_________________________________________________________________
conv1d_7 (Conv1D)            (None, None, 128)         115328    
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_8 (Conv1D)            (None, None, 128)         49280     
_________________________________________________________________
conv1d_9 (Conv1D)            (None, None, 128)         49280     
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 128)               0         
__________

In [107]:
cnn_model.fit(training_tokens, training_labels, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fc4aa983ac8>

In [108]:
cnn_model.evaluate(test_tokens, test_labels)



[1.9418595876693725, 0.35949999999999999]

In [68]:
def create_lstm_model(vocab_size, embedding_size=None, embedding_weights=None):
    message = layers.Input(shape=(None,), dtype='int32', name='title')
    embedding = make_embedding('message_vec', vocab_size, embedding_size, embedding_weights)(message)

    lstm_1 = layers.LSTM(units=128, return_sequences=False)(embedding)
#     lstm_2 = layers.LSTM(units=128, return_sequences=False)(lstm_1)
    category = layers.Dense(units=len(label_encoder.classes_), activation='softmax')(lstm_1)
    
    model = models.Model(
        inputs=[message],
        outputs=[category],
    )
    model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return model

In [69]:
lstm_model = create_lstm_model(VOCAB_SIZE, embedding_weights=w2v)
lstm_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
title (InputLayer)           (None, None)              0         
_________________________________________________________________
message_vec/embedding (Embed (None, None, 300)         15000000  
_________________________________________________________________
lstm_8 (LSTM)                (None, 128)               219648    
_________________________________________________________________
dense_26 (Dense)             (None, 13)                1677      
Total params: 15,221,325
Trainable params: 221,325
Non-trainable params: 15,000,000
_________________________________________________________________


In [70]:
lstm_model.fit(training_tokens, training_labels, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fc4af4563c8>

In [72]:
lstm_model.evaluate(test_tokens, test_labels)



[1.868003028869629, 0.38124999999999998]

## Comparing our models

Let's compare the predictions from our models on a sample of our data.

In [109]:
predictions = {
    'lstm': lstm_model.predict(test_tokens[:100]),
    'cnn': cnn_model.predict(test_tokens[:100]),
    'unigram': unigram_model.predict(test_tokens[:100]),
}

In [121]:
# Make a dataframe just for test data

pd.options.display.max_colwidth = 128
test_df = emotion_df[training_count:training_count+100].reset_index()
eval_df = pd.DataFrame({
    'content': test_df['content'],
    'true': test_df['sentiment'],
    'lstm': [label_encoder.classes_[np.argmax(x)] for x in predictions['lstm']],
    'cnn': [label_encoder.classes_[np.argmax(x)] for x in predictions['cnn']],
    'unigram': [label_encoder.classes_[np.argmax(x)] for x in predictions['unigram']],
})
eval_df = eval_df[['content', 'true', 'lstm', 'cnn', 'unigram']]
eval_df.head(10)

Unnamed: 0,content,true,lstm,cnn,unigram
0,HAPPY MOTHER'S DAY to all of the wonderful women out there. Have a great and relaxful day.,happiness,love,love,love
1,"browsing thru adopting agencies, i'm gonna get some exotic kids",enthusiasm,neutral,fun,happiness
2,"I am tired of my phone. Walkman works like a charm, but l need better video and wap really. Thanks for yesterday and for buy...",love,relief,happiness,love
3,Happy Mother's Day to all the Mommiessss,love,love,love,love
4,@mattgarner haha what's up Matt ?,happiness,neutral,happiness,neutral
5,What's up!!? @guillermop,neutral,neutral,neutral,neutral
6,@KandyBee we shuld do a dance like that its seriously the best thing haha. see yu tomoro.,fun,happiness,happiness,happiness
7,@TravelTweetie I will go to sleep now. Might be awakened early w/breakfast tray from my 'spark' &amp; my 'joper' w/their Dad...,happiness,worry,neutral,worry
8,@nak1a &quot;If there's a camel up a hill&quot; and &quot;I'll give you plankton&quot; ....HILARIOUS!!,happiness,happiness,neutral,neutral
9,@Bern_morley LOL I love your kids,love,love,love,love


## Qualitative Evaluation

We can examine some of our error cases by hand.  Often, the models tend to agree when they make mistakes, and that the mistakes aren't unreasonable: this task would be challenging even for a human.

In [124]:
eval_df[eval_df['lstm'] != eval_df['true']].head(10)

Unnamed: 0,content,true,lstm,cnn,unigram
0,HAPPY MOTHER'S DAY to all of the wonderful women out there. Have a great and relaxful day.,happiness,love,love,love
1,"browsing thru adopting agencies, i'm gonna get some exotic kids",enthusiasm,neutral,fun,happiness
2,"I am tired of my phone. Walkman works like a charm, but l need better video and wap really. Thanks for yesterday and for buy...",love,relief,happiness,love
4,@mattgarner haha what's up Matt ?,happiness,neutral,happiness,neutral
6,@KandyBee we shuld do a dance like that its seriously the best thing haha. see yu tomoro.,fun,happiness,happiness,happiness
7,@TravelTweetie I will go to sleep now. Might be awakened early w/breakfast tray from my 'spark' &amp; my 'joper' w/their Dad...,happiness,worry,neutral,worry
10,@davecandoit dude that honest to god happens to me all the time.. minus the trail mix.,sadness,neutral,surprise,surprise
12,Happy Mother's Day to the tweetin' mamas Nite tweeple!,worry,happiness,love,neutral
13,On my way home...then SLEEP! Seeing Amber Pacific tomorow with the besties,happiness,neutral,neutral,neutral
14,@xoMusicLoverxo I'm using it in a story. I actually already wrote it but have to write the chapters before it.,relief,neutral,neutral,neutral
