## Trying out a simple learner

Before we try to build our deep learning models, let's make sure we can learn something using a simple linear model.

In [13]:
import numpy as np
import pandas as pd
from keras.utils.data_utils import get_file
from keras import regularizers
import nb_utils

emotion_csv = get_file('text_emotion.csv', 
                       'https://www.crowdflower.com/wp-content/uploads/2016/07/text_emotion.csv')
emotion_df = pd.read_csv(emotion_csv)

In [2]:
emotion_df.head()

Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...


In [3]:
emotion_df['sentiment'].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

In [4]:
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

VOCAB_SIZE = 50000

tfidf_vec = TfidfVectorizer(max_features=VOCAB_SIZE)
label_encoder = LabelEncoder()

X = tfidf_vec.fit_transform(emotion_df['content'])
y = label_encoder.fit_transform(emotion_df['sentiment'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [5]:
bayes = MultinomialNB()
bayes.fit(X_train, y_train)
predictions = bayes.predict(X_test)
precision_score(predictions, y_test, average='micro')

0.2802272727272727

In [6]:
classifiers = {'sgd': SGDClassifier(loss='hinge'),
               'svm': SVC(),
               'random_forest': RandomForestClassifier()}

for lbl, clf in classifiers.items():
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    print(lbl, precision_score(predictions, y_test, average='micro'))



sgd 0.32856060606060605
svm 0.21863636363636363
random_forest 0.2821212121212121


## Checking what our model learned

Our linear models appear to be learning something more powerful than "pick the most popular category".  We can take a quick look at which words they find the most correlated with each category before moving on to our neural network.

In [7]:
from scipy.sparse import eye
d = eye(len(tfidf_vec.vocabulary_))
word_pred = bayes.predict_proba(d)


In [8]:
inverse_vocab = {v: k for k, v in tfidf_vec.vocabulary_.items()}

from collections import Counter, defaultdict
by_cls = defaultdict(Counter)
for word_idx, pred in enumerate(word_pred):
    for class_idx, score in enumerate(pred):
        cls = label_encoder.classes_[class_idx]
        by_cls[cls][inverse_vocab[word_idx]] = score

In [9]:
for k in by_cls:
    words = [x[0] for x in by_cls[k].most_common(5)]
    print(k, ':', ' '.join(words))

anger : confuzzled fridaaaayyyyy aaaaaaaaaaa transtelecom filthy
boredom : squeaking ouuut cleanin sooooooo candyland3
empty : _cheshire_cat_ bethsybsb conversating kimbermuffin less_than_3
enthusiasm : lena_distractia foolproofdiva attending krisswouldhowse tatt
fun : xbox bamboozle sanctuary oldies toodaayy
happiness : excited woohoo excellent yay wars
hate : hate hates suck fucking zomberellamcfox
love : love mothers mommies moms loved
neutral : www painting souljaboytellem link frenchieb
relief : finally relax mastered relief inspiration
sadness : sad sadly cry cried miss
surprise : surprise wow surprised wtf surprisingly
worry : worried poor throat hurts sick


## Training a deep model

Now that we've seen how well a simple linear model can do, let's see if we can do any better with a deep learning model.  In this case, we don't have an excessive amount of training data: this constrains the models we can train effectively: use too big of a model, and we'll end up overfitting our data.

We'll start with a CNN.

In [11]:
from itertools import chain
from keras.preprocessing.sequence import pad_sequences

chars = list(sorted(set(chain(*emotion_df['content']))))
char_to_idx = {ch: idx for idx, ch in enumerate(chars)}
max_sequence_len = max(len(x) for x in emotion_df['content'])

char_vectors = []
for txt in emotion_df['content']:
    vec = np.zeros((max_sequence_len, len(char_to_idx)))
    vec[np.arange(len(txt)), [char_to_idx[ch] for ch in txt]] = 1
    char_vectors.append(vec)
char_vectors = np.asarray(char_vectors)
char_vectors = pad_sequences(char_vectors)
labels = label_encoder.transform(emotion_df['sentiment'])


def split(lst):
    training_count = int(0.9 * len(char_vectors))
    return lst[:training_count], lst[training_count:]

training_char_vectors, test_char_vectors = split(char_vectors)
training_labels, test_labels = split(labels)

char_vectors.shape

(40000, 167, 100)

In [14]:
from keras.layers import Input, Conv1D, MaxPooling1D, Flatten, Dense, Dropout, Merge, LSTM
from keras.models import Model
from keras.layers.merge import Concatenate

def create_char_cnn_model(num_chars, max_sequence_len, num_labels):
    char_input = Input(shape=(max_sequence_len, num_chars), name='input')
    
    conv_1x = Conv1D(128, 6, activation='relu', padding='valid')(char_input)
    max_pool_1x = MaxPooling1D(6)(conv_1x)
    conv_2x = Conv1D(256, 6, activation='relu', padding='valid')(max_pool_1x)
    max_pool_2x = MaxPooling1D(6)(conv_2x)

    flatten = Flatten()(max_pool_2x)
    dense = Dense(128, 
                  activation='relu',
                  kernel_regularizer=regularizers.l2(0.01))(flatten)
    preds = Dense(num_labels, activation='softmax')(dense)

    model = Model(char_input, preds)
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['acc'])
    return model

char_cnn_model = create_char_cnn_model(len(char_to_idx), char_vectors.shape[1], len(label_encoder.classes_))
char_cnn_model.summary()

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 167, 100)          0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 162, 128)          76928     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 27, 128)           0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 22, 256)           196864    
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 3, 256)            0         
_______________________________________________________

In [15]:
char_cnn_model.fit(training_char_vectors, training_labels, epochs=20, batch_size=1024)
char_cnn_model.evaluate(test_char_vectors, test_labels)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[2.027604729652405, 0.33625]

In [16]:
from keras.layers import Input, Conv1D, MaxPooling1D, Flatten, Dense, Dropout, Merge, LSTM
from keras.models import Model
from keras.layers.merge import Concatenate

def create_char_cnn_model(num_chars, max_sequence_len, num_labels):
    char_input = Input(shape=(max_sequence_len, num_chars), name='input')
    
    layers = []
    for window in (5, 6, 7):
        conv_1x = Conv1D(128, window, activation='relu', padding='valid')(char_input)
        max_pool_1x = MaxPooling1D(window)(conv_1x)
        dropout_1x = Dropout(0.3)(max_pool_1x)
        conv_2x = Conv1D(128, window, activation='relu', padding='valid')(dropout_1x)
        max_pool_2x = MaxPooling1D(window)(conv_2x)
        dropout_2x = Dropout(0.3)(max_pool_2x)
        layers.append(dropout_2x)

    if len(layers) > 1:
        merged = Concatenate(axis=1)(layers)
    else:
        merged = layers[0]

    dropout = Dropout(0.3)(merged)
    
    flatten = Flatten()(dropout)
    dense = Dense(128, activation='relu')(flatten)
    preds = Dense(num_labels, activation='softmax')(dense)

    model = Model(char_input, preds)
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['acc'])
    return model

char_cnn_model = create_char_cnn_model(len(char_to_idx), char_vectors.shape[1], len(label_encoder.classes_))
char_cnn_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input (InputLayer)              (None, 167, 100)     0                                            
__________________________________________________________________________________________________
conv1d_5 (Conv1D)               (None, 163, 128)     64128       input[0][0]                      
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (None, 162, 128)     76928       input[0][0]                      
__________________________________________________________________________________________________
conv1d_9 (Conv1D)               (None, 161, 128)     89728       input[0][0]                      
__________________________________________________________________________________________________
max_poolin

In [17]:
char_cnn_model.fit(training_char_vectors, training_labels, epochs=20, batch_size=1024)
char_cnn_model.evaluate(test_char_vectors, test_labels)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[2.0358053636550903, 0.329]

## Featurizing and preparing our data

Just like we did when computing word embeddings, we want to featurize our data so we can classify it effectively.

In [18]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot

VOCAB_SIZE = 50000
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(emotion_df['content'])

In [19]:
# This may take a while to load

w2v, idf = nb_utils.load_w2v(tokenizer)

In [21]:
tokens = tokenizer.texts_to_sequences(emotion_df['content'])
tokens = pad_sequences(tokens)


training_count = int(0.9 * len(char_vectors))
training_tokens, training_labels = tokens[:training_count], labels[:training_count]
test_tokens, test_labels = tokens[training_count:], labels[training_count:]

In [22]:
from keras import layers, models
import keras.backend as K


def make_embedding(name, vocab_size, embedding_size, weights=None, mask_zero=True):
    if weights is not None:
        return layers.Embedding(mask_zero=mask_zero, input_dim=vocab_size, 
                                output_dim=weights.shape[1], 
                                weights=[weights], trainable=False, 
                                name='%s/embedding' % name)
    else:
        return layers.Embedding(mask_zero=mask_zero, input_dim=vocab_size, 
                                output_dim=embedding_size,
                                name='%s/embedding' % name)

def create_unigram_model(vocab_size, embedding_size=None, embedding_weights=None, idf_weights=None):
    assert not (embedding_size is None and embedding_weights is None)
    message = layers.Input(shape=(None,), dtype='int32', name='message')
    
    embedding = make_embedding('message_vec', vocab_size, embedding_size, embedding_weights)
    idf = make_embedding('message_idf', vocab_size, embedding_size, idf_weights)

    mask = layers.Masking(mask_value=0)
    def _combine_and_sum(args):
        embedding, idf = args
        return K.sum(embedding * K.abs(idf), axis=1)

    sum_layer = layers.Lambda(_combine_and_sum, name='combine_and_sum')
    sum_msg = sum_layer([mask(embedding(message)), idf(message)])
    fc1 = layers.Dense(units=128, activation='relu')(sum_msg)
    categories = layers.Dense(units=len(label_encoder.classes_), activation='softmax')(fc1)
    
    model = models.Model(
        inputs=[message],
        outputs=categories,
    )
    
    model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    model.summary()
    return model

unigram_model = create_unigram_model(vocab_size=VOCAB_SIZE,
                                     embedding_weights=w2v,
                                     idf_weights=idf)

Instructions for updating:
keep_dims is deprecated, use keepdims instead
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
message (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
message_vec/embedding (Embeddin (None, None, 300)    15000000    message[0][0]                    
__________________________________________________________________________________________________
masking_1 (Masking)             (None, None, 300)    0           message_vec/embedding[0][0]      
__________________________________________________________________________________________________
message_idf/embedding (Embeddin (None, None, 1)      50000       message[0][0]                    
____________________________________

In [23]:
unigram_model.fit(training_tokens, training_labels, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f19482867f0>

In [24]:
unigram_model.evaluate(test_tokens, test_labels, verbose=2)

[2.4151585874557493, 0.302]

## Learning Embeddings

It looks like our model with pre-trained embeddings isn't doing much better than the linear models.

We can also try training a model "from scratch", and learn the word embeddings from our training data.  Note that we use a small embedding size here to speed up training and to try to avoid overfitting.

Only training for 10 epochs stops the model while it is still improving on the training set, but prevents it
from overfitting.  We can formalize this by using a validation set and early stopping.

In [25]:
learned_embeddings_model = create_unigram_model(vocab_size=VOCAB_SIZE, embedding_size=25)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
message (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
message_vec/embedding (Embeddin (None, None, 25)     1250000     message[0][0]                    
__________________________________________________________________________________________________
masking_2 (Masking)             (None, None, 25)     0           message_vec/embedding[0][0]      
__________________________________________________________________________________________________
message_idf/embedding (Embeddin (None, None, 25)     1250000     message[0][0]                    
__________________________________________________________________________________________________
combine_an

In [26]:
learned_embeddings_model.fit(training_tokens, training_labels, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f19156797f0>

In [27]:
# Note the test set accuracy is lower than that on the training set.

learned_embeddings_model.evaluate(test_tokens, test_labels, verbose=2)

[2.0061434240341187, 0.35275]

# More Complex Models

As with our previous task, we can try using more powerful models to classify our text.  In this case, the limited training data and text size limit their effectiveness.

In [28]:
def create_cnn_model(vocab_size, embedding_size=None, embedding_weights=None):
    message = layers.Input(shape=(None,), dtype='int32', name='title')
    
    # The convolution layer in keras does not support masking, so we just allow
    # the embedding layer to learn an explicit value.
    embedding = make_embedding('message_vec', vocab_size, embedding_size, embedding_weights,
                              mask_zero=False)

    def _combine_sum(v):
        return K.sum(v, axis=1)

    cnn_1 = layers.Convolution1D(128, 3)
    cnn_2 = layers.Convolution1D(128, 3)
    cnn_3 = layers.Convolution1D(128, 3)
    
    global_pool = layers.GlobalMaxPooling1D()
    local_pool = layers.MaxPooling1D(strides=1, pool_size=3)

    cnn_encoding = global_pool(cnn_3(local_pool(cnn_2(local_pool(cnn_1(embedding(message)))))))
    fc1 = layers.Dense(units=128, activation='elu')(cnn_encoding)
    categories = layers.Dense(units=len(label_encoder.classes_), activation='softmax')(fc1)
    model = models.Model(
        inputs=[message],
        outputs=[categories],
    )
    model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return model

In [29]:
cnn_model = create_cnn_model(VOCAB_SIZE, embedding_weights=w2v)
cnn_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
title (InputLayer)              (None, None)         0                                            
__________________________________________________________________________________________________
message_vec/embedding (Embeddin (None, None, 300)    15000000    title[0][0]                      
__________________________________________________________________________________________________
conv1d_11 (Conv1D)              (None, None, 128)    115328      message_vec/embedding[0][0]      
__________________________________________________________________________________________________
max_pooling1d_11 (MaxPooling1D) (None, None, 128)    0           conv1d_11[0][0]                  
                                                                 conv1d_12[0][0]                  
__________

In [30]:
cnn_model.fit(training_tokens, training_labels, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f19157b81d0>

In [31]:
cnn_model.evaluate(test_tokens, test_labels)



[2.817425817489624, 0.323]

In [32]:
def create_lstm_model(vocab_size, embedding_size=None, embedding_weights=None):
    message = layers.Input(shape=(None,), dtype='int32', name='title')
    embedding = make_embedding('message_vec', vocab_size, embedding_size, embedding_weights)(message)

    lstm_1 = layers.LSTM(units=128, return_sequences=False)(embedding)
#     lstm_2 = layers.LSTM(units=128, return_sequences=False)(lstm_1)
    category = layers.Dense(units=len(label_encoder.classes_), activation='softmax')(lstm_1)
    
    model = models.Model(
        inputs=[message],
        outputs=[category],
    )
    model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return model

In [33]:
lstm_model = create_lstm_model(VOCAB_SIZE, embedding_weights=w2v)
lstm_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
title (InputLayer)           (None, None)              0         
_________________________________________________________________
message_vec/embedding (Embed (None, None, 300)         15000000  
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               219648    
_________________________________________________________________
dense_11 (Dense)             (None, 13)                1677      
Total params: 15,221,325
Trainable params: 221,325
Non-trainable params: 15,000,000
_________________________________________________________________


In [34]:
lstm_model.fit(training_tokens, training_labels, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f1913112080>

In [35]:
lstm_model.evaluate(test_tokens, test_labels)



[1.8966071195602416, 0.38475]

## Comparing our models

Let's compare the predictions from our models on a sample of our data.

In [36]:
predictions = {
    'lstm': lstm_model.predict(test_tokens[:100]),
    'char_cnn': char_cnn_model.predict(test_char_vectors[:100]),
    'cnn': cnn_model.predict(test_tokens[:100]),
    'unigram': unigram_model.predict(test_tokens[:100]),
}

In [37]:
# Make a dataframe just for test data

pd.options.display.max_colwidth = 128
test_df = emotion_df[training_count:training_count+100].reset_index()
eval_df = pd.DataFrame({
    'content': test_df['content'],
    'true': test_df['sentiment'],
    'lstm': [label_encoder.classes_[np.argmax(x)] for x in predictions['lstm']],
    'cnn': [label_encoder.classes_[np.argmax(x)] for x in predictions['cnn']],
    'char_cnn': [label_encoder.classes_[np.argmax(x)] for x in predictions['char_cnn']],    
    'unigram': [label_encoder.classes_[np.argmax(x)] for x in predictions['unigram']],
})
eval_df = eval_df[['content', 'true', 'lstm', 'cnn', 'char_cnn', 'unigram']]
eval_df.head(10)

Unnamed: 0,content,true,lstm,cnn,char_cnn,unigram
0,HAPPY MOTHER'S DAY to all of the wonderful women out there. Have a great and relaxful day.,happiness,love,love,love,love
1,"browsing thru adopting agencies, i'm gonna get some exotic kids",enthusiasm,neutral,worry,worry,worry
2,"I am tired of my phone. Walkman works like a charm, but l need better video and wap really. Thanks for yesterday and for buy...",love,relief,relief,happiness,love
3,Happy Mother's Day to all the Mommiessss,love,love,love,love,happiness
4,@mattgarner haha what's up Matt ?,happiness,neutral,fun,neutral,worry
5,What's up!!? @guillermop,neutral,neutral,neutral,neutral,neutral
6,@KandyBee we shuld do a dance like that its seriously the best thing haha. see yu tomoro.,fun,happiness,fun,happiness,neutral
7,@TravelTweetie I will go to sleep now. Might be awakened early w/breakfast tray from my 'spark' &amp; my 'joper' w/their Dad...,happiness,neutral,worry,worry,worry
8,@nak1a &quot;If there's a camel up a hill&quot; and &quot;I'll give you plankton&quot; ....HILARIOUS!!,happiness,happiness,neutral,happiness,neutral
9,@Bern_morley LOL I love your kids,love,love,love,love,love


## Qualitative Evaluation

We can examine some of our error cases by hand.  Often, the models tend to agree when they make mistakes, and that the mistakes aren't unreasonable: this task would be challenging even for a human.

In [38]:
eval_df[eval_df['lstm'] != eval_df['true']].head(10)

Unnamed: 0,content,true,lstm,cnn,char_cnn,unigram
0,HAPPY MOTHER'S DAY to all of the wonderful women out there. Have a great and relaxful day.,happiness,love,love,love,love
1,"browsing thru adopting agencies, i'm gonna get some exotic kids",enthusiasm,neutral,worry,worry,worry
2,"I am tired of my phone. Walkman works like a charm, but l need better video and wap really. Thanks for yesterday and for buy...",love,relief,relief,happiness,love
4,@mattgarner haha what's up Matt ?,happiness,neutral,fun,neutral,worry
6,@KandyBee we shuld do a dance like that its seriously the best thing haha. see yu tomoro.,fun,happiness,fun,happiness,neutral
7,@TravelTweetie I will go to sleep now. Might be awakened early w/breakfast tray from my 'spark' &amp; my 'joper' w/their Dad...,happiness,neutral,worry,worry,worry
10,@davecandoit dude that honest to god happens to me all the time.. minus the trail mix.,sadness,neutral,happiness,sadness,neutral
12,Happy Mother's Day to the tweetin' mamas Nite tweeple!,worry,love,love,love,happiness
13,On my way home...then SLEEP! Seeing Amber Pacific tomorow with the besties,happiness,neutral,happiness,happiness,happiness
14,@xoMusicLoverxo I'm using it in a story. I actually already wrote it but have to write the chapters before it.,relief,neutral,neutral,sadness,neutral


In [39]:
emotion_df.count()

tweet_id     40000
sentiment    40000
author       40000
content      40000
dtype: int64

## Analyzing Tweets

We can gather a sample of Twitter data using the Twitter API (https://dev.twitter.com).  To do so, we'll need to create a Twitter application and get credentials for it.  You can do this manually at https://app.twitter.com.  Once you have an app, go to the "Key and Access Tokens" tab to find your credentials.

In [42]:
import twitter
import emoji

In [43]:
# Fill these in!

CONSUMER_KEY = 'xbMuxcJpRTiVGt2C2EYnA'
CONSUMER_SECRET = '2DbQTsvIptkPTdaUcos8DDvQH9fzO0hNjJpUT2uVzQ'
ACCESS_TOKEN = '7319442-EDm4CPxL7W4KkZcGWRMJNVHp88W5OH9vgblu898fg'
ACCESS_SECRET = '5ZxJSbqXhG7uhgXzTFWf9XhkfsxxinlPRXyDTzbA9w'

In [44]:
api = twitter.Twitter(
    auth=twitter.OAuth(
        consumer_key=CONSUMER_KEY,
        consumer_secret=CONSUMER_SECRET,
        token=ACCESS_TOKEN,
        token_secret=ACCESS_SECRET,
    ))

stream = twitter.TwitterStream(
    auth=twitter.OAuth(
        consumer_key=CONSUMER_KEY,
        consumer_secret=CONSUMER_SECRET,
        token=ACCESS_TOKEN,
        token_secret=ACCESS_SECRET,
    ))

In [45]:
import itertools
def has_emoji(tweet):
    if tweet.get('lang') != 'en':
        return False
    return any(ch for ch in tweet.get('text', '') if ch in emoji.UNICODE_EMOJI)

%time st = list(itertools.islice(filter(has_emoji, stream.statuses.sample()), 0, 10))

CPU times: user 199 ms, sys: 23 ms, total: 222 ms
Wall time: 8.88 s


In [46]:
len(st), [t.get('text', None) for t in st][:10]

(10,
 ['@JTMusicTeam Congrats my fav peeps. You guys just keep surprising  each and everyday!! LOVE YOU!!❤❤❤',
  '@grace_ashe Omg Stop get out of here!! You’re too nice💘💘',
  'The best advice!!!! Thank you doctor love ❤️❤️❤️',
  'So one piece will be on break next week.😭😭😭😭😭😭😭😭',
  'RT @Sporf: 👤 @ManUtd managers win %:\n\n🏴\U000e0067\U000e0062\U000e0073\U000e0063\U000e0074\U000e007f Sir Alex Ferguson\n✅ 59%\n\n🇵🇹 Jose Mourinho \n✅ 58%\n\n🏴\U000e0067\U000e0062\U000e0065\U000e006e\U000e0067\U000e007f Ernest Mangnall\n✅ 54%\n\n🏴\U000e0067\U000e0062\U000e0073\U000e0063\U000e0074\U000e007f Davi…',
  "RT @mihyochaeng: I'm still thinking about this 😂 #MiChaeng https://t.co/Dj9cp60GvQ",
  'RT @DayswithDae: Jongdae                                 me\n                           🤝\n               \n                  nothing i just…',
  'RT @Tee_Jaruji: 27 December 2018 , SBFIVE  SPARK (ช็อต...หัวใจ) ⚡️💘\n#SparkSBFIVE #SBFIVE #Starhunterstudio https://t.co/PLvWxjI6d6',
  'RT @Mo3tadilaCBA: For all

## Save Emojis

fetch many emoji-tweets and save as 'data/emojis.txt' for next section.

In [64]:
tweets = list(itertools.islice(
    filter(has_emoji, stream.statuses.sample()), 0, 100000))

In [65]:
stripped = []
for tweet in tweets:
    text = tweet['text']
    emojis = {ch for ch in text if ch in emoji.UNICODE_EMOJI}
    if len(emojis) == 1:
        emoiji = emojis.pop()
        text = ''.join(ch for ch in text if ch != emoiji)
        stripped.append((text, emoiji))

In [66]:
pd.DataFrame(stripped).to_csv('data/emojis.txt', header=['text', 'emoji'], index=None)