# EMOJI PREDICTION CHALLENGE: Modelling Walkthrough

This notebook is a starter example of developing a model to predict Emojis from Twitter data
- Tweet & emoji data exploration
- NLP preprocessing
- GBMs

In [1]:
from ast import literal_eval
from collections import Counter
from multiprocessing import Pool

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier

# Data Reading

In [237]:
## preprocess.py should have created a test.csv and train.csv 
## - make sure they are in the same directory as this notebook

In [55]:
DATA_FNAME = 'train.csv'
TWEET_COL = 'text'
EMOJI_COL = 'emoticons'

df = pd.read_csv(DATA_FNAME)
df = df[['tweet_id', TWEET_COL, EMOJI_COL]]
df = df[df[EMOJI_COL].apply(type) == str]

# Data Exporation
### Look at some general views of the data

In [56]:
df.head(20)

Unnamed: 0,tweet_id,text,emoticons
0,1433932,@AmazonHelp Prime shipping is useless since ev...,[':face_with_tears_of_joy:']
1,1175287,@118625 This is the 2nd time from a different ...,[':angry_face:']
2,1929034,@116062 where are all your curly hair products...,[':thinking_face:']
3,2852186,"@792994 Olá, Idris! É difícil se resistir, não...",[':thinking_face:']
4,2007969,"@593759 Thanks for the suggestion, Dijonay! We...",[':green_heart:']
5,997293,how do u cancel an uber eats order i accidenta...,[':face_with_tears_of_joy:']
6,2114182,@623333 Hey Naina! We're launching regularly i...,[':green_heart:']
7,1559747,@GWRHelp If service is displaying as 4minutes ...,[':face_with_rolling_eyes:']
8,1701953,@115858 can make brand new phone with all thes...,[':face_with_rolling_eyes:']
9,1481022,@282263 Hi there Are you wanting to order a re...,[':thumbs_down:']


In [57]:
df.describe()

Unnamed: 0,tweet_id,text,emoticons
count,29713,29713,29713
unique,29713,29583,55
top,828269,@AmazonHelp,[':face_with_tears_of_joy:']
freq,1,24,8007


### Data types

#### Each tweet's text was read as a string, which makes sense:

In [58]:
df.head()[TWEET_COL].apply(type)

0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
3    <class 'str'>
4    <class 'str'>
Name: text, dtype: object

#### Each tweet's list of emojis was read as one big string:

In [59]:
df.head()[EMOJI_COL].apply(type)

0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
3    <class 'str'>
4    <class 'str'>
Name: emoticons, dtype: object

In [60]:
df.loc[1, EMOJI_COL]

"[':angry_face:']"

#### However, what we want is a list of emojis. Apply `ast.literal_eval()` to convert each emoji-list-string to an actual list of emoji strings:

In [63]:
df['emoji_list'] = df[EMOJI_COL].apply(literal_eval)

In [65]:
df.loc[1, 'emoji_list']

[':angry_face:']

In [83]:
df['emoji_list'][:10]

0    [:face_with_tears_of_joy:]
1                [:angry_face:]
2             [:thinking_face:]
3             [:thinking_face:]
4               [:green_heart:]
5    [:face_with_tears_of_joy:]
6               [:green_heart:]
7    [:face_with_rolling_eyes:]
8    [:face_with_rolling_eyes:]
9               [:thumbs_down:]
Name: emoji_list, dtype: object

In [None]:
### Plot the distribution of tweet lengths
### What are the min/max/mean lengths? Do these make sense from what you know about tweets?

### Explore emojis

#### Emojis per tweet

In [76]:
### What is the distribution of Emojis per tweet?

#### Tweets per emoji

In [74]:
### What emojis appear in the dataset?
### With what frequency do they appear?

# Data Prep

### Split dataset into train & validation samples

In [162]:
tweets_train, tweets_eval, emojis_train, emojis_eval = train_test_split(
    df[TWEET_COL].tolist(),
    df['emoji_list'].tolist(),
    random_state=12
)
print(len(tweets_train), len(emojis_train))
print(len(tweets_eval), len(emojis_eval))

22284 22284
7429 7429


### Each tweet has a list of one or more emojis that appeared in it. Convert each emoji list to True/False dummy values (one for each emoji) indicating which emojis appeared
Note: We only need this for the train dataset, so eval & test are ignored

In [195]:
## Hard-coded with 3 emojis - but you might want to extend to include all unique emojis in the dataset 

distinct_emojis = [':face_with_tears_of_joy:', ':thumbs_down:', ':smiling_face:']

n_emojis = len(distinct_emojis)

In [188]:
def dummify_emojis(emojis):
    emoji_set = set(emojis)
    return [emoji in emoji_set for emoji in distinct_emojis]
emoji_dummies_train = np.array([dummify_emojis(emojis) for emojis in emojis_train])
emoji_dummies_train.shape

(22284, 3)

In [189]:
for i, emoji_tweet in enumerate(emojis_train[:10]):
    print(str(emoji_tweet) + ":\t" + str(emoji_dummies_train[i]))

[':face_with_rolling_eyes:']:	[False False False]
[':face_with_tears_of_joy:']:	[ True False False]
[':face_with_rolling_eyes:']:	[False False False]
[':smiling_face:']:	[False False  True]
[':face_with_rolling_eyes:']:	[False False False]
[':thinking_face:']:	[False False False]
[':face_with_tears_of_joy:']:	[ True False False]
[':thumbs_down:']:	[False  True False]
[':smiling_face_with_heart-eyes:']:	[False False False]
[':smiling_face:']:	[False False  True]


#### ^ This array contains the training targets for our models

# GBM Approach

Gradient Boosting Model (GBM) = ensemble of decision trees:
![](gbm.png)

- This baseline modeling approach uses 3 GBMs (one for each sample emoji) to predict if its respective emoji appears in a given tweet.
- For each tweet, all 3 GBMs will output some probability between 0 and 1.
- These 3 outputs must be used to determine which emojis to predict belong with the tweet.

### Convert each tweet's text to feature vectors

This is the biggest NLP-y step. We need to convert raw text of each tweet to a set of structured features that a modeling algorithm can learn from. To accomplish that, `sklearn`'s `CountVectorizer` is used here:

In [190]:
vectorizer = CountVectorizer()

This class can handle:

- Tokenization
 - Splitting each tweet string into "tokens" (basically a list of words)
- n-grams
 - Pairs of two words that appear in sequence (bi-grams), three words (tri-grams), etc.
- Dropping stop words
 - Words that appear very frequently and don't mean much for our prediction task, e.g. "the", "a", etc.
- Lowercasing
 - Ignores the difference between "The" and "the", e.g.
- And more
 - See the docs for all options: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Finally, it counts the number of times each token appears in each tweet, and returns a sparse matrix of the token counts for each tweet.

In [191]:
tweet_vectors_train = vectorizer.fit_transform(tweets_train)
tweet_vectors_train

<22284x36529 sparse matrix of type '<class 'numpy.int64'>'
	with 360452 stored elements in Compressed Sparse Row format>

In [192]:
tweet_vectors_eval = vectorizer.transform(tweets_eval)
tweet_vectors_eval

<7429x36529 sparse matrix of type '<class 'numpy.int64'>'
	with 112459 stored elements in Compressed Sparse Row format>

#### ^ These sparse matrixes are the input feature values for our GBM models
- Most values in each matrix are 0, since any given tweet only uses a small subset of the vocabulary

### We are now ready to train the GBM

In [196]:
def fit_emoji_i_gbm(i):
    X = tweet_vectors_train
    y = emoji_dummies_train[:, i]
    gbm = GradientBoostingClassifier()
    gbm.fit(X, y)
    return gbm
gbms = Pool(n_emojis).map(fit_emoji_i_gbm, range(n_emojis))
len(gbms)

3

In [199]:
gbms

[GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.1, loss='deviance', max_depth=3,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=100,
               presort='auto', random_state=None, subsample=1.0, verbose=0,
               warm_start=False),
 GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.1, loss='deviance', max_depth=3,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=100,
               presort='auto', random_state=None, subsample=1.0, verbose=0,
               warm_start=False),
 GradientBoostingClassifier(criterion='f

### With training complete, call each model's  `predict_proba` on the validation sample to get 15 emoji probabilities for each tweet

In [200]:
def predict_probas(gbm, tweet_vectors):
    X = tweet_vectors
    cls_probas = gbm.predict_proba(X)
    pos_cls_ix = gbm.classes_.argmax()
    pos_cls_probas = cls_probas[:, pos_cls_ix]
    return pos_cls_probas

emoji_probas_eval = np.array([predict_probas(gbm, tweet_vectors_eval) for gbm in gbms])
emoji_probas_eval = emoji_probas_eval.T  # transpose probas from by-emoji to by-tweet
emoji_probas_eval.shape

(7429, 3)

In [201]:
emoji_probas_eval[:2]

array([[ 0.42524449,  0.063263  ,  0.06133028],
       [ 0.13776286,  0.04504414,  0.03926895]])

## Predict Emojis for each tweet

In [203]:
def predict_emojis(emoji_probas, threshold=0.5):
    emojis_pred = [emoji for emoji, emoji_proba in zip(distinct_emojis, emoji_probas) if emoji_proba > threshold]
    if emojis_pred:
        return emojis_pred
    else:
        max_proba_emoji_ix = np.argmax(emoji_probas)
        return [distinct_emojis[max_proba_emoji_ix]]

emoji_preds_eval = [predict_emojis(emoji_probas) for emoji_probas in emoji_probas_eval]
len(emoji_preds_eval)

7429

In [206]:
emoji_preds_eval[:10]

[[':face_with_tears_of_joy:'],
 [':face_with_tears_of_joy:'],
 [':face_with_tears_of_joy:'],
 [':face_with_tears_of_joy:'],
 [':smiling_face:'],
 [':face_with_tears_of_joy:'],
 [':face_with_tears_of_joy:'],
 [':face_with_tears_of_joy:'],
 [':face_with_tears_of_joy:'],
 [':face_with_tears_of_joy:']]

### Score the emoji predictions against the true emojis

In [204]:
def score_preds(emojis_true, emojis_pred):
    def score_1(trues, preds):
        trues = set(trues)
        preds = set(preds)
        n_correct = len(trues.intersection(preds))
        score = n_correct/ max(len(trues), len(preds))
        return score
    scores = [score_1(trues, preds)
              for trues, preds in zip(emojis_true, emojis_pred)]
    mean_score = sum(scores) / len(scores)
    print('# tweets:', len(scores))
    print('score:', mean_score)

score_preds(emojis_eval, emoji_preds_eval)

# tweets: 7429
score: 0.2779647328038767


## Part 1: Understand the baseline model

Questions you might want to think about, or ask a friendly drop-in face!

* What is a GBM model?
* What are the Features in this model?
* What are the Targets in this model?

## Part 2: Improve the model!

- Hand craft features
  - Use the @<user> handles
    - Is there a correlation between the number of handles used and the number of emojis?
  - Any rules, i.e. `'love'` -> `':heart:'`?
  - Are any of the train tweets duplicated in the test set?
- Tuning the GBM parameters
- Optimize the emoji probability to prediction function
  - Maximize expected points