<a href="https://colab.research.google.com/github/Keenandrea/ELMo/blob/master/ELMo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ELMo: Embeddings from Language Models 

---



In [0]:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import pickle

In [0]:
train = pd.read_csv("train_2kmZucJ.csv")
test = pd.read_csv("test_oJQbWVk.csv")

train.shape, test.shape

((7920, 3), (1953, 2))

Let's run the two cells below to check the class distribution in the train set and test set:

In [0]:
train['label'].value_counts(normalize = True)

0    0.744192
1    0.255808
Name: label, dtype: float64

The zero on the lefthand-side represents a non-negative tweet, while the one represents a negative tweet.

Let's peep into the first five rows of our train set:

In [0]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...


Then the first five rows of our test set:

In [0]:
test.head()

Unnamed: 0,id,tweet
0,7921,I hate the new #iphone upgrade. Won't let me d...
1,7922,currently shitting my fucking pants. #apple #i...
2,7923,"I'd like to puts some CD-ROMS on my iPad, is t..."
3,7924,My ipod is officially dead. I lost all my pict...
4,7925,Been fighting iTunes all night! I only want th...


Our target variable for our test set is the column entitled *label*, while the independent variable is the column *tweet*.

---

## clean the text and preprocess the data

---

Any elements of our tweets that do not give us data about sentiment will be removed. First, we remove any URL links.

In [0]:
train['clean_tweet'] = train['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))

test['clean_tweet'] = test['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))

Let's look at what we've done and how this compares to see if we've gotten it right:

In [0]:
train.head()

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,#fingerprint #Pregnancy Test #android #apps #...
1,2,0,Finally a transparant silicon case ^^ Thanks t...,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...,What amazing service! Apple won't even talk to...


In [0]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,7921,I hate the new #iphone upgrade. Won't let me d...,I hate the new #iphone upgrade. Won't let me d...
1,7922,currently shitting my fucking pants. #apple #i...,currently shitting my fucking pants. #apple #i...
2,7923,"I'd like to puts some CD-ROMS on my iPad, is t...","I'd like to puts some CD-ROMS on my iPad, is t..."
3,7924,My ipod is officially dead. I lost all my pict...,My ipod is officially dead. I lost all my pict...
4,7925,Been fighting iTunes all night! I only want th...,Been fighting iTunes all night! I only want th...


Seems fitting enough to our task. Next, let's do the routine cleaning subject of all NLP tasks:

In [0]:
punctuation = '!"#$%&()*+-/:;<=>?@[\\]^_`{|}~'

train['clean_tweet'] = train['clean_tweet'].apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))
test['clean_tweet'] = test['clean_tweet'].apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))

# convert text to lowercase
train['clean_tweet'] = train['clean_tweet'].str.lower()
test['clean_tweet'] = test['clean_tweet'].str.lower()

# remove numbers
train['clean_tweet'] = train['clean_tweet'].str.replace("[0-9]", " ")
test['clean_tweet'] = test['clean_tweet'].str.replace("[0-9]", " ")

# remove whitespaces
train['clean_tweet'] = train['clean_tweet'].apply(lambda x:' '.join(x.split()))
test['clean_tweet'] = test['clean_tweet'].apply(lambda x: ' '.join(x.split()))

Another step would be to normalize the text. In the case of sentiment analysis, reducing a word to its base form can boon our model. 

For instance, the base form of the words *sucked*, *sucking*, and *sucks* is *suck*. Quite often, a sentiment can be analyzed by the base word alone.

In [0]:
nlp = spacy.load('en', disable=['parser', 'ner'])

def lemmatization(texts):
    output = []
    for i in texts:
        s = [token.lemma_ for token in nlp(i)]
        output.append(' '.join(s))
    return output

In [0]:
train['clean_tweet'] = lemmatization(train['clean_tweet'])
test['clean_tweet'] = lemmatization(test['clean_tweet'])

Now let's check out our original tweets and compare them with their cleaned versions:

In [0]:
train.sample(10)

Unnamed: 0,id,label,tweet,clean_tweet
2823,2824,0,Sunrise . . . #sky #skylovers #skyline #horizo...,sunrise . . . sky skylover skyline horizon clo...
3103,3104,0,pamela neighbours in love #Jeanne #and #pamela...,pamela neighbour in love jeanne and pamela nei...
7731,7732,0,hacker Wifi Password Prank https://goo.gl/uKuX...,hacker wifi password prank android app beautif...
3413,3414,0,I just want someone to me as much as my #iPhon...,i just want someone to -PRON- as much as -PRON...
2783,2784,0,Summer sunday morning #enjoyeverymoment #iphon...,summer sunday morning enjoyeverymoment iphone ...
3154,3155,0,@Xbox Any news as to when the new Nokia Lumia ...,xbox any news as to when the new nokia lumia w...
2983,2984,1,Just spent 40 freakin bucks on an iPhone charg...,just spend freakin buck on an iphone charger a...
5055,5056,0,Cakesmash #Sony #A6300 #Selp18105g #F4 #Cakesm...,cakesmash sony a selp g f cakesmash lumberjack...
5033,5034,0,"i12 - ""APPLE"" #stevejobs #apple #iphone7 #ipho...",i apple stevejob apple iphone iphone plus ipad...
1082,1083,0,Happy new year freckles #freckles #portrait #s...,happy new year freckle freckle portrait smile ...


In [0]:
!pip install "tensorflow>=1.7.0"
!pip install tensorflow-hub



In [0]:
import tensorflow_hub as hub
import tensorflow as tf

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

W0610 17:51:19.924640 140589041088384 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14
W0610 17:51:20.386457 140589041088384 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


In [0]:
# write a random sentence as an example for ELMo
x = ["Roasted ants are a popular snack in Columbia"]

# extract ELMo features 
embeddings = elmo(x, signature="default", as_dict=True)["elmo"]

embeddings.shape

TensorShape([Dimension(1), Dimension(8), Dimension(1024)])

TensorShape(*D1*, *D2*, *D3*), or, a three-dimensional tensor of shape (1, 8, 1024):


D1.   The number of training samples

D2.   The maximum length of the longest string in the input list of strings

D3.   The length of the ELMo vector



Let's use a function that extracts the ELMo vectors for the cleaned tweets in the train and test datasets. 

Then, to arrive at the vector representation of an entire tweet, we take the mean of the ELMo vectors of constituent terms or tokens of the tweet.

In [0]:
def elmo_vectors(x):
  embeddings = elmo(x.tolist(), signature="default", as_dict=True)["elmo"]

  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    # return average of ELMo features
    return sess.run(tf.reduce_mean(embeddings,1))

Split both train and test sets into batches of one-hundred samples each to prevent from running out of memory while running the above function:

In [0]:
list_train = [train[i:i+100] for i in range(0,train.shape[0],100)]
list_test = [test[i:i+100] for i in range(0,test.shape[0],100)]

Now iterate through these batches and extract ELMo vectors:

In [0]:
# Extract ELMo embeddings
elmo_train = [elmo_vectors(x['clean_tweet']) for x in list_train]
elmo_test = [elmo_vectors(x['clean_tweet']) for x in list_test]

Concatenate all the vectors back to a single array:

In [0]:
elmo_train_new = np.concatenate(elmo_train, axis = 0)
elmo_test_new = np.concatenate(elmo_test, axis = 0)

Save the ELMo vectorized array as *pickle* files:

In [0]:
pickle_out = open("elmo_train_03032019.pickle","wb")
pickle.dump(elmo_train_new, pickle_out)
pickle_out.close()

# save elmo_test_new
pickle_out = open("elmo_test_03032019.pickle","wb")
pickle.dump(elmo_test_new, pickle_out)
pickle_out.close()

Load the pickle files into which we have saved out ELMo vectorized array:

In [0]:
pickle_in = open("elmo_train_03032019.pickle", "rb")
elmo_train_new = pickle.load(pickle_in)

# load elmo_train_new
pickle_in = open("elmo_test_03032019.pickle", "rb")
elmo_test_new = pickle.load(pickle_in)

In [0]:
from sklearn.model_selection import train_test_split

xtrain, xvalid, ytrain, yvalid = train_test_split(elmo_train_new, 
                                                  train['label'],  
                                                  random_state=42, 
                                                  test_size=0.2)

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

lreg = LogisticRegression()
lreg.fit(xtrain, ytrain)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
preds_valid = lreg.predict(xvalid)

In [0]:
f1_score(yvalid, preds_valid)

0.7752675386444708

In [0]:
preds_test = lreg.predict(elmo_test_new)

In [0]:
# prepare submission dataframe
sub = pd.DataFrame({'id':test['id'], 'label':preds_test})

# write predictions to a CSV file
sub.to_csv("sub_lreg.csv", index=False)

# Future Considerations

---

Working on ELMo embeddings with fastai and PyTorch as opposed to TensorFlow Hub.