# Deep Learning Assignment 4: Analyzing Sentiment

 * some data: http://help.sentiment140.com/for-students

**Columns:**

    0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
    1 - the id of the tweet (2087)
    2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    3 - the query (lyx). If there is no query, then this value is NO_QUERY.
    4 - the user that tweeted (robotickilldozr)
    5 - the text of the tweet (Lyx is cool)

In [1]:
import pandas as pd

#### the data has already been split into train and test sets

In [2]:
cols = ['polarity','id', 'date', 'query', 'user', 'tweet']

data = pd.read_csv('sentiment.csv',names=cols, encoding='ISO-8859-1')
print('length of data {}'.format(len(data)))

length of data 1600000


This is a lot of data. That's great! However, it will take a long time to get through this notebook with all of that data, so I'm going to randomly choose about 10% of it. We also don't need all of those columns, so let's only keep the ones we need.

In [3]:
data=data.sample(frac=0.005,random_state=200)
data = data.drop(['id', 'date', 'query', 'user'], axis=1)
#data[:3]
data

Unnamed: 0,polarity,tweet
1542165,4,@FatDaddySweets I usually adore your goodies f...
1023877,4,@trohman That's excellent!! We'd love to have ...
1345012,4,Going to go take a shower &amp; get ready
1336524,4,@jasdeep but there is a thin strip where ishq...
389691,0,"It's coming to an end.... Lived, Partied, and ..."
505832,0,Some people do not understand that should not ...
854712,4,@BOHEMiahne shower with a friend or two
1510237,4,Mojo...where art thou??
852964,4,@ddlovato don't care bout ppl who r saying u r...
1547906,4,@bigskyvip Get 100 followers a day using www.t...


How many of each type are there?

In [4]:
set(data.polarity)

{0, 4}

## 1.) How many of each polarity are there?

* Hint: use a mask over the `data` dataframe

In [5]:
print('There are ' + str(len(data[data.polarity == 0])) + ' entries with polarity 0')
print('There are ' + str(len(data[data.polarity == 4])) + ' entries with polarity 4')

There are 4014 entries with polarity 0
There are 3986 entries with polarity 4


## 2.) Change all 4s in polarity to 1

* Hint: a lambda function might be useful here

In [6]:
data['polarity'] = data['polarity'].map(lambda x: 1 if x == 4 else 0)

In [7]:
#data[:3]
data

Unnamed: 0,polarity,tweet
1542165,1,@FatDaddySweets I usually adore your goodies f...
1023877,1,@trohman That's excellent!! We'd love to have ...
1345012,1,Going to go take a shower &amp; get ready
1336524,1,@jasdeep but there is a thin strip where ishq...
389691,0,"It's coming to an end.... Lived, Partied, and ..."
505832,0,Some people do not understand that should not ...
854712,1,@BOHEMiahne shower with a friend or two
1510237,1,Mojo...where art thou??
852964,1,@ddlovato don't care bout ppl who r saying u r...
1547906,1,@bigskyvip Get 100 followers a day using www.t...


## 3.) split the data into 10% test, 10% and 80% train

* create `test`, `dev`, and `train` data tables
* you can use the `.sample()` method for the dataframe
* print out the shapes of each of the three tables
* What is the baseline for this task? 

In [8]:
import random
from sklearn import model_selection as ms

# Shuffle the data so that each of our slices have some of each label
# Stratify based on class labels
# sklearn has functions for this, but only for splitting between train and test,
# so we hack that a bit to split between train and test/dev, then between test and
# dev
sss = ms.StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 1234)
train = None
dev_test = None
dev = None
test = None
for train_idx, dev_test_idx in sss.split(data.tweet, data.polarity):
    train = data.iloc[train_idx]
    dev_test = data.iloc[dev_test_idx]
sss = ms.StratifiedShuffleSplit(n_splits = 1, test_size = 0.5, random_state = 1234)
for dev_idx, test_idx in sss.split(dev_test.tweet, dev_test.polarity):
    dev = dev_test.iloc[dev_idx]
    test = dev_test.iloc[test_idx]
# drop dev_test from memory. I'm running into memory issues.
dev_test = None
# Expected output: ((128029, 2), (16004, 2), (16004, 2))
train.shape, dev.shape, test.shape

((6400, 2), (800, 2), (800, 2))

## 4.) Use a LabelEncoder to convert the tweet column to numbers

* I do this for you. Just run the following cells to see how well representing a full tweet as an index number works for this task.

In [9]:
y = train.polarity.as_matrix()

# Expected (128029,)
y.shape

(6400,)

In [10]:
from sklearn import preprocessing

leX = preprocessing.LabelEncoder()
leX.fit(data.tweet) # use the original data df so all possibilities are encoded
X = leX.transform(train.tweet)
X = X.reshape(X.shape[0], 1)

# Expected (128029, 1)
X.shape

(6400, 1)

In [11]:
from sklearn import linear_model

model = linear_model.LogisticRegression(penalty='l2')
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
Xdev = leX.transform(dev.tweet)
Xdev = Xdev.reshape(Xdev.shape[0], 1)
ydev = dev.polarity.as_matrix()

In [13]:
from sklearn.metrics import *

# Expected 0.49725068732816796
accuracy_score(model.predict(Xdev), ydev) 

0.56875

## 5.) Anaylsis of LabelEncoder

* How well does LabelEncoder perform compared to the baseline?
* Why does it perform so poorly? What does it have to do with the way the features are represented?

Label encoder doesn't make any connection between different sentences. For example, the sentences, s_1 = "I had a burger at McDonalds" and s_2 = "My favorite burger place is Five Guys" would have completely different label values, despite them both talking about food in general and burgers in particular. Encoding sentences in one dimension is difficult, because you lose any semantic meaning between the various words in the sentence, and thus the connections between the sentences as well.

In addition, a label encoder imposes an arbitrary order on the sentences. If the sentence s_3 = "I like cats" happens to be encoded as 3, the sentence "I had a burger at McDonalds" happens to be encoded as 4, and the sentence "My favorite burger place is Five Guys" happens to be encoded as 9999, then s_1 would be interpreted by the model as semantically closer to s_3 than s_2, even though intuitively, s_1 and s_2 should be closer to each other.

## 6.) one-hot encoding

* Repeat the steps of preparing the test and dev data as in #4, only this time use one-hot vectors instead of the label encoder
* Hint: do you want to represent the entire tweet as a vector, or each word? (Hint: use words to make the one-hot encoder, then sum them to represent the entire tweet)
* Hint: try `get_dummies()`, alternatively use scikitlearn's OneHotEncoder

In [14]:
#scipy for sparse matrices
import scipy as sp
import numpy as np

# Drop data to save memory
data = None

# Transform the data by splitting on word
X_tr = train.tweet.map(lambda x: x.split()).as_matrix()
X_dev = dev.tweet.map(lambda x: x.split()).as_matrix()

y_tr = train.polarity.as_matrix()
y_dev = dev.polarity.as_matrix()

# I couldn't figure out how to get get_dummies or sklearn's OneHotEncoder
# to work with the words in the sentences instead of just the sentences.
# The one hot was just taking each sentence value and making a new column
# for each sentence instead of a column for each word, so I ended up kinda
# implementing my own sentence word one-hot encoder.

# Pass in a list of lists. The outer list is the rows of our training data,
# the inner lists are the words. Returns a list encoding indexes to words.
def one_hot_sentence_word_fit(X):
    result = set()
    for X_cur in X:
        for word_cur in X_cur:
            result.add(word_cur)
    # Convert result into a list to map index (column in the new matrix) to word.
    return list(result)

# X is the data to fit, model is the list of words seen in our training data.
# Encode the sentences as one-hot on each word. Words not in the model will be completely
# ignored in this approach. A second approach could be to have a wildcard entry in the table to hold
# words we've never seen before.
def one_hot_sentence_word_transform(X, model):
    # Was using csr, but I got a warning that lil would be better
    result = sp.sparse.lil_matrix((X.shape[0], len(model)))
    for i in range(X.shape[0]):
        for j in range(len(model)):
            word = model[j]
            num_appearances = X[i].count(word)
            # Only populate the matrix if non-zero to keep it sparse
            if num_appearances != 0:
                result[i, j] = num_appearances
    return result

ohm = one_hot_sentence_word_fit(X_tr)
X_tr = one_hot_sentence_word_transform(X_tr, ohm)
X_dev = one_hot_sentence_word_transform(X_dev, ohm)

In [15]:
def fit_predict_accuracy_lrm(X_tr, X_te, y_tr, y_te):
    lrm = linear_model.LogisticRegression(penalty = 'l2')
    lrm.fit(X_tr, y_tr)
    return accuracy_score(lrm.predict(X_te), y_te)

fit_predict_accuracy_lrm(X_tr, X_dev, y_tr, y_dev)

0.7175

## 7.) word2vec

* download the `GoogleNews-vectors-negative300.bin` file from https://github.com/mmihaltz/word2vec-GoogleNews-vectors and unzip the file
* load the file by running the cell below (you may need to pip install gensim and you may need to change the path to the file)

In [16]:
#from gensim.models.word2vec import Word2Vec as w
from gensim.models import KeyedVectors as kv
w2v = kv.load_word2vec_format('/run/media/wd_blue_1000/word2vec/GoogleNews-vectors-negative300.bin',binary=True)

* You can access vectors like a dictionary:

In [17]:
w2v['red'][:3] # show the first three values for the vector for 'red'

array([ 0.09716797, -0.08496094,  0.27148438], dtype=float32)

* vectors are length 300

In [18]:
len(w2v['red'])

300

* Repeat the steps of preparing the test and dev data as in #4, only this time use w2v vectors
* How to do you represent a tweet, which is multiple words, as a single vector? (Hint: try summing the vectors)
* Note: w2v only has lower-cased words
* Hint: if w2v doesn't have a word you are looking for, just ignore that word

In [19]:
# Transform the data by splitting on word
X_tr = train.tweet.map(lambda x: x.split()).as_matrix()
X_dev = dev.tweet.map(lambda x: x.split()).as_matrix()

y_tr = train.polarity.as_matrix()
y_dev = dev.polarity.as_matrix()

def word_lists_to_w2v_vectors(X):
    result = []
    for x_cur in X:
        cur_vec = [0.0 for i in range(300)]
        for word in x_cur:
            if word in w2v:
                cur_vec += w2v[word]
        result.append(cur_vec)
    return np.matrix(result)

X_tr = word_lists_to_w2v_vectors(X_tr)
X_dev = word_lists_to_w2v_vectors(X_dev)

In [20]:
fit_predict_accuracy_lrm(X_tr, X_dev, y_tr, y_dev)

0.705

## 8.) Comparing the three approaches

* Now that you've tried things out on your `dev` set, train on your `train`+`dev` data and test on your `test` data for all three approaches and report the results. 
* Why do you think one-hot and word2vec worked better than the label encoder?
* Did one-hot or word2vec work better? Why do you think that is the case?
* What do you think would happen if you cleaned up the tweets (e.g., removed punctuation, emojis, etc.)? 

In [21]:
# I don't quite understand the importance of the dev set. Your example code has training based on train,
# then testing based on dev. Now we're just training with train+dev and testing with test. Aren't those
# two things very similar?
def get_new_clean_matrices(train, dev, test):
    X_tr = train.tweet.map(lambda x: x.split()).as_matrix()
    X_dev = dev.tweet.map(lambda x: x.split()).as_matrix()
    X_te = test.tweet.map(lambda x: x.split()).as_matrix()

    X_tr = np.concatenate((X_tr, X_dev))
    X_dev = None

    y_tr = train.polarity.as_matrix()
    y_dev = dev.polarity.as_matrix()
    y_te = test.polarity.as_matrix()

    y_tr = np.concatenate((y_tr, y_dev))
    y_dev = None    
    return X_tr, X_te, y_tr, y_te

# Label encoder
X_tr, X_te, y_tr, y_te = get_new_clean_matrices(train, dev, test)

lem = preprocessing.LabelEncoder()
lem.fit(np.concatenate((X_tr, X_te)))
X_tr = lem.transform(X_tr)
X_tr = X_tr.reshape(X_tr.shape[0], 1)
X_te = lem.transform(X_te)
X_te = X_te.reshape(X_te.shape[0], 1)

print(fit_predict_accuracy_lrm(X_tr, X_te, y_tr, y_te))

# Word one-hot encoder
X_tr, X_te, y_tr, y_te = get_new_clean_matrices(train, dev, test)

ohm = one_hot_sentence_word_fit(X_tr)
X_tr = one_hot_sentence_word_transform(X_tr, ohm)
X_te = one_hot_sentence_word_transform(X_te, ohm)

print(fit_predict_accuracy_lrm(X_tr, X_te, y_tr, y_te))

# w2v encoder
X_tr, X_te, y_tr, y_te = get_new_clean_matrices(train, dev, test)

X_tr = word_lists_to_w2v_vectors(X_tr)
X_te = word_lists_to_w2v_vectors(X_te)

print(fit_predict_accuracy_lrm(X_tr, X_te, y_tr, y_te))

0.56375
0.7375
0.74625


One-hot and word2vec worked better because they preserved the semantic meaning of the sentence. By processing each word in the sentence, then putting them together in some way (in both cases, it was a sort of vector addition), we represent the sentence as a sum of words. Though we lose some information about the ordering of the words, this representation is more accurate than just using a label encoder. In addition, those representations do not have the "arbitrary ordering" problem of the label encoder. In the case of one-hot, we represent each word as a new, orthogonal unit vector in an ever increasing dimensionality. Thus, all words are equadistant from each other rather than, when using the label encoder, some sentences being closer to each other and other sentences being farther away from each other with no rhyme or reason.

One-hot and w2v worked about the same (about 75% accuracy), which surprised me. I was expecting w2v to pull ahead of the one-hot encoder, because w2v is specially designed for this purpose. There could be several reasons for this. One reason is that tweets don't follow rigorous English standards, so you may end up with a word at the end of a sentence being, "omg!!!!!!!" which may not be present in that exact form in the w2v dictionary. In that case, my w2v code ignores that word and moves on. If tweets are composed of many of these abbreviations or words with weird punctuation, the sentence's representation could be pared down from, say, 12 words to a sum of a mere four or five vectors. Note that this problem is also present in the one-hot encoder, but only if we train the encoder on just the training data. If we were to train the one-hot encoder on train + test (potentially bad practice), we could capture all words in all tweets present in the data set, so each word would have a unique orthogonal vector (no information loss). One other potential reason that the w2v encoder didn't perform as well is that the w2v vectors are constrained to 300 dimensions. Though this may seem like a lot, it pales in comparison to the thousands of dimensions of the one-hot encoder. Though the w2v dictionary is specially designed for this purpose, this dimensionality could be limiting it from representing words as accurately as they could be represented. It could be that the developers made a tradeoff decision based on the size of the w2v data. The scope of the dictionary is huge, since it must basically cover every word of the English language. The dictionary is already ~3 GB, and adding more dimensions would just balloon that size even more.

If the tweets were cleaned up, both the one-hot and the w2v encoders would perform much better. They would encounter less words that are not in their respective dictionaries and thus ignore less of the input data. In addition, for the one-hot encoder, words like "omg!!!!!" and "omg" would no longer map to different one-hot values, so the model would treat those two words as the same (as it should, I think).