## Building a Sentiment Predicting Model on a Social Media Corpus

### Using the SemEval 2017 Task 4A: Positive/Negative/Neutral Classifier Corpus

This model does not incorporate vector word embeddings or any smoothing. 

Helpful Resources:
https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-embedding-and-convolutional-neural-networks-on-keras-163197aef623

SemEval Tweet Download: 
https://github.com/seirasto/twitter_download

Good Post I found that most of this code is built off of
https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

Potential other corpus to use (1.6million tweets as positive/negative)
https://drive.google.com/uc?id=0B04GJPshIjmPRnZManQwWEdTZjg&export=download




In [13]:

# Standard python helper libraries.
import collections
import itertools
import json
import os
import re
import sys
import time

# Numerical manipulation libraries.
import numpy
import pandas as pd
from scipy import stats
import scipy.optimize

# NLTK
import nltk
from nltk.tokenize import word_tokenize

# Helper libraries (from w266 Materials).
# import segment
from shared_lib import utils
from shared_lib import vocabulary

# Machine Learning Packages
from sklearn.model_selection import train_test_split

# Word2Vec Model
import gensim
from gensim.models.word2vec import Word2Vec # the word2vec model gensim class

# Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Conv1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard


In [2]:
# Pull in Tweet Data (Must be downloaded using https://github.com/seirasto/twitter_download)
tweets = pd.read_table("Data/twitter_download-master/2016train.txt_semeval_tweets.txt", header=None)
tweets

# SemEval Dataset is actually relatively small (6000 tweets in 2016). 
# We can group all of the Train/Test/Dev data from 2013 through 2016 to get more.
# Additionally, we could consider using this data which has 1.6 million rows but it is only a binary positive/negative class 
# https://drive.google.com/uc?id=0B04GJPshIjmPRnZManQwWEdTZjg&export=download

Unnamed: 0,0,1,2
0,628949369883000832,negative,dear @Microsoft the newOoffice for Mac is grea...
1,628976607420645377,negative,@Microsoft how about you make a system that do...
2,629023169169518592,negative,Not Available
3,629179223232479232,negative,Not Available
4,629186282179153920,neutral,If I make a game as a #windows10 Universal App...
5,629226490152914944,positive,"Microsoft, I may not prefer your gaming branch..."
6,629345637155360768,negative,@MikeWolf1980 @Microsoft I will be downgrading...
7,629394528336637953,negative,@Microsoft 2nd computer with same error!!! #Wi...
8,629650766580609026,positive,Just ordered my 1st ever tablet; @Microsoft Su...
9,629797991826722816,negative,"After attempting a reinstall, it still bricks,..."


In [3]:
# Segregate X and Y
X = tweets[2]
Y = tweets[1]

### Cleaning

In [4]:
# Tokenize each Tweet (really slow, need to optimize for larger corpora?)
for i, tweet in enumerate(X):
    X[i,] = tweet.split()


# print(X)

# Alternatively, use this?
# from nltk.tokenize import TweetTokenizer # a tweet tokenizer from nltk.
# tokenizer = TweetTokenizer()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


### Preprocessing & (future) Feature Engineering

### Split Train and Test Sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.20,random_state=100)
print(X_train[:10])

4395    [#Apology, to, Jeb, Bush, for, John, Dempsey, ...
4068    [Top, 5, Gambling, Apps, for, the, iPad, http:...
3710                                     [Not, Available]
4516    [@Milbank, doesn't, think, Vice, President, Jo...
4243    [1st, day, back, at, work, after, a, terrible,...
4288                                     [Not, Available]
4916                                     [Not, Available]
1360    [1), may, be, wrong,, but, if, I, read, it, ri...
1012    [@GailSimone, Donald, Trump, may, think, he's,...
1538    [#nowplaying, Bob, Marley, -, Sun, Is, Shining...
Name: 2, dtype: object


### Word2Vec

In [6]:
# Convert each word into a vector representation. Couldn't get Keras working with straight indexes for each word so I followed the steps laid out here:
# https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

vec_dim = 10

tweet_w2v = Word2Vec(size=vec_dim, min_count=2) #vector size and minimum threshold to include for rare words
tweet_w2v.build_vocab(x for x in X_train)
tweet_w2v.train((x for x in X_train), total_examples=tweet_w2v.corpus_count, epochs=1)

49155

In [7]:
tweet_w2v.most_similar('yes')

[('livestream', 0.8935967683792114),
 ('Friday:', 0.8406975269317627),
 ('recent', 0.795516312122345),
 ("it'd", 0.7905986905097961),
 ('fucked', 0.7877628803253174),
 ('books', 0.778687059879303),
 ('point', 0.7704617977142334),
 ('&amp;...', 0.770136833190918),
 ('happiest', 0.7694127559661865),
 ('Edgar', 0.7674282789230347)]

### Create Word Vectors 
https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

print('building tf-idf matrix ...')
vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=10)
matrix = vectorizer.fit_transform([x for x in X_train])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
print('vocab size :', len(tfidf))

building tf-idf matrix ...
vocab size : 798


In [9]:
def buildWordVector(tokens, size):
    vec = numpy.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += tweet_w2v[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: # handling the case where the token is not
                         # in the corpus. useful for testing.
            continue
    if count != 0:
        vec /= count
    return vec

In [10]:
from sklearn.preprocessing import StandardScaler

train_vecs_w2v = numpy.concatenate([buildWordVector(z, vec_dim) for z in map(lambda x: x, X_train)])

scaler = StandardScaler()
scaler.fit(train_vecs_w2v)
train_vecs_w2v = scaler.transform(train_vecs_w2v)

test_vecs_w2v = numpy.concatenate([buildWordVector(z, vec_dim) for z in map(lambda x: x, X_test)])
test_vecs_w2v = scaler.transform(test_vecs_w2v)

In [11]:
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder

print("Original Y:", y_train[:10])
encoder = LabelEncoder()
encoder.fit(Y)
y_train = encoder.transform(y_train)
y_test= encoder.transform(y_test)
print("Encoded Y:", y_train[:10])

y_train = to_categorical(y_train, 3)
y_test = to_categorical(y_test, 3)
print("One Hot Y:", y_train[:10])

Using TensorFlow backend.


Original Y: 4395     neutral
4068     neutral
3710    positive
4516     neutral
4243    negative
4288     neutral
4916     neutral
1360     neutral
1012     neutral
1538    positive
Name: 1, dtype: object
Encoded Y: [1 1 2 1 0 1 1 1 1 2]
One Hot Y: [[ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]


### CNN Model (really simple)

In [14]:
# Pad the sequence to the same length
# train_vecs_w2v = sequence.pad_sequences(train_vecs_w2v, maxlen=vec_dim)
# test_vecs_w2v = sequence.pad_sequences(test_vecs_w2v, maxlen=vec_dim)

# Build Keras Model
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=vec_dim))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_vecs_w2v, y_train, epochs=9, batch_size=32, verbose=2)

Epoch 1/9
 - 0s - loss: 1.0243 - acc: 0.4945
Epoch 2/9
 - 0s - loss: 0.9949 - acc: 0.5119
Epoch 3/9
 - 0s - loss: 0.9882 - acc: 0.5157
Epoch 4/9
 - 0s - loss: 0.9840 - acc: 0.5142
Epoch 5/9
 - 0s - loss: 0.9819 - acc: 0.5151
Epoch 6/9
 - 0s - loss: 0.9799 - acc: 0.5153
Epoch 7/9
 - 0s - loss: 0.9780 - acc: 0.5174
Epoch 8/9
 - 0s - loss: 0.9778 - acc: 0.5174
Epoch 9/9
 - 0s - loss: 0.9760 - acc: 0.5191


<keras.callbacks.History at 0x197f0f8a550>

In [15]:
score = model.evaluate(test_vecs_w2v, y_test, batch_size=128, verbose=2)
print("Accuracy: %.2f%%" % (score[1]*100))

Accuracy: 50.64%


In [16]:
# # LSTM for sequence classification

# from keras.models import Sequential
# from keras.layers import Dense
# from keras.layers import LSTM, Conv1D, Flatten, Dropout
# from keras.layers.embeddings import Embedding
# from keras.preprocessing import sequence
# from keras.callbacks import TensorBoard

# # # Using keras to load the dataset with the top_words
# # top_words = 10000


# # # Pad the sequence to the same length
# max_review_length = vec_dim
# X_train = sequence.pad_sequences(train_vecs_w2v, maxlen=max_review_length)
# X_test = sequence.pad_sequences(test_vecs_w2v, maxlen=max_review_length)

# # Using embedding from Keras
# # embedding_vecor_length = 300
# model = Sequential()
# # model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))


# # Convolutional model (3x conv, flatten, 2x dense)
# model.add(Conv1D(64, 3, padding='same', input_shape=(None, vec_dim)))
# model.add(Conv1D(32, 3, padding='same'))
# model.add(Conv1D(16, 3, padding='same'))
# model.add(Flatten())
# model.add(Dropout(0.2))
# model.add(Dense(180,activation='relu'))
# model.add(Dropout(0.2))
# model.add(Dense(1,activation='softmax'))

# # Log to tensorboard
# tensorBoardCallback = TensorBoard(log_dir='./logs', write_graph=True)
# model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['accuracy'])

# model.fit(train_vecs_w2v, y_train, nb_epoch=3, callbacks=[tensorBoardCallback], batch_size=64)

# # Evaluation on the test set
# scores = model.evaluate(test_vecs_w2v, y_test, verbose=0)
# print("Accuracy: %.2f%%" % (scores[1]*100))

# Applied to Reddit

In [17]:
# Function to Predict Positive/Neutral/Negative

def prediction(text):
    sentiment = ["Negative", "Neutral", "Positive"]
    text = text.split() # Tokenize
    text = buildWordVector(text, vec_dim)
    text = scaler.transform(text)
    predic = model.predict(text, batch_size=32)
    result = sentiment[predic.argmax(axis=1)[0]]
    return result

In [18]:
import praw

#this is a read-only instance
reddit = praw.Reddit(user_agent='first_scrape (by /u/dswald)',
                     client_id='TyAK1zSuAvQjmA', 
                     client_secret="uxHGsL0zNODbowN6umVnBWpqLAQ")

subreddit = reddit.subreddit('Portland')
hot_python = subreddit.hot(limit = 3) #need to view >2 to get past promoted posts

for submission in hot_python:
    if not submission.stickied: #top 2 are promoted posts, labeled as 'stickied'
        print('Title: {}, ups: {}, downs: {}, Have we visited: {}'.format(submission.title,
                                                                          submission.ups,
                                                                          submission.downs,
                                                                          submission.visited))
        comments = submission.comments.list() #unstructured
        for comment in comments:
            print (20*'-')
            print ('Parent ID:', comment.parent())
            print ('Comment ID:', comment.id)
            print (comment.body)
            print("#"*10,'PREDICTED SENTIMENT:', prediction(comment.body),"#"*10)

Title: Happy Whalesplosion Day!, ups: 349, downs: 0, Have we visited: False
--------------------
Parent ID: 7cgsh0
Comment ID: dppspoz
>THE BLAST BLASTED BLUBBER BEYOND ALL BELIEVABLE BOUNDS!
########## PREDICTED SENTIMENT: Positive ##########
--------------------
Parent ID: 7cgsh0
Comment ID: dpprm5u
What I would love, more than anything, would be to get a hold of the insurance claim made on [this car.](http://www.salem-news.com/nphotos/smashed-car.jpg)

That damage was caused by flying whale blubber.  That claim has to be framed in an insurance office somewhere as the weirdest claim they've ever seen.
########## PREDICTED SENTIMENT: Positive ##########
--------------------
Parent ID: 7cgsh0
Comment ID: dppr9fs
I remember first learning about this and being so stunned. After watching the video I was like sad and disgusted but also... so fucking amused- what a ridiculous plan. real life slap-stick

I'm all about respecting the corpses of all creatures and stuff, but like, the effort wa