# Building The Classifier

This notebook is for building the classifier to analyze the tweets from 2011 and 2019. I found a dataset of tweets called [Sentiment140](http://help.sentiment140.com/for-students) that had already been given a positive and negative ranking based on which emoticons the tweet used. I decided to use that as a starting off point for building my classifier. 

First I import all the necessary libraries:

In [1]:
import pandas as pd
import re
import pickle
import nltk

In [2]:
# Formatting
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%pprint

Pretty printing has been turned OFF


Opening the dataset and putting it into a dataframe and annotated the columns based on what the Sentiment140 site had said. I am most interested in polarity (0 for negative and 4 for positive) and text.

In [3]:
#building classifier
f = open(r'D:\Documents\Classes\Spring2020\ling1340\Twitter-Positivity-Analysis\data\trainingandtestdata\training.1600000.processed.noemoticon.csv', 'r+')
classify = pd.read_csv(f, index_col=False, names=["polarity", "tweet_id", "date", "query", "username", "text"] ,error_bad_lines=False)

In [4]:
classify['polarity'].value_counts()

4    800000
0    800000
Name: polarity, dtype: int64

This is a huge corpus, over 1500000 entries! I spent a lot of time trying to work with the entire corpus which caused me a LOT of problems trying to compile. For the sake of time and resources I decided to pare down the corpus to about 1/4 the size.

I first shuffled the dataset, as the entries were ordered by polarity. Then I took the first 200,000 entries and put them in `classifysm`.

In [5]:
from sklearn.utils import shuffle
classify = shuffle(classify)
classify[:10]
classifysm = classify[:200000]

Unnamed: 0,polarity,tweet_id,date,query,username,text
608693,0,2223442081,Thu Jun 18 08:16:44 PDT 2009,NO_QUERY,ambick,My poor son is being circumcised right now
1463003,4,2064087901,Sun Jun 07 05:10:51 PDT 2009,NO_QUERY,gemcruz,@mulder8scully5 hi pet!!!
978725,4,1833860922,Mon May 18 00:43:13 PDT 2009,NO_QUERY,ShelleAmanda,@ddlovato Demi.. Don't say that! you are soo w...
911725,4,1752051769,Sat May 09 20:56:07 PDT 2009,NO_QUERY,benlawsonphoto,"@NicholeAudrey LOL! Ok, &quot;we&quot; will ge..."
255843,0,1984584311,Sun May 31 15:23:48 PDT 2009,NO_QUERY,MGHarris,@casshorowitz good luck. I totally failed to i...
914768,4,1752943543,Sat May 09 23:32:12 PDT 2009,NO_QUERY,ThankYouProject,@ggw_bach Thank you for YOUR positive energy +...
1421569,4,2058347713,Sat Jun 06 14:45:31 PDT 2009,NO_QUERY,iamdebra,@BethRosen Thanks. I'm going to see if I can ...
1003629,4,1880334713,Fri May 22 00:52:42 PDT 2009,NO_QUERY,Livedreams9,@billyraycyrus Tweet dreams :] Tweep tight :] ...
1051607,4,1961195251,Fri May 29 09:07:40 PDT 2009,NO_QUERY,santaaurelia,@safirathetiger thanks for follow
1473579,4,2065576124,Sun Jun 07 08:52:43 PDT 2009,NO_QUERY,haleyymae,@myria101 I am now


Then I sorted the dataframe into positive and negative tweets. I did this because I thought it might be useful and so that i could manipulate either polarity easier.

In [6]:
pos_tweets = pd.DataFrame()
pos_tweets = classifysm[classifysm['polarity'] == 4]
pos_tweets

neg_tweets = pd.DataFrame()
neg_tweets = classifysm[classifysm['polarity'] == 0]
neg_tweets

Unnamed: 0,polarity,tweet_id,date,query,username,text
1463003,4,2064087901,Sun Jun 07 05:10:51 PDT 2009,NO_QUERY,gemcruz,@mulder8scully5 hi pet!!!
978725,4,1833860922,Mon May 18 00:43:13 PDT 2009,NO_QUERY,ShelleAmanda,@ddlovato Demi.. Don't say that! you are soo w...
911725,4,1752051769,Sat May 09 20:56:07 PDT 2009,NO_QUERY,benlawsonphoto,"@NicholeAudrey LOL! Ok, &quot;we&quot; will ge..."
914768,4,1752943543,Sat May 09 23:32:12 PDT 2009,NO_QUERY,ThankYouProject,@ggw_bach Thank you for YOUR positive energy +...
1421569,4,2058347713,Sat Jun 06 14:45:31 PDT 2009,NO_QUERY,iamdebra,@BethRosen Thanks. I'm going to see if I can ...
...,...,...,...,...,...,...
1390508,4,2053224846,Sat Jun 06 03:45:54 PDT 2009,NO_QUERY,jetdillo,errr...should have said &quot;Leaving for RIC&...
893693,4,1691979600,Sun May 03 19:01:15 PDT 2009,NO_QUERY,StephParrott,Sitting here talking to my boyfriend who just ...
824434,4,1556242184,Sat Apr 18 22:07:14 PDT 2009,NO_QUERY,cyberfx1,@ShannonLeto Awwww....you made me SMILE
895955,4,1692855543,Sun May 03 20:56:52 PDT 2009,NO_QUERY,sonjaphelps,has just hung with Drew in FL ~ great friends ...


Unnamed: 0,polarity,tweet_id,date,query,username,text
608693,0,2223442081,Thu Jun 18 08:16:44 PDT 2009,NO_QUERY,ambick,My poor son is being circumcised right now
255843,0,1984584311,Sun May 31 15:23:48 PDT 2009,NO_QUERY,MGHarris,@casshorowitz good luck. I totally failed to i...
400617,0,2057455901,Sat Jun 06 13:02:18 PDT 2009,NO_QUERY,Brittneyondich,Grad party and then a soriee. I wish I was at ...
309062,0,2000826995,Mon Jun 01 23:04:09 PDT 2009,NO_QUERY,supergirlnancy,twitter wont let me upload a new pic ???
146776,0,1882577611,Fri May 22 07:07:17 PDT 2009,NO_QUERY,ShakilaKelley,sp proud of the mr. a lil disappointed that i ...
...,...,...,...,...,...,...
168678,0,1962256864,Fri May 29 10:46:01 PDT 2009,NO_QUERY,purity_xo,"@reactiveretro yeah i agree, he was the best! ..."
211840,0,1974589898,Sat May 30 13:26:40 PDT 2009,NO_QUERY,bobbi10100,@TessMorris Poor you Hope it doesn't last lo...
135109,0,1836430771,Mon May 18 07:59:13 PDT 2009,NO_QUERY,Torontonian_Fan,@sofdlovesbsb i wish i had gone wouldnt it ha...
667563,0,2245648277,Fri Jun 19 16:26:15 PDT 2009,NO_QUERY,kangaroo5383,"goodbye t-mobile, you've been good to me all t..."


Then, I cleaned the text column and took out things like usernames, hashtags, and other symbols. I word tokenized the cleaned list and normalized all the words to lowercase.

In [7]:
pos_list = []
neg_list= []
pos_list = ([re.sub(r'(?:(@|&|;|http|https)[\w_]+)', '', i) for i in pos_tweets['text']])
neg_list = ([re.sub(r'(?:(@|&|;|http|https)[\w_]+)', '', i) for i in neg_tweets['text']])

In [8]:
from nltk.tokenize import word_tokenize
pos_toks = [word_tokenize(i) for i in pos_list]
neg_toks = [word_tokenize(i) for i in neg_list]

In [9]:
poslower = []
for line in pos_toks:
    poslower.append([w.lower() for w in line])

In [10]:
neglower = []
for line in neg_toks:
    neglower.append([w.lower() for w in line])

I then took the tokenized words and started to build a simple Naive Bayers classifier based on chapter 6 of the nltk book for [this](https://www.nltk.org/book/ch06.html) movie reviews classifier. 

In [11]:
posneglower = poslower+neglower

In [12]:
i = 1
all_words = [nltk.FreqDist(posneglower[0])]
while True:
    all_words = nltk.FreqDist(posneglower[i])
    if i > len(posneglower)-2:
        break
    i = i+1
word_features = list(all_words)[:2000]

In [13]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [14]:
#creating a tuple of the tweet tokens and the polarity ranking
pos_tup = [(x,4) for x in poslower]
neg_tup = [(x,0) for x in neglower]
posng_tup = pos_tup+neg_tup

from random import shuffle
shuffle(posng_tup)
posng_tup[:5]

[(['o', '...', '..k', '...', '.', 'that', 'looks', 'like', 'a', 'really', 'wierd', 'show', ':', 's', '...', '.', 'i', 'ca', "n't", 'watch', 'a', 'show', 'without', 'plot', 'lines'], 4), (['i', 'knoww', ',', 'so', 'excitedd', 'we', "'ll", 'need', 'to', 'have', 'a', 'major', 'catch-up', 'before', 'we', 'get', 'the', 'bus', 'ca', "n't", 'wait', 'to', 'see', 'you', '!', 'xx'], 4), (['baby', 'shower', 'today', '...', 'and', 'not', 'of', 'the', '4-legged', 'variety'], 0), (['seriously', '!', 'i', 'know', 'this', 'cuz', 'i', 'have', 'a', 'gay', 'friend', 'lol', 'i', "'m", 'trying', 'to', 'cheer', 'up', 'bb..', 'it', 'ai', "n't", 'easy', 'though'], 0), (['you', "'re", 'probably', 'not', 'but', 'you', "'re", 'only', 'one', 'i', 'know', 'that', 'dances', 'in', 'there', 'undies', 'on', 'twitter', 'lol'], 4)]

In [15]:
featuresets = [(document_features(d), c) for (d,c) in posng_tup]
#90/10 split to the data
train_set, test_set = featuresets[20000:], featuresets[:20000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [16]:
nltk.classify.accuracy(classifier, test_set)

0.5888

In [17]:
classifier.show_most_informative_features(20)

Most Informative Features
         contains(check) = True                4 : 0      =      3.0 : 1.0
           contains(n't) = True                0 : 4      =      2.2 : 1.0
         contains(marks) = True                0 : 4      =      2.1 : 1.0
           contains(but) = True                0 : 4      =      1.7 : 1.0
          contains(went) = True                0 : 4      =      1.5 : 1.0
            contains(my) = True                0 : 4      =      1.5 : 1.0
             contains(i) = True                0 : 4      =      1.4 : 1.0
          contains(they) = True                0 : 4      =      1.4 : 1.0
             contains(i) = False               4 : 0      =      1.3 : 1.0
          contains(were) = True                0 : 4      =      1.2 : 1.0
            contains(to) = True                0 : 4      =      1.2 : 1.0
         contains(there) = True                0 : 4      =      1.2 : 1.0
             contains(,) = True                4 : 0      =      1.1 : 1.0

I admit this is not the best result, but I spent so long fiddling with the data that unfortuately most of my time went into that. 

Moving forward I want to clean up the test data in a way that improves accuracy, and also look into a classifier better suited to this dataset. For now, i'm going to move forward with this less-than-ideal classifier and describe the kinds of things I would look into with a better classifier in the future. I'm just going to pickle a few important things so I can use them back in [this notebook](https://github.com/Data-Science-for-Linguists-2020/Twitter-Positivity-Analysis/blob/master/notebooks/data_analysis.ipynb). See you there!

In [22]:
f = open( "classifier.pkl", "wb" )
pickle.dump(classifier, f)
f.close()

In [23]:
f = open( "word_features.pkl", "wb" )
pickle.dump(word_features, f)
f.close()