Creating a sentiment analyzer using Python's Natrual Learning Toolkit.

<!--more-->

TABLE OF CONTENTS

This is the fifth post in an on-going Pokemon Go analysis series.  Last time, we discussed how a Naive Bayes Classifier can be used to predict the class of a sample given a number of features about that sample.  In this post, we'll apply the technique to our Pokemon Go tweets to build a sentiment analyzer that automatically classifies whether each tweet has a positive or negative tone.  Once complete, we'll use the sentiment analyzer to remove negative tweets from our data set before using the positive tweets to map out the dominance of each team in each state.


We'll cover the following topics:

1. [Manually classifying a training set](#trainSet)
2. [Python's Natural Learning Toolkit](#NLTK)
3. [Extracting features from our data](#features)
4. [Training our sentiment analyzer](#features)
5. [Evaluating the analyzer's performance](#eval)


# <a name="tweepy"></a> Manually classying a training set

Recall from our last post that the implementation of a Naive Bayes Classifier requires a training set of samples which have known features and known classes.  Right now, all of the tweets we've collected are unclassified.  Before we create our sentiment analyzer, we'll have to manually label a subset of the tweets as positive or negative to form our training set.  

The process of manually labeling tweets is going to be time consuming and tedious, but we can use Python to make the process a little more bearable.  First, we'll import the Pandas and JSON libraries.

In [2]:
import json
import pandas as pd

If you aren't familiar with the Pandas package, I highly recommend watching [Wes McKinney's tutorial](https://github.com/estimate/pandas-exercises) on the library that he created.  Pandas is designed to provide intuitive interactions with tabular data by introducing data frames to Python.  We'll be using it to create some data frames about our Pokemon Go tweets. Before we do so, we'll need to define a function that loads our Pokemon Go tweets from the JSON text file we created in the Tweepy blog post:

In [3]:
#Define a function to load our Pokemon Go tweets
def load_twitter_data(tweets_data_path):
    tweets_data = []
    
    #Open the text file that contains the tweets we collected
    tweets_file = open(tweets_data_path, "r")
    
    #Read the text file line by line
    for line in tweets_file:
        
        #Append the content of each tweet to a tweets_data list
        try:
            tweet = json.loads(line)
            tweets_data.append(tweet)
        except:
            continue
            
    #return the list of tweets_data
    return tweets_data

In [None]:
def pop_tweets(path):
    #Use the previous function to load our tweets from the text file
    tweets_data = load_twitter_data(path)
    
    #Declare a new data frame with pandas, with some specific column names
    tweets = pd.DataFrame(columns=['screenName','userId','text','latt','long','location'])

    #For each tweet in the list
    for tweet in tweets_data:
        if ('text' in tweet): 
            if tweet['coordinates'] != None:
                tweets.loc[len(tweets)]=[tweet['user']['screen_name'],tweet['user']['id'],tweet['text'], \
                                         tweet['coordinates']['coordinates'][0],\
                                         tweet['coordinates']['coordinates'][1],tweet['place']['full_name']]    
            else:
                tweets.loc[len(tweets)]=[tweet['user']['screen_name'],tweet['user']['id'],tweet['text'], \
                                         float('nan'),float('nan'),tweet['place']['full_name']]    
        
    return tweets

Now that we have a 

In [None]:
PoGo_tweets = pop_tweets('PoGo_USA.json')

In [None]:
len(PoGo_tweets)

In [None]:
#Remove tweets that discuss two or more teams

#Even though we didnt' track mystic, valor, and instinct alone when getting tweets
#we can track them here since the tweets are specific to pokemon go 
for row in range(len(PoGo_tweets)):
    has_b=any(word in PoGo_tweets.loc[row,'text'].lower() for word in ['team mystic','#teamblue','#teammystic','#mystic','mystic'])
    has_r=any(word in PoGo_tweets.loc[row,'text'].lower() for word in ['team valor','#teamred','#teamvalor','#valor','valor'])
    has_y=any(word in PoGo_tweets.loc[row,'text'].lower() for word in ['team instinct','#teamyellow','#teaminstinct','#instinct','instinct'])
    if has_b+has_r+has_y > 1:
        PoGo_tweets.loc[row,'multi-team']=True
    else:
        PoGo_tweets.loc[row,'multi-team']=False
    

In [None]:
#Applying cut to remove multi-team tweets
PoGo_tweets = PoGo_tweets[PoGo_tweets['multi-team'] == False]

In [None]:
len(PoGo_tweets)

<h1> Select the first 2000 tweets to be manually labeled </h1>

In [None]:
isCont=input('Are you continuing from a previous checkpoint? (yes/no)')

if isCont == 'yes':
    #START HERE IF CONTINUING
    #Import the csv dataframe
    import pandas
    from nltk.corpus import stopwords
    PoGo_labeled = pandas.read_csv('PoGo_Sentiment_Labeled_extended.csv')
    PoGo_labeled.drop('Unnamed: 0', axis=1, inplace=True)

else:
    PoGo_labeled = PoGo_tweets.ix[:2000]
    PoGo_labeled = PoGo_labeled.reset_index(drop=True)


In [None]:
PoGo_labeled.head(n=5)

In [None]:
len(PoGo_labeled)

In [None]:
#Extending PoGo_labeled to include more tweets
PoGo_labeled = PoGo_labeled.append(PoGo_tweets.ix[4000:6000])

In [None]:
len(PoGo_labeled)

In [None]:
backup = PoGo_labeled

In [None]:
PoGo_labeled = PoGo_labeled.reset_index(drop=True)

In [None]:
#Display the tweet for each row, and ask user to label it
#Putting pos: Can identify user as being on team, neg: Negative tweet about a team, 
#or nan: Can identify as being on a diff team, or can't identify as being on team

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

#Left off at 3633
for row in range(3960,4000):
    print (PoGo_labeled.loc[row,'text'])
    PoGo_labeled.loc[row,'sentiment'] = input()


In [None]:
row

<h1> Save the data </h1>

In [None]:
PoGo_labeled.to_csv('PoGo_Sentiment_Labeled_extended.csv')