<h3, align=center>Tweet about us!</h3>

<a href="https://twitter.com/intent/tweet?hashtags=BDUmeetup%2C&original_referer=https%3A%2F%2Fabout.twitter.com%2Fresources%2Fbuttons&ref_src=twsrc%5Etfw&related=SaeedAghabozorg&text=%23Text%20%23Analysis%20%26%20%23NLP%20%20%40BigDataU%20%23NLP%20%23TextMining%20%23datascience%20Livestream%20here%3A%20http%3A%2F%2Fbit.ly%2FEventTextMining&tw_p=tweetbutton"><img src = 'http://ibm.box.com/shared/static/71e9pzujwf4094sp762cib6eswf5cald.png', width=600></a>

#1.  Importing packages
For this project we are going to need a variety of python modules mostly from the Natural Language Toolkit (NLTK) package. 

In [1]:
import csv
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import MaxentClassifier, NaiveBayesClassifier
import re 
import string 
from nltk.corpus import reuters, stopwords
from nltk.tokenize import word_tokenize 
from nltk.tag import pos_tag
from collections import Counter
import collections
from nltk.corpus import wordnet as wn
from nltk.stem.lancaster import LancasterStemmer as stem

In order to save memory python's nltk package replaces some of the features with "Lazy" versions of themselves on import. To access the full version of the Reuters corpus, the stopwords corpus, the wordnet, and the part of speech tagger we will use later on in this notebook we must use the nltk downloader

In [2]:
nltk.download("reuters")
nltk.download("stopwords") 
nltk.download('maxent_treebank_pos_tagger')
nltk.download("wordnet")

[nltk_data] Downloading package reuters to /Users/evgenus/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/evgenus/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     /Users/evgenus/nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/evgenus/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

#2. Downloading datasets

In [3]:
!wget -O /resources/Sentiment140.200000.csv https://ibm.box.com/shared/static/cdzdpognqgn2ny54e66uoipnorxsfccg.csv

/bin/sh: wget: command not found


In [4]:
!wget -O /resources/ELXN42test2.txt  https://ibm.box.com/shared/static/ssx1isyzg7szp11u71uchn2114td8ndn.txt

/bin/sh: wget: command not found


# 3. Pre-Processing The Text Data 

   ###A) The Reuters Corpus

The first major step is constructing the $\textbf{Term-Document Matrix}$ or $\textbf{TDM}$ for the Reuters corpus. The TDM is a very widely known and widely used tool in natural language processing. In this case study we use the TDM solely to get prior probabilities for our spell-checker but it is included because knowledge of the TDM translates well into any NLP project the reader may try. The first thing we must do with regards specifically to the Reuters corpus is write a function that can take a list of words and return only the important ones and remove any punctuation. Unimportant words in NLP are called $\textbf{stop words}$ and can be removed with a simple reference to the nltk stopwords corpus. The function "_clean_up2" will be needed to create the normalized tweet training data using the Lancaster Stemmer we imported earlier, _clean_up has the added parameter in the definition as a convenience for the "format_train"function later

In [5]:
def _clean_up2(word_list, st): 
    '''Return a list of strings cleaned_words which consists of the a
    Lancaster Stemmer st normailized words from a list of 
    strings word_list with stop words and punctuation removed'''
    stop_words = stopwords.words('english') 
    cleaned_words = [st.stem(cleanup(word)) for word in word_list if (word not in stop_words and 
                     word not in list(string.punctuation))]
    return cleaned_words

def _clean_up(word_list, st=None): 
    '''Return a list of strings cleaned_words which consists of the words from a list of 
    strings word_list with stop words and punctuation removed'''
    stop_words = stopwords.words('english') 
    cleaned_words = [cleanup(word) for word in word_list if (word not in stop_words and 
                     word not in list(string.punctuation))]
    return cleaned_words

def cleanup(s): 
    '''remove punctuation and unwanted characters or hyperlinks from list of string s'''
    return s.translate(str.maketrans(string.punctuation, " "*(len(string.punctuation))))  

The Reuters corpus is divided into 10788 documents which are various types of news articles. The nltk CorpusReader class contains a method which takes any one of these 10788 filenames and returns a list of $\textbf{normalized}$ words used in the document. To normalize a word is to represent it by its base (e.g. dogs $\rightarrow$ normalization $\rightarrow$ dog)  

Most of the time the TDM is actually a Matrix put in python it is almost always better to use dictionaries (called hash tables in other languages) for storing large amounts of information. So to make the TDM we can use the following function and if we need the name of a document represented at position $\textit{j}$ we can find it by looking up the $\textit{jth}$ name in reuters.fileids()

In [6]:
def make_tdm(corpus): 
    '''Return a list of dictionaries TDM where each element contains the words
    used and word counts in each corpus document''' 
    TDM = []
    for f in corpus.fileids(): 
        word_list = corpus.words(f) 
        cleaned_words = _clean_up(word_list) 
        DM = Counter(cleaned_words) 
        TDM.append(DM)
    
    return TDM 
    

Now that we have made the TDM we can use this code heavily inspired by Peter Norvig (you can find the original here: $\href{url}{http://norvig.com/spell-correct.html}$). The idea is we would like to estimate what word is was meant by a misspelled word we try to maximize P(intended word | input string), using the Bayes Rule $argmax_{w}P(w|s) = argmax_{w}P(s|w)*P(w)$. The goal the TDM serves is in estimating the prior probability P(w) which is the MLE for our $\textbf{language model}$ in this example we assign equal probability to any of the five operations you can see in the "edits1(word)" function below. If you wanted to build a better $\textbf{error model}$ from which to draw the probability that an author typed a string $\textit{s}$ when he/she meant $\textit{w}$ you could add weights for keys that are close together or use misspelled data to estimate a better distribution. 

For now we develop the aggregated TDM in NWORDS and we will later call the function correct(word) to spell check our data. 

In [7]:
def train_tdm(tdm): 
    model = collections.defaultdict(lambda: 1)
    for doc in tdm: 
        for word in doc.keys(): 
            model[word] += doc[word]
    return model

TDM = make_tdm(reuters) 
NWORDS = train_tdm(TDM)
alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or [word]
    return max(candidates, key=NWORDS.get)

###B) The ELXN42 Twitter Data

Now that we have the spell checker implemented we can load and clean the test_data. We want to compare results of the classifiers on the clean and unclean data so we will keep both. Since here we don't get normalized words, we can also get the normalized words by using the LancasterStemmer from the nltk packagage which we imported earlier as "stem". 

In [8]:
with open("ELXN42test2.txt") as f:
    cdnpoli = f.readlines()

In [9]:
clean_list = [] 
st = stem()
for tweet in cdnpoli: 
    clean_tweet = _clean_up(tweet.split()) 
    for i in range(len(clean_tweet)): 
        if not wn.synsets(clean_tweet[i]): 
            clean_tweet[i] = correct(st.stem(clean_tweet[i]))
    clean_string = " ".join(clean_tweet) 
    clean_list.append(clean_string) 
          

#4. Subject Determination

### Part of Speech Tagging (POST) 

The second big step in this project is determining the subject and main political party referred to by a tweet. In NLP this is called the "aboutness" problem and there is often no easy solution. For subject determination we will be using $\textbf{part of speech tagging (POST)}$ although I encourage the reader to take a look at $\textbf{Semantic Role Labelling}$ in the $\textit{nlpnet}$ package as further reading. 

For determining the political party in question we will try to extract relevant hashtags and then determine which of those are in the body of a tweet and if none are, using the trailing hashtags as an indicator. 

In getting the subject we will use the pos_tag and word_tokenize functions we imported earlier. If we can see the subjects of tweets we can use opinion mining to estimate how people feel about certain issues as well as political parties. First we implement a help function which gets all of the nouns in a given tweet. 

In [13]:
def _get_subjects(tweet): 
    subjects = []
    word_list = tweet.split() 
    trail_tags = 0
    for i in range(len(word_list)-1, -1, -1):
        x = word_list[i]
        if x[0]=="#": 
            trail_tags += 1 
    word_list2 = word_list[:(len(word_list)-trail_tags)]
    new_tweet = ' '.join(word_list2) 
    tagged = pos_tag(word_tokenize(new_tweet))
    for word, tag in tagged: 
        if (tag in ["NNP", "NNS", "PRP$", "NNPS", "NN"]): 
            subjects.append(word) 
    
    return subjects 

Using this helper function we can construct a dictionary where each key is a noun and each value is a list of tweets which contain that noun

In [14]:
def make_sub_dict(tweet_list): 
    d = {} 
    for tweet in tweet_list: 
        subs = _get_subjects(tweet)
        if len(subs)> 0:  
            for sub in subs:
                if sub in d.keys(): 
                    d[sub].append(tweet) 
                else: 
                    d[sub] = [tweet]
    return d

Now we simply make the dictionaries, one for the unprocessed test data and one for the processed. Note: this may take a while

In [15]:
sub_d = make_sub_dict(cdnpoli) 

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - '/Users/evgenus/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/evgenus/anaconda3/nltk_data'
    - '/Users/evgenus/anaconda3/lib/nltk_data'
    - ''
**********************************************************************


In [49]:
clean_sub_d = make_sub_dict(clean_list) 

#5. Political Party Determination

Getting the relevant political party is a different process altogether. What we do is find all the hashtags using regular expressions (python's re module) and then filter the most common ones with rule-based affiliations. For example we consider any tweet where the first political hashtag or word used it "trudeau" to be about the Liberal Party. 

###A) Finding Top Hashtags in ELXN42 Twitter Data

In [50]:
def get_hashtags(tweet_list, n): 
    '''Return the top n hashtags used in election tweet strings
    stored in a list tweet_list'''
    tag_list = [] 
    for tweet in tweet_list: 
        tags = re.findall("#\w\w+", tweet) 
        tag_list += (tags) 
    tag_dict = Counter(tag_list) 
    return tag_dict.most_common(n) 

In [51]:
hashtag_tups = get_hashtags(cdnpoli, 600) 
hashtags = [tag for tag, value in hashtag_tups]

###B) Building a List of Party-Relevant Hashtags

Then we instantiate certain partisan hashtags manually (rule-based) and look for similar common hashtags to build a larger network of partisan hashtags

In [52]:
def get_partisan(hashtags): 
    Liberal_tags = ["#liberal", "justin", "trudeau", "lpc", "#lpc" "realchage", "justnotready"
                   "#justin", "#trudeau", "#realchage", "liberal", "#justnotready"]
    NDP_tags = ["#ndp", "tom", "thomas", "mulcair", "changethatsready"
               "ndp", "#tom", "#thomas", "#mulcair", "#changethatsready"]
    Tory_tags = ["#conservative", "steve", "steven" "harper", "tory",
                "#steve", "#steven" "#harper", "#tory", "conservative", "cpc", "#cpc"]
    partisan = [Tory_tags, NDP_tags, Liberal_tags]
    for tag in hashtags:
        for party in partisan: 
            for string in party: 
                if string in tag and tag not in party: 
                    party.append(tag) 
                    party.append(tag[1:])
                    
    return partisan

Now we can use the hashtags we extracted earlier to make a list of lists of partisan hashtags 

In [53]:
p = get_partisan(hashtags)

###C) Determining A Likely Political Party By Tweet 

Now to get the party referred to by a tweet we look at the first partisan word or tag used in the tweet using the following helper function 

In [54]:
def _get_party(tweet, partisan):
    '''Return a string word which is our best guess for the subject of 
    a string tweet using our list hashtags if we have no guess return the 
    string "unkown"'''
    tagged = pos_tag(word_tokenize(tweet))
    for word, tag in tagged:
        for party in partisan:
            if word in party: 
                return party[0]

    return "unknown"

As before with the subject dictionary, we use a dictionary where each key is a political party (or "unknown") and each value is a list of relevant tweets

In [55]:
def make_party_dict(tweet_list, partisan): 
    '''Return a Dictionary where each key is a subject and each 
    value a list of tweets with that subject'''
    tweet_dict = {} 
    for tweet in tweet_list: 
        sub = _get_party(tweet, partisan) 
        if sub in tweet_dict.keys(): 
            tweet_dict[sub].append(tweet)
        else: 
            tweet_dict[sub] = []
            tweet_dict[sub].append(tweet)
            
    return tweet_dict

In [56]:
par_d = make_party_dict(cdnpoli, p)

In [57]:
clean_par_d = make_party_dict(clean_list, p) 

#6. Text Classification

###A) Formatting The Sentiment140 Training Data

First we must load and format the data. nltk.classify.classifiers take training input in the following form [({feature1: value, feature2: value, feature3: value, ....}, category}), ({feature1: value, feature2: value, feature3: value, ....}, category})] in other words, a list of tuples where each entry is a training case and its class label. In our case the features will be words and the values their respective counts. We also allow the choice of cleaning functions for later 

In [58]:
def format_train(trainData, clean_function, st=None):
    '''Return a list of Tuples of tweets in csv.DictReader trainData
    where the first entry is a dictionary of words in a tweet and the 
    value "True" and the second is the class value we choose the 
    cleaning function from _clean_up and _clean_up2'''
    tweet_list = [] 
    for row in trainData: 
        tweet = row["text"]
        twl = tweet.split()
        twl = clean_function(twl, st)
        cat = row["class"]
        features = {}
        for word in twl: 
            features[word] = True
        if cat == "0": 
            tup = (features, "negative") 
        else: 
            tup = (features, "positive") 
        tweet_list.append(tup) 
    return tweet_list


Now we load and format the data

In [59]:
with open('/resources/Sentiment140.200000.csv', 'rb') as csvfile:
    trainData = csv.DictReader(csvfile) 
    dat = format_train(trainData, _clean_up)

And for the cleaned data we use _clean_up2 instead

In [60]:
with open('/resources/Sentiment140.200000.csv', 'rb') as csvfile:
    trainData = csv.DictReader(csvfile) 
    st = stem()
    dat2 = format_train(trainData, _clean_up2, st)

###B) Training the Classifiers 

In machine learning algorithms we often want to test the performance of a classifier on known examples so that we can get a grasp on how confident we should be in the results on the data we are really concerned with this. We do this by a process called $\textbf{cross validation}$ wherein we divide our training data further into a training set and labelled testing set. Since we have two categories, we would like an equal amount of training cases for both which we can implement in the following: 

In [61]:
positive = dat[100000:]
negative = dat[:100000]
train1 = positive[:80000] + negative[:80000]
test = positive[80000:] + negative[80000:] 
algorithm = MaxentClassifier.ALGORITHMS[0]
MEclassifier = MaxentClassifier.train(train1, algorithm,max_iter=3)
NBclassifier = NaiveBayesClassifier.train(train1) 

  ==> Training (3 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.500
             2          -0.69314        0.868
         Final          -0.69313        0.868


And we do the same for the cleaning data

In [62]:
positive2 = dat2[100000:]
negative2 = dat2[:100000]
train2 = positive2[:80000] + negative2[:80000]
test2 = positive2[80000:] + negative2[80000:] 
algorithm2 = MaxentClassifier.ALGORITHMS[0]
MEclassifier2 = MaxentClassifier.train(train2, algorithm,max_iter=3)
NBclassifier2 = NaiveBayesClassifier.train(train2) 

  ==> Training (3 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.500
             2          -0.69314        0.847
         Final          -0.69313        0.847


###C) Testing The Classifiers

We can then test the performance of our classifier by the $\textbf{categorization error}$ which is the percentage of the labelled test set cases it misclassifies, the functin is easy enough to write and uses the nltk.classify.classifier class' method "classify(case)" as follows: 

In [63]:
def get_cat_error(test, classifier): 
    '''Return float cat_error which is the categorization error 
    of an nltk.classify classifier on a testset test which is a 
    list of tuples with the first entry being a dictionary of 
    features and the second a string denoting a category'''
    wrong = 0 
    for case in test: 
        cla = classifier.classify(case[0]) 
        if cla != case[1]: 
            wrong += 1 
    
    error = float(wrong)/float(len(test))
    return error

In [64]:
print get_cat_error(test, MEclassifier) 
print get_cat_error(test, NBclassifier) 

0.25345
0.257525


#7. Getting Our Results

The fourth and final major step in this project is getting our actual results. Now that we have everything made we will simply need to find the percentage of positive tweets for a certain key value in one of the 4 dictionaries we constructed earlier.

###A) Results By Political Party

In [65]:
def get_percent_positive(key, d, classifier):
    pos_count = 0 
    for tweet in d[key]: 
        cla = classifier.classify(Counter(tweet)) 
        if cla == "positive": 
            pos_count += 1
    
    return 100*float(pos_count)/len(d[key])
    

Now to get the results by party we can simply run:

In [66]:
for classifier in [MEclassifier, NBclassifier]: 
    print "Percent Positive", "Total Tweets"
    for party in ["#ndp", "#liberal", "#conservative"]: 
        print party, get_percent_positive(party, par_d, classifier), len(par_d[party])
        

Percent Positive Total Tweets
#ndp 54.7008547009 234
#liberal 45.6035767511 671
#conservative 43.8066465257 331
Percent Positive Total Tweets
#ndp 51.7094017094 234
#liberal 43.9642324888 671
#conservative 44.1087613293 331


And for the cleaned data:

In [67]:
for classifier in [MEclassifier2, NBclassifier2]: 
    print "Percent Positive", "Total Tweets"
    for party in ["#ndp", "#liberal", "#conservative"]: 
        print party, get_percent_positive(party, clean_par_d, classifier), len(clean_par_d[party])
        

Percent Positive Total Tweets
#ndp 53.5269709544 241
#liberal 59.3917710197 559
#conservative 48.7179487179 156
Percent Positive Total Tweets
#ndp 53.5269709544 241
#liberal 59.3917710197 559
#conservative 48.7179487179 156


###B) Results By Subject

We can similarly classify tweets by subject using the other two dictionaries we constructed previously. Since there are so many subjects we can restrict them to a certain sample in the interest of brevity.

In [68]:
some_keys = ["jobs", "pm", "women", "realchange"]

In [69]:
for classifier in [MEclassifier, NBclassifier]: 
    print "Percent Positive", "Total Tweets"
    for sub in some_keys: 
        print sub, get_percent_positive(sub, sub_d, classifier), len(sub_d[sub])

Percent Positive Total Tweets
jobs 77.2727272727 22
pm 29.6296296296 27
women 52.380952381 21
realchange 47.4576271186 59
Percent Positive Total Tweets
jobs 77.2727272727 22
pm 29.6296296296 27
women 52.380952381 21
realchange 42.3728813559 59


And for the cleaned data: 

In [70]:
some_keys = ["job", "pm", "women", "realchang"]
for classifier in [MEclassifier2, NBclassifier2]: 
    print "Percent Positive", "Total Tweets"
    for sub in some_keys: 
        print sub, get_percent_positive(sub, clean_sub_d, classifier), len(clean_sub_d[sub])

Percent Positive Total Tweets
job 68.4210526316 19
pm 61.7647058824 34
women 47.619047619 21
realchang 58.2677165354 127
Percent Positive Total Tweets
job 68.4210526316 19
pm 61.7647058824 34
women 42.8571428571 21
realchang 58.2677165354 127


#8. Conclusion

So we see that with all kinds of data the NDP leads in terms of positive public opinion but the Liberals have the highest volume of relevant tweets by far. If we look at it by subject, we see that tweets related to "jobs" are incredibly positive and those relating to women are distinctly negative (if you take a look at these tweets with sub_d["women"] you will see that they are mostly about women's rights and are mostly negative).

As for clean vs. unclean data, we can see that the normalization process may over-restrict our ability to find party relevant tweets, however it seems to enhance our ability to filter by subject.

And that concludes the project! Feel free to analyze some of the tweets to see how well you think the classifiers did. The only real way to get a concrete grasp on the actual sentiment is to read them yourself I'm afraid but randomly sample a few tweets from each dictionary key and you can see that it fairly accurately reflects the sentiment of the tweet. 

Thank you for reading! Best of luck in all of your future NLP endeavors!

<h3, align=center>We are sponsoring a SportsHack!</h3>

<a href = "http://sportshackweekend.org/ca/2015/"><img src = "http://sportshackweekend.org/ca/2015/img/ca2015.jpg", width = 400></a>

<h3, align=center>Tweet about us!</h3>

<a href="https://twitter.com/intent/tweet?hashtags=BDUmeetup%2C&original_referer=https%3A%2F%2Fabout.twitter.com%2Fresources%2Fbuttons&ref_src=twsrc%5Etfw&related=SaeedAghabozorg&text=%23Text%20%23Analysis%20%26%20%23NLP%20%20%40BigDataU%20%23NLP%20%23TextMining%20%23datascience%20Livestream%20here%3A%20http%3A%2F%2Fbit.ly%2FEventTextMining&tw_p=tweetbutton"><img src = 'http://ibm.box.com/shared/static/71e9pzujwf4094sp762cib6eswf5cald.png', width=600></a>