## Twitter Classification with Naive Bayes

In this notebook, we'll try to predict which twitter user follows which handle you selected. The goals here are potentially two-fold. 

1. Build a Classifier: We might be legitimately interested in classification. For instance, we could do this along some dimension we might care about. Then we could take any description and score it along this dimension. Could you do this with just general text? What might be the strengths and weaknesses of doing so?
1. Naive Bayes (NB) for Exploration: If we just want to understand how two groups use (this very particular sub-species of) language, NB could help us do it. As we'll see below, the `show_most_informative_features` for sets of words can give us a view into the raw language that's being used. 


In [None]:
import nltk
import random
from string import punctuation
from pprint import pprint

Let's start by simply using the words in descriptions. First, let's read in the data. 

In [None]:
# you'll need to replace your file names and labels here. 

d = []

with open("20191014_GeneralMills_followers.txt",'r') as infile :
    next(infile)
    
    for line in infile.readlines() :
        line = line.strip("\n").split("\t") # need to specify what we're stripping here.
        
        if line[6] : # test for empty description
            d.append((line[6],'big_food'))

with open("20191014_michaelpollan_followers.txt",'r') as infile :
    next(infile)
    
    for line in infile.readlines() :
        line = line.strip("\n").split("\t")
        if line[6] :
            d.append((line[6],'pollan'))


As always, let's look at a little of the data. I'll shuffle it first.

In [None]:
random.shuffle(d)
sample = d[:5]
print(sample)

Now we need to write a function that cleans up the description and maps it on to words. 

In [None]:
def desc_features(the_description) :
    """ Input: A twitter description
        Output: A dictionary listin the words that are in 
                the description.
                
        This function does some cleaning on the descriptions,
        removing some punctuation, splitting on whitespace, 
        dropping to lower case. It returns a dictionary 
        of the form 
            {example : True,
             word :    True}
    
        """
    exclude = set(punctuation)
    exclude.remove("#") #useful for twitter...
    
    # Found this at https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
    the_description = ''.join([ch.lower() for ch in the_description if ch not in exclude])
    
    word_list = the_description.split()

    ret_val = {}
    
    for word in word_list :
        ret_val[word] = True
    
    return(ret_val)
    

As always, it's a good idea to test your functions.

In [None]:
for a in sample :
    desc, label = a
    print("Started with: " + desc)
    print("-------------------Then got---------------------------")
    pprint(desc_features(desc))
    print("------------------------------------------------------")
    print()
    print()

Okay, now we're ready to do the NB stuff. It's actually shockingly easy at this point, since we've done the work to set it up. We've got 255K total descriptions (found by typing `len(d)` in a cell). That's big enough that I'll use a full 5000 for our test set.

In [None]:
test_set_size = 5000

featuresets = [(desc_features(desc), label) for (desc, label) in d]
train_set, test_set = featuresets[test_set_size:], featuresets[:test_set_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

How'd we do?

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

Not terrible, assuming it's about a 50/50 split. Let's see what that is, using my trick from last time.

In [None]:
from collections import Counter

Counter([label for desc, label in d])

Hmm, in my example Pollan is about 83% of the data, so we're not doing better than if we just guessed "pollan" all the time. Well, we can think about making it better in a minute. For now, let's see what's predictive.

In [None]:
classifier.show_most_informative_features(20)

Lots of the big food features seem to be pretty "spammy". For instance "freebie", "#sweepstakes", "#giveaways". Although some of them seem legit like those related to coupons or pillsbury. 

If I was going to try to improve it, here's some stuff I'd try:

1. Remove stopwords: Remember, these are those common words that don't carry a lot of meaning. Might not matter, but it'd be cleaner and faster. 
1. Limit the model to just the top $N$ remaining words. Not sure what to pick for $N$, but I'd try 1000 or so. It'd be worth it to do the whole `train_set/dev_test_set/test_set` if we were headed down this path and we could try a bunch of $N$s. 
1. See if number of followers is predictive. Using continuous variables in Naive Bayes is [a bit tricky](https://stats.stackexchange.com/questions/61034/naive-bayes-on-continuous-variables), but it can sometimes be quite helpful. 

From an exploratory standpoint, I might be able to get more interesting results by sampling my more pervasive class (pollan for me). Let's take a look at that.

In [None]:
len(d)

In [None]:
# Let's get 60K from Pollan
new_d = [item for item in d if item[1]=="big_food"]
pollan = [item for item in d if item[1]=="pollan"]
 
new_d.extend(random.choices(pollan,k=65000))

In [None]:
# Did we get what we expected? 
Counter([label for desc, label in new_d])

In [None]:
test_set_size = 5000

featuresets = [(desc_features(desc), label) for (desc, label) in new_d]
train_set, test_set = featuresets[test_set_size:], featuresets[:test_set_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(20)

Now we're getting more words that seem interesting both ways. 

In [None]:
# I'm curious to view these words "in situ"

count = 0
for item in d :
    desc, label = item
    
    if ("herbalist" in desc) :
        print(desc)
        print(label)
        print("\n")
        count += 1

    if count > 5 :
        break
        


Do these features seem more informative? 