## Twitter Classification with Naive Bayes

In this notebook, we'll try to predict which twitter user follows which handle you selected. The goals here are potentially two-fold. 

1. Build a Classifier: We might be legitimately interested in classification. For instance, we could do this along some dimension we might care about. Then we could take any description and score it along this dimension. Could you do this with just general text? What might be the strengths and weaknesses of doing so?
1. Naive Bayes (NB) for Exploration: If we just want to understand how two groups use (this very particular sub-species of) language, NB could help us do it. As we'll see below, the `show_most_informative_features` for sets of words can give us a view into the raw language that's being used. 


In [1]:
import nltk
import random
from string import punctuation
from pprint import pprint

Let's start by simply using the words in descriptions. First, let's read in the data. 

In [2]:
# you'll need to replace your file names and labels here. 

d = []

with open("20181106_GeneralMills_followers.txt",'r') as infile :
    next(infile)
    
    for line in infile.readlines() :
        line = line.strip("\n").split("\t") # need to specify what we're stripping here.
        
        if line[6] : # test for empty description
            d.append((line[6],'big_food'))

with open("20181106_michaelpollan_followers.txt",'r') as infile :
    next(infile)
    
    for line in infile.readlines() :
        line = line.strip("\n").split("\t")
        if line[6] :
            d.append((line[6],'pollan'))


As always, let's look at a little of the data. I'll shuffle it first.

In [3]:
random.shuffle(d)
sample = d[:5]
print(sample)

[('There is a mead out there for everyone. Let us help you find yours!  Tasting Room Hours: Thurs-Fri: 4pm-7pm Sun: 1pm-4pm', 'pollan'), ('is a fire fighter And wants to be a YouTuber but needs advice and wants to be a official actor', 'big_food'), ('Writer of mysteries, young adult fantasies, and military memoir. Also an avid knitter, crocheter, and sometime designer.', 'pollan'), ('Club DJ plays House music,Deep,classic,afrosoul,company Jivemore Entertainment,DJ or SOUND system bookings: buti200@webmail.co.za 0825868070', 'big_food'), ('HAHAHAHAHAHAHAHAH', 'big_food')]


Now we need to write a function that cleans up the description and maps it on to words. 

In [4]:
def desc_features(the_description) :
    """ Input: A twitter description
        Output: A dictionary listin the words that are in 
                the description.
                
        This function does some cleaning on the descriptions,
        removing some punctuation, splitting on whitespace, 
        dropping to lower case. It returns a dictionary 
        of the form 
            {example : True,
             word :    True}
    
        """
    exclude = set(punctuation)
    exclude.remove("#") #useful for twitter...
    
    # Found this at https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
    the_description = ''.join([ch.lower() for ch in the_description if ch not in exclude])
    
    word_list = the_description.split()

    ret_val = {}
    
    for word in word_list :
        ret_val[word] = True
    
    return(ret_val)
    

As always, it's a good idea to test your functions.

In [5]:
for a in sample :
    desc, label = a
    print("Started with: " + desc)
    print("-------------------Then got---------------------------")
    pprint(desc_features(desc))
    print("------------------------------------------------------")
    print()
    print()

Started with: There is a mead out there for everyone. Let us help you find yours!  Tasting Room Hours: Thurs-Fri: 4pm-7pm Sun: 1pm-4pm
-------------------Then got---------------------------
{'1pm4pm': True,
 '4pm7pm': True,
 'a': True,
 'everyone': True,
 'find': True,
 'for': True,
 'help': True,
 'hours': True,
 'is': True,
 'let': True,
 'mead': True,
 'out': True,
 'room': True,
 'sun': True,
 'tasting': True,
 'there': True,
 'thursfri': True,
 'us': True,
 'you': True,
 'yours': True}
------------------------------------------------------


Started with: is a fire fighter And wants to be a YouTuber but needs advice and wants to be a official actor
-------------------Then got---------------------------
{'a': True,
 'actor': True,
 'advice': True,
 'and': True,
 'be': True,
 'but': True,
 'fighter': True,
 'fire': True,
 'is': True,
 'needs': True,
 'official': True,
 'to': True,
 'wants': True,
 'youtuber': True}
------------------------------------------------------


Started wit

Okay, now we're ready to do the NB stuff. It's actually shockingly easy at this point, since we've done the work to set it up. We've got 255K total descriptions (found by typing `len(d)` in a cell). That's big enough that I'll use a full 5000 for our test set.

In [6]:
test_set_size = 5000

featuresets = [(desc_features(desc), label) for (desc, label) in d]
train_set, test_set = featuresets[test_set_size:], featuresets[:test_set_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

How'd we do?

In [7]:
print(nltk.classify.accuracy(classifier, test_set))

0.759


Not terrible, assuming it's about a 50/50 split. Let's see what that is, using my trick from last time.

In [8]:
from collections import Counter

Counter([label for desc, label in d])

Counter({'big_food': 59782, 'pollan': 195324})

Hmm, in my example Pollan is about 76% of the data, so we're not doing better than if we just guessed "pollan" all the time. Well, we can think about making it better in a minute. For now, let's see what's predictive.

In [9]:
classifier.show_most_informative_features(20)

Most Informative Features
             influenster = True           big_fo : pollan =     70.8 : 1.0
                 rockalt = True           big_fo : pollan =     64.3 : 1.0
               hiphoprap = True           big_fo : pollan =     62.1 : 1.0
                bzzagent = True           big_fo : pollan =     59.9 : 1.0
            #sweepstakes = True           big_fo : pollan =     53.4 : 1.0
               pillsbury = True           big_fo : pollan =     49.0 : 1.0
                  roblox = True           big_fo : pollan =     47.7 : 1.0
              #giveaways = True           big_fo : pollan =     44.4 : 1.0
              #packaging = True           big_fo : pollan =     42.5 : 1.0
               couponing = True           big_fo : pollan =     41.6 : 1.0
             #industrial = True           big_fo : pollan =     40.3 : 1.0
                   brony = True           big_fo : pollan =     38.1 : 1.0
                   amosc = True           big_fo : pollan =     38.1 : 1.0

Lots of the big food features seem to be pretty "spammy". For instance "influenster", "#sweepstakes", "#giveaways". Although some of them seem legit like those related to coupons or pillsbury. 

If I was going to try to improve it, here's some stuff I'd try:

1. Remove stopwords: Remember, these are those common words that don't carry a lot of meaning. Might not matter, but it'd be cleaner and faster. 
1. Limit the model to just the top $N$ remaining words. Not sure what to pick for $N$, but I'd try 1000 or so. It'd be worth it to do the whole `train_set/dev_test_set/test_set` if we were headed down this path and we could try a bunch of $N$s. 
1. See if number of followers is predictive. Using continuous variables in Naive Bayes is [a bit tricky](https://stats.stackexchange.com/questions/61034/naive-bayes-on-continuous-variables), but it can sometimes be quite helpful. 

From an exploratory standpoint, I might be able to get more interesting results by sampling my more pervasive class (pollan for me). Let's take a look at that.

In [10]:
len(d)

255106

In [11]:
# Let's get 60K from Pollan
new_d = [item for item in d if item[1]=="big_food"]
pollan = [item for item in d if item[1]=="pollan"]
 
new_d.extend(random.choices(pollan,k=60000))

In [12]:
# Did we get what we expected? 
Counter([label for desc, label in new_d])

Counter({'big_food': 59782, 'pollan': 60000})

In [13]:
test_set_size = 5000

featuresets = [(desc_features(desc), label) for (desc, label) in new_d]
train_set, test_set = featuresets[test_set_size:], featuresets[:test_set_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [14]:
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(20)

0.6912
Most Informative Features
               hiphoprap = True           big_fo : pollan =     67.5 : 1.0
             sweepstakes = True           big_fo : pollan =     61.7 : 1.0
                freebies = True           big_fo : pollan =     50.0 : 1.0
                    fmcg = True           big_fo : pollan =     50.0 : 1.0
               herbalist = True           pollan : big_fo =     45.3 : 1.0
            permaculture = True           pollan : big_fo =     39.6 : 1.0
                  herbal = True           pollan : big_fo =     38.0 : 1.0
                fortnite = True           big_fo : pollan =     35.4 : 1.0
               couponing = True           big_fo : pollan =     32.5 : 1.0
            naturopathic = True           pollan : big_fo =     32.0 : 1.0
               minecraft = True           big_fo : pollan =     31.0 : 1.0
                  gamers = True           big_fo : pollan =     31.0 : 1.0
                     wwe = True           big_fo : pollan =     30.

In [20]:
count = 0
for item in d :
    desc, label = item
    
    if ("sweepstakes" in desc) :
        print(desc)
        print(label)
        print("\n")
        count += 1

    if count > 5 :
        break
        


42yo gay Caucasian(fair complexion) male/36 inch waist/blue-grey eyes/Zodiac: Cancer/ animal lover/ dining out/ shopping/sweepstakes lover.
big_food


i love tv,giveaways,contests,&sweepstakes my ,mom of one son,nkotb fan 4 -life,! i love my boyfriend to :) love u babe! :)
big_food


 my baby girl and hubby, my dog and cat, sweepstakes, twitter parties, contests, outside and nature, non-GMO/natural/organics, and food of course 
big_food


I m a married mom of 3 who loves bargains, coupons and sweepstakes!
big_food


WIN Cash & Prizes!! Easy to enter sweepstakes, contests & giveaways. #giveaway #win #contest #free #cash WIN BIG!!
big_food


I am originally from Hawaii and live in So Cal.I love to coupon and enter sweepstakes.I am I breast cancer survivor and am obsessed with cats.
big_food




Do these features seem more informative? 