In [1]:
import nltk
import random
from string import punctuation
from pprint import pprint


I'm probably just going to use the words in descriptions. Let's see how it goes.

In [2]:
d = []

with open("20171107_AstroKomrade_followers.txt",'r') as infile :
    next(infile)
    
    for line in infile.readlines() :
        line = line.strip("\n").split("\t") # need to specify what we're stripping here.
        
        if line[6] : # test for empty description
            d.append((line[6],'astronaut'))

with open("20171107_FlatEarthOrg_followers.txt",'r') as infile :
    next(infile)
    
    for line in infile.readlines() :
        line = line.strip("\n").split("\t")
        if line[6] :
            d.append((line[6],'flatearth'))


As always, let's look at a little of the data. I'll shuffle it first.

In [3]:
random.shuffle(d)
sample = d[:3]
print(sample)

[('instagram >>     ___criss_____', 'astronaut'), ('Native of the District. Uppsala University and Clemson University alumnus. Go Tigers!', 'astronaut'), ('Alhamdualliuh', 'astronaut')]


Now we need to write a function that cleans up the description and maps it on to words. 

In [4]:
def desc_features(the_description) :
    """ Input: A twitter description
        Output: A dictionary listin the words that are in 
                the description.
                
        This function does some cleaning on the descriptions,
        removing some punctuation, splitting on whitespace, 
        dropping to lower case. It returns a dictionary 
        of the form 
            {example : True,
             word :    True}
    
        """
    exclude = set(punctuation)
    
    # Found this at https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
    the_description = ''.join([ch.lower() for ch in the_description if ch not in exclude])
    
    word_list = the_description.split()

    ret_val = {}
    
    for word in word_list :
        ret_val[word] = True
    
    return(ret_val)
    

As always, it's a good idea to test your functions.

In [5]:
for a in sample :
    desc, label = a
    print("Started with: " + desc)
    print("-------------------Then got---------------------------")
    pprint(desc_features(desc))
    print("------------------------------------------------------")

Started with: instagram >>     ___criss_____
-------------------Then got---------------------------
{'criss': True, 'instagram': True}
------------------------------------------------------
Started with: Native of the District. Uppsala University and Clemson University alumnus. Go Tigers!
-------------------Then got---------------------------
{'alumnus': True,
 'and': True,
 'clemson': True,
 'district': True,
 'go': True,
 'native': True,
 'of': True,
 'the': True,
 'tigers': True,
 'university': True,
 'uppsala': True}
------------------------------------------------------
Started with: Alhamdualliuh
-------------------Then got---------------------------
{'alhamdualliuh': True}
------------------------------------------------------


Okay, now we're ready to do the NB stuff. It's actually shockingly easy at this point, since we've done the work to set it up. We've got 60K total descriptions (found by typing `len(d)` in a cell). That's big enough that I'll use a full 5000 for our test set.

In [6]:
test_set_size = 5000

featuresets = [(desc_features(desc), label) for (desc, label) in d]
train_set, test_set = featuresets[test_set_size:], featuresets[:test_set_size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

How'd we do?

In [7]:
print(nltk.classify.accuracy(classifier, test_set))

0.6956


Not terrible, assuming it's about a 50/50 split. Let's see what that is, using my trick from last time.

In [8]:
from collections import Counter

Counter([label for desc, label in d])

Counter({'astronaut': 39744, 'flatearth': 20564})

Hmm, Astronaut is about 66% of the data, so we're not doing that much better than if we just guessed "astronaut" all the time. Well, we can think about making it better in a minute. For now, let's see what's predictive.

In [9]:
classifier.show_most_informative_features(20)

Most Informative Features
               flatearth = True           flatea : astron =    227.7 : 1.0
                    flat = True           flatea : astron =    110.6 : 1.0
                    mgwv = True           flatea : astron =     70.3 : 1.0
                    rico = True           astron : flatea =     40.1 : 1.0
                 truther = True           flatea : astron =     36.8 : 1.0
             flatearther = True           flatea : astron =     36.8 : 1.0
                     iss = True           astron : flatea =     32.6 : 1.0
                 earther = True           flatea : astron =     31.8 : 1.0
           follow4follow = True           flatea : astron =     31.6 : 1.0
               astronaut = True           astron : flatea =     28.5 : 1.0
                    kita = True           flatea : astron =     27.7 : 1.0
             exploration = True           astron : flatea =     27.6 : 1.0
          teamfollowback = True           flatea : astron =     26.0 : 1.0

Number 2 for me is [mgwv](https://www.drewtolbert.com/what-does-mgwv-mean/), which seems like a spammy thing. The flat earth group has a lot of spammy words ("f4f", "teamfollowback" in their most informative features, but also just a lot of words like "flat". Not the most impressive model ever.

If I was going to try to improve it, here's some stuff I'd try:

1. Remove stopwords: We haven't talked about them much, but those are just our most common words. Might not matter, but it'd be cleaner. 
1. Limit the model to just the top $N$ remaining words. Not sure what to pick for $N$, but I'd try 1000 or so. It'd be worth it to do the whole `train_set/dev_test_set/test_set` if we were headed down this path and we could try a bunch of $N$s. 
1. See if number of followers is predictive. Using continuous variables in Naive Bayes is [a bit tricky](https://stats.stackexchange.com/questions/61034/naive-bayes-on-continuous-variables), but it can sometimes be quite predictive. 
