## Naive Bayes in Scikit-Learn: Trump Tweet Classifier



This article is a good introduction to NaiveBayes: 
* YHat post on Naives Bayes: http://blog.yhathq.com/posts/naive-bayes-in-python.html

For scikit-learn, it's convenient to have the data files in separate directories based on their classification.
So there's a "positive" and a "negative" folder in movies, or, say, an "iphone" and "android" for trump tweets. For training on reviews with scores, you might want a folder of 1-star reviews vs. a folder of 5-star reviews.

We are using data from Trump tweets from David Robinson's analysis: http://varianceexplained.org/r/trump-tweets/

In [22]:
import nlp_utilities as mytools

from sklearn.datasets import load_files

Now let's use the data in your data/trump tweets folders, since it has more than we had in our small sample.

In [29]:
# If you want to load data into sklearn, be sure to provide the names of the subfolders for the categories.

bunchf = load_files('data/trump/', categories=['android','iphone'], encoding="latin1")

In [31]:
ls data/trump

[34mandroid[m[m/ [34miphone[m[m/


In [30]:
# This is what sklearn creates:

bunchf.keys()

dict_keys(['DESCR', 'data', 'target', 'filenames', 'target_names'])

In [32]:
len(bunchf.data)

1385

In [33]:
bunchf.filenames[0:10]

array(['data/trump/iphone/I7.605198474616586e+17.txt',
       'data/trump/iphone/I7.5247194141662e+17.txt',
       'data/trump/android/A7.036017628763054e+17.txt',
       'data/trump/iphone/I7.202473441978204e+17.txt',
       'data/trump/iphone/I7.278983529652593e+17.txt',
       'data/trump/android/A7.240390814208082e+17.txt',
       'data/trump/iphone/I7.097367646137713e+17.txt',
       'data/trump/android/A7.249115634051891e+17.txt',
       'data/trump/iphone/I7.471948656335954e+17.txt',
       'data/trump/iphone/I7.468950655917834e+17.txt'],
      dtype='<U45')

In [38]:
bunchf['target'][0:10] # we can see that the 1 corresponds to iphone, 0 to android.

array([1, 1, 0, 1, 1, 0, 1, 0, 1, 1])

In [39]:
bunchf['target_names']   # target names is just a list of the words for the 0, 1

['android', 'iphone']

### Split up the data into train and test using sklearn's cross-validation... now 

In [18]:
from sklearn import model_selection
# try changing the random_state and % of test data - interesting differences in results.
Xf_train, Xf_test, yf_train, yf_test = model_selection.train_test_split(bunchf.data, 
                                                                         bunchf.target, 
                                                                         test_size=0.10)

In [11]:
# instead of a simple true/false for a feature (word), we'll use the TF-IDF weight.

from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
# The sklearn vectorizer for TF-IDF has the stopwords as an option and a
# lot of other features we can play with.
# This is where I hacked around for a while trying to improve the results. You can too!

tfidfvec = TfidfVectorizer(tokenizer=mytools.tokenize_clean,
                           stop_words=["'s", "'m", "'s", "n't", "'d", "'ve", "'ll", "'re"],
                           #ngram_range=(1,2),
                           max_df=0.75,
                           max_features=20000,
                           min_df=3)

# we create the tf-idf model from the training data:
vectors_train = tfidfvec.fit_transform(Xf_train)

# Depending on whether you stemmed or lemmatized, you'll get different column numbers here!
vectors_train.shape

(1246, 930)

In [159]:
TfidfVectorizer?

In [13]:
from sklearn.naive_bayes import MultinomialNB
import sklearn.metrics as metrics

# We set up our classifier
clf = MultinomialNB(alpha=.01)
# We train the classifier on the training data and target classes (pos/neg)
clf.fit(vectors_train, yf_train)

# We use the model on the test data:
vectors_test = tfidfvec.transform(Xf_test)
# We get a prediction from the test data 
pred = clf.predict(vectors_test)
# We check the accuracy against the "truth" in the yf_test var
metrics.accuracy_score(yf_test, pred)

0.75539568345323738

Getting most informative features out of sklearn is a little ugly:

In [14]:
# code for binary classification case posted here: 
# http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers

def show_most_informative_features(vectorizer, classifier, n=10):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In [40]:
# the results show the most negative features on the left side, and the most positive on the right.
show_most_informative_features(tfidfvec, clf, n=30)  # 1 to the right, 0 to left.

	-11.8122	\the           		-2.5509	https          
	-11.8122	a.m.           		-3.2237	thank          
	-11.8122	abc            		-3.2491	trump2016      
	-11.8122	action         		-3.6585	makeamericagreatagain
	-11.8122	agent          		-4.3619	great          
	-11.8122	ago            		-4.4612	america        
	-11.8122	allowed        		-4.6306	amp            
	-11.8122	along          		-4.6651	americafirst   
	-11.8122	angry          		-4.7050	new            
	-11.8122	anncoulter     		-4.7546	join           
	-11.8122	anticipated    		-4.7723	make           
	-11.8122	anyone         		-4.8030	hillary        
	-11.8122	arena          		-4.8305	crookedhillary 
	-11.8122	audit          		-4.9135	votetrump      
	-11.8122	badly          		-4.9668	imwithyou      
	-11.8122	ballot         		-5.0513	soon           
	-11.8122	barbara        		-5.2014	clinton        
	-11.8122	bertshad       		-5.2391	tonight        
	-11.8122	biggest        		-5.2512	support        
	-11.8122	board          

We can now use it to classify a tweet if we want:
(Android is Trump, class 0, iphone is 1, not Trump).

In [16]:
clf.predict(tfidfvec.transform(["Proof she is a liar! Bad!"]))   # android

array([0])

In [17]:
clf.predict(tfidfvec.transform(["Join me at 7 and #makeamericangreat"]))  # iphone

array([1])

In [41]:
bunchf['target_names'][0]

'android'

In [42]:
bunchf['target_names'][1]

'iphone'