## Naive Bayes in Scikit-Learn: A Brief Intro



This article is a good introduction to NaiveBayes: 
* YHat post on Naives Bayes: http://blog.yhathq.com/posts/naive-bayes-in-python.html

For scikit-learn, it's convenient to have the data files in separate directories based on their classification.
So there's a "positive" and a "negative" folder in movies, or, say, an "iphone" and "android" for trump tweets. For training on reviews with scores, you might want a folder of 1-star reviews vs. a folder of 5-star reviews.

In [2]:
import nlp_utilities as mytools

from sklearn.datasets import load_files

The following command will not work on Windows:

In [44]:
!ls data/movie_reviews/

[31mSOURCE_README.txt[m[m [31mall_pos.txt[m[m       [30m[43mnegative[m[m
[31mall_neg.txt[m[m       [30m[43mallfiles[m[m          [30m[43mpositive[m[m


Now let's use the data in your nltk_data/copora/movie_reviews folder, since it has more than we had in our small sample.

In [45]:
# If you want to load data into sklearn, be sure to provide the names of the subfolders for the categories.

bunchf = load_files('data/movie_reviews/', categories=['positive','negative'], encoding="latin1")

In [46]:
# This is what sklearn creates:

bunchf.keys()

dict_keys(['target', 'filenames', 'data', 'DESCR', 'target_names'])

In [49]:
len(bunchf)

5

In [50]:
bunchf['target']

array([0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1])

In [18]:
bunchf['filenames'][0:10]

array(['data/movie_reviews/negative/cv696_tok-28821.txt',
       'data/movie_reviews/positive/cv675_tok-11864.txt',
       'data/movie_reviews/positive/cv699_tok-10425.txt',
       'data/movie_reviews/negative/cv698_tok-20916.txt',
       'data/movie_reviews/negative/cv681_tok-11979.txt',
       'data/movie_reviews/negative/cv672_tok-20564.txt',
       'data/movie_reviews/positive/cv674_tok-11591.txt',
       'data/movie_reviews/positive/cv698_tok-27735.txt',
       'data/movie_reviews/positive/cv680_tok-18142.txt',
       'data/movie_reviews/negative/cv692_tok-4797.txt'],
      dtype='<U47')

In [53]:
bunchf['target_names']  # 0 is negative, 1 is positive

['negative', 'positive']

In [47]:
len(bunchf.data)

60

### Split up the data into train and test using sklearn's cross-validation...

In [33]:
from sklearn import model_selection

# try changing the random_state and % of test data - interesting differences in results.
Xf_train, Xf_test, yf_train, yf_test = model_selection.train_test_split(bunchf.data, 
                                                                         bunchf.target, 
                                                                         test_size=0.10)

In [26]:
# instead of a simple true/false for a feature (word), we'll use the TF-IDF weight.

from sklearn.feature_extraction.text import TfidfVectorizer

In [27]:
# The sklearn vectorizer for TF-IDF has the stopwords as an option and a
# lot of other features we can play with.
# This is where I hacked around for a while trying to improve the results. You can too!

tfidfvec = TfidfVectorizer(tokenizer=mytools.tokenize_clean,
                           stop_words=["'s", "'m", "'s", "n't", "'d", "'ve", "'t", "'ll", "'re"],
                           ngram_range=(1,2),
                           max_df=0.80,
                           #max_features=20000,
                           min_df=3)

# we create the tf-idf model from the training data:
vectors_train = tfidfvec.fit_transform(Xf_train)

# Depending on whether you stemmed or lemmatized, you'll get different column numbers here!
vectors_train.shape

(54, 1271)

In [18]:
TfidfVectorizer?

In [28]:
from sklearn.naive_bayes import MultinomialNB
import sklearn.metrics as metrics

# We set up our classifier
clf = MultinomialNB(alpha=.01)

# We train the classifier on the training data and target classes (pos/neg)
clf.fit(vectors_train, yf_train)

# We use the model on the test data:
vectors_test = tfidfvec.transform(Xf_test)

# We get a prediction from the test data 
pred = clf.predict(vectors_test)
# We check the accuracy against the "truth" in the yf_test var
metrics.accuracy_score(yf_test, pred)

0.5

Getting most informative features out of sklearn is a little ugly:

In [29]:
# code for binary classification case posted here: 
# http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers

def show_most_informative_features(vectorizer, classifier, n=10):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In [30]:
# the results show the most negative features on the left side, and the most positive on the right.
show_most_informative_features(tfidfvec, clf, n=60)  # positive to the right, negative to left.

	-10.2393	action sequences		-5.1675	movie          
	-10.2393	adventure      		-5.3990	like           
	-10.2393	air            		-5.4937	jackie         
	-10.2393	al             		-5.5858	good           
	-10.2393	almost every   		-5.6310	damon          
	-10.2393	archived       		-5.6580	see            
	-10.2393	badly          		-5.6648	time           
	-10.2393	battle         		-5.6978	show           
	-10.2393	beautiful      		-5.7301	also           
	-10.2393	brad           		-5.8238	douglas        
	-10.2393	central        		-5.8428	wife           
	-10.2393	character much 		-5.8492	new            
	-10.2393	cinematography 		-5.8654	first          
	-10.2393	clark          		-5.8856	way            
	-10.2393	crap           		-5.8989	great          
	-10.2393	critical       		-5.9013	performances   
	-10.2393	cry            		-5.9044	two            
	-10.2393	cut            		-5.9107	plot           
	-10.2393	cute           		-5.9191	even           
	-10.2393	davis          		-5.

We can now use it to classify a review if we want (negative reviews are 0, positive are 1):

In [31]:
clf.predict(tfidfvec.transform(["This movie was very bad. But I loved the characters and plot."]))

array([0])

In [32]:
clf.predict(tfidfvec.transform(["This movie was very good. But I loved the characters and plot."]))

array([1])