# Classification Problem -- Naive Bayes
<br>
**Pros**: Works with a small amount of data, handles multiple classes 

**Cons**: Sensitive to how the input data is prepared

**Works with**: Nominal values

## 1. Introduction
<font size = 3.5>
<br>
<font color = 'green'>**Naive**</font>: 朴素，即为最简单最原始的假设条件下
<br><br>
**Put simply, we choose the class with the higher probability. That’s Bayesian decision theory in a nutshell: choosing the decision with the highest probability.**
<br><br>
For examplt, we have an equation for the probability of a piece of data belonging to Class 1 (the circles): p1(x, y), and we have an equation for the class belonging to Class 2 (the triangles): p2(x, y). To classify a new measurement with features (x, y), we use the following rules:
<br><br>
If $p_1(x, y) > p_2(x, y)$, then the class is 1. 
<br><br>
If $p_2(x, y) > p_1(x, y)$, then the class is 2.
<br><br>
For the kNN method, the classification problem above is too calculatedly costing. If we use decision trees from chapter 2, and make a split of the data once along the x-axis and once along the y-axis, the result might not satisfied. Given this problem, the best choice would be the probability comparison we just discussed.
<br><br>
A useful way to manipulate conditional probabilities is known as **Bayes’ rule**.
<br><br>
$$p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}$$
<br><br>
忍不住想补充一点：
    先验概率即$p(\theta)$是基于experience or other inference，而后验概率的意义则是已知果反推因为何。
    <br><br>
    课上学的：prior probability is what we believe about $\theta$, and posterior probability is what we would adjust our belief about $\theta$ given some data.
</font>

## 2. Document classification with naive Bayes
<br>
<font size = 3.5>The first assumption of naive Bayes is the independence. By independence, I mean statistical independence; one feature or word is just as likely by itself as it is next to other words. The other assumption we make is that every feature is equally important.
<br>
<br>
Make a quick filter for an online message board that flags a message as inappropriate if the author uses negative or abusive language. Filtering out this sort of thing is common because abusive postings make people not come back and can hurt an online community. We’ll have two categories: abusive and not. We’ll use 1 to represent abusive and 0 to represent not abusive.
</font>

### 2.1 Prepare: making word vectors from text

In [1]:
def load_dataset():
    posting_list=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    class_vec = [0,1,0,1,0,1]            #1 is abusive, 0 not
    return posting_list, class_vec 

def create_vocab_list(dataset):          #extract the unique word list from the text
    vocabulary_set = set([])                              #create empty set
    for document in dataset:
        vocabulary_set = vocabulary_set | set(document)   #union of the two sets
    return list(vocabulary_set)


# whether every unique word existing in every posting_list
def wordset_to_vec(vocab_list, input_set):           #output is a vector containing 0 or 1
    return_vec = [0]*len(vocab_list)                 #create a vector containing only 0
    for word in input_set:
        if word in vocab_list:
            return_vec[vocab_list.index(word)] = 1
        else: 
            print("the word: %s is not in my Vocabulary!" % word)
    return return_vec


### 2.2 Naive Bayes classifier training function

In [2]:
def train_nb0(train_matrix, train_category):
    # train_matrix comes from return vec and train category comes from class_vec
    
    num_train_docs = len(train_matrix)
    num_words = len(train_matrix[0])
    p_abusive = sum(train_category)/float(num_train_docs)
    p0_numer = np.zeros(num_words) 
    p1_numer = np.zeros(num_words)              
    p0_denom = 0.0                      
    p1_denom = 0.0
    # 此处分子是一个元素个数等于vocab长度的numpy数组
    # 而分母是总词数
    # p0 or p1的0和1表示class是否为abusive
    
    for i in range(num_train_docs):
        if train_category[i] == 1:
            p1_numer += train_matrix[i]        # 提取有abusive词汇的vector，是一个vector
            p1_denom += sum(train_matrix[i])   # 计算该vector对应的posting_list的总词数，是一个number   
        else:
            p0_numer += train_matrix[i]
            p0_denom += sum(train_matrix[i])
    p1_vec = p1_numer/p1_denom                 # class为abusive的posting里各个词出现的概率
    p0_vec = p0_numer/p0_denom          
    return p0_vec, p1_vec, p_abusive


### 2.3 Simple test

In [3]:
# create dataset for simple test

import numpy as np
posting_list, class_list = load_dataset()
print(posting_list)
print(class_list)
vocab_list = create_vocab_list(posting_list)
print(vocab_list)

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
[0, 1, 0, 1, 0, 1]
['maybe', 'food', 'dog', 'worthless', 'love', 'him', 'stop', 'posting', 'dalmation', 'licks', 'quit', 'take', 'to', 'so', 'I', 'is', 'flea', 'mr', 'my', 'park', 'please', 'not', 'steak', 'has', 'cute', 'garbage', 'how', 'help', 'ate', 'buying', 'stupid', 'problems']


In [4]:
# use wordset_to_vec function to transfer the text into training set
# append every vector

train_mat = []
for posting in posting_list:
    train_mat.append(wordset_to_vec(vocab_list, posting))
train_mat

[[0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1],
 [1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0],
 [0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 [0,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0],
 [0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0],
 [0,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0]]

In [5]:
p0_vec, p1_vec, p_abusive = train_nb0(train_mat, class_list)
print(p0_vec)
print(p1_vec)
print(p_abusive)

[0.         0.         0.04166667 0.         0.04166667 0.08333333
 0.04166667 0.         0.04166667 0.04166667 0.         0.
 0.04166667 0.04166667 0.04166667 0.04166667 0.04166667 0.04166667
 0.125      0.         0.04166667 0.         0.04166667 0.04166667
 0.04166667 0.         0.04166667 0.04166667 0.04166667 0.
 0.         0.04166667]
[0.05263158 0.05263158 0.10526316 0.10526316 0.         0.05263158
 0.05263158 0.05263158 0.         0.         0.05263158 0.05263158
 0.05263158 0.         0.         0.         0.         0.
 0.         0.05263158 0.         0.05263158 0.         0.
 0.         0.05263158 0.         0.         0.         0.05263158
 0.15789474 0.        ]
0.5


We could see that the 10th number in p1_vec is the largest probabilities, and the 10th word in vocab_list is *stupid*, which means that the word *stupid* is most indicative of a class 1 (abusive).
### 2.4 modifying the classifier for real-world conditions
### 2.4.1 Problem about probability 0
When we attempt to classify a document, we multiply a lot of probabilities together to get the probability that a document belongs to a given class. This will look something like $P(w_0|1)P(w_1|1)P(w_2|1)$. If any of these numbers are 0, then when we multiply them together we get 0. 

To lessen the impact of this, we’ll initialize all of our occur- rence counts to 1, and we’ll initialize the denominators to 2. 

The <font color = 'blue'>train_nb0 function</font> changes as below:

In [6]:
def train_nb0(train_matrix, train_category):
    # train_matrix comes from return vec and train category comes from class_vec
    
    num_train_docs = len(train_matrix)
    num_words = len(train_matrix[0])
    p_abusive = sum(train_category)/float(num_train_docs)
    p0_numer = np.ones(num_words)          #changes here
    p1_numer = np.ones(num_words)          #changes here    
    p0_denom = 2.0                         #changes here
    p1_denom = 2.0                         #changes here
    
    for i in range(num_train_docs):
        if train_category[i] == 1:
            p1_numer += train_matrix[i]        
            p1_denom += sum(train_matrix[i])      
        else:
            p0_numer += train_matrix[i]
            p0_denom += sum(train_matrix[i])
    p1_vec = p1_numer/p1_denom                 
    p0_vec = p0_numer/p0_denom          
    return p0_vec, p1_vec, p_abusive


#### 2.4.2 Problem underflow
We might do too many multiplications of small numbers. When we go to calculate the product $P(w_0|c_i)P(w_1|c_i)P(w_2|c_i)...P(w_N|c_i)$ and many of these numbers are very small, we’ll get underflow, or an incorrect answer.

One solution to this is to take the natural logarithm of this product.

The <font color = 'blue'>train_nb0 function</font> changes as below:

In [7]:
from math import *

def train_nb0(train_matrix, train_category):
    # train_matrix comes from return vec and train category comes from class_vec
    
    num_train_docs = len(train_matrix)
    num_words = len(train_matrix[0])
    p_abusive = sum(train_category)/float(num_train_docs)
    p0_numer = np.ones(num_words) 
    p1_numer = np.ones(num_words)              
    p0_denom = 2.0                      
    p1_denom = 2.0
    
    for i in range(num_train_docs):
        if train_category[i] == 1:
            p1_numer += train_matrix[i]        
            p1_denom += sum(train_matrix[i])      
        else:
            p0_numer += train_matrix[i]
            p0_denom += sum(train_matrix[i])
    p1_vec = [log(x) for x in p1_numer/p1_denom]              #changes here           
    p0_vec = [log(x) for x in p0_numer/p0_denom]              #changes here
    return p0_vec, p1_vec, p_abusive


### 2.5 Naive Bayes classify function

In [8]:
def classify_nb(vec_to_classify, p0_vec, p1_vec, p_class1):
    p1 = sum(vec_to_classify * p1_vec) + log(p_class1)     #element-wise multiplication
                                                           #to multiply the first elements of both vectors, and so on
    p0 = sum(vec_to_classify * p0_vec) + log(1.0 - p_class1)
    if p1 > p0:
        return 1
    else: 
        return 0

In [9]:
# convenience function to wrap up everything properly and save you some time from typing all the code
# 便利函数，封装所有函数（包括load dataset, create vocabulary list, wordset to vector, train_nb, classify_nb）

def testing_nb():
    list_of_posts, list_classes = load_dataset()
    vocab_list = create_vocab_list(list_of_posts)
    train_mat=[]
    for post_in_doc in list_of_posts:
        train_mat.append(wordset_to_vec(vocab_list, post_in_doc))
    p0_vector, p1_vector, p_abusive = train_nb0(np.array(train_mat), np.array(list_classes))
    test_entry = ['love', 'my', 'dalmation']
    this_doc = np.array(wordset_to_vec(vocab_list, test_entry))
    print(test_entry,'classified as: ', classify_nb(this_doc, p0_vector, p1_vector, p_abusive))
    test_entry = ['stupid', 'garbage']
    this_doc = np.array(wordset_to_vec(vocab_list, test_entry))
    print(test_entry,'classified as: ', classify_nb(this_doc, p0_vector, p1_vector, p_abusive))


In [10]:
testing_nb()

['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1


### 2.6 Prepare: the bag-of-words document model
Up until this point we’ve treated the presence or absence of a word as a feature. This could be described as a **set-of-words model**. 

If a word appears <font color = 'red'>more than once in a document</font>, that might convey some sort of information about the document over just the word occurring in the document or not. This approach is known as a **bag-of-words model**. A bag of words can have multiple occurrences of each word, whereas a set of words can have only one occurrence of each word.“

In [11]:
# 从0到1变成num + 1

def wordbag_to_vec(vocab_list, input_set):
    return_vec = [0]*len(vocab_list)
    for word in input_set:
        if word in vocab_list:
            return_vec[vocab_list.index(word)] += 1
    return return_vec

## 3. Example: classifying spam email with naive Bayes
### 3.1 Prepare: tokenizing text

In [12]:
# try to split the text

my_text = 'This book is the best book for Python or M.L. I\'ve ever laid eyes upon.'
my_text.split()
# That works well, but the punctuation is considered part of the word.

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'for',
 'Python',
 'or',
 'M.L.',
 "I've",
 'ever',
 'laid',
 'eyes',
 'upon.']

In [13]:
# We could use regular expressions to split up the sentence on anything that isn’t a word or number.
import re
reg_ex = re.compile('\W+')      # \W Matches any character which is not a word character
                                # *：匹配零次或多次；+：匹配一次或多次；？：匹配零次或一次
list_of_token = reg_ex.split(my_text)
print(list_of_token)
print(re.split('\W+', my_text))
print(re.split('\w+', my_text))
print(re.split('\s+', my_text))

['This', 'book', 'is', 'the', 'best', 'book', 'for', 'Python', 'or', 'M', 'L', 'I', 've', 'ever', 'laid', 'eyes', 'upon', '']
['This', 'book', 'is', 'the', 'best', 'book', 'for', 'Python', 'or', 'M', 'L', 'I', 've', 'ever', 'laid', 'eyes', 'upon', '']
['', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.', '. ', "'", ' ', ' ', ' ', ' ', '.']
['This', 'book', 'is', 'the', 'best', 'book', 'for', 'Python', 'or', 'M.L.', "I've", 'ever', 'laid', 'eyes', 'upon.']


In [14]:
[tok for tok in list_of_token if len(tok)>0]               # 去掉空字符

['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'for',
 'Python',
 'or',
 'M',
 'L',
 'I',
 've',
 'ever',
 'laid',
 'eyes',
 'upon']

In [15]:
[tok.lower() for tok in list_of_token if len(tok)>0]       # 去掉空字符并统一转换为小写字母

['this',
 'book',
 'is',
 'the',
 'best',
 'book',
 'for',
 'python',
 'or',
 'm',
 'l',
 'i',
 've',
 'ever',
 'laid',
 'eyes',
 'upon']

### 3.2 Test: cross validation with naive Baye

In [16]:
def text_parse(big_string):    # input is big string and output is word list
    import re
    list_of_tokens = re.split('\W+', big_string)
    return [tok.lower() for tok in list_of_tokens if len(tok) > 2] 
# 太短的word也没有意义

In [17]:
import random

def spam_test():
    doc_list=[]
    class_list = []
    full_text = []
    for i in range(1,26):
        word_list = text_parse(open('/Users/elinabian 1/Desktop/CU-life/summerlearning/mlinaction/dataset/email/spam/%d.txt' % i, encoding = "ISO-8859-1").read())
        doc_list.append(word_list)     #every text record as a list 
        full_text.extend(word_list)    #every text record combining into one list
        class_list.append(1)           #class 1 represents spam
        word_list = text_parse(open('/Users/elinabian 1/Desktop/CU-life/summerlearning/mlinaction/dataset/email/ham/%d.txt' % i, encoding = "ISO-8859-1").read())
        doc_list.append(word_list)
        full_text.extend(word_list)
        class_list.append(0)           #class 0 represents not spam
    vocab_list = create_vocab_list(doc_list)       #create vocabulary list
    
    
    training_set = list(range(50))
    test_set=[]                                    #create test set
    for i in range(10):                #randomly select 10 of 50 files as training set.
        rand_index = int(random.uniform(0,len(training_set)))     #0-50均匀取一个数，一共只取10个数
        
        # As a number selected, add it to test set and removed from the training set. 
        # 所以training set 和 test set 是两个index的list
        test_set.append(training_set[rand_index])
        del(training_set[rand_index]) 
        
        
    train_mat=[]
    train_classes = []
    for doc_index in training_set:      #train the classifier (get probs) train_nb0
        train_mat.append(wordbag_to_vec(vocab_list, doc_list[doc_index]))
        train_classes.append(class_list[doc_index])
    p0_vec, p1_vec, p_spam = train_nb0(np.array(train_mat), np.array(train_classes))
    
    
    error_count = 0
    for doc_index in test_set:          #classify the remaining items in test set
        word_vector = wordbag_to_vec(vocab_list, doc_list[doc_index])
        if classify_nb(np.array(word_vector), p0_vec, p1_vec, p_spam) != class_list[doc_index]:
            error_count += 1
            print("classification error", doc_list[doc_index])
    print('the error rate is: ', float(error_count)/len(test_set))


In [18]:
spam_test()

the error rate is:  0.0


这里10组里错了1组，所以classification error只输出了一个list，但如果有两组错了，这里会输出两个list。

可以将以上封装好的程序多运行几次，对错误率求平均值，即可得到一个可信度更高的error rate。

## 4. Example: using naïve Bayes to reveal local attitudes from personal ads
We’re going to see if people in different cities use different words
### 4.1 Collect: importing RSS feeds

In [19]:
import feedparser
ny = feedparser.parse('http://newyork.craiglist.org/sss/index.rss')
sf = feedparser.parse('https://sfbay.craigslist.org/sss/index.rss')

In [20]:
print(ny['entries'][0]['summary'])
print(len(ny['entries']))
print(len(sf['entries']))

# 不知道为什么只能提取出25个= =、

Up for sale is a barely used Apple TV with brand new HDMI cables, Power Cable, and OEM Remote. 
I am moving and do not need it. This was used for a few months max used for less than an 2 hours a week. Adult used, pet free, and smoke free home. First  ...
25
25


### 4.2 RSS feed classifier and frequent word removal functions

In [21]:
# calculate the frequency of word occurence

def cal_most_freq(vocab_list, full_text):
    import operator
    freq_dict = {}
    for token in vocab_list:
        freq_dict[token]=full_text.count(token)
    sorted_freq = sorted(freq_dict.items(), key=operator.itemgetter(1), reverse=True)      #dict中排序的方法，KNN也用过
    return sorted_freq[:30]     

In [22]:
# similar to spam test

def local_words(feed1, feed0):
    import feedparser
    doc_list=[]
    class_list = []
    full_text =[]
    min_len = min(len(feed1['entries']), len(feed0['entries']))
    for i in range(min_len):
        word_list = text_parse(feed1['entries'][i]['summary'])
        doc_list.append(word_list)
        full_text.extend(word_list)
        class_list.append(1)                             #NY is class 1
        word_list = text_parse(feed0['entries'][i]['summary'])
        doc_list.append(word_list)
        full_text.extend(word_list)
        class_list.append(0)
        
        
    vocab_list = create_vocab_list(doc_list)             #create vocabulary
    top30_words = cal_most_freq(vocab_list, full_text)   #remove top 30 words
    for pair_w in top30_words:
        if pair_w[0] in vocab_list: 
            vocab_list.remove(pair_w[0])                 #list中remove某个元素的方法
    
    
    training_set = list(range(2*min_len))
    test_set=[]                                           #create test set
    for i in range(5):
        rand_index = int(random.uniform(0, len(training_set)))
        test_set.append(training_set[rand_index])
        del(training_set[rand_index])  
        
        
    train_mat=[]
    train_classes = []
    for doc_index in training_set:                        #train the classifier (get probs) trainNB0
        train_mat.append(wordbag_to_vec(vocab_list, doc_list[doc_index]))
        train_classes.append(class_list[doc_index])
    p0_vec, p1_vec, p_spam = train_nb0(np.array(train_mat), np.array(train_classes))
    
    
    error_count = 0
    for doc_index in test_set:                            #classify the remaining items
        word_vector = wordbag_to_vec(vocab_list, doc_list[doc_index])
        if classify_nb(np.array(word_vector), p0_vec, p1_vec, p_spam) != class_list[doc_index]:
            error_count += 1
    print('the error rate is: ', float(error_count)/len(test_set))
    return vocab_list, p0_vec, p1_vec


In [23]:
local_words(ny, sf)

the error rate is:  0.2


(['mountainbike',
  'free',
  'contact',
  'couple',
  '650',
  'tested',
  'shown',
  'top',
  'back',
  'long',
  'utf8',
  'jailbroken',
  'deliver',
  'including',
  'smoke',
  'alexa',
  'misc',
  'mustang',
  'mobile',
  'checking',
  'undercover',
  'vehicle',
  'dual',
  'children',
  'radiator',
  'ones',
  'yours',
  'cannot',
  'mattress',
  'lightbulbs',
  'wii',
  'job',
  'emblem',
  'closet',
  'barely',
  'left',
  'super',
  'cover',
  'girl',
  'audio',
  'infinite',
  'front',
  'od_aui_detailpages02',
  'store',
  'wieght',
  '918',
  'pipes',
  'delete',
  'phones',
  'charger',
  'urn',
  'lego',
  'french',
  'world',
  'firm',
  'shape',
  '1527089727',
  'rb14b',
  'cheapest',
  'little',
  'wood',
  'features',
  'mid',
  'business',
  'antique',
  'convertible',
  'tires',
  'qid',
  'billing',
  'plates',
  'machine',
  'stain',
  'ride',
  'already',
  'oem',
  'collection',
  'black',
  'press',
  'first',
  'pin',
  'weights',
  'lighted',
  'watts',
  'h

We can comment out the three lines that removed the most frequently used words and see the performance before and after. Error rate is 54% without these lines and 70% with the lines included. 

An interesting observation is that the top 30 words in these posts make up close to 30% of all the words used. The size of the *vocab_list* was about 3000 words when I was testing this. A small percentage of the total words makes up a large portion of the text. 

The reason for this is that a large percentage of language is *redundancy and structural glue*. 

Another common approach is to not just remove the most common words but to also remove this structural glue from a predefined list. This is known as a *stop word list*, and there are a number of sources of this available. 
<br><br>
<font color = 'green'>这段话想证明的问题，仅仅是top words占了很大一部分。至于top words对于分类问题的影响，感觉应该是需要被忽略的，然鹅这里的结果却是保留这些词的时候，分类错误率更低= =、</font>

### 4.3 Analyze: displaying locally used words

In [26]:
# Most descriptive word display function

def get_top_words(ny, sf):
    import operator
    vocab_list, p0_vec, p1_vec = local_words(ny, sf)
    top_ny=[]
    top_sf=[]
    for i in range(len(p0_vec)):
        if p0_vec[i] > -5.0 :          #之前取了log，所以是negative
                                       #这里没取top words，而是取了超过某个阈值的所有words
            top_sf.append((vocab_list[i], p0_vec[i]))
        if p1_vec[i] > -5.0 : 
            top_ny.append((vocab_list[i], p1_vec[i]))
            
            
    sorted_sf = sorted(top_sf, key=lambda pair: pair[1], reverse=True)
    print("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**")
    for item in sorted_sf:
        print(item[0])
        
        
    sorted_ny = sorted(top_ny, key=lambda pair: pair[1], reverse=True)
    print("NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**")
    for item in sorted_ny:
        print(item[0])

In [31]:
get_top_words(ny, sf)

the error rate is:  0.2
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
weights
lamp
bench
each
wieght
bar
seats
right
buy
get
games
NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**
delete
game
that
sale
few
channel
block
old
everything
has
book
liqour
power
inquiries
approximately
comes
well
very
great
never
vague


The words from this output are entertaining. 

One thing to note: a lot of stop words appear in the output. It would be interesting to see how things would change if you removed the *fixed stop words*. In my experience, the classification error will also go down.