# Tweet classification with naive bayes

For this notebook we are going to implement a naive bayes classifier for classifying tweets about Trump or Obama based on the words in the tweet. Recall that for two events A and B the bayes theorem says

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

where P(A) and P(B) is the ***class probabilities*** and P(B|A) is called ***conditional probabilities***. this gives us the probability of A happening, given that B has occurred. So as an example if we want to find the probability of "is this a tweet about Trump given that it contains the word "president" " we will obtain the following 

$$ P(\text{"Trump"}|\text{"president" in tweet}) = \frac{P("\text{"president" in tweet}|\text{"Trump"})P(\text{"Trump"})}{P("\text{"president" in tweet})} $$

This means that to find the probability of "is this a tweet about Trump given that it contains the word "president" " we need the probability of "president" being in a tweet about Trump, the probability of a tweet being about Trump and the probability of "president" being in a tweet. 

Similarly if we want to obtain the opposite "is this a tweet about Obama given that it contains the word "president" "
we get 

$$ P(\text{"Obama"}|\text{"president" in tweet}) = \frac{P(\text{"president" in tweet}|\text{"Obama"})P(\text{"Obama"})}{P(\text{"president" in tweet})} $$

where we need the probability of "president" being in a tweet about Obama, the probability of a tweet being about Obama and the probability of "president" being in a tweet. 

We can now build a classifier where we compare those two probabilities and whichever is the larger one it's classified as 

if P("Trump"|"president" in tweet) $>$ P("Obama"|"president" in tweet)
    
   Tweet is about Trump

else
   
   Tweet is about Obama

Now let's expand this to handle multiple features and put the Naive assumption into bayes theroem. This means that if features are independent we have 

$$ P(A,B) = P(A)P(B) $$

This gives us:

$$ P(A|b_1,b_2,...,b_n) = \frac{P(b_1|A)P(b_2|A)...P(b_n|A)P(A)}{P(b_1)P(b_2)...P(b_n)} $$

or

$$ P(A|b_1,b_2,...,b_n) = \frac{\prod_i^nP(b_i|A)P(A)}{P(b_1)P(b_2)...P(b_n)} $$


So with our previous example expanded with more words "is this a tweet about Trump given that it contains the word "president" and "America" " gives us 

$$ P(\text{"Trump"}|\text{"president", "America" in tweet}) = \frac{P(\text{"president" in tweet}|\text{"Trump"})P(\text{"America" in tweet}|\text{"Trump"})P(\text{"Trump"})}{P(\text{"president" in tweet})P(\text{"America" in tweet})} $$

As you can see the denominator remains constant which means we can remove it and the final classifier end up

$$y = argmax_A P(A)\prod_i^nP(b_i|A) $$

In [2]:
#stuff to import
import pandas as pd
import numpy as np
import random
import sklearn
from sklearn.model_selection import train_test_split

In [3]:
print(pd.__version__)
print(np.__version__)
print(sklearn.__version__)
if (pd.__version__ > "1.2.1" and np.__version__ > "1.19.4" and sklearn.__version__ > "0.24.0"):
    print("I assume we're good to go!")
assert pd.__version__ == "1.2.1", "Looks like you don't have the same version of pandas as us!"
assert np.__version__ == "1.19.4", "Looks like you don't have the same version of numpy as us!"
assert sklearn.__version__ == "0.24.0", "Looks like you don't have the same version of sklearn as us!"

1.2.4
1.20.1
0.24.1
I assume we're good to go!


AssertionError: Looks like you don't have the same version of pandas as us!

Load the data and explore

In [4]:
df_t = pd.read_csv('trump_20200530.csv')
trump_tweets = df_t['text']
df_t = pd.read_csv('Tweets-BarackObama.csv')
obama_tweets = df_t['Tweet-text']

tweet_data = trump_tweets.append(obama_tweets, ignore_index=True)
tweet_labels = np.array(['T' for _ in range(len(trump_tweets))] + ['O' for _ in range(len(obama_tweets))])

In [5]:
lab, counts = np.unique(tweet_labels, return_counts=True)
print('Number of tweets about ', lab[0], ': ', counts[0])
print('Number of tweets about ', lab[1], ': ', counts[1])

Number of tweets about  O :  6851
Number of tweets about  T :  18467


As you can see we have many more Trump than Obama Tweets so simlpy guessing that a tweet is a Trump tweet already gives you a classifier that is correct about 70% of the time, but we can do better than this.

Now lets split the data into a training set and a test set using scikit-learns train_test_split function 
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [6]:
#Split data into train_tweets, test_tweets, train_labels and test_labels
train_tweets, test_tweets, train_labels, test_labels = sklearn.model_selection.train_test_split(tweet_data, tweet_labels)
#understanding the split function:
test = sklearn.model_selection.train_test_split(tweet_labels)
print(len(test))
print(len(test[0]))
print(len(test[1]))

2
18988
6330


What we need to build our classifier is "probability of tweet about Obama" P(O) , "probability of tweet about Trump" P(T), "probability of word in tweet given tweet about Obama" P(w|O) and "probability of word in tweet given tweet about Trump" P(w|T). Start by calculating the probability that a tweet is about Obama and trump respectively 

In [7]:
import string
Ob = Tr = total = 0
if lab[0] == 'O' and lab[1] == 'T':
    #We good!
    Ob = counts[0]
    Tr = counts[1]
elif lab[1] == 'O' and lab[0] == 'T':
    #we also good but they reversed... ok for now
    Ob = counts[1]
    Tr = counts[0]
else:
    raise Exception("Hey something is off with your data..")
total = Ob+Tr
print(f"Ob:{Ob}, Tr:{Tr}, total:{total}")

P_O = Ob/total
P_T = Tr/total
print(P_O, P_T)

## No we can only use our test data!
lab, counts = np.unique(train_labels, return_counts=True)
print('Number of tweets about ', lab[0], ': ', counts[0])
print('Number of tweets about ', lab[1], ': ', counts[1])
total = len(train_tweets)
Ob = counts[0]
Tr = counts[1]
P_O = Ob/total
P_T = Tr/total

train_O_tweets = []
train_T_tweets = []
index = 0
for (tw,lb) in zip(train_tweets,train_labels):
    if lb == 'T':
        train_T_tweets.append(tw)
    elif lb == 'O':
        train_O_tweets.append(tw)
    else:
        raise Exception("Unexpected label!")      
    index+=1
assert(index==total)
print(f"len T tweets: {len(train_T_tweets)} len O tweets: {len(train_O_tweets)}")


#create new dict with each word in it to count
# got some help from https://www.geeksforgeeks.org/python-count-occurrences-of-each-word-in-given-text-file-using-dictionary/
# as well as from https://datagy.io/python-remove-punctuation-from-string/
# also this code is fairly repetitive, I should probably refactor it into some function... apologies
word_occurrences = dict()
for tw in train_tweets:
    tw = tw.lower()
    tw = tw.translate(str.maketrans('', '', ""))
    words = tw.split(" ")
    for word in words:
        if word in word_occurrences:
            word_occurrences[word] = word_occurrences[word] + 1
        else:
            word_occurrences[word] = 1
print(len(word_occurrences))
#these are for all tweets. now let's create 2 more dict, one for t and one for O and count their word counts
word_occurrences_T = dict()
for tw in train_T_tweets:
    tw = tw.lower()
    tw = tw.translate(str.maketrans('', '', ""))
    words = tw.split(" ")
    for word in words:
        if word in word_occurrences_T:
            word_occurrences_T[word] = word_occurrences_T[word] + 1
        else:
            word_occurrences_T[word] = 1
word_occurrences_O = dict()
for tw in train_O_tweets:
    tw = tw.lower()
    tw = tw.translate(str.maketrans('', '', ""))
    words = tw.split(" ")
    for word in words:
        if word in word_occurrences_O:
            word_occurrences_O[word] = word_occurrences_O[word] + 1
        else:
            word_occurrences_O[word] = 1
assert(word_occurrences_O['president'] + word_occurrences_T['president'] == word_occurrences['president'])
print(word_occurrences_O['president'], word_occurrences_T['president'],word_occurrences_O['president']+word_occurrences_T['president'])
print( word_occurrences['president']  )
#print( word_occurrences )


Ob:6851, Tr:18467, total:25318
0.2705979935223951 0.7294020064776049
Number of tweets about  O :  5109
Number of tweets about  T :  13879
len T tweets: 13879 len O tweets: 5109
43766
912 1374 2286
2286


In [8]:
word_occurrences_unique = dict() #each word is only added maximum once per tweet here
for tw in train_tweets:
    tw = tw.lower()
    tw = tw.translate(str.maketrans('', '', ""))
    words = tw.split(" ")
    word_history = []
    for word in words:
        if word in word_occurrences_unique:
            if word not in word_history:
                word_occurrences_unique[word] = word_occurrences_unique[word] + 1
                word_history.append(word)
        else:
            if word not in word_history:
                word_occurrences_unique[word] = 1
                word_history.append(word)
    word_history = []



word_occurrences_unique_O = dict() #each word is only added maximum once per tweet here
for tw in train_O_tweets:
    tw = tw.lower()
    tw = tw.translate(str.maketrans('', '', ""))
    words = tw.split(" ")
    word_history = []
    for word in words:
        if word in word_occurrences_unique_O:
            if word not in word_history:
                word_occurrences_unique_O[word] = word_occurrences_unique_O[word] + 1
                word_history.append(word)
        else:
            if word not in word_history:
                word_occurrences_unique_O[word] = 1
                word_history.append(word)
    word_history = []

word_occurrences_unique_T = dict() #each word is only added maximum once per tweet here
for tw in train_T_tweets:
    tw = tw.lower()
    tw = tw.translate(str.maketrans('', '', ""))
    words = tw.split(" ")
    word_history = []
    for word in words:
        if word in word_occurrences_unique_T:
            if word not in word_history:
                word_occurrences_unique_T[word] = word_occurrences_unique_T[word] + 1
                word_history.append(word)
        else:
            if word not in word_history:
                word_occurrences_unique_T[word] = 1
                word_history.append(word)
    word_history = []

assert(word_occurrences_unique_O['president'] + word_occurrences_unique_T['president'] == word_occurrences_unique['president'])

For P(w|O), P(w|T) we need to count how many tweets each word occur in. Count the number of tweets each word occurs in and store in the word counter. An entry in the word counter is for instance  {'president': 'O':87, 'T': 100} meaning president occurs in 87 words about Obaman and 100 tweets about Trump. Be aware that we are not interested in calculating multiple occurances of the same word in the same tweet.  For each word convert it to lower case. You can use Python's [lower](https://www.w3schools.com/python/ref_string_lower.asp). Another handy Python string method is [split](https://www.w3schools.com/python/ref_string_split.asp).

In [12]:
word_counter = {}
print(f"length of tweets obama: {len(obama_tweets)} length of tweets trump:{len(trump_tweets)} ")
def word_counter_f(word, data):
    count = 0
    for string in data:
        #s = (string.lower()).split(word.lower())
        #if len(s) >= 2:
            # if word exist, split and if len of split >= 2 we found the word in the tweet and can +1. 
            # dont care multiple occurances in same tweet otherwise we could count the splits. 
            # with that said, we don't really need split for this
            #count +=1 
        #alternatively
        if word.lower() in string.lower():
            count += 1
    return count


#attempt 1
word_list = ["president", "hello", "obama", "trump", "word"]
word_counter = set()
for w in word_list:
    word_counter.add(w)
word_counter = dict.fromkeys(word_counter, {'O':0,'T':0})
for w in word_list:
    word_counter[w]['O'] += word_counter_f(w, obama_tweets)
    word_counter[w]['T'] += word_counter_f(w, trump_tweets)
print(f"attempt1{word_counter}")

#attempt 2
word_counter = {}
for w in word_list:
    word_counter[w] = [f"O:{(word_counter_f(w, obama_tweets))}"]
    word_counter[w].append( f"T:{word_counter_f(w, trump_tweets)}" )
print(f"attempt2{word_counter}")

#attempt 3
word_counter = {}
for w in word_list:
    word_counter[w] = {'O':0,'T':0}
for w in word_list:
    word_counter[w]['O']+=(word_counter_f(w, obama_tweets))
    word_counter[w]['T']+=(word_counter_f(w, trump_tweets))
print(f"attempt3{word_counter}")


#attempt 4
word_counter = {}
#obama
for key in list(word_occurrences_O.keys()):
    if key in word_counter:
            word_counter[key]['O'] = word_occurrences_O[key] + 1
    else:
        word_counter[key] = {'O':0,'T':0}
        word_counter[key]['O'] = word_counter[key]['O'] + word_occurrences_O[key]
#trump
for key in list(word_occurrences_T.keys()):
    if key in word_counter:
            word_counter[key]['T'] = word_occurrences_T[key] + 1
    else:
        word_counter[key] = {'O':0,'T':0}
        word_counter[key]['T'] = word_counter[key]['O'] + word_occurrences_T[key]
print(f"attempt4{word_counter['word']}")
#print(f"attempt4{word_counter}")

#attempt 5
word_counter = {}
#obama
for key in list(word_occurrences_unique_O.keys()):
    if key in word_counter:
            word_counter[key]['O'] = word_occurrences_unique_O[key] + 1
    else:
        word_counter[key] = {'O':0,'T':0}
        word_counter[key]['O'] = word_occurrences_unique_O[key]
#trump
for key in list(word_occurrences_unique_T.keys()):
    if key in word_counter:
            word_counter[key]['T'] = word_occurrences_unique_T[key] + 1
    else:
        word_counter[key] = {'O':0,'T':0}
        word_counter[key]['T'] = word_occurrences_unique_T[key]
print(f"attempt5{word_counter['president']}")
print(len(train_O_tweets),len(train_T_tweets))


#word_counter = {}
train_data = train_tweets
for (tweet, label) in zip(train_data, train_labels):
    # ... Count number of tweets each word occurs in and store in word_counter where an entry looks like ex. {'word': 'O':98, 'T':10}
    #i = 1
    continue #do nothing
    
#print(word_counter)

length of tweets obama: 6851 length of tweets trump:18467 
attempt1{'trump': {'O': 5708, 'T': 6373}, 'word': {'O': 5708, 'T': 6373}, 'president': {'O': 5708, 'T': 6373}, 'hello': {'O': 5708, 'T': 6373}, 'obama': {'O': 5708, 'T': 6373}}
attempt2{'president': ['O:2775', 'T:2180'], 'hello': ['O:5', 'T:9'], 'obama': ['O:2874', 'T:505'], 'trump': ['O:0', 'T:3534'], 'word': ['O:54', 'T:145']}
attempt3{'president': {'O': 2775, 'T': 2180}, 'hello': {'O': 5, 'T': 9}, 'obama': {'O': 2874, 'T': 505}, 'trump': {'O': 0, 'T': 3534}, 'word': {'O': 54, 'T': 145}}
attempt4{'O': 9, 'T': 38}
attempt5{'O': 903, 'T': 1274}
5109 13879


Lets work with a smaller subset of words. Find the 100 most occuring words in tweet data.

In [13]:
nr_of_words_to_use = 100
popular_words = sorted(word_counter.items(), key=lambda x: x[1]['O'] + x[1]['T'], reverse=True)
popular_words = [x[0] for x in popular_words[:nr_of_words_to_use]]
print(popular_words)


['the', 'to', 'and', 'of', 'a', 'in', 'is', 'rt', 'for', 'on', 'that', 'are', 'our', 'with', 'be', 'will', 'we', 'i', 'this', 'president', 'have', 'you', 'great', 'it', 'obama', 'at', 'they', 'has', '&amp;', 'was', 'all', 'not', 'by', 'from', 'people', 'just', '—president', 'he', 'who', 'as', 'very', 'your', 'my', 'their', 'more', 'about', 'no', 'thank', 'so', 'democrats', 'if', 'but', 'do', 'get', 'now', 'an', 'new', 'his', 'than', 'out', '-', 'trump', 'been', 'what', 'time', 'up', '', 'big', 'or', 'american', 'should', 'news', 'make', 'fake', 'many', 'can', 'one', 'would', 'today', 'country', 'want', 'never', 'there', 'house', 'when', '@realdonaldtrump', 'u.s.', 'america', 'congress', 'good', '@realdonaldtrump:', 'like', 'me', 'how', 'united', 'going', 'even', 'only', 'much', 'years']


Now lets compute P(w|O), P(w|T) for the popular words

In [14]:
P_w_given_t = dict()
P_w_given_o = dict()
for word in popular_words:
    #obama
    if word in P_w_given_o:
        assert(False) #should be first and only time
    else:
        word_occurrences = word_counter_f(word,train_O_tweets)
        nr_of_tweets = len(train_O_tweets)
        P_w_given_o[word] = word_occurrences/nr_of_tweets
    #trump
    if word in P_w_given_t:
        assert(False)
    else:
        word_occurrences = word_counter_f(word,train_T_tweets)
        nr_of_tweets = len(train_T_tweets)
        P_w_given_t[word] = word_occurrences/nr_of_tweets
    

In [15]:
classifier = {
    'basis'  : popular_words,
    'P(T)'   : P_O,
    'P(O)'   : P_T,
    'P(w|O)' : P_w_given_o,
    'P(w|T)' : P_w_given_t
    }   

Write a tweet_classifier function that takes your trained classifier and a tweet and returns wether it's about Trump or Obama unsing the popular words selected. Note that if there are words in the basis words in our classifier that are not in the tweet we have the opposite probabilities i.e P(w_1 occurs )*  P(w_2 does not occur) * .... if w_1 occurs and w_2 does not occur. The function should return wether the tweet is about Obama or Trump i.e 'T' or 'O'

In [18]:
def tweet_classifier(tweet, classifier_dict):
    """ param tweet: string containing tweet message
        param classifier: dict containing 'basis' - training words
                                          'P(T)' - class probabilities
                                          'P(O)' - class probabilities
                                          'P(w|O)' - conditional probabilities
                                          'P(w|T)' - conditional probabilities
        
        return: either 'T' or 'O'
    """
    words_in_tweet = np.unique([x.lower() for x in tweet.split()])
    
    # ... Code for classifying tweets using the naive bayes classifier
    """if P("trump"|"president" in tweet) > P("obama"|"president" in tweet) 
    then tweet is about trump"""
    nom_products = 1
    for wd in words_in_tweet:
        if wd in P_w_given_o:
            nom_products = nom_products * P_w_given_o[wd] #P(w1|O) * ... * P(wn|O)
    nom_products = nom_products * P_O #P_O
    #TODO y = numpy.argmax() ....  not sure about this one for now
    nominator = nom_products
    denom_products = 1
    for wd in words_in_tweet:
        counts = word_counter_f(wd, train_data)
        if counts > 0:
            denom_products = denom_products * (counts/len(train_data))
    denominator = denom_products
    P_o_given_w = nominator/denominator

    nom_products = 1
    for wd in words_in_tweet:
        if wd in P_w_given_t:
            nom_products = nom_products * P_w_given_t[wd] #P(w1|T) * ... * P(wn|T)
    nom_products = nom_products * P_T #P_T
    #TODO y = numpy.argmax() ....  not sure about this one for now
    nominator = nom_products
    denom_products = 1
    for wd in words_in_tweet:
        counts = word_counter_f(wd, train_data)
        if counts > 0:
            denom_products = denom_products * (counts/len(train_data))
        
    denominator = denom_products
    P_t_given_w = nominator/denominator

    if P_t_given_w > P_o_given_w:
        return 'T'
    else:
        return 'O'





In [33]:
import math
def tweet_classifier(tweet, classifier_dict):
    """ param tweet: string containing tweet message
        param classifier: dict containing 'basis' - training words
                                          'P(T)' - class probabilities
                                          'P(O)' - class probabilities
                                          'P(w|O)' - conditional probabilities
                                          'P(w|T)' - conditional probabilities
        
        return: either 'T' or 'O'
    """
    words_in_tweet = np.unique([x.lower() for x in tweet.split()])
    
    # ... Code for classifying tweets using the naive bayes classifier
    """if P("trump"|"president" in tweet) > P("obama"|"president" in tweet) 
    then tweet is about trump"""   
    list_of_probabilities = [] #P("president" in tweet|"trump"), ... ,P("america" in tweet|"trump")
    for word in words_in_tweet:
        try: 
            tweet_occurrences = word_occurrences_unique_O[word]
            P_w_given_o = tweet_occurrences/len(word_occurrences_unique_O)
            list_of_probabilities.append(P_w_given_o)
        except KeyError:
            #word does not exist in training data
            list_of_probabilities.append(1) #we ignore new words by multiplying with 1 (new unseen words don't help with classification)
    prob_products = math.prod(list_of_probabilities) #P("president" in tweet|"obama")*...*P("america" in tweet|"obama")
    nominator_o = P_O * prob_products #P("obama) * P("president" in tweet|"obama")*...*P("america" in tweet|"obama")
    
    #trump
    list_of_probabilities = [] #P("president" in tweet|"trump"), ... ,P("america" in tweet|"trump")
    for word in words_in_tweet:
        try: 
            tweet_occurrences = word_occurrences_unique_T[word]
            P_w_given_t = tweet_occurrences/len(word_occurrences_unique_T)
            list_of_probabilities.append(P_w_given_t)
        except KeyError:
            #word does not exist in training data
            list_of_probabilities.append(1) #we ignore new words by multiplying with 1 (new unseen words don't help with classification)
    prob_products = math.prod(list_of_probabilities) #P("president" in tweet|"obama")*...*P("america" in tweet|"obama")
    nominator_t = P_T * prob_products #P("obama) * P("president" in tweet|"obama")*...*P("america" in tweet|"obama")



    list_of_probabilities = [] #P("president" in tweet), ... ,P("america" in tweet)
    for word in words_in_tweet:
        try: 
            tweet_occurrences = word_occurrences_unique[word]
            P_w = tweet_occurrences/len(word_occurrences_unique)
            list_of_probabilities.append(P_w)
        except KeyError:
            #word does not exist in training data
            list_of_probabilities.append(1) #we ignore new words by multiplying with 1 (new unseen words don't help with classification)
    prob_products = math.prod(list_of_probabilities) #P("president" in tweet|"obama")*...*P("america" in tweet|"obama")
    denominator = prob_products #P("president" in tweet)*...*P("america" in tweet)
    #print(probs)
    P_o_given_w = nominator_o/denominator
    P_t_given_w = nominator_t/denominator


    #print( )
    #P_t_given_w = 0.7
    #P_o_given_w = 0.3
    if P_t_given_w > P_o_given_w:
        return 'T'
    else:
        return 'O'





In [34]:
def test_classifier(classifier, test_tweets, test_labels):
    total = len(test_tweets)
    correct = 0
    for (tweet,label) in zip(test_tweets, test_labels):
        predicted = tweet_classifier(tweet,classifier)
        if predicted == label:
            correct = correct + 1
    return(correct/total)

In [35]:
acc = test_classifier(classifier, test_tweets, test_labels)
print(f"Accuracy: {acc:.4f}")

  P_o_given_w = nominator_o/denominator
  P_t_given_w = nominator_t/denominator


Accuracy: 0.2295
