# Discriminative ML

The goal is to learn a model which can tell the difference between classes. For example, consider sentences from class1 = Wikipedia (C1) or class2 = Twitter (C2). If we show our model a new input sentence (X) where we do NOT know the origin, our model should be able to tell where the sentence came from. 

Formally, the $i^{th}$ input is called $x_{n}$ and its true class is called $y_{n}$. Our model predicts $\hat y_{n}$ given $x_{n}$, and we want to train it so that $\hat y_{n}$ is close to $y_{n}$.

There are many, many different ways to do this prediction, and we will look at a simple one called Naive Bayes.

## Using a model to make a prediction
 
 How can we predict a class? 
 Lets think of a simple example: We are given a sentence ($x$), and need to predict where it came from ($\hat y$), Wikipedia (C1) or Twitter (C2). We start by asking a friend where they think it came from, and they say it has a 70% chance of being from Wikipedia and 30% from Twitter. 

 In mathmatical terms we can rewrite the "_probability of this specific sentence $x$ being from Wikipedia, $\hat y = C1$, is 0.7_" as $p(\hat y = C1 | x) = 0.7$. Similarly, we can write $p(\hat y = C2 | x) = 0.3$ for the sentence being from twitter.

 Now it is pretty easy to make a prediction, Wikipedia has a 70% chance and Twitter only has a 30% chance so we should predict Wikipedia!

 ### But how did our friend come up with $p(\hat y| x)$ in the first place?

 This is what Naive Bayes solves!

 ## Naive Bayes - Probability theory (spooky)

 Note: Naive Bayes is pretty simple, and is quite intuitive when you wrap your head around it, but if this is your first introduction to probability it can be quite confusing! If you don't understand at first don't get discoraged! I find that drawing diagrams and thinking about it from a few directions helps really understand.

 The end goal of a model is to calculate $p(y = C|X=x)$. In plain english, this can be read as calculate the _probability that the true class of the input is C given what we know about the sentence_. For a concrete example, lets use the sentence "The University of Auckland was founded on 23 May 1883". We want to predict the probability that $y = Wikipedia$ and $y = Twitter$ given that the sentence $x$ = "The University of Auckland was founded on 23 May 1883". This can be difficult to calculate!

 Instead of calculating this directly, _Bayes Theorum_ gives us a way to swap things around.

 $$ p(y=C|X=x) = p(X=x|y=C)p(y=C)


 

Load dependencies and files

In [53]:
import pathlib
import random
import numpy as np
import math

data_path = pathlib.Path('data')
chess_filename = "chess.txt"
music_filename = "music.txt"
angry_filename = "angry_topical_chat.txt"
happy_filename = "happy_topical_chat.txt"
disgusted_filename = "disgusted_topical_chat.txt"
trumpspeech_filename = "trumpSpeech.txt"
wallstreetbets_filename = "wallstreetbets_comments.txt"
javascript_filename = "javascript.txt"
shakespeare_filename = "shakespeare.txt"

def get_file_or_cache(path):
    cache = None
    def get():
        nonlocal cache
        if not cache:
            with path.open('r', encoding='utf8') as f:
                cache = f.readlines()
        return cache
    return (get, path.stem)

data_files = [get_file_or_cache(data_path / x) for x in [chess_filename, music_filename, happy_filename, trumpspeech_filename, wallstreetbets_filename, javascript_filename, shakespeare_filename]]

Construct probabilities. Want p(source), p(word) and p(word|Source)

In [2]:
p_source = {}
p_word = {}
p_word_given_source = {}
total_words = 0
for source_constructor, source in data_files:
    lines = source_constructor()
    for line in lines:
        for word in line.split(' '):
            if p_source.get(source, 0) > 400000:
                continue
            p_source[source] = p_source.get(source, 0) + 1
            p_word[word] = p_word.get(word, 0) + 1
            source_conditional = p_word_given_source.get(source, {})
            source_conditional[word] = source_conditional.get(word, 0) + 1
            p_word_given_source[source] = source_conditional
            total_words += 1
num_unique_words = len(p_word.keys())

In [3]:
p_source

{'chess': 115493,
 'music': 202479,
 'happy_topical_chat': 400001,
 'trumpSpeech': 400001,
 'wallstreetbets_comments': 400001,
 'javascript': 5848,
 'shakespeare': 7330}

In [79]:
def likelihood_word_from_source(word, source):
    """ number of times word is seen in source / number of words in source
    """
    # print(word)
    source_count_smoothing = (p_word_given_source[source].get(word, 0) + 1)
    source_total_smoothing = (p_source[source] + num_unique_words)
    likelihood = math.log(source_count_smoothing) - math.log(source_total_smoothing)
    # print(f"word count: {source_count_smoothing} / {source_total_smoothing} = {likelihood}")
    return likelihood

def likelihood_sentence_from_source(sentence, source):
    """ Naive Bayes uses the product of words to get sentence
    likelihood
    """
    likelihood_product = 0
    for word in sentence:
        word_likelihood = likelihood_word_from_source(word, source)
        
        likelihood_product += word_likelihood
        
    return likelihood_product

def likelihood_source_from_sentence(sentence, source, prior):
    """ Using the bayes rule, p(A|B) = p(B|A)p(A) / p(B).
    We ignore p(B) as it is the same for any A so we dont need
    it to compare As.
    Here A is the source and B is the sentence. p(A) is
    the number of words in a source / total number of words seen.
    """
    source_likelihood = (p_source[source] / total_words)
    sentence_likelihood = likelihood_sentence_from_source(sentence, source)
    # print(source, source_likelihood)
    # print(sentence_likelihood)
    return sentence_likelihood + prior

In [157]:
for r in range(5):
    # Prior is prob of drawning sentence from pool
    priors = {s: math.log(p_source[s]) - math.log(total_words) for g, s in data_files}
    # Prior uniformly pick source then sentence
    priors = {s: 1 / len(data_files) for g, s in data_files}
    data_index = random.choices(range(len(data_files)), [priors[s] for g,s in data_files ])[0]
    get_data, name = data_files[data_index]
    print(name)
    lines = get_data()
    sentence_str = random.choice(lines)
    sentence = sentence_str.split(' ')
    print("-------")
    print(f"Sentence is: {sentence_str}")
    print(f"True label, y = {name}")
    likelihoods = []
    for source in p_source:
        source_likelihood = likelihood_source_from_sentence(sentence, source, priors[source])
        print(f"{source} has likelihood {source_likelihood}")
        likelihoods.append((source_likelihood, source))
    likelihoods.sort(key = lambda x: x[0], reverse=True)
    print(f"we predict yhat = {likelihoods[0][1]}")
    print(f"We were {'right! :)' if likelihoods[0][1] == name else 'wrong. :('}")
    print("-------")

chess
-------
Sentence is: 1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qc2 d5 5. a3 Bxc3+ 6. Qxc3 Ne4 7. Qb3 dxc4 8. Qxc4 Nd7 9. Nf3 O-O 10. Bf4 Nd6 11. Bxd6 cxd6 12. Rc1 Nb6 13. Qc7 h6 14. Qxd8 Rxd8 15. e4 Bd7 16. Be2 Rac8 17. O-O a6 18. b3 Rc6 19. b4 Rdc8 20. Rxc6 Rxc6 21. e5 d5 22. b5 Rc3 23. bxa6 bxa6 24. Rb1 Nc4 25. Rb8+ Kh7 26. Rb7 Bb5 27. Rc7 Rc1+ 28. Bf1 Ra1 29. h3 Nxa3 30. Nh2 Bxf1 31. Nxf1 Nc4 32. f3 Nd2 33. Kf2 Nxf1 34. Rxf7 Nd2 35. Re7 Nc4 36. Rxe6 a5 37. Rc6 a4 38. e6 Ra2+ 39. Kg3 a3 40. Rc7 Re2 41. e7 a2 42. Ra7 Rxe7 43. Rxa2 Re3 44. Kf4 Rd3 45. Kg4 Rxd4+ 46. Kf5 Rd1 47. Re2 d4 48. Ra2 d3 49. Ra4 Rc1 50. Ra2 d2 51. Rxd2 Nxd2 52. f4 Rg1 {White resigns} 0-1

True label, y = chess
chess 0.07542877818219342
-1104.6557750591624
chess has likelihood -1106.6016852082178
music 0.13223956064482126
-1941.6145484590863
music has likelihood -1943.5604586081417
happy_topical_chat 0.26124169171859374
-2077.7698904043286
happy_topical_chat has likelihood -2079.715800553384
trumpSpeech 0.2612416917

In [165]:
right = 0
wrong = 0
for r in range(10000):
    # Prior is prob of drawning sentence from pool
    priors = {s: p_source[s] / total_words for g, s in data_files}
    # Prior uniformly pick source then sentence
    priors = {s: 1 / len(data_files) for g, s in data_files}
    data_index = random.choices(range(len(data_files)), [priors[s] for g,s in data_files ])[0]
    get_data, name = data_files[data_index]
    lines = get_data()
    sentence_str = random.choice(lines)
    sentence = sentence_str.split(' ')
    likelihoods = []
    for source in p_source:
        source_likelihood = likelihood_source_from_sentence(sentence, source, priors[source])
        likelihoods.append((source_likelihood, source))
    likelihoods.sort(key = lambda x: x[0], reverse=True)
    correct = likelihoods[0][1] == name
    if correct:
        right += 1
    else:
        wrong += 1
print(f"Accuracy: {right / (right + wrong)}")

Accuracy: 0.85


In [55]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

X = []
Y = []
source_IDs = {}
ID_to_source = {}
source_ID = 0
for source_constructor, source in data_files:
    source_IDs[source] = source_ID
    ID_to_source[source_ID] = source
    lines = source_constructor()
    for line in lines:
        X.append(line)
        Y.append(source_ID)
    source_ID += 1




In [56]:
cv = CountVectorizer()
X_cv = cv.fit_transform(X)
X_cv, X, Y = shuffle(X_cv, X, Y) 
train_len = int(len(X) * 0.9)
# X_train, X_test, Y_train, Y_test = train_test_split(X_cv, Y)
X_train = X_cv[:train_len]
X_test = X_cv[train_len:]
Y_train = Y[:train_len]
Y_test = Y[train_len:]
X_sentence_train = X[:train_len]
X_sentence_test = X[train_len:]

In [57]:
len(X)

1193341

In [58]:
clf = MultinomialNB().fit(X_train, Y_train)

In [59]:

i = 3
sentence = X_test[i]
yhat = clf.predict(X_test[0])[0]
print(cv.inverse_transform(sentence))
print(ID_to_source[yhat])
print(ID_to_source[Y_test[i]])


[array([], dtype='<U1411')]
wallstreetbets_comments
wallstreetbets_comments


In [65]:
import tqdm
p_source = {}
p_word = {}
p_word_given_source = {}
total_words = 0
for line_cv, line, source_cv in tqdm.tqdm(zip(X_train, X_sentence_train, Y_train), total = X_train.shape[0]):
    source = ID_to_source[source_cv]
    # print(line)
    # print(source)
    for word in line.split(' '):
        p_source[source] = p_source.get(source, 0) + 1
        p_word[word] = p_word.get(word, 0) + 1
        source_conditional = p_word_given_source.get(source, {})
        source_conditional[word] = source_conditional.get(word, 0) + 1
        p_word_given_source[source] = source_conditional
        total_words += 1
num_unique_words = len(p_word.keys())

100%|██████████| 1074006/1074006 [00:58<00:00, 18208.34it/s]


In [78]:
right = 0
wrong = 0
sklearn_right = 0
sklearn_wrong = 0
for r in tqdm.tqdm(range(X_test.shape[0])):
    # Prior is prob of drawning sentence from pool
    priors = {s: math.log(p_source[s]) - math.log(total_words) for g, s in data_files}
    # Prior uniformly pick source then sentence
    # priors = {s: 1 / len(data_files) for g, s in data_files}
    sentence_cv = X_test[r]
    source_ID = Y_test[r]
    # print(sentence)
    # print(source_ID)
    sentence = X_sentence_test[r]
    source = ID_to_source[source_ID]
    sentence_str = ' '.join(sentence)
    likelihoods = []
    for s in p_source:
        source_likelihood = likelihood_source_from_sentence(sentence, s, priors[s])
        likelihoods.append((source_likelihood, s))
    likelihoods.sort(key = lambda x: x[0], reverse=True)
    # print(likelihoods)
    our_yhat = likelihoods[0][1]
    correct = our_yhat == source
    if correct:
        right += 1
    else:
        wrong += 1
    
    sklearn_yhat  = clf.predict(sentence_cv)[0]
    sk_correct = sklearn_yhat == source_ID
    if sk_correct:
        sklearn_right += 1
    else:
        sklearn_wrong += 1
    if not correct and False:
        print(sentence)
        print(source_ID)
        print(likelihoods)
        print(f"Ours {our_yhat} Theirs {sklearn_yhat} GT: {source}")
print(f"Our Acc: {right / (right + wrong)}")
print(f"SK Acc: {sklearn_right / (sklearn_right + sklearn_wrong)}")

100%|██████████| 119335/119335 [10:09<00:00, 195.67it/s]Our Acc: 0.9392382787949889
SK Acc: 0.9791008505467801



In [76]:
p_word_given_source['wallstreetbets_comments']['[deleted]']

2