# Machine Learning for text classification

Lets set the stage! After a year full of COVID disruptions the university has made all exams online. Celebrations all around! But wait, what about the cheating you say? Fortunately (or unfortunately), the uni has predicted this, and hired a crack team of Designated Exam Validators (or DEVs for short) to check for plagarism. And you've just been hired! And of course, you're going to use ML to do all your work for you!

# Problem Statement

Given a sentence, we need to find the probability a sentence came from each possible source.

For example, given the text "The University of Auckland began as a constituent college of the University of New Zealand, founded on 23 May 1883 as Auckland University College" the probability for "Wikipedia" = 0.9 and "Twitter" = 0.1

We need to build a machine learning classifier to do this for us!

# 1) Collect data

First, we need some data from each source. This will allow us to find patterns, i.e., learn what makes *wikipedia* sentences different to *twitter* sentences

This is normally the *most important part of ML*, but we've taken care of it! In the github repo there is a data folder containing some datasets from different sources.

First we will just load this in.

In [1]:
# Load in some libraries
# The pathlib library makes handling filepaths easier, letting us open data files.
import pathlib

# Set the folder containing our data
data_path = pathlib.Path('data')

# Setup availiable filenames
chess_filename = "chess.txt"
music_filename = "music.txt"
angry_filename = "angry_topical_chat.txt"
happy_filename = "happy_topical_chat.txt"
disgusted_filename = "disgusted_topical_chat.txt"
trumpspeech_filename = "trumpSpeech.txt"
wallstreetbets_filename = "wallstreetbets_comments.txt"
javascript_filename = "javascript.txt"
shakespeare_filename = "shakespeare.txt"

# Just a helper function, don't worry about this one!
def get_file_or_cache(path):
    cache = None
    def get():
        nonlocal cache
        if not cache:
            with path.open('r', encoding='utf8') as f:
                cache = f.readlines()
        return cache
    return (get, path.stem)

# Construct a list of possible files and their names
data_files = [get_file_or_cache(data_path / x) for x in [chess_filename, music_filename, happy_filename, trumpspeech_filename, wallstreetbets_filename, javascript_filename, shakespeare_filename]]

# Construct dataset of sentences and labels from every source
# Note! Rather than using the actual names for the datasets, we give each one
# its own numeric ID.
# We record these so we can swap between human readable name and ID.
X = []
Y = []
source_IDs = {}
ID_to_source = {}
source_ID = 0
for source_constructor, source in data_files:
    source_IDs[source] = source_ID
    ID_to_source[source_ID] = source
    lines = source_constructor()
    for line in lines:
        X.append(line)
        Y.append(source_ID)
    source_ID += 1

observations = list(zip(X, Y))

Now we have our data loaded into X, a list of sentences, and Y, a list of the sources each sentence came from.
Lets check out some sample lines from each source!

In [2]:
inspect_index = 100000
for index, (line, source) in enumerate(observations[inspect_index:inspect_index+5]):
    print("Observation", index)
    print("X{} = ".format(index), line)
    # Source is stored as a number, representing the source (explained later!)
    print("Y{} =".format(index), source, "Source name:", ID_to_source[source])
    print('-----------')

Observation 0
X0 =  I don't know how to process this. My brain... It can't... What? Yay?

Y0 = 4 Source name: wallstreetbets_comments
-----------
Observation 1
X1 =  [removed]

Y1 = 4 Source name: wallstreetbets_comments
-----------
Observation 2
X2 =  [deleted]

Y2 = 4 Source name: wallstreetbets_comments
-----------
Observation 3
X3 =  [removed]

Y3 = 4 Source name: wallstreetbets_comments
-----------
Observation 4
X4 =  I fnally got back in after I forgot to delete a stop loss on RH. If anyone is wondering, Chase and Fidelity are not restricting trading. Trade at your own risk and always be careful.

Y4 = 4 Source name: wallstreetbets_comments
-----------


# How do we *actually* predict?

The goal is to learn a model which can tell the difference between classes. For example, consider sentences from Wikipedia (W) or Twitter (T). If we show our model a new input sentence (X) where we do NOT know the origin, our model should be able to tell where the sentence came from. 

Formally, the $i^{th}$ input is called $x_{i}$ and its true class is called $y_{i}$.
For us, $x_{i}$ is the $i^{th}$ sentence we need to label and $y_{i}$ is where this sentence actually came from.
Our model predicts some label, $\hat y_{i}$, for this sentence (so *wikipedia* or *twitter*).
We want to train it so that *our* label, $\hat y_{n}$, is close to the *true* label, $y_{i}$.

There are many, many different ways to do this prediction, and we will look at a simple one called Naive Bayes.

Lets true a manual version to get the idea:


In [3]:
import random
# Pick two random labels to choose from
g, label_opt1 = random.choice(data_files)
g, label_opt2 = random.choice(data_files)

# Select a random observations from a random source
source_gen, true_label = random.choice(data_files)

# Shuffle labels
label_options = [label_opt1, label_opt2, true_label]
random.shuffle(label_options)

lines = source_gen()
line = random.choice(lines)
print(line)
predicted_label = input("Where did this sentence come from? {} or {} or {}".format(*label_options))
print("You were", "Right" if true_label == predicted_label else "Wrong", "!")
print("Your predicted label (y^) was: {}. The true label (y) was {}".format(predicted_label, true_label))

Where my name is to go?

Where did this sentence come from? javascript or shakespeare or trumpSpeechshakespeare
You were Wrong !
Your predicted label (y^) was: shakespeare. The true label (y) was trumpSpeech


## Using a model to make a prediction
 
 How can we *train the computer* to predict a class for us? 
 Lets think of a simple example: We are given a sentence ($x$), and need to predict where it came from ($\hat y$), Wikipedia (W) or Twitter (T). We start by asking a friend where they think it came from, and they say it has a 70% chance of being from Wikipedia and 30% from Twitter. 

 In mathmatical terms we can rewrite the "_probability of this specific sentence $x$ being from Wikipedia, $\hat y = W$, is 0.7_" as $p(\hat y = W | x) = 0.7$. Similarly, we can write $p(\hat y = T | x) = 0.3$ for the sentence being from twitter.

 Now it is pretty easy to make a prediction, Wikipedia has a 70% chance and Twitter only has a 30% chance so we should predict Wikipedia! 

 ### But how did our friend come up with $p(\hat y| x)$ in the first place?
 This is basically what the computer needs to do, find the probability of a sentence coming from each source.
 This is what Naive Bayes solves!

 ## Naive Bayes - Probability theory (spooky)

 Note: Naive Bayes is pretty simple, and is quite intuitive when you wrap your head around it, but if this is your first introduction to probability it can be quite confusing! If you don't understand at first don't get discoraged! I find that drawing diagrams and thinking about it from a few directions helps really understand.

 The end goal of a model is to calculate $p(y = C|X=x)$. In plain english, this can be read as calculate the _probability that the true class of the input is C given what we know about the sentence_. For a concrete example, lets use the sentence "The University of Auckland was founded on 23 May 1883". We want to predict the probability that $y = Wikipedia$ or $y = Twitter$ given that the sentence $x$ = "The University of Auckland was founded on 23 May 1883". This can be difficult to calculate!

 Instead of calculating this directly, _Bayes Theorum_ gives us a way to swap things around.

 $$ p(y=C|X=x) = p(X=x|y=C)p(y=C)$$
 
 ## Intuition
 Instead of directly trying to work out the probability of wikipedia or twitter, lets look at indivdual words. If we see the words "follow me!" we can say twitter has a pretty high probability. On the other hand, if we see the words "Auckland[1][2]" we could say wikipedia has a high probability. How did we do this?
 
 We looked at the probability of each word (or few words) coming from each source! This is the $p(x=X|y=C)$ in the formula!
 
Lets see if we can use this to do some machine learning!

# Exercise 1: Calculate word probabilities

We need to *learn* 2 things from our data.
1. What is the probability of seeing each word overall?
    1. This is $p(x=X)$, or the *probability* that a word is X
    2. e.g p('and') is high, it occurs a lot! But p('founded') is low, it doesn't occur too often.
2. What is the probability of seeing each word *from each source*?
    1. This is $p(x=X|y=Y)$, or the *probability* that a word is X if we *know* it is from e.g wikipedia
    2. e.g. p('founded'|Twitter) might be pretty low, its an uncommon word on twitter. But p('founded'|Wikipedia) is high, it occurs all the time on wikipedia!

In [7]:
def fit_NB_model(X, Y):
    # fill word probabilities so it contains every word, and the probability of each
    # word_probabilities = {
    #       'and': 0.75,
    #       'founded': 0.01,
    #       ....
    #}
    word_probabilities = {}
    # fill word_probs_by_source so it contains a dict for each source
    # each containing each word and probability only in that source
    # word_probs_by_source = {
    #       'Twitter': {
    #              'founded': 0.001,
    #               ...,
    #        },
    #       'Wikipedia': {
    #              'founded': 0.05,
    #              ...,
    #       }
    #}
    word_probs_by_source = {}
    sources = set(Y)

    for x, y in zip(X, Y):
        words = x.split()
        for word in words:
            # Check word is in dict, else add it
            if word not in word_probabilities:
                word_probabilities[word] = 0
            word_probabilities[word] += 1

            # For each source, check word is in dict else add it
            for s in sources:
                if s not in word_probs_by_source:
                    word_probs_by_source[s] = {}
                if word not in word_probs_by_source[s]:
                    word_probs_by_source[s][word] = 1

            # Increment counter
            word_probs_by_source[y][word] += 1

    # Convert counts to probabilities, taking into account initial 1s
    for source in sources:
        # Get total number of words in each source
        source_total = sum(word_probs_by_source[source].values())
        for word in word_probabilities:
            source_word_probability = word_probs_by_source[source][word] / source_total
            word_probs_by_source[source][word] = source_word_probability
    return word_probs_by_source

NB_model = fit_NB_model(X, Y)

{0, 1, 2, 3, 4, 5, 6}


Lets check the output but showing the most likely words for each source

In [8]:
for gen, source in data_files:
    word_probs = NB_model[source_IDs[source]].items()
    print(source)
    print(sorted(word_probs, key = lambda x: x[1])[-20:])

chess
[('16.', 0.0016418199832438306), ('d4', 0.00164715634288601), ('15.', 0.0016596078487177617), ('14.', 0.0016827320738338721), ('13.', 0.0016969623662130167), ('12.', 0.0017111926585921615), ('11.', 0.0017147502316869478), ('10.', 0.001718307804781734), ('9.', 0.0017272017375186996), ('8.', 0.0017414320298978442), ('7.', 0.0017414320298978442), ('6.', 0.0017449896029926304), ('5.', 0.001755662322276989), ('4.', 0.0017609986819191684), ('3.', 0.0017627774684665614), ('2.', 0.0017681138281087408), ('Nf6', 0.001771671401203527), ('1.', 0.0017805653339404925), ('Nf3', 0.001844601649646644), ('O-O', 0.002561452628246063)]
music
[('z', 0.0010849852796859666), ('G', 0.001088125613346418), ('F2', 0.001110107948969578), ('E2', 0.0012011776251226692), ('f2', 0.0012922473012757605), ('A', 0.001348773307163886), ('d', 0.0015026496565260059), ('c2', 0.0017224730127576055), ('e2', 0.0018512266928361139), ('g2', 0.0019203140333660452), ('D2', 0.0022249263984298333), (':|', 0.0026582924435721296)

Now we can make predictions for single words! Lets write a function which takes a word an returns the most likely source.

In [22]:
def get_word_probabilities(model, word):
    """ Returns a list of tuples giving
    the ID of a source and the probability of word
    coming from that source.
    """
    source_probs = []
    for source in model:
        source_word_probability = model[source][word]
        source_probs.append((source_word_probability, source))
    return source_probs

def predict_single_word(model, word):
    """ Return the source ID word is most likely
    to have come from.
    """
    source_probs = get_word_probabilities(model, word)
    return sorted(source_probs)[-1][1]

        

In [14]:
word_pred_games_played = 0
word_pred_human_correct = 0
word_pred_ML_correct = 0

In [21]:
import random
s, label_opt1 = random.choice(data_files)
s, label_opt2 = random.choice(data_files)
source_gen, true_label = random.choice(data_files)
label_options = [label_opt1, label_opt2, true_label]
random.shuffle(label_options)
lines = source_gen()
line = random.choice(lines)
word = random.choice(line.split())
print("Word is", word)
predicted_label = input("Where did this word come from? {} or {} or {}".format(*label_options))
ML_predicted_label_ID = predict_single_word(NB_model, word)
ML_predicted_label = ID_to_source[ML_predicted_label_ID]
print("You were", "Right" if true_label == predicted_label else "Wrong", "!")
print("The computer was", "Right" if true_label == ML_predicted_label else "Wrong", "!")
print("Your predicted label (y^) was: {}. The computer predicted {}. The true label (y) was {}".format(predicted_label, ML_predicted_label, true_label))
word_pred_games_played += 1
word_pred_human_correct += (true_label == predicted_label)
word_pred_ML_correct += (true_label == ML_predicted_label)
print("Your score:", str(word_pred_human_correct / word_pred_games_played), "AI score:", str(word_pred_ML_correct / word_pred_games_played))

Word is thou
Where did this word come from? shakespeare or happy_topical_chat or wallstreetbets_commentsshakespeare
[(1.7787865473930993e-06, 0), (1.5701668302257114e-06, 1), (9.725384322874979e-07, 2), (9.036190847965185e-07, 3), (3.7607351615902056e-06, 4), (2.229788638334972e-06, 5), (8.810611500491191e-05, 6)]
You were Right !
The computer was Right !
Your predicted label (y^) was: shakespeare. The computer predicted shakespeare. The true label (y) was shakespeare
Your score: 0.7142857142857143 AI score: 0.7142857142857143


## Predicting Sentences

Now that we can classify the source of individual words, how can we upgrade to sentences?

Lets think about probability a little bit. What is the probability of a heads when a coin is flipped? 
$$p(H) = \frac{1}{2}$$
Now what is the probability of flipping two heads in a row?
$$ p(H) \times p(H) = \frac{1}{2} \times \frac{1}{2} = \frac{1}{4}$$

We can check this is correct, as there are four possible options, HH, HT, TH, TT, and all are equally likely.

Can we do something similar with a sentence? We have the probability of each word coming from each source, can we just multiply them all together to get the probability of the sentence coming from a source?



### Yes! (Kind of)

It turns out that yes, we can! If we make some assumptions...

If we assume that all words are independant from each other, this works! This is actually exactly how Naive Bayes works!
(Though this may not be so realistic...)

So, lets put all of our intuition together!
1. Firstly, we want to predict the probability of a label (wikipedia, twitter, etc,) given some sentence ("Follow me please!!!", "Auckland was founded...").
    1. In statistics, this can be written p(y=Y|x=X)
2. This is *hard* to calculate, so we use bayes rule to flip it around. We predict the probability of the sentence we have coming from each of the sources!
    1. This is now: p(x=X|y=Y)
3. The probability of the sentence coming from a source can be thought of as the product of each *word* coming from that source!
    1. p(Follow me please|Twitter) = p(Follow|Twitter) x p(me|Twitter) x p(please|Twitter)
4. Once probabilities are calculated, we can take the most likely label as our prediction!
    1. If p(Follow me please|Twitter) = 0.2 and p(Follow me please|Wikipedia) = 0.001, we can predict the sentence "Follow me please" is from Twitter!

In [70]:
import math
def get_sentence_probs(model, sentence):
    """ Returns a list of tuples giving
    the ID of a source and the probability of sentence
    coming from that source.
    """
    source_probs = {}
    for word in sentence.split():
#         print(word)
        source_probabilities = get_word_probabilities(model, word)
#         print({ID_to_source[k]:v for v,k in source_probabilities})
        for prob, s_ID in source_probabilities:
            if s_ID not in source_probs:
                source_probs[s_ID] = 1
            source_probs[s_ID] += math.log(prob)
#         print({ID_to_source[k]:v for k,v in source_probs.items()})
    if len(source_probs.items()) < 1:
        for source in model:
            source_probs[source] = 1
    return source_probs.items()
def predict_sentence(model, sentence):
    source_probs = get_sentence_probs(model, sentence)
#     print(sorted(source_probs, key= lambda x: x[1]))
    return sorted(source_probs, key= lambda x: x[1])[-1][0]

In [48]:
sent_pred_games_played = 0
sent_pred_human_correct = 0
sent_pred_ML_correct = 0

In [56]:
import random
s, label_opt1 = random.choice(data_files)
s, label_opt2 = random.choice(data_files)
source_gen, true_label = random.choice(data_files)
label_options = [label_opt1, label_opt2, true_label]
random.shuffle(label_options)
lines = source_gen()
line = random.choice(lines)
print("Line is:", line)
predicted_label = input("Where did this word come from? {} or {} or {}".format(*label_options))
ML_predicted_label_ID = predict_sentence(NB_model, line)
ML_predicted_label = ID_to_source[ML_predicted_label_ID]
print("You were", "Right" if true_label == predicted_label else "Wrong", "!")
print("The computer was", "Right" if true_label == ML_predicted_label else "Wrong", "!")
print("Your predicted label (y^) was: {}. The computer predicted {}. The true label (y) was {}".format(predicted_label, ML_predicted_label, true_label))
sent_pred_games_played += 1
sent_pred_human_correct += (true_label == predicted_label)
sent_pred_ML_correct += (true_label == ML_predicted_label)
print("Your score:", str(sent_pred_human_correct / sent_pred_games_played), "AI score:", str(sent_pred_ML_correct / sent_pred_games_played))

Line is: B2A2 G3G | AGFD F2dc | B2AG ^FGAF | G4 G2 ||

Where did this word come from? wallstreetbets_comments or happy_topical_chat or musicchess
You were Wrong !
The computer was Right !
Your predicted label (y^) was: chess. The computer predicted music. The true label (y) was music
Your score: 0.75 AI score: 0.875


## Evaluating

So, after playing the ML algorithm, it seems pretty good!
The next step is to test *how good*. Evaluation is an important part of machine learning, to test how good our models actually are. This is required for two things:
1. Making sure we can actually put it into critical decision making roles
2. Making sure we understand *what* its decisions are based on
3. Making sure any changes we make actually are improving it!

Lets look at the most basic form of evaluation, how well the predicted labels match the known labels. This is called *accuracy*!

In [None]:
num_right = 0
num_wrong = 0
for X, Y in observations:
    ML_predicted_label_ID = predict_sentence(NB_model, X)
    num_right += (ML_predicted_label_ID == Y)
    num_wrong += (ML_predicted_label_ID != Y)
accuracy = num_right / (num_right + num_wrong)
print(accuracy)