# Machine Learning for text classification

Lets set the stage! After a year full of COVID disruptions the university has made all exams online. Celebrations all around! But wait, what about the cheating you say? Fortunately (or unfortunately), the uni has predicted this, and hired a crack team of Designated Exam Validators (or DEVs for short) to check for plagarism. And you've just been hired! And of course, you're going to use ML to do all your work for you!

# Problem Statement

Given a sentence, we need to find the probability a sentence came from each possible source.

For example, given the text "The University of Auckland began as a constituent college of the University of New Zealand, founded on 23 May 1883 as Auckland University College" the probability for "Wikipedia" = 0.9 and "Twitter" = 0.1

We need to build a machine learning classifier to do this for us!

# 1) Collect data

First, we need some data from each source. This will allow us to find patterns, i.e., learn what makes *wikipedia* sentences different to *twitter* sentences

This is normally the *most important part of ML*, but we've taken care of it! In the github repo there is a data folder containing some datasets from different sources.

First we will just load this in.

In [1]:
# Load in some libraries
# The pathlib library makes handling filepaths easier, letting us open data files.
import pathlib

# Set the folder containing our data
data_path = pathlib.Path('data')

# Setup availiable filenames
chess_filename = "chess.txt"
music_filename = "music.txt"
angry_filename = "angry_topical_chat.txt"
happy_filename = "happy_topical_chat.txt"
disgusted_filename = "disgusted_topical_chat.txt"
trumpspeech_filename = "trumpSpeech.txt"
wallstreetbets_filename = "wallstreetbets_comments.txt"
javascript_filename = "javascript.txt"
shakespeare_filename = "shakespeare.txt"

# Just a helper function, don't worry about this one!
def get_file_or_cache(path):
    cache = None
    def get():
        nonlocal cache
        if not cache:
            with path.open('r', encoding='utf8') as f:
                cache = f.readlines()
        return cache
    return (get, path.stem)

# Construct a list of possible files and their names
data_files = [get_file_or_cache(data_path / x) for x in [chess_filename, music_filename, happy_filename, trumpspeech_filename, wallstreetbets_filename, javascript_filename, shakespeare_filename]]

No we have to data loaded, lets check out some sample lines from each source!

In [2]:
for data_generator, data_name in data_files:
    print("Source: ", data_name)
    lines = data_generator()
    print("Lines:")
    for line in lines[:5]:
        print(line)

Source:  chess
Lines:
1. e4 e6 2. d4 d5 3. Nc3 Nf6 4. e5 Nfd7 5. f4 g6 6. Nf3 Bg7 7. Bd3 O-O 8. O-O Re8 9. Be3 a6 10. Ng5 h6 11. Nf3 Qe7 12. Qd2 Nb6 13. f5 exf5 14. Bxh6 Be6 15. Bxg7 Kxg7 16. Ng5 Rh8 17. Qf4 Rh5 18. h4 c5 19. Be2 Nc6 20. Bxh5 gxh5 21. dxc5 Qxc5+ 22. Kh1 Qe7 23. Qf3 Rh8 24. Qg3 Kf8 25. Rad1 d4 26. Ne2 Qc5 27. c3 dxc3 28. Qxc3 Qxc3 29. Nxc3 Ke7 30. g3 Nc4 31. Rf2 Ne3 32. Rd3 Ng4 33. Re2 Nb4 34. Rd6 Rc8 35. Rb6 Rc4 36. Rxb7+ Kf8 37. Ra7 Rc6 38. Rd2 Bc4 39. b3 Be6 40. Rd4 Nc2 41. Rd6 Rxc3 42. Nxe6+ fxe6 43. Rd8# {Black checkmated} 1-0

1. d4 g6 2. c4 Bg7 3. Nc3 d6 4. f3 Nc6 5. Be3 Nf6 6. Qd2 O-O 7. a3 Na5 8. Qd3 Be6 9. d5 Bf5 10. Ne4 Nxe4 11. fxe4 Bg4 12. Rb1 c5 13. h3 Bd7 14. b3 f5 15. b4 fxe4 16. Qc2 Be5 17. bxa5 Qxa5+ 18. Bd2 Bg3+ 19. Kd1 Rxf1+ {White resigns} 0-1

1. d4 d5 2. c4 e6 3. Nf3 Nf6 4. Nc3 Be7 5. Bg5 O-O 6. e3 h6 7. Bh4 c6 8. Bd3 dxc4 9. Bxc4 b5 10. Bd3 Nd5 11. Bxe7 Qxe7 12. Nxd5 exd5 13. O-O Nd7 14. Rc1 Qd6 15. Qc2 Bb7 16. Kh1 Nf6 17. Ne5 Rac8 18. Bf5 Rc7 19

# How do we *actually* predict?

The goal is to learn a model which can tell the difference between classes. For example, consider sentences from Wikipedia (W) or Twitter (T). If we show our model a new input sentence (X) where we do NOT know the origin, our model should be able to tell where the sentence came from. 

Formally, the $i^{th}$ input is called $x_{i}$ and its true class is called $y_{i}$.
For us, $x_{i}$ is the $i^{th}$ sentence we need to label and $y_{i}$ is where this sentence actually came from.
Our model predicts some label, $\hat y_{i}$, for this sentence (so *wikipedia* or *twitter*).
We want to train it so that *our* label, $\hat y_{n}$, is close to the *true* label, $y_{i}$.

There are many, many different ways to do this prediction, and we will look at a simple one called Naive Bayes.

Lets true a manual version to get the idea:


In [11]:
import random
s, label_opt1 = random.choice(data_files)
s, label_opt2 = random.choice(data_files)
source_gen, true_label = random.choice(data_files)
label_options = [label_opt1, label_opt2, true_label]
random.shuffle(label_options)
lines = source_gen()
line = random.choice(lines)
print(line)
predicted_label = input("Where did this sentence come from? {} or {} or {}".format(*label_options))
print("You were", "Right" if true_label == predicted_label else "Wrong", "!")
print("Your predicted label (y^) was: {}. The true label (y) was {}".format(predicted_label, true_label))

['shakespeare', 'trumpSpeech', 'wallstreetbets_comments']
Thawing cold fear, that mean and gentle all,

Where did this sentence come from? shakespeare or trumpSpeech or wallstreetbets_commentsshakespeare
You were Right !
Your predicted label (y^) was: shakespeare. The true label (y) was shakespeare


## Using a model to make a prediction
 
 How can we *train the computer* to predict a class for us? 
 Lets think of a simple example: We are given a sentence ($x$), and need to predict where it came from ($\hat y$), Wikipedia (W) or Twitter (T). We start by asking a friend where they think it came from, and they say it has a 70% chance of being from Wikipedia and 30% from Twitter. 

 In mathmatical terms we can rewrite the "_probability of this specific sentence $x$ being from Wikipedia, $\hat y = W$, is 0.7_" as $p(\hat y = W | x) = 0.7$. Similarly, we can write $p(\hat y = T | x) = 0.3$ for the sentence being from twitter.

 Now it is pretty easy to make a prediction, Wikipedia has a 70% chance and Twitter only has a 30% chance so we should predict Wikipedia! 

 ### But how did our friend come up with $p(\hat y| x)$ in the first place?
 This is basically what the computer needs to do, find the probability of a sentence coming from each source.
 This is what Naive Bayes solves!

 ## Naive Bayes - Probability theory (spooky)

 Note: Naive Bayes is pretty simple, and is quite intuitive when you wrap your head around it, but if this is your first introduction to probability it can be quite confusing! If you don't understand at first don't get discoraged! I find that drawing diagrams and thinking about it from a few directions helps really understand.

 The end goal of a model is to calculate $p(y = C|X=x)$. In plain english, this can be read as calculate the _probability that the true class of the input is C given what we know about the sentence_. For a concrete example, lets use the sentence "The University of Auckland was founded on 23 May 1883". We want to predict the probability that $y = Wikipedia$ or $y = Twitter$ given that the sentence $x$ = "The University of Auckland was founded on 23 May 1883". This can be difficult to calculate!

 Instead of calculating this directly, _Bayes Theorum_ gives us a way to swap things around.

 $$ p(y=C|X=x) = p(X=x|y=C)p(y=C)$$
 
 ## Intuition
 Instead of directly trying to work out the probability of wikipedia or twitter, lets look at indivdual words. If we see the words "follow me!" we can say twitter has a pretty high probability. On the other hand, if we see the words "Auckland[1][2]" we could say wikipedia has a high probability. How did we do this?
 
 We looked at the probability of each word (or few words) coming from each source! This is the $p(x=X|y=C)$ in the formula!
 
Lets see if we can use this to do some machine learning!

# Exercise 1: Calculate word probabilities

We need to *learn* 2 things from our data.
1. What is the probability of seeing each word overall?
    1. This is $p(x=X)$, or the *probability* that a word is X
    2. e.g p('and') is high, it occurs a lot! But p('founded') is low, it doesn't occur too often.
2. What is the probability of seeing each word *from each source*?
    1. This is $p(x=X|y=Y)$, or the *probability* that a word is X if we *know* it is from e.g wikipedia
    2. e.g. p('founded'|Twitter) might be pretty low, its an uncommon word on twitter. But p('founded'|Wikipedia) is high, it occurs all the time on wikipedia!

In [None]:
# fill word probabilities so it contains every word, and the probability of each
# word_probabilities = {
#       'and': 0.75,
#       'founded': 0.01,
#       ....
#}
word_probabilities = {}
# fill word_probs_by_source so it contains a dict for each source
# each containing each word and probability only in that source
# word_probs_by_source = {
#       'Twitter': {
#              'founded': 0.001,
#               ...,
#        },
#       'Wikipedia': {
#              'founded': 0.05,
#              ...,
#       }
#}
word_probs_by_source = {}

for data_generator, data_name in data_files:
    lines = data_generator()
    for line in lines:
        words = lines.split()
        for word in words:
            # ..... Add code!

Now we can make predictions for single words! Lets write a function which takes a word an returns the most likely source.

In [None]:
def predict_single_word(word):
    # .... Add code!

In [None]:
import random
s, label_opt1 = random.choice(data_files)
s, label_opt2 = random.choice(data_files)
source_gen, true_label = random.choice(data_files)
label_options = [label_opt1, label_opt2, true_label]
random.shuffle(label_options)
lines = source_gen()
line = random.choice(lines)
word = random.choice(line.split())
print("Word is", word)
predicted_label = input("Where did this word come from? {} or {} or {}".format(*labels_options))
ML_predicted_label = predict_single_word
print("You were", "Right" if true_label == predicted_label else "Wrong", "!")
print("The computer was", "Right" if true_label == ML_predicted_label else "Wrong", "!")
print("Your predicted label (y^) was: {}. The computer predicted {}. The true label (y) was {}".format(predicted_label, ML_predicted_label, true_label))