# Markov Model - Text Classifier

Starting from 2 sets of poems by 2 different authors: Edgar Allan Poe and Robert Frost, build a text classifier that can distinguished between the 2 authors.
 - Compute train and test accuracy
 - Check for class imbalance, compute F1-score if imbalanced

### Outline of the code:
 - Loop through each file, save each line to a list (one line == one sample)
 - Save the labels too
 - train-test split
 - Create a mapping from unique word to unique int index
    - loop through data and tokenize each line (string split is enough)
    - Assign each unique word a unique index
    - create a special index for unknown word (words that could be in test set but are not in training set)  
 - Convert each line of text into integer lists
 - Train a Markov model for each class (Edgar Allan Poe / Robert Frost)
 - Use smoothing (add-one smoothing)
 - Do we need A and pi or just Log(A) and Log(pi)?
 - We also need to compute the priors p(class = k) to know if we need to take it into account in Baye's rule.
 - Write a function to compute the posterior for each class, given an input
 - Take the argmax over the posteriors to get the predicted class
 - Make predictions for both train and test sets
 - Compute accuracy for train and test sets
 - Check for class imbalance
 - Check confusion matrix and F1-score

In [1]:
#get datasets
#!wget -nc https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/edgar_allan_poe.txt
#!wget -nc https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt

--2023-04-30 22:24:42--  https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/edgar_allan_poe.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8003::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26622 (26K) [text/plain]
Saving to: ‘edgar_allan_poe.txt’


2023-04-30 22:24:43 (2.92 MB/s) - ‘edgar_allan_poe.txt’ saved [26622/26622]

--2023-04-30 22:24:43--  https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8003::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 

### Dataset creation

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import string
import pandas as pd
from sklearn.model_selection import train_test_split

In [5]:
input_files = [
  '../datasets/poems/edgar_allan_poe.txt',
  '../datasets/poems/robert_frost.txt',
]
# label: Edgar Allan Poe => 0, Robert Frost => 1

In [6]:
inputs = []
labels = []
for i, filepath in enumerate(input_files):
    with open(filepath) as file:
      for line in file:
         inputs.append(line)
         labels.append(i)

In [14]:
inputs_train, inputs_test, Ytrain, Ytest = train_test_split(inputs, labels, random_state=123)

### Create a mapping from each word to a unique index

In [99]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectorizer = CountVectorizer(tokenizer=(lambda x: x.lower().split(' ')))

In [100]:
Xtrain = vectorizer.fit_transform(inputs_train)
Xtest = vectorizer.transform(inputs_test)



In [101]:
#Get the vocabulary dict from vectorizer
word2idx = vectorizer.vocabulary_
idx2word = {v: k for k, v in word2idx.items()}

In [102]:
# create a special index for "unknown" word
idx2word[len(word2idx)] = "unk"
word2idx["unk"] = len(word2idx)

In [103]:
vocab_size = len(word2idx)
print("vocabulary size: ", vocab_size)

vocabulary size:  3786


### Train a Markov model A0 and pi0 for Edgar Allan Poe, and another model A1 and pi1 for Robert Frost

Let's compute pi0 first.  
pi0 is a vector of size vocab_size, and each component i is the total count of word i divided by total count of all words.  
So for all Xtrain[k,:] where k such as Y[k] == 0, pi0[i] = np.sum(Xtrain_k[:,i])/np.sum(Xtrain_k)

In [104]:
pi0 = np.zeros(vocab_size)
pi1 = np.zeros(vocab_size)
sum_Xtrain_0 = 0
sum_Xtrain_1 = 0
for j in range(Xtrain.shape[1]):
  for i in range(Xtrain.shape[0]):
    if Ytrain[i] == 0:
        # compute pi0
        pi0[j] += Xtrain[i,j]
        sum_Xtrain_0 += Xtrain[i,j]
    if Ytrain[i] == 1:
        # compute pi1
        pi1[j] += Xtrain[i,j]
        sum_Xtrain_1 += Xtrain[i,j]
pi0 = pi0 / sum_Xtrain_0
pi1 = pi1 / sum_Xtrain_1

In [105]:
# The word with the highest initial probability for Edgar Allan Poe is:
idx2word[np.argmax(pi0)]

'the'

Let's compute A0 and A1 now.  
With Add-One smoothing, Aij = (count(word i to word j) + 1) / (count(word i) + vocab_size)
In order to compute count(word i to word j), we first need to transfrom our list of words into list of int using word2idx


In [106]:
inputs_train[0].split(' ')

['Of', 'all', 'to', 'whom', 'thine', 'absence', 'is', 'the', 'night-\n']

In [107]:
list(map(lambda x: word2idx[x],inputs_train[0].lower().split(' ')))

[2185, 187, 3270, 3630, 3179, 137, 1626, 3132, 2128]

In [108]:
inputs_train_index = []
A0 = np.ones([vocab_size, vocab_size]) # we start at 1 for Add-One smoothing
A1 = np.ones([vocab_size, vocab_size])
for i in range(len(inputs_train)):
  # tokenize then transform into list of int
  inputs_train_index.append(list(map(lambda x: word2idx[x],inputs_train[i].lower().split(' '))))
  for j in range(len(inputs_train_index[i])-1):
    if Ytrain[i] == 0:
      # compute A0
      A0[inputs_train_index[i][j],inputs_train_index[i][j+1]]+=1
    if Ytrain[i] == 1:
      # compute A1
      A1[inputs_train_index[i][j],inputs_train_index[i][j+1]]+=1



In [114]:
A0

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

In [112]:
# we still need to divide by count(word i) + vocab_size
indices_label_0 = [i for i in range(len(Ytrain)) if Ytrain[i] == 0]
indices_label_1 = [i for i in range(len(Ytrain)) if Ytrain[i] == 1]
    

In [115]:
word_counts_0 = np.ones(vocab_size)*vocab_size
word_counts_1 = np.ones(vocab_size)*vocab_size

for j in range(vocab_size):
  for i in indices_label_0:
    word_counts_0[j] += Xtrain[i,j]
  for i in indices_label_1:
    word_counts_1[j] += Xtrain[i,j]

IndexError: column index (3785) out of range