In [1]:
import numpy as np
import string
from sklearn.model_selection import train_test_split

We import everything that we need and create a list with the files that we're gonna use

In [2]:
input_files = [
    'otro/edgar_allan_poe.txt',
    'otro/robert_frost.txt'
]

Althought in here we will be facing a task of unsupervised learning we turn it into a supervised learning by storing into two lists the lines of the dataset and its labels. There are important things to be mentioned here. We are not using 2 different datasets, we are usign both of the authors in a unique dataset and from there making the differentiation between training and testing. So in this following step we do a bunch of things:

-in our lists we store for each file, the label for it (all the lines in poe will have a 0 and all the lines of frost a 1. So our list labels will contain [0,0,0...1,1,1,] one for each line in the files.)

-We're normalizing the text in order to reduce dimensionality, which means that we're erasing turning of words to lowercase as well as eliminating the endlines at the end of each line.

In [3]:
input_texts = []
labels = []

for label, document in enumerate(input_files):
    print(f"{document} corresponds to label {label}")
    for line in open(document):
        line = line.rstrip().lower()
        if line:
            line = line.translate(str.maketrans('','',string.punctuation))
            input_texts.append(line)
            labels.append(label)

otro/edgar_allan_poe.txt corresponds to label 0


FileNotFoundError: [Errno 2] No such file or directory: 'otro/edgar_allan_poe.txt'

Once this is done we use sklearn to make the differentiation between train and test. The variables with text will contain the lines and the Yvariables will contain the labels. Remember that, althought they are in different lists, we keep track of which line corresponds to which label by matching the in the index of the lists.

In [None]:
train_text, test_text, Ytrain, Ytest = train_test_split(input_texts,labels)

we create an index variable and a dictionary in order to make the word to index mapping. Each word would be a number in our word to index mapping so, in order to be aware of the unknown words that can appear in the test set, we create a first state for the model with the index 0, and after that we create our dictionary of word to index mapping.

In [None]:
indice = 1
word_to_index_mapping = {'<unknown>':0}

In [None]:
for text in train_text:
    text = text.split()
    for word in text:
        if word not in word_to_index_mapping:
            word_to_index_mapping[word]=indice
            indice+=1

In [None]:
word_to_index_mapping

{'<unknown>': 0,
 'whose': 1,
 'entablatures': 2,
 'intertwine': 3,
 'lying': 4,
 'down': 5,
 'to': 6,
 'die': 7,
 'have': 8,
 'suddenly': 9,
 'arisen': 10,
 'and': 11,
 'both': 12,
 'that': 13,
 'morning': 14,
 'equally': 15,
 'lay': 16,
 'politician': 17,
 'at': 18,
 'odd': 19,
 'seasons': 20,
 'i': 21,
 'saw': 22,
 'but': 23,
 'them': 24,
 'only': 25,
 'for': 26,
 'hours': 27,
 'perhaps': 28,
 'hear': 29,
 'some': 30,
 'word': 31,
 'about': 32,
 'the': 33,
 'weather': 34,
 'a': 35,
 'winter': 36,
 'garden': 37,
 'in': 38,
 'an': 39,
 'alder': 40,
 'swamp': 41,
 'such': 42,
 'as': 43,
 'it': 44,
 'is': 45,
 'isnt': 46,
 'worth': 47,
 'mortgage': 48,
 'who': 49,
 'has': 50,
 'heart': 51,
 'your': 52,
 'getting': 53,
 'lost': 54,
 'hadnt': 55,
 'you': 56,
 'long': 57,
 'suspected': 58,
 'where': 59,
 'were': 60,
 'love': 61,
 'of': 62,
 'years': 63,
 'till': 64,
 'they': 65,
 'sorrowfully': 66,
 'trailed': 67,
 'dust': 68,
 'made': 69,
 'him': 70,
 'throw': 71,
 'his': 72,
 'bare': 73,

Now that we have our word to index mapping we are gonna create two lists that will contain more lists. Each one of the lists contained in the general listas will be the index representation of the line

In [None]:
train_text_as_int = []
test_text_as_int = []


for sentence in train_text:
    tokens = sentence.split()
    line_as_int = [word_to_index_mapping[token] for token in tokens]
    train_text_as_int.append(line_as_int)

for sentence in test_text:
    tokens = sentence.split()
    line_as_int = [word_to_index_mapping.get(token, 0) for token in tokens]
    test_text_as_int.append(line_as_int)

In [None]:
len(word_to_index_mapping)

2509

Now that we have our dataset as ready as we can it's time to create the model. Remember that, for a markov model we need two things: a state transition matrix and a initial vector matrix. For performance of the computer we're gonna use numpy arrays. AX will be the transitions matrix and piX will be the initial state vectors. NOTE that we're creating a matrix and a vector of ones because if we make the multiplication of something by 0 (let's imagina transitions that never happens which will probbly happen) we will have 0 and we don't want that. This technique is named add one smoothing and, if used, we will have to be aware of it when doing the transitions from counts to probabilities. By now we'll leave it like it is.

In [None]:
V = len(word_to_index_mapping)

A0 = np.ones((V,V))
pi0 = np.ones(V)

A1 = np.ones((V,V))
pi1 = np.ones (V)

In order to populate the matrix and the vector we iterate through them. for every lines we check if it's the first element of the line and add a 1 to that element in the vector, otherwise we check the precedent element and the element itself in the vector and its coincidence in the matrix we add 1.

In [None]:
def populate_matrix_and_initial_vector(text_as_int,matrix,vector):
    for line in text_as_int:
        last_index = None
        for state in line:
            if last_index is None:
                vector[state] +=1
            else:
                matrix[last_index, state] +=1
            last_index = state



Now we populate our vectors and matrices for both labels. In order to see what zip is look this url https://www.programiz.com/python-programming/methods/built-in/zip 

In [None]:
populate_matrix_and_initial_vector([line for line, label in zip(train_text_as_int, Ytrain) if label == 0], A0, pi0)
populate_matrix_and_initial_vector([line for line, label in zip(train_text_as_int, Ytrain) if label == 1], A1, pi1)

As we said earlier we need to be aware that we've used the add one smoothing so what we do is, for each row of the matrix as well as for the vector, we add all the numbers and divide each number of the row for the sum. Doing that we will obtain the probabilities. After that we use the log in order to have values less close to 0 and avoid that the computer makes an approximation.

In [None]:
A0 /= A0.sum(axis=1, keepdims=True)
pi0 /= pi0.sum()

A1 /= A1.sum(axis=1, keepdims=True)
pi1 /= pi1.sum()

logA0 = np.log(A0)
logpi0 = np.log(pi0)

logA1 = np.log(A1)
logpi1 = np.log(pi1)

Due to the descompensation of data we need to be aware that the possibility of a line being from Frost is much more strong than the possibility of being from Poe because the dataset contains much more lines from Frost than lines from Poe. In order to calculate this posibility we need to calculate the priors. This prior will need to be taken care of when calculating the probabilities.


In [None]:
count0 = sum(y == 0 for y in Ytrain)
# Number of total labels in Ytrain from poe
count1 = sum(y == 1 for y in Ytrain)
# Number of total labels in Ytrain from frost


total = len(Ytrain)
# Total number of labels

probability0 = count0 / total
probability1 = count1 / total
# probabilities for each author

logprobability0 = np.log(probability0)
logprobability1 = np.log(probability1)
# Log of the rpobabilities
probability0,probability1

(0.3461300309597523, 0.6538699690402476)

As we can see the probability of a line to be part of poe writtings is just 0.32 while the probability of being of Frost is 0.67. Be aware of that for the future

In [None]:
logprobability0

-1.0609407625016578

We're gonna create a class whose builder method will have a matrix, a vector, the priors and we'll also add length of the priors into its atributes. 
This class will have 2 methods:

-The first method will be named compute_log_likelihood and it would take a matrix and a vector and, using the rules of markov chain, compute the mulitplications. Remember that log(AB) = log(A) + log(B) so we won't have the need to multiply being this operation more demanding for the performance. 

-The second method is predict and it will do the following. By using the class as well as a suppoused input we will generate a vector that will contain the probability of all the lines from the testing and training set. Each one of the elements in the vector will be a probability then. To the result of adding the log of the probabilities we will also add the log of the priors. So for each model (each initial transitions as well as its matrix) we compute the log likelihood obtaining a list with two elements, one corresponing to each score in the models. Then taking the bigger ones we decide wether if that line is from Frost or from Poe (0 or 1), and we add it to the predictions. predictions is a vector that contains as much elements as lines we have in our dataset, so, in the end, we'll have a vector with all the predictions of the list of inoputs that we've given to the model. 

In [None]:
class Classifier:
    def __init__(self,logAs,logpis,logpriors):
        self.logAs = logAs
        self.logpis = logpis
        self.logpriors = logpriors
        self.Modelos = len(logpriors)

    def _compute_log_likelihood(self, input_,class_):
        logA = self.logAs[class_]
        logpi = self.logpis[class_]

        last_state = None
        logprob = 0
        for state in input_:
            if last_state is None:
                logprob += logpi[state]
            else:
                logprob += logA[last_state, state]
            last_state = state
        return logprob

    def predict(self, inputs):
        predictions = np.zeros(len(inputs))
        for index, input_ in enumerate(inputs):
            posteriors = [self._compute_log_likelihood(input_, c) + self.logpriors[c] for c in range(self.Modelos)]
            pred = np.argmax(posteriors)
            predictions[index] = pred
            # np.argmax devuelve el indice del número más alto de la matriz o vector
        return predictions

One the class is designed now we declare it with our previous data 

In [None]:
clf = Classifier([logA0, logA1],[logpi0,logpi1],[logprobability0,logprobability1])

And now we call predict on both the training set and the test set. Remember that with this we will obtain a vector with the predictions for each line in each set so, by calling the mean over Ytrain that has the labels we will have the accuracy obteined by our model

In [None]:
Ptrain = clf.predict(train_text_as_int)
print(f"Train acc: {np.mean(Ptrain==Ytrain)}")

Train acc: 0.9956656346749226


In [None]:
Ptest = clf.predict(test_text_as_int)
print(f"Train acc: {np.mean(Ptest==Ytest)}")

Train acc: 0.8367346938775511
