## Reading in and cleaning the data
Reading in the data for both the test set and the training set I've decided to remove stopwords before we get any counts for any words at all. Removing these words will help our predicting performance as these stop words aren't actually really related to the class, for example the which is the most common word in the english language will not have any correlation to what the actual text is about and its class. Hence removing these words will improve our model performance and not try to make predictions on attributes which are uncorrelated/unimportant. 

### The data
the data we read in had a huge class imbalance where we only had ~130 class instances for A and V while B & E had ~1600 and ~1300 instances respectively. This meant that the priors for both A and V would be massively overshadowed by the priors to B & E which is one of the major problems of Naive Bayes if our training set is not very representative of real world data. Ie if we had in the real world, more classifications of A & V then our naive bayes would work terribly. To tackle this huge class imbalance there is such a thing as Complement Naive Bayes which is one of the implementations I decided to use to increase the prediction performance of the model.

### Accuracy
For calculating the accuracy I just created a simple function called accuracy_metric. This function will calculate the accuracy and the performance of the naive bayes models predictions by comparing the actual values from the test set to the predicted classes from our Naive Bayes.

### Cross validation
To validate the performance of the model I chose to use cross validation with the accruacy metric to run the model on different training and test sets. I chose to run the algorithms on a 10-fold cross validation 10 times over. I wanted run the 10-fold cross validation more times like 30 to really get an accurate representation of the different extended Naive Bayes implementations. Also I found that using stratified cross validation would've been a much better way to get an accurate representation of our model performance due to the huge class imbalance. But I ran out of time for implementation so I cannot be 100% accurate with the model performances.

### Training the model and getting prior probabilities
To "train" our naive bayes model I decided to go through each document and count each occurence of a word in that document using the Python Counter built in module. The counts would be attributed to a dict of words with their counts to each different class in the whole dataset. Hence, we get the occurence of words given a class. After getting these counts we get the priori through simply doing a division of the count of a class over the total amount of documents we read over. Hence giving us prior probabilites of 0.032,0.4005,0.0315,0.536 to a,b,v,e respectively which again shows us the imbalance talked about before. So using these priors we will have a prediction that it is a class given a document, It will in this case most likely be B and E.

### Standard Naive Bayes
For implementing the standard Naive Bayes I chose to first have a training function where we get all the probabilites of a word occuring in that class. For example all the probabilities for class a wouled be represneted as a_probabilites={'The':1,.... }.
I got these probabilites by going through the training data, calculating the frequency of a word in a class lets say Nc, getting the sum of all unique words we've trained on so far |V| , getting a count of all words in a specific class,Nci, and then using these variables to get the probabilites.
I iterated through each word in a document and assigning the probability of a class given that word for each class by doing the following.

    Iterate through each document
        Count each word in document
                calculate Nc + 1 / Nci + |V| for each class and each word
            apply the priori to above claculation to specific class
        return all probabilites for a class of each word
    
The calculation I used was with laplace smoothing to ensure that if a word that hasnt been seen before we would not get a 0 division error.
Then on the testing it would be pretty similar

    Itrate through each testing document
        count each word in document
                calculate Nc + 1 / Nci + |V| for each class and each word if not seen already
                times that calculation by the frequency of word in this document.
            apply the priori
            return all probabilites for a class of each word
        return the probability of a class of that specific document
    return all predictions of a class given the document

The performance I recieved for standarnd multinomial Naive Bayes was pretty bad, I got an accuracy of around 50% so it wasn't very good at predicting much at all, it would just choose the majority class which was almost always E and this could be due to the prior of E which was quite high.

### TF * IDF Naive Bayes
For my extension of Naive Bayes I decided to add TF * IDF to discount words which would appear very frequently in the documents. If a word would appear in many documents in different classes then it won't really help us in our classification as that word most likely isn't related to the class and is hence pretty useless. With TF * IDF we apply a weighting to our words to prefer the rarer words and make their probability for a class higher than common words which will apear in every text, the texts with high frequency will recieve a weighting which reduces the probability of that word given a class which has significantly increased the performance of the algorithm. This gave an accuracy of about 91.7 when run through cross validation, I thought that I could get a higher accuracy than this so I decided to combine this with Complement Naive bayes.

### Complement TF * IDF Naive Bayes
In complement Naive bayes we instead of calculating the likelihood that a document is inside a class by calculating the occurance of that word in a class / all words in that class (and laplace smoothing) we instead calculate the likelihood by calculating the occurance of word not in class c / all words not in class c. This will reduce the load of the huge class imbalance I've mentioned earlier by negating the priors for each class as well, when ran with TF&IDF we got a increase in performance of around 1% but due to not being able to find a suitable way to calculate the statistical significance of the means of these classes I cannot be 100% to say this was not attributed to chance rather than an actual increase in accuracy. But testing as well on the kaggle performance, this gave a much higher accuracy than just standard TF&IDF Naive Bayes which was around 92% acuracy while the accuracy for Complement TF * IDF Naive Bayes was 93.36% 

#### N grams
I as well implemented N grams to hopefully increase the performance of the model but due to the size of the dataset this was quite unlikely as we would need a much larger dataset to find common occurence of lets say "a virus" rather than just "a" and "virus". N grams would ensure that we wouldn't lose the meaning of a sentence so we would include lets say 2 words instead of them single-handedly but we would need a much larger dataset to represent all different N-grams. So when ran with cross validation the performance of the model actually decreased due to this, hence I did not include it as my extension but more so added it in to talk about.

In [31]:
import collections
from collections import Counter
import numpy
import math
import csv
import operator
from random import seed
from random import randrange
import multiprocessing
from random import randrange
import statistics 

In [32]:
redundant_words = ["the", "of", "and", "a", "in", "to", "that", "is", "with", "for", "from", "are", "by", " ", "was", "we", "this", "were", "as", "an", "have" ,"which", "has", "these", "at", "be"]
with open('trg.csv', newline='') as csvfile:
    data = list(csv.reader(csvfile))
    for i in data:
        a_str = ""
        for j in i[2].split():
            if j not in redundant_words:
                a_str += j + " "
        i[2] = a_str

In [33]:
with open('tst.csv', newline='') as csvfile:
    test_data = list(csv.reader(csvfile))
    for i in test_data:
        a_str = ""
        for j in i[1].split():
            if j not in redundant_words:
                a_str += j + " "
        i[1] = a_str
test_data.pop(0)

['id', 'abstract ']

In [34]:
# Calculate accuracy percentage between two lists
def accuracy_metric(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i][1] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) * 100.0

In [35]:
seed(1)
def cross_validation_split(dataset, folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / folds)
    for i in range(folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

In [36]:
text_in_class = {'E':{}, 'B':{}, 'A': {}, 'V':{}}
a_counter = Counter()
e_counter = Counter()
b_counter = Counter()
v_counter = Counter()

In [37]:
def get_counts(data):
    text_in_class = {'E':{}, 'B':{}, 'A': {}, 'V':{}}
    a_counter = Counter()
    e_counter = Counter()
    b_counter = Counter()
    v_counter = Counter()
    word_freq = Counter()
    x = []
    y = []
    count = 0
    for i in data:
        if count != 0:
            x.append(i[2])
            y.append(i[1])
            if (i[1] == "A"):
                a = Counter(i[2].split())
                a_counter += a
                word_freq += Counter(set(a))
            if (i[1] == "B"):
                b = Counter(i[2].split())
                b_counter += b
                word_freq += Counter(set(b))
            if (i[1] == "E"):
                e = Counter(i[2].split())
                e_counter += e
                word_freq += Counter(set(e))
            if (i[1] == "V"):
                v = Counter(i[2].split())
                v_counter += v
                word_freq += Counter(set(v))
        count += 1
    target_counter = Counter(y)
    
    return word_freq, target_counter, a_counter, b_counter, e_counter, v_counter

word_freq_in_doc, target_counter, a_counter, b_counter, e_counter, v_counter = get_counts(data)
##print(a_counter)
print(target_counter)
print(len(word_freq_in_doc))

Counter({'E': 2144, 'B': 1602, 'A': 128, 'V': 126})
32216


In [38]:
a_sum = sum(target_counter.values())
a_probability = target_counter["A"] / a_sum
b_probability = target_counter["B"] / a_sum
v_probability = target_counter["V"] / a_sum
e_probability = target_counter["E"] / a_sum

In [39]:
print(a_probability)
print(b_probability)
print(v_probability)
print(e_probability)

0.032
0.4005
0.0315
0.536


In [40]:
a_probabilites = {}
b_probabilites = {}
v_probabilites = {}
e_probabilites = {}

### Standard Naive Bayes Training

In [41]:
a_word_sum = sum(a_counter.values())
b_word_sum = sum(b_counter.values())
v_word_sum = sum(v_counter.values())
e_word_sum = sum(e_counter.values())

In [42]:
all_unique = a_counter + b_counter + v_counter + e_counter
sum_unique = len(all_unique.keys())

In [43]:
def standard(all_unique, a_counter, b_counter, e_counter, v_counter):
    a_probabilites = {}
    b_probabilites = {}
    v_probabilites = {}
    e_probabilites = {}
    for word in all_unique:
        if word not in a_counter.keys():
            a_prob = 1
        elif word in a_counter.keys():
            a_prob = a_counter.get(word) + 1
        a_probabilites[word] = a_prob / a_word_sum + sum_unique

        if word not in b_counter.keys():
            b_prob = 1
        elif word in b_counter.keys():
            b_prob = b_counter.get(word) + 1
        b_probabilites[word] = b_prob / b_word_sum + sum_unique

        if word not in v_counter.keys():
            v_prob = 1
        elif word in b_counter.keys():
            v_prob = v_counter.get(word) + 1
        v_probabilites[word] = v_prob / v_word_sum + sum_unique


        if word not in e_counter.keys():
            e_prob = 1
        elif word in e_counter.keys():
            e_prob = e_counter.get(word) + 1
        e_probabilites[word] = e_prob / e_word_sum + sum_unique
    return a_probabilites, b_probabilites, v_probabilites, e_probabilites
a_probabilites, b_probabilites, v_probabilites, e_probabilites = standard(all_unique, a_counter, b_counter, e_counter, v_counter)

### Standard Naive Bayes testing

In [None]:
def standard_test(all_unique,test_data,  a_counter, b_counter, e_counter, v_counter, index, a_probabilites, b_probabilites, v_probabilites, e_probabilites):
    all_unique = a_counter + b_counter + v_counter + e_counter
    sum_unique = len(all_unique.keys())
    a_word_sum = len(a_counter.values())
    b_word_sum = len(b_counter.values())
    v_word_sum = len(v_counter.values())
    e_word_sum = len(e_counter.values())
    predictions = []

    for i in test_data:
        count_train = Counter(i[index].split())
        #count_train = Counter(i[1])
        class_prob = {"A": 0, "B": 0, "V": 0, "E": 0}
        for word in i[index].split():
            a_prob = a_probabilites.get(word)
            if (a_prob == None):
                a_prob = 1 / a_word_sum + sum_unique
            class_prob["A"] = math.log(a_prob) * count_train[word] + class_prob["A"]


            b_prob = b_probabilites.get(word) 
            if (b_prob == None):
                b_prob = 1 / b_word_sum + sum_unique
            class_prob["B"] =  math.log(b_prob) * count_train[word] + class_prob["B"]


            v_prob = v_probabilites.get(word)
            if (v_prob == None):
                v_prob = 1 / v_word_sum + sum_unique
            class_prob["V"] = math.log(v_prob) * count_train[word] + class_prob["V"]


            e_prob = e_probabilites.get(word)
            if (e_prob == None):
                e_prob = 1 / e_word_sum + sum_unique
            class_prob["E"] = math.log(e_prob) * count_train[word] + class_prob["E"]
        #print(class_prob)
        class_prob["A"] = class_prob["A"] + math.log(a_probability)
        class_prob["B"] = class_prob["B"] + math.log(b_probability)
        class_prob["V"] = class_prob["V"] + math.log(v_probability)
        class_prob["E"] = class_prob["E"] + math.log(e_probability)
        predictions.append(max(class_prob.items(), key=operator.itemgetter(1))[0])
    return predictions
predictions = standard_test(all_unique,test_data, a_counter, b_counter, e_counter, v_counter, 1, a_probabilites, b_probabilites, v_probabilites, e_probabilites)
##print(all_unique)

In [None]:
k = 10
rand_split = cross_validation_split(data, k)
overall_accuracies = []
for i in range(10):
    accuracies = []
    for i in range(k):
        test_split = rand_split[i]
        training_split = rand_split.copy()
        training_split.pop(i)

        training_split = [item for sublist in training_split for item in sublist]
        text_in_class = {'E':{}, 'B':{}, 'A': {}, 'V':{}}
        a_counter = Counter()
        e_counter = Counter()
        b_counter = Counter()
        v_counter = Counter()

        word_freq_in_doc,target_counter, a_counter, b_counter, e_counter, v_counter = get_counts(training_split)

    #    a_counter = Counter(dict(a_counter.most_common(1250)))
    #    b_counter = Counter(dict(b_counter.most_common(1250)))
    #    e_counter = Counter(dict(e_counter.most_common(1250)))
    #    v_counter = Counter(dict(v_counter.most_common(1250)))

        all_unique = a_counter + b_counter + v_counter + e_counter
        sum_unique = len(all_unique.keys())

        a_sum = sum(target_counter.values())
        a_probability = target_counter["A"] / a_sum
        b_probability = target_counter["B"] / a_sum
        v_probability = target_counter["V"] / a_sum
        e_probability = target_counter["E"] / a_sum


        a_sum = len(a_counter.values())
        b_sum = len(b_counter.values())
        v_sum = len(v_counter.values())
        e_sum = len(e_counter.values())

        a_probabilites, b_probabilites, v_probabilites, e_probabilites = standard(all_unique, a_counter, b_counter, e_counter, v_counter)
        predictions = standard_test(all_unique,test_split,  a_counter, b_counter, e_counter, v_counter, 2, a_probabilites, b_probabilites, v_probabilites, e_probabilites)


        accuracies.append(accuracy_metric(test_split, predictions))
        print(accuracies)
    overall_accuracies.append(statistics.mean(accuracies))
    #print(overall_accuracies)
print(overall_accuracies)

### TF * IDF Naive Bayes Training

In [None]:
def tf_idf_train(training_split_len, all_unique, word_freq_in_doc,a_sum, b_sum, e_sum, v_sum, sum_unique, a_counter, b_counter, e_counter, v_counter):
    a_probabilites = {}
    b_probabilites = {}
    v_probabilites = {}
    e_probabilites = {}
    
    for word in all_unique:

        idf = math.log(training_split_len / word_freq_in_doc.get(word))

        if word not in a_counter.keys():
            a_prob = 1
        elif word in a_counter.keys():
            a_prob = a_counter.get(word) + 1
        a_prob = (math.log(a_prob)*idf) / ((math.log(a_sum)*idf) + sum_unique)
        if a_prob == 0 or a_prob < 0:
            a_prob = 0.00001
        a_probabilites[word] = a_prob 

        if word not in b_counter.keys():
            b_prob = 1
        elif word in b_counter.keys():
            b_prob = b_counter.get(word) + 1
        b_prob = (math.log(b_prob)*idf) / ((math.log(b_sum)*idf) + sum_unique)
        if b_prob == 0 or b_prob < 0:
            b_prob = 0.00001
        b_probabilites[word] = b_prob

        if word not in v_counter.keys():
            v_prob = 1
        elif word in b_counter.keys():
            v_prob = v_counter.get(word) + 1
        v_prob = (math.log(v_prob)*idf) / ((math.log(v_sum)*idf) + sum_unique)
        if v_prob == 0 or v_prob < 0:
            v_prob = 0.00001
        v_probabilites[word] = v_prob


        if word not in e_counter.keys():
            e_prob = 1
        elif word in e_counter.keys():
            e_prob = e_counter.get(word) + 1
        e_prob = (math.log(e_prob)*idf) / ((math.log(e_sum)*idf) + sum_unique)
        if e_prob == 0 or e_prob < 0:
            e_prob = 0.00001
        e_probabilites[word] = e_prob
        
    return a_probabilites, b_probabilites, e_probabilites, v_probabilites

In [None]:
a_probabilites, b_probabilites, e_probabilites, v_probabilites = tf_idf_train(len(data),all_unique, word_freq_in_doc,a_sum, b_sum, e_sum, v_sum, sum_unique,  a_counter, b_counter, e_counter, v_counter)

### TF * IDF Naive Bayes Testing

In [None]:
def tf_idf_test(a_word_sum, b_word_sum, v_word_sum, e_word_sum,test_data, a_probabilites, b_probabilites, e_probabilites, v_probabilites, a_probability, b_probability, v_probability, e_probability, index):
    predictions = []
    for i in test_data:
        count_train = Counter(i[index].split())
        #count_train = Counter(i[1])
        class_prob = {"A": 0, "B": 0, "V": 0, "E": 0}
        for word in i[index].split():
            a_prob = a_probabilites.get(word)
            if (a_prob == None):
                a_prob = 1 / a_word_sum + sum_unique
            class_prob["A"] = math.log(a_prob) * count_train[word] + class_prob["A"]


            b_prob = b_probabilites.get(word) 
            if (b_prob == None):
                b_prob = 1 / b_word_sum + sum_unique
            class_prob["B"] =  math.log(b_prob) * count_train[word] + class_prob["B"]


            v_prob = v_probabilites.get(word)
            if (v_prob == None):
                v_prob = 1 / v_word_sum + sum_unique
            class_prob["V"] = math.log(v_prob) * count_train[word] + class_prob["V"]


            e_prob = e_probabilites.get(word)
            if (e_prob == None):
                e_prob = 1 / e_word_sum + sum_unique
            class_prob["E"] = math.log(e_prob) * count_train[word] + class_prob["E"]
        #print(class_prob)
        class_prob["A"] =  math.log(a_probability) + class_prob["A"]
        class_prob["B"] =  math.log(b_probability) + class_prob["B"]
        class_prob["V"] =  math.log(v_probability) + class_prob["V"]
        class_prob["E"] =  math.log(e_probability) + class_prob["E"]
        predictions.append(max(class_prob.items(), key=operator.itemgetter(1))[0])
    
    return predictions
predictions = tf_idf_test(a_word_sum, b_word_sum, v_word_sum, e_word_sum,test_data, a_probabilites, b_probabilites, e_probabilites, v_probabilites, a_probability, b_probability, v_probability, e_probability, 1)
##print(all_unique)
print(predictions)

### Cross Validation


In [None]:
k = 10
rand_split = cross_validation_split(data, k)
overall_accuracies = []
for i in range(10):
    accuracies = []
    for i in range(k):
        test_split = rand_split[i]
        training_split = rand_split.copy()
        training_split.pop(i)

        training_split = [item for sublist in training_split for item in sublist]
        text_in_class = {'E':{}, 'B':{}, 'A': {}, 'V':{}}
        a_counter = Counter()
        e_counter = Counter()
        b_counter = Counter()
        v_counter = Counter()

        word_freq_in_doc,target_counter, a_counter, b_counter, e_counter, v_counter = get_counts(training_split)

    #    a_counter = Counter(dict(a_counter.most_common(1250)))
    #    b_counter = Counter(dict(b_counter.most_common(1250)))
    #    e_counter = Counter(dict(e_counter.most_common(1250)))
    #    v_counter = Counter(dict(v_counter.most_common(1250)))

        all_unique = a_counter + b_counter + v_counter + e_counter
        sum_unique = len(all_unique.keys())

        a_sum = sum(target_counter.values())
        a_probability = target_counter["A"] / a_sum
        b_probability = target_counter["B"] / a_sum
        v_probability = target_counter["V"] / a_sum
        e_probability = target_counter["E"] / a_sum


        a_sum = len(a_counter.values())
        b_sum = len(b_counter.values())
        v_sum = len(v_counter.values())
        e_sum = len(e_counter.values())

        a_probabilites, b_probabilites, e_probabilites, v_probabilites = tf_idf_train(len(training_split), all_unique, word_freq_in_doc,a_sum, b_sum, e_sum, v_sum, sum_unique, a_counter, b_counter, e_counter, v_counter)
        predictions = tf_idf_test(a_sum, b_sum, v_sum, e_sum,test_split, a_probabilites, b_probabilites, e_probabilites, v_probabilites, a_probability, b_probability, v_probability, e_probability, 2)


        accuracies.append(accuracy_metric(test_split, predictions))
        print(accuracies)
    overall_accuracies.append(statistics.mean(accuracies))
    #print(overall_accuracies)
print(overall_accuracies)

In [None]:
print(statistics.mean(overall_accuracies))

### Accuracy with TF*IDF
after a 10-fold cross validation which was run 10 times I got an average accuracy of 91.704166666667 which shows us that our model performed much better after applying a wieghting to the frequencies of the words overall.

### Complement TF*IDF Naive Bayes Training

In [None]:
def complement_idf(a_counter, b_counter, v_counter, e_counter, a_probability, b_probability, v_probability, e_probability, test_data, training_len, index):
    predictions = []
    # Complement naive bayes so get all words in other classes
    not_a = e_counter + b_counter + v_counter
    not_b = a_counter + e_counter + v_counter
    not_e = a_counter + v_counter + b_counter
    not_v = a_counter + e_counter + b_counter
    
    sum_unique = a_counter + b_counter + v_counter + e_counter
    sum_unique = len(sum_unique.keys())

    #Get the sum of all counts in a class. Total word count for each class
    a_sum = sum(not_a.values())
    b_sum = sum(not_b.values())
    e_sum = sum(not_e.values())
    v_sum = sum(not_v.values())

    #Iterate through the csv
    for i in test_data:
        count_train = Counter(i[index].split())
        # For every document, do a word count
        #count_train = Counter(i[1])
        class_prob = {"A": 1, "B": 1, "V": 1, "E": 1}
        # Iterate through each word in a document
        for word in i[index].split():
            # Get the frequency of a word in all training documents
            if word in word_freq_in_doc:
                word_in_doc = word_freq_in_doc[word]
            else:
                word_in_doc = 0
    #        print(word_in_doc)
            #Calculate the idf 
            if word_in_doc != 0:
                idf = math.log(training_len / word_in_doc)
            else:
                idf = 0
            if word not in not_a.keys():
                a_prob = 1
            elif word in not_a.keys():
                a_prob = not_a.get(word) + 1
            a_prob = (math.log(a_prob)*idf) / ((math.log(a_sum)*idf) + sum_unique)
            if a_prob == 0 or a_prob < 0:
                a_prob = 0.00001
           # print(a_prob)
            class_prob["A"] = math.log(a_prob) * count_train[word] + class_prob["A"]

            if word not in not_b.keys():
                b_prob = 1
            elif word in not_b.keys():
                b_prob = not_b.get(word) + 1
           # print("b_Probability " + str(b_prob))
            #print("idf " + str(idf))
            b_prob = (math.log(b_prob)*idf) / ((math.log(b_sum)*idf) + sum_unique)
            if b_prob == 0 or b_prob < 0:
                b_prob = 0.00001
            class_prob["B"] =  math.log(b_prob) * count_train[word] + class_prob["B"]


            if word not in not_v.keys():
                v_prob = 1
            elif word in not_v.keys():
                v_prob = not_v.get(word) + 1
            v_prob = (math.log(v_prob)*idf) / ((math.log(v_sum)*idf) + sum_unique)
            if v_prob == 0 or v_prob <0:
                v_prob = 0.00001
            class_prob["V"] = math.log(v_prob) * count_train[word] + class_prob["V"]

            if word not in not_e.keys():
                e_prob = 1
            elif word in not_e.keys():
                e_prob = not_e.get(word) + 1
            e_prob = (math.log(e_prob)*idf) / ((math.log(e_sum)*idf) + sum_unique)
            if e_prob == 0 or e_prob < 0:
                e_prob = 0.00001
            class_prob["E"] = math.log(e_prob) * count_train[word] + class_prob["E"]
        class_prob["A"] =  math.log(a_probability) - class_prob["A"]
        class_prob["B"] =  math.log(b_probability) - class_prob["B"]
        class_prob["V"] =  math.log(v_probability) - class_prob["V"]
        class_prob["E"] =  math.log(e_probability) - class_prob["E"]
        #print(class_prob)
        predictions.append(max(class_prob.items(), key=operator.itemgetter(1))[0])
    return predictions
print(predictions)

In [None]:
overall_accuracy = []
for j in range(10):
    k = 10
    rand_split = cross_validation_split(data, k)
    accuracies = []
    for i in range(k):
        test_split = rand_split[i]
        training_split = rand_split.copy()
        training_split.pop(i)

        training_split = [item for sublist in training_split for item in sublist]
        text_in_class = {'E':{}, 'B':{}, 'A': {}, 'V':{}}
        a_counter = Counter()
        e_counter = Counter()
        b_counter = Counter()
        v_counter = Counter()

        word_freq_in_doc,target_counter, a_counter, b_counter, e_counter, v_counter = get_counts(training_split)

        a_sum = sum(target_counter.values())
        a_probability = target_counter["A"] / a_sum
        b_probability = target_counter["B"] / a_sum
        v_probability = target_counter["V"] / a_sum
        e_probability = target_counter["E"] / a_sum

        training_len = len(training_split)

        predictions = complement_idf(a_counter, b_counter, v_counter, e_counter, a_probability, b_probability, v_probability, e_probability, test_split, training_len, 2)


    #    a_probabilites, b_probabilites, e_probabilites, v_probabilites = complement_train(len(training_split), all_unique, word_freq_in_doc,a_sum, b_sum, e_sum, v_sum, sum_unique, not_a, not_b, not_e, not_v)
     #   predictions = complement_test(word_freq_in_doc,a_sum, b_sum, v_sum, e_sum,len(training_split),test_split, a_probabilites, b_probabilites, e_probabilites, v_probabilites, a_probability, b_probability, v_probability, e_probability, 2)


        accuracies.append(accuracy_metric(test_split, predictions))
    print(accuracies)
    overall_accuracy.append(statistics.mean(accuracies))

In [None]:
print(statistics.mean(overall_accuracy))

In [None]:
with open('predictions.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["id", "class"])
    for i in range(len(predictions)):
        writer.writerow([i+1, predictions[i]])

In [None]:
def create_ngram(s, n):
    
    ngrams = [ngram for ngram in s.split(" ") if ngram != ""]
    
    ngrams = zip(*[ngrams[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]