<h2>Report Section</h2>
<h3>The Data</h3>

<h3>Basic Preprocessing</h3>

<p>The raw data from the trg.csv contains the id, actual classification and an abstract. The abstracts have already been converted to lowercase, and punctuation removed. Lastly, the data is randomized to prevent the order from affecting evaluation. From here, the data can then be further processed to enable compatibility with Naive Bayes.</p>

<p>Basic preprocessing for Naive Bayes involves creating a matrix with a count vectorizer from sk learn. The columns represent unique words, and the rows are the abstracts. The values in the matrix are filled in with 1 or 0, depending on if the abstract contains a word. Furthermore, as the matrix is created from the training data set, the test data may contain words that have yet to be seen. Hence we need to fit the test data to this model. These matrixes are then rejoined with the respective datasets to retain the relationship between the actual protein and the abstract.</p>
 
<h3>Basic Naive Bayes</h3>
<p>The idea behind Naive Bayes is to create a table of probabilities based on the occurrence of unique words in the abstract. For every potential prediction(A, B, E, V), a probability for a given abstract can be calculated. The prediction would be the one with the highest probability. The numerator of the probability is the number of rows where the protein matches the actual protein and has a value of 1, plus one for Laplace smoothing. The denominator is the number of rows with the matching protein plus the number of unique words. This is repeated for every row and every protein type giving a probability matrix with columns equal to unique words and rows equal to the number of proteins(4)</p>

<p>Predicting the class/protein of an abstract involves summing the log2 of the probabilities. Furthermore, it doesn't change the criteria for prediction, as the class with the highest probability is still predicted.</p>

<h3>Problems and Improvements</h3>

<p>The first change is to remove uninformative words which don't hold any valuable information("a", "the", "you",...). These words dilute the information, making predictions less accurate. Furthermore, the storage of irrelevant words is increasing the data table's width, slowing both the training and evaluation methods. As seen in the modified preprocessing code in the coding section under #change1.=,  the stop words function and max_df are used in the count vectorizer to ignore specific words and words that occur in too many rows. In addition, some smaller changes have been made, such as removing words that don't occur frequently enough and weighting probabilities based on the frequency of the word in an abstract.</p>


<p>From the tables in the results section, a clear problem is present the model isn't predicting abstracts to be A or V. The large imbalance in classes causes this. The class imbalance lowers accuracy because it is influenced by the proportion of classes in the table. The solution is to equalize the amount of each class in the training data before it trains the probabilities. This could be achieved through over or under-sampling. In this case, oversampling the minority classes will preserve the resolution of the large dataset, maintaining accuracy. This can be achieved with the sampling method for data frames with replacement. By sampling all the classes for the size of the largest class number of examples.</p>

<h3>Evaluation Method</h3>

<p>The basic evaluation of the method looks at the accuracy of predictions made by the model through the portion of correct predictions and a table with the predicted and actual values. The data is split on a 75 - 25 training test split. Ideally, cross-validation would be used to measure average performance; however, the way data is handled and processed, it becomes overly complex to use. Instead, each test is run across five different randomizations and an average is taken.</p>

<h3>Results</h3>  
<p>All the results can be found in a raw form in the testing section due to size constraints </p>
For the basic model, the main insight we can gain is what sort of training-test split should be used and the starting accuracy. From testing five different shuffles of the data and splits the average accuracy, here are the results<p>
    
 <p>The observation is that the algorithm continues improving as we increase the size of the training set and the test set shrinks. This is the result of the algorithm being familiar with a wider range of vocabulary and being able to recognize it in the test set. Hence the more it can train on, the better for unseen results, as shown by the gradual increase in accuracy. However, for now, we'll stick with a 75-25 split for internal testing, as a small test set size may skew the evaluation. Finally, 83-85% is a good point to start making adjustments to improve.</p>
 
 <h4>Modified Model</h4>
 <p>Before the model is modified, a baseline average accuracy from a 75 - 25 training test split is 85%. The results after implementing the changes to reduce the width of the data table and the frequency weighting is an average accuracy of 86.8% with a table:</p>
   

<p>The results show a slight improvement over the baseline. However, the table in the testing section shows that the model isn't identifying abstracts for A or V and mispredicting B as E. One of the causes of this is the imbalance in classes, with A and V combined only making up 6% of the total dataset and B being just under two-thirds the size of class B. Justifying the change to equalize the number of each class in the training set. The result of this change is an increase in accuracy to 95.4%. All classes are being predicted at around 95% except class V which is approximately 85%, depending on the sample. From this point, fine-tuning parameters will be the key to getting extra accuracy.



<p> Regarding Kaggle, using the above methods with a training set comprised of the entire 4000 abstract and the test set from Kaggle results in an accuracy of 97%. Before the equalization of the dataset, the highest accuracy was 89%, and before any modifications, it was in the mid 80's. These are slightly higher than the internal result because the size of the training set is larger.</p>

<p> Some final comments about different improvement methods. N-grams could have been used for higher accuracy. However, they increased the run time to an unfeasible timeframe for testing. An additional addition would be using parallel processing to speed up the predictions. Lastly, a lot of the initial data from the report has been cut due to space. However, all the raw data is in the testing section, and when running the program, be aware it will take about 10 min per evaluation.

<h2>Basic Naive Bayes Coding Component</h2>
<h4>Classifying abstracts based on what protein they focus on</h4>

<h4>Step 1: Data preprocessing</h4>
<p>Firstly, we need to move the data out of the Excel spreadsheet into a data frame to operate on it. From here, I need to create a second table that stores in binary which unique words each abstract uses.



In [76]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from openpyxl import load_workbook #used for writing to excel file for kaggle

#extracting data from csv
names = ["Id","ActualProtein","Abstract"]
rawData = pd.read_csv("trg.csv",header=0,names = names)
print(rawData)



        Id ActualProtein                                           Abstract
0        1             B  the 4 202 353 bp genome of the alkaliphilic ba...
1        2             A  the complete 1751377-bp sequence of the genome...
2        3             E  in 1992 we started assembling an ordered libra...
3        4             E  the aim of this study is to measure human mito...
4        5             B  the amino acid sequence of the spirulina maxim...
...    ...           ...                                                ...
3995  3996             E  we have isolated and characterized two diureti...
3996  3997             E  myotonias are muscle diseases in which the fun...
3997  3998             E  cysteine synthase o-acetylserine sulfhydrylase...
3998  3999             E  a region of 25 nucleotides is highly conserved...
3999  4000             B  thermoanaerobacter tengcongensis is a rod-shap...

[4000 rows x 3 columns]


In [119]:
#preprocessing
#the data has already been converted to lower case and there appears to be no puntuation.
#so no preprocessing required for basic functionality
#ratio is the training - split,data is the base dataframe,randomizer is the seed for the data to be randomized

def basicPreprocess(ratio,data,randomizer):
    #randomize data and split into training and test
    randomizedData = data.sample(frac = 1,random_state = randomizer)
    splitPoint = int(round(len(data)*ratio,0))
    trainingData = randomizedData.iloc[:splitPoint]
    testData = randomizedData.iloc[splitPoint:]
    
    #need create data frame that stores a binary record of unique words in the abstracts. Will use sk learn count vectorizer.
    #need to convert the abstract column to an array/list to pass to the vectorizer.

    trainingAbstractList = trainingData["Abstract"]
    countVectorizer = CountVectorizer(binary=True)
    trainArray = countVectorizer.fit_transform(trainingAbstractList).toarray()
    
    #convert back to dataframe
    fittedTrainingData = pd.DataFrame(data=trainArray,columns = countVectorizer.get_feature_names_out())
    
    
    #fit test data to this matrix as it handles words not seen in the training data
    
    testArray = countVectorizer.transform(testData["Abstract"]).toarray()
    fittedTestData = pd.DataFrame(data=testArray,columns = countVectorizer.get_feature_names_out())
    
    #rejoin the new tables with original to retain the actual values for evaluation
    trainingData = trainingData.drop("Abstract",axis=1)
    testData = testData.drop("Abstract",axis=1)
    trainingData = trainingData.reset_index()
    testData = testData.reset_index()
   
    trainingData = pd.concat([trainingData,fittedTrainingData],axis=1)
    testData = pd.concat([testData,fittedTestData],axis=1)
    trainingData = trainingData.drop("index",axis=1)
    testData = testData.drop("index",axis=1)
    
    
    
    return trainingData,testData
    




<h4>Step 2: Naive Bayes </h4>
<p>want to create a table of probabilities for our training set. Note that laplace smoothing will be used to prevent probability values of 0 from breaking the log function.</p>
<p>Note that this initial implementation is prolonged as it must calculate probabilities for all 26000 rows for each type of protein. In the modified version, optimizations will have to be made</p>

In [78]:
#Training
#takes a set fo data and creates a 2d arry holding probabilites. table will need a column for each word and a row
#for each potential output(protein).
def trainProbabilities(trainingData):
    
    probabilities = []
    types = ["A","B","E","V"]
    columnNames = list(trainingData.columns)
    for protein in types:
        probProtein = []
        #can shift denominator out as it will be the same for all values for the same protein
        #denominator is number of rows where the protein specified is present plus the number of unique words less two to account 
        #for columns not holding words
        denominator = len(trainingData[(trainingData["ActualProtein"] ==protein)])+len(trainingData.columns)

        for i in range(2,len(trainingData.columns)):
            
            #numerator is number of rows with value with the given protein specified + 1 for laplace smoothing.
            numerator = len(trainingData[(trainingData["ActualProtein"] == protein) & (trainingData[columnNames[i]] != 0)])+1
            probProtein.append(numerator/denominator)


        probabilities.append(probProtein)

    
    return probabilities


In [118]:
#Predict
#takes an abstract that has been vectorised as in the preprocessing step and returns the predicted protein type using the table
def predict(row,trainingData,probabilities):
    #calculate probabilty of abstract being each protein and return the protein with highest probability
    #probabilty is sum of the log of relevant probabilities plus the log of overall probabilty of the protein
    #calculate probaility of protein in occuring in table
    
    #A
    probA = np.log2(len(trainingData[(trainingData["ActualProtein"] =="A")])/len(trainingData))
    probB = np.log2(len(trainingData[(trainingData["ActualProtein"] =="B")])/len(trainingData))
    probE = np.log2(len(trainingData[(trainingData["ActualProtein"] =="E")])/len(trainingData))
    probV = np.log2(len(trainingData[(trainingData["ActualProtein"] =="V")])/len(trainingData))
    
    for i in range(2,len(row)):
        #probability table doesn't include rows for index and actual protein hence it is offset by 2 which must be accounted for
        probIndex = i -2
        
        if row[i] == 0:
            probA = probA + np.log2(1-(probabilities[0][probIndex]))
            probB = probB + np.log2(1-(probabilities[1][probIndex]))
            probE = probE + np.log2(1-(probabilities[2][probIndex]))
            probV = probV + np.log2(1-(probabilities[3][probIndex]))
            
        else:
            #change 2 multiply probability by frequency, not we can get away with using the originl predict
            #as the binary matrix from before will multiply by 1 and have no change.
            probA = probA + np.log2((row[i]*probabilities[0][probIndex]))
            probB = probB + np.log2((row[i]*probabilities[1][probIndex]))
            probE = probE + np.log2((row[i]*probabilities[2][probIndex]))
            probV = probV + np.log2((row[i]*probabilities[3][probIndex]))
        
        

    maxProb = max(probA,probB,probE,probV)
    
    if maxProb == probA:
        return "A",row[1]
    if maxProb == probB:
        return "B",row[1]
    if maxProb == probE:
        return "E",row[1]
    if maxProb == probV:
        return "V",row[1]

#predict values for test data and compare to actual values and return the proportion predicted correctly
def internalEvaluate(probabilities,trainingData,testData):
    
    correct = 0
    
    predictionTable = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
    for row in range(0,len(testData)):
        
        predicted,actual = predict(testData.iloc[row],trainingData,probabilities)
        if predicted == actual:
                correct += 1   


        if actual == "A":
            column = 0
        elif actual == "B":
            column = 1
        elif actual == "E":
            column = 2
        elif actual == "V":
            column = 3

        if predicted == "A":
            row = 0
        elif predicted == "B":
            row = 1
        elif predicted == "E":
            row = 2
        elif predicted == "V":
            row = 3
        predictionTable[row][column] = (predictionTable[row][column]) + 1 
        
    
    return correct/len(testData),predictionTable
        
   
    

<h3>Modified preprocessing</h3>
<p>The initial data collection from the .csv is perfectly fine. However, improvements could be made to the preprocessing stage. Modify the data tables to allow for improved training and prediction. Any changes to the code will be mentioned in the surrounding comments</p>

In [124]:
def modifiedPreprocess(ratio,data,randomizer):
    
    randomizedData = data.sample(frac = 1,random_state = randomizer)
    splitPoint = int(round(len(data)*ratio,0))
    trainingData = randomizedData.iloc[:splitPoint]
    testData = randomizedData.iloc[splitPoint:]
    
    #need create data frame that stores a binary record of unique words in the abstracts. Will use sk learn count vectorizer.
    #need to convert the abstract column to an array/list to pass to the vectorizer.

    trainingAbstractList = trainingData["Abstract"]
    #CHANGE #1
    #Change count vectoriser has a parameter that allows you to insert a list of words that will be ignored
    #reading through a few abstracts manually at rondom i picked out a selection of irrelevant words
    ignoredWords = ["a","the","an","you","your","my","our","who","what","when","why","than","then","that","youre","once","even","though","and","because","human","test","testing","one","two","three","four","five","six","seven","eight","nine","research","often","some","few","other"]
    countVectorizer = CountVectorizer(min_df = 1,max_df = 0.7, stop_words = ignoredWords)
    trainArray = countVectorizer.fit_transform(trainingAbstractList).toarray()
    
    #convert back to dataframe
    fittedTrainingData = pd.DataFrame(data=trainArray,columns = countVectorizer.get_feature_names_out())
    
    
    #fit test data to this matrix as it handles words not seen in the training data
    
    testArray = countVectorizer.transform(testData["Abstract"]).toarray()
    fittedTestData = pd.DataFrame(data=testArray,columns = countVectorizer.get_feature_names_out())
    
    #rejoin the new tables with original to retain the actual values for evaluation
    trainingData = trainingData.drop("Abstract",axis=1)
    testData = testData.drop("Abstract",axis=1)
    trainingData = trainingData.reset_index()
    testData = testData.reset_index()
   
    trainingData = pd.concat([trainingData,fittedTrainingData],axis=1)
    testData = pd.concat([testData,fittedTestData],axis=1)
    trainingData = trainingData.drop("index",axis=1)
    testData = testData.drop("index",axis=1)
    
    
    
    
    return trainingData,testData
    

In [106]:
def dataEqualisation(data,r):
    #given a dataset want to equalize the quantity of each class within reason, won't be perfectly equal but close enough will do
    dataA = data.loc[data["ActualProtein"] == "A"]
    dataB = data.loc[data["ActualProtein"] == "B"]
    dataE = data.loc[data["ActualProtein"] == "E"]
    dataV = data.loc[data["ActualProtein"] == "V"]
    
    #want to keep resolution of largest dataset so instead want to oversmaple minority classes
    maxDataSize = max(len(dataA),len(dataB),len(dataB),len(dataB))
    
    
    
    upsizedDataA = dataA.sample(n=maxDataSize,replace=True,random_state = r)    
    upsizedDataB = dataB.sample(n=maxDataSize,replace=True,random_state = r)
    upsizedDataE = dataE.sample(n=maxDataSize,replace=True,random_state = r)
    upsizedDataV = dataV.sample(n=maxDataSize,replace=True,random_state = r)
        
   
        
    equalizedData = pd.concat([upsizedDataA,upsizedDataB,upsizedDataE,upsizedDataV])
    return equalizedData

<h4>Kaggle Versions</h4>
<p>
as the Kaggle submission requires an Excel spreadsheet with the predictions for the unseen abstract, it requires some custom preprocess and evaluation steps to format it correctly. These functions will be functionally identical to those above; it just handles the Excel documents better.</p>

In [137]:
def kagglePreprocess(trainingData,testData): # removes the split and randomisation part
    print("Kaggle preprocess start")
    trainingAbstractList = trainingData["Abstract"]
    #Change count vectoriser has a parameter that allows you to insert a list of words that will be ignored
    #reading through a few abstracts manually at rondom i picked out a selection of irrelevant words
    ignoredWords = ["a","the","an","you","your","my","our","who","what","when","why","than","then","that","youre","once","even","though","and","because","human","test","testing","one","two","three","four","five","six","seven","eight","nine","research","often","some","few","other"]
    
    countVectorizer = CountVectorizer(min_df = 1,max_df = 0.7,stop_words = ignoredWords)
    trainArray = countVectorizer.fit_transform(trainingAbstractList).toarray()
    
    #convert back to dataframe
    fittedTrainingData = pd.DataFrame(data=trainArray,columns = countVectorizer.get_feature_names_out())
    
    
    #fit test data to this matrix as it handles words not seen in the training data
    
    testArray = countVectorizer.transform(testData["Abstract"]).toarray()
    fittedTestData = pd.DataFrame(data=testArray,columns = countVectorizer.get_feature_names_out())
    
    #rejoin the new tables with original to retain the actual values for evaluation
    trainingData = trainingData.drop("Abstract",axis=1)
    testData = testData.drop("Abstract",axis=1)
    trainingData = trainingData.reset_index()
    testData = testData.reset_index()
   
    trainingData = pd.concat([trainingData,fittedTrainingData],axis=1)
    testData = pd.concat([testData,fittedTestData],axis=1)
    trainingData = trainingData.drop("index",axis=1)
    testData = testData.drop("index",axis=1)
    
    #remove numbers from columns
    columnNames = testData.columns
    remove = []
    for word in columnNames:
        
        if word.isnumeric() == True:
            remove.append(word)
    
    trainingData = trainingData.drop(remove,axis=1)
    testData = testData.drop(remove,axis=1)
    
    
    print("Kaggle preprocess complete")
    return trainingData,testData
    

def kaggleEvaluate(testData,trainingData,probabilities):
    print("Kaggle evaluate start")
    predictions = []
    for row in range(0,len(testData)):
        
        prediction = kagglePredict(testData.iloc[row],trainingData,probabilities)
        predictions.append(prediction)
    
    #put predictions into dataframe to then convert to csv
    predictionsDF = pd.DataFrame(predictions,columns=["Predictions"])
    predictionsDF.to_csv("Results.csv")
    print("Kaggle evaluate complete")
    return predictions 

def kagglePredict(row,trainingData,probabilities):
    probA = np.log2(len(trainingData[(trainingData["ActualProtein"] =="A")])/len(trainingData))
    probB = np.log2(len(trainingData[(trainingData["ActualProtein"] =="B")])/len(trainingData))
    probE = np.log2(len(trainingData[(trainingData["ActualProtein"] =="E")])/len(trainingData))
    probV = np.log2(len(trainingData[(trainingData["ActualProtein"] =="V")])/len(trainingData))
    
    for i in range(1,len(row)):
        #probability table doesn't include rows for index and actual protein hence it is offset by 2 which must be accounted for
        probIndex = i -1
        
        if row[i] == 0:
            probA = probA + np.log2(1-(probabilities[0][probIndex]))
            probB = probB + np.log2(1-(probabilities[1][probIndex]))
            probE = probE + np.log2(1-(probabilities[2][probIndex]))
            probV = probV + np.log2(1-(probabilities[3][probIndex]))
            
        else:
            #change 2 multiply probability by frequency, not we can get away with using the originl predict
            #as the binary matrix from before will multiply by 1 and have no change.
            probA = probA + np.log2((row[i]*probabilities[0][probIndex]))
            probB = probB + np.log2((row[i]*probabilities[1][probIndex]))
            probE = probE + np.log2((row[i]*probabilities[2][probIndex]))
            probV = probV + np.log2((row[i]*1.2*probabilities[3][probIndex]))
        
        

    maxProb = max(probA,probB,probE,probV)
    
    if maxProb == probA:
        return "A"
    if maxProb == probB:
        return "B"
    if maxProb == probE:
        return "E"
    if maxProb == probV:
        return "V"

columnNames = ["Id","ActualProtein","Abstract"]
trainData = pd.read_csv("trg.csv",header=0,names = columnNames)
columnNames = ["Id","Abstract"]
testData = pd.read_csv("tst.csv",header=0,names = columnNames)

trainData,testData = kagglePreprocess(trainData,testData)
probs = trainProbabilities(dataEqualisation(trainData,31))
predictions = kaggleEvaluate(testData,trainData,probs)



Kaggle preprocess start
Kaggle preprocess complete
Kaggle evaluate start
Kaggle evaluate complete


<h3>Testing</h3>
<p>The purpose of this section is to demonstrate the figures mentioned in the report at the beginning as well as show additional testing completed during the development process.</p>
<h4>Baseline</h4>
<p> testing the unmodified naive Bayes algorithm, preprocessing and evaluation components. The only variable that can be modified here is the training test split. For testing, we'll use various randomizer values and take an average accuracy measure at each split ratio.</p>


In [83]:
randomizerValues = [100,200,300,400,500]
splitratios = [0.5,0.6,0.7,0.8,0.9]
#go through all the ratios
for ratio in splitratios:
    print("The accuracies when the split value is " + str(ratio))
    total = 0
    #go through all the randomizer values with that ratio
    for r in randomizerValues:
        trainData,testData = basicPreprocess(ratio,rawData,r)
        probabilities = trainProbabilities(trainData)
        accuracy,table = internalEvaluate(probabilities,trainData,testData)
        print("r = "+ str(r) + ": accuracy = "+ str(accuracy))
        total = total + accuracy
    
    print("The average accuracy when the training split ratio = "+str(ratio) + " is "+str(total/len(randomizerValues)))
    print()#adding a blank line for seperation

The accuracies when the split value is 0.5
r = 100: accuracy = 0.8305
r = 200: accuracy = 0.848
r = 300: accuracy = 0.8275
r = 400: accuracy = 0.8185
r = 500: accuracy = 0.8605
The average accuracy when the training split ratio = 0.5 is 0.8370000000000001

The accuracies when the split value is 0.6
r = 100: accuracy = 0.840625
r = 200: accuracy = 0.844375
r = 300: accuracy = 0.83125
r = 400: accuracy = 0.824375
r = 500: accuracy = 0.858125
The average accuracy when the training split ratio = 0.6 is 0.8397500000000001

The accuracies when the split value is 0.7
r = 100: accuracy = 0.8433333333333334
r = 200: accuracy = 0.8583333333333333
r = 300: accuracy = 0.8358333333333333
r = 400: accuracy = 0.8183333333333334
r = 500: accuracy = 0.8583333333333333
The average accuracy when the training split ratio = 0.7 is 0.8428333333333333

The accuracies when the split value is 0.8
r = 100: accuracy = 0.85
r = 200: accuracy = 0.855
r = 300: accuracy = 0.85375
r = 400: accuracy = 0.8325
r = 500: 


<table>
        <tr>
            <td>Training Portion</td>
            <td>Average accuracy</td>
        </tr>
        <tr>
            <td>0.5</td>
            <td>0.837</td>
        </tr>
            <td>0.6</td>
            <td>0.840</td>
        <tr>
            <td>0.7</td>
            <td>0.843</td>
        </tr>
        <tr>
            <td>0.8</td>
            <td>0.849</td>
        </tr>
        <tr>
            <td>0.9</td>
            <td>0.853</td>
        </tr>
     </table>
<h4>Basic model on a 75 - 25 split</h4>
<p> for the tables the columns are the actual values (A,B,E,V) and the rows are the predicted values (A,B,E,V).

In [128]:
randomizerValues = [100,200,300,400,500]
ratio = 0.75
totalbasic = 0

#go through all the randomizer values with that ratio
for r in randomizerValues:
    #label variables associated with basic methods with a B
    trainDataB,testDataB = basicPreprocess(ratio,rawData,r)
    probabilitiesB = trainProbabilities(trainDataB)
    accuracyB,tableB = internalEvaluate(probabilitiesB,trainDataB,testDataB)
    print("basic: r = "+ str(r) + ": accuracy = "+ str(accuracyB))
    for row in tableB:
        print(row)
    print()
    totalbasic = totalbasic + accuracyB
    
print("The average accuracy for the basic preprocessing is "+str(totalbasic/len(randomizerValues)))

    

basic: r = 100: accuracy = 0.851
[0, 0, 0, 0]
[18, 298, 0, 1]
[13, 85, 553, 32]
[0, 0, 0, 0]

basic: r = 200: accuracy = 0.847
[0, 0, 0, 0]
[13, 296, 0, 0]
[8, 98, 551, 34]
[0, 0, 0, 0]

basic: r = 300: accuracy = 0.848
[0, 0, 0, 0]
[17, 293, 0, 1]
[13, 95, 555, 26]
[0, 0, 0, 0]

basic: r = 400: accuracy = 0.827
[0, 0, 0, 0]
[15, 291, 1, 1]
[14, 104, 536, 38]
[0, 0, 0, 0]

basic: r = 500: accuracy = 0.852
[0, 0, 0, 0]
[21, 299, 0, 2]
[15, 73, 553, 37]
[0, 0, 0, 0]

The average accuracy for the basic preprocessing is 0.845


<h4>Removing oversused words, underused words and weighting by frequency</h4>

In [129]:
randomizerValues = [100,200,300,400,500]
ratio = 0.75
totalmodified = 0
#go through all the randomizer values with that ratio
for r in randomizerValues:
    #label variable assocaited with modified methods with an M
    trainDataM,testDataM = modifiedPreprocess(ratio,rawData,r)
    probabilitiesM = trainProbabilities(trainDataM)
    accuracyM,tableM = internalEvaluate(probabilitiesM,trainDataM,testDataM)
    print("modified: r = "+ str(r) + ": accuracy = "+ str(accuracyM))
    for row in tableM:
        print(row)
    
    print()
    totalmodified = totalmodified + accuracyM
    
print("The average accuracy for the modified preprocessing is "+str(totalmodified/len(randomizerValues)))
    

modified: r = 100: accuracy = 0.872
[0, 0, 0, 0]
[24, 319, 0, 1]
[7, 64, 553, 32]
[0, 0, 0, 0]

modified: r = 200: accuracy = 0.876
[0, 0, 0, 0]
[16, 326, 1, 0]
[5, 68, 550, 34]
[0, 0, 0, 0]

modified: r = 300: accuracy = 0.871
[0, 0, 0, 0]
[20, 317, 1, 1]
[10, 71, 554, 26]
[0, 0, 0, 0]

modified: r = 400: accuracy = 0.847
[0, 0, 0, 0]
[15, 312, 2, 1]
[14, 83, 535, 38]
[0, 0, 0, 0]

modified: r = 500: accuracy = 0.872
[0, 0, 0, 0]
[25, 319, 0, 3]
[11, 53, 553, 36]
[0, 0, 0, 0]

The average accuracy for the modified preprocessing is 0.8676


<h4>Balancing the training dataset with the modifications from above</h4>


In [131]:
randomizerValues = [100,200,300,400,500]
ratio = 0.75
totalmodified = 0
#go through all the randomizer values with that ratio
for r in randomizerValues:
    #label variables associated with basic methods with a B
    #label variable assocaited with modified methods with an M
    
    trainDataM,testDataM = modifiedPreprocess(ratio,rawData,r)
    probabilitiesM = trainProbabilities(dataEqualisation(trainDataM,r))
    accuracyM,tableM = internalEvaluate(probabilitiesM,trainDataM,testDataM)
    print("modified: r = "+ str(r) + ": accuracy = "+ str(accuracyM))
    for row in tableM:
        print(row)
        
    print()
    totalmodified = totalmodified + accuracyM
    
print("The average accuracy for the modified preprocessing is "+str(totalmodified/len(randomizerValues)))

modified: r = 100: accuracy = 0.954
[24, 1, 0, 0]
[5, 356, 3, 1]
[2, 25, 546, 4]
[0, 1, 4, 28]

modified: r = 200: accuracy = 0.963
[16, 0, 0, 0]
[4, 377, 9, 0]
[0, 17, 541, 5]
[1, 0, 1, 29]

modified: r = 300: accuracy = 0.96
[25, 0, 0, 0]
[3, 371, 9, 1]
[2, 16, 541, 3]
[0, 1, 5, 23]

modified: r = 400: accuracy = 0.942
[21, 2, 0, 0]
[5, 369, 8, 2]
[3, 23, 527, 12]
[0, 1, 2, 25]

modified: r = 500: accuracy = 0.953
[28, 0, 0, 0]
[4, 354, 10, 4]
[3, 18, 541, 5]
[1, 0, 2, 30]

The average accuracy for the modified preprocessing is 0.9544
