#REPORT

Initial Kaggle Score: 88%
Final Kaggle Score: 80.33%

To start, I have to make it clear that due to having no Cross Validation code, I don't have valuable statistics to demonstrate the range of my models performance, but as it's well below 90% it's quite clear that my model is not performing the way it should be anyway.

In terms of pre-processing, once the training data is read, I made sure to do the standard word cleaning procedures like making sure all words contain only lowercase letters, remove any punctuation to avoid situations like "test" and "test." being added to the Bag of Words separately. Once that was done, while scanning each abstract and appending its contents to a dictionary to send to my runNB function, I also removed a list of common stop-words (these were taken from a list of supposed words google used to filter in their search engine algorithm). Something I could've implemented in my pre-processing was 'stemming': I noticed that often times in my bag of words, I'd come across words like 'gene'/'genes' or 'sequence'/'sequences'. In my model, these would be stored as seperate words, but they should've been stored as the same word by only keeping the root word (in these examples removing the 's' at the end, for other cases it could be that we remove 'ed', 'ing' etc. 

Once the dictionary is appended to the runNB function, I've scanned each paragraph and taken the relevant information (total words, total words per class, unique words per class etc) for calculating priors and likelihood. From there, the counts of how many times a word appears per class were taken, and these counts + the words found in each class were zipped together into a dictionary, creating my bag of words. Once the bag of words was created, I've gone through every count value and divided it by the total number of words in the class, giving word likelihood values. Finally, these numbers were logged to prevent underflow, as I knew looking at the training & test data, that the model would be reading a lot of words and that some of these values might be lost to underflow. By logging them, python doesn't round them to 0, losing valuable information for rarely occuring words. After this, priors were calculated (and also logged, not to prevent underflow but for consistency in the method). Lastly, the prior values were added to my bag of words so that the next function, makePrediction, had all necessary information to classify each instance. 

Finally I created a list of 4 0's, and after reading the test file I ran each row of the test file through makePrediction, which examined every word against the bag of words, and appended each classes likelihood to classLikeliness for final scoring. The Naive Bayes calculation essentially happens at this step, with these values all being added together as Log(A * B) is equivalent to Log(A) + Log(B) via log properties. Then, to actually choose the classifier for each row, I've taken the min() of classLikeliness. This is actually weird because I know the Naive Bayes is supposed to take the max() value, which would make sense as the likelihoods for each word in classes where they appear often were way smaller than classes where they were sparse, therefore the most likely class SHOULD be the smallest negative number. The errors in my algorithms logic, (like in this case) are likely the cause for the subpar classification accuracy, something I'll have to examine closely in the future. 

In terms of extensions, while I didn't get to implement any, I was deciding between using TF-IDF, and N-Grams. For N-Grams, I was considering this extension as I noticed that a lot of the paragraphs that were being read into the model had a lot of similar words between eachother. Therefore, by implementing either bi-grams or tri-grams into my model, I believe it would've helped contextualize what data was being stored into a way that would've been more predictive of each class. While a problem with the N-Grams extension is that it requires a lot of training data to get more examples (so that each bi/tri-gram isnt stored in the BoW with a low likelihood count), I dont think that would've been an issue in the case of the data provided for this assignment. One issue I did forsee happening however, was that implementation of this extension might've increased runtime, which while not a problem in terms of classification accuracy, would've been worse for quality of life.

Therefore, I think in the end I would've chosen to go with TF-IDF. As I didnt remove the top1000 words in my model, implementing TF-IDF would've helped weight the rarer, more indicitive words/features higher in the model, possibly leading to a solid gain in accuracy. I noticed when working with the data that words like 'gene', 'sequence' and many other biology related words were appearing a lot. as they were typically the highest counted words across all of the classifiers, their inclusion in the calculations for P(C) would've been more of a hinderance than a benefit. 

##FILE HANDLING

In [120]:
import csv
import math
import numpy as np


def readFile(filename):
    with open(filename, 'r') as theFile:
        row = csv.reader(theFile, delimiter=',')
        next(row)
        data = [data for data in row]
        
    return data

def getTestData(filename):
    
    with open(filename, 'r') as theFile:
        file = csv.reader(theFile, delimiter=',')
        
        #cleaning the input data for better readability from the model
        allRows = [row[1].translate(str.maketrans('','',string.punctuation)).lower() for row in file][1:]
    
    for i in range(len(allRows)):
        allRows[i]=allRows[i].split()
        
    return allRows


def dataToDict(fileData):

    #Stop words to ignore, taken from the list of popular stopwords google used to use in their search algorithm
    uselessWords = ['the','i', 'a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'how', 'in', 'is'
                   'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'what', 'when', 'where', 'who',
                    'will', 'with']

    wordDict={}

    for para in fileData[:50]:#
        target = para[1]
        paragraph = para[2].translate(str.maketrans('','',string.punctuation)).lower()

        splitPara = paragraph.split()
        cleanPara =[]

        if target not in wordDict.keys():
            wordDict[target] = []
            wordDict[target].append(cleanPara)
        else:
            wordDict[target].append(cleanPara)

        for word in splitPara:
            if word not in uselessWords:
                cleanPara.append(word)
                
    return wordDict


def output(results): 
    int = 0

    file = open('jizz216.csv', 'a') 
    file.write('id,class'+'\n')
    
    for target in results: 
        int += 1
        file.write(str(int)+','+target+'\n')  

    file.close() 

                
Array = np.asarray(readFile('trg.csv'))
forModel = dataToDict(Array)


##NAIVE BAYES CALCULATION

In [112]:
def runNB(inputDict):
    
    countDict = {}
    
    allWords = []
    
    classTotals = []

    #look at each classifier, and each paragraph classified under this classifier
    for target in inputDict: 
        
        classWordsDuplicate = []
        wordCount = []

        #finding the total words in each class
        for para in inputDict[target]: 
            for word in para: 
                classWordsDuplicate.append(word)
                    
        #getting the list of unique words from the total words list
        uniqueWords = np.unique(classWordsDuplicate)
        
        #adding unique words to the vocab
        allWords.extend(uniqueWords)
        
        #total word value for each class
        classWordAmount = len(classWordsDuplicate)
        
        #getting the count for each unique word, adding 1 to avoid multiplication by 0
        for word in uniqueWords: 
            wordAmount = classWordsDuplicate.count(word) + 1
            wordCount.append(wordAmount)
        
        #taking class unique words & their word counts to create the Bag of Words
        classWordCounts = dict(zip(uniqueWords, wordCount)) 
        
        #taking the dictionary we just created, and assigning it to its respective class
        if target not in countDict.keys(): 
            countDict[target] = classWordCounts 

        #saving the word count for each class to calculate denominator for NB later
        classTotals.append(classWordAmount) 

    #beginning the prior calculation for each class
    totalParagraphs = 0
    classParaCounts = []
    priorList = []
    
    #getting the class occurence count and total row amount
    for key in inputDict:
        classParaCounts.append(len(inputDict[key]))
        totalParagraphs += len(inputDict[key])
    
    #calculating the priors, using log as we will log the feature conditionals as well
    for val in classParaCounts:
        priorList.append(math.log(val/totalParagraphs))
        
    outputDict = {}
   
    #getting the count of all words in the input data
    wordAmountConstant = len(allWords)
    
    #saving the calculated denominator values
    denomValues = []
    for val in classTotals:
        denomValues.append(val+wordAmountConstant)

    
    #preparing our final bag of words to send to predictive function for classification, first adding priors
    counter = 0
    for key in inputDict:
        if key not in outputDict:
            outputDict[key] = {'Prior'+key: priorList[counter]} #math.log()
        counter += 1
    
    #then merging the output dictionary with our priors with the bag of words
    counter = 0
    for target,classDict in countDict.items():
        for key,occurs in classDict.items():
            countDict[target][key] = math.log(occurs/denomValues[counter])
        
        for key, occurs in classDict.items():
            outputDict[target][key] = occurs
            
        counter += 1

    return outputDict


readyToPredict = runNB(forModel)

testData = getTestData("tst.csv")


##MAKING PREDICTION 

In [121]:
def makePrediction(toBePredicted, bagOfWords):

    uselessWords = ['the','i', 'a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'how', 'in', 'is'
               'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'what', 'when', 'where', 'who',
                'will', 'with']
    
    classLikeliness = [0  ,0  ,0  ,0 ]
    targetName = ['A','B','E','V']
    priors = []
    
    #adding the priors to a local priors list
    for target, probabilityDict in bagOfWords.items():
        for key, val in probabilityDict.items():
            if key == 'PriorA':
                priors.append(probabilityDict['PriorA'])
            elif key == 'PriorB':
                priors.append(probabilityDict['PriorB'])
            elif key == 'PriorE':
                priors.append(probabilityDict['PriorE'])
            elif key == 'PriorV':
                priors.append(probabilityDict['PriorV'])

    #looking at every word in the paragraph, and based on its likeliness, adding its score to each classifier respectively
    for word in toBePredicted:
        if word not in uselessWords:
            for target, probabilityDict in bagOfWords.items():
                for key, val in probabilityDict.items():
                    if word == key:
                        if target == 'A':
                            classLikeliness[0] += val
                        if target == 'B':
                            classLikeliness[1] += val
                        if target == 'E':
                            classLikeliness[2] += val
                        if target == 'V':
                            classLikeliness[3] += val

    #adding the logged priors to each classes likeliness score
    for i in range(len(priors)):
        classLikeliness[i] += priors[i]

    prediction = targetName[classLikeliness.index(min(classLikeliness))]
    
    return(prediction) # for every paragraph we input, spit out either A B E or V

finalPredictions = []

#looking through every row of the test data, and predicting its class
for row in testData:
    finalPredictions.append(makePrediction(row, readyToPredict))


output(finalPredictions)