# Abstract Classification Report

### About

This naive bayes algorithm predicts the domain from the following: 
Archaea(A), Bacteria(B), Eukaryota(E) or Virus(V) on the abstracts from research papers 
about proteins taken from the MEDLINE database.


### Pre-processing

Since I only use NumPy to manipulate the data, I decided to first read the class and abstract rows in the csv into a list each, then used pop() to remove the headers. To remove the stop words in the abstracts I got a list of stop words and assigned the list a variable. I used split() on the abstracts so that I could iterate through all the words in the abstract and remove all stop words. In the end I am passing a NumPy array of a list of sub lists (each sub list is an abstract) of words. Also passing a NumPy array of classes.


### Data representation

I decided to represent my abstract data as a dictionary where we have a word frequency for each word within that abstract. I decided to use dictionaries because it lets me access each word’s frequency very easily by simply using a key/ value for loop and if I want to access a specific abstract, I can just use another for loop to do so.


### Method extension

For my method extension I tried a few, like n-grams, TF-IDF and complement naïve bayes. Using n-grams, specifically bigrams, I saw a huge decrease in my model’s accuracy all the way down to 50%. Which I justified by the fact that when you join two terms the likelihood of the same joint term occurring in the data becomes quite rare, so it does not really provide any useful training data for my model. Using TF-IDF I saw my model’s performance increase slightly but by a very small amount. The increase in performance, I justified by the fact that TF-IDF tries to make up for the fact that some words like ‘gene’ for example appear a lot in the different abstracts, so that word does not really help in classification, so TF-IDF makes up for this by assigning it a lower weight than words that do not occur in as many abstracts meaning the word ‘gene’ has a smaller effect on our classification.

Complement Naïve Bayes gave me the best increase in performance, which is because, instead of calculating the conditional probability of a word occurring in the class, it calculates the complement probability of that class i.e., the probability of the word occurring in other classes. At the end we take the class with the lowest value instead of the highest value because the class with the highest value means, the likelihood that, an abstract with those words occurring with that class is highly unlikely. This strategy of taking the complement of a class works especially well with datasets that have imbalanced classes, which our dataset happens to have.

So, the extended method I decided to use was Complement Naïve Bayes because it gave me the highest accuracy out of all the ones that I tried.


### Implementation
For my standard naïve bayes function I passed three arguments the training abstracts, training classes and the test abstracts. I stored each abstract in the training set into its respective class and stored them into list variables. Then found the frequency of each word in each class and stored them into dictionaries of word and frequency pairs. 

Then I did the calculations of the conditional probability of each word but iterated through the test abstract instead of the training abstract to avoid making any unnecessary calculations because not all words in the training set are in the test set. I also applied smoothing variables to the equation in case of a word not appearing within the class. I did this calculation for each class and stored the words in a dictionary of word and conditional probability for each class. Then I iterated through each abstract in the test set and each word in the abstract and did the final calculation for predicting the class. To avoid underflow due to multiplying very small numbers I got the log of each value in my prediction calculation and then, finally, I chose the class with the highest value as the classification for that test abstract. 

For my complement naïve bayes I created a similar function passing the exact same three arguments. But instead of calculating conditional probabilities I calculated the complement of that class. I did this by getting the total frequency of the word in all other classes’ training set and adding them together along with a smoothing variable and dividing that by the number of words within all other classes’ training set along with a smoothing variable I did this for all other classes. The prediction calculation was the exact same as the standard but instead of getting the max value I got the min because like I mentioned above, in the complement NB the higher-class values mean higher likely hood of those words not appearing within an abstract of that classification.


### Performance
standard NB score: 0.932 / complement NB score: 0.947
 
standard NB score: 0.929 / complement NB score: 0.947
 
standard NB score: 0.931 / complement NB score: 0.946
 
standard NB score: 0.930 / complement NB score: 0.946
 
standard NB score: 0.929 / complement NB score: 0.946
 
standard NB mean accuracy: 0.930 std: 0.001

complement NB mean accuracy: 0.946 std: 0.000

After running 10-fold cross validation five times we got a higher score for complement naïve bayes every time compared to the standard one. Overall, the complement naïve bayes gave us 1.6% increase in accuracy.

When running the model on the tst.csv dataset and then uploading it on Kaggle, the standard model had a 92% accuracy score while the complement naïve bayes had a score of 93.6% so again, a 1.6% increase in performance. Therefore, I can say that the extended model performs better than the standard model.


### Model Evaluation/Validation

To evaluate my model, I decided to use k-fold cross validation since it splits the data into k-folds so that each fold gets to be in the train and test set. This gives a good estimation of the performance of our model on the entire dataset. I decided to go for 10 k-fold because it splits my data set evenly and is universally regarded as a good number for k-fold due to its not very high bias/ variance. To calculate the accuracy mean I took the mean of the 10 results which is given by a score function which just compares the models’ class predictions to the test classes.








# Abstract Classification Code

In [1]:
import csv
import numpy as np
import random
import math

### Preprocessing trg.csv file

In [2]:
classes = []
abstracts = []
w_sep = []
new_words = []
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
with open("trg.csv") as csv_file:
    read_file = csv.reader(csv_file, delimiter = ',', quotechar='"')
    for row in read_file:
        classes.append(row[1])
        abstracts.append(row[2])
    classes.pop(0)
    abstracts.pop(0)
    for i in abstracts:
        spl = i.split()
        w_sep.append(spl)
    for i in w_sep:
        clean = [word for word in i if not word in stopwords]
        new_words.append(clean)
np_classes = np.array(classes)
np_abstracts = np.array(new_words, dtype=object)
#print(type(np_abstracts[:10]))
#print(np_classes[:100])

### Cross-validation function

In [3]:
def cross_validation(dataset, k):
    data = []
    index = list(range(0, len(dataset)))
    rand_index = random.sample(index, len(index))
    np_index = np.array(rand_index)
    for split in np.split(np_index, k):
        sp = split.tolist()
        data.append(sp)
    for i in range(k):
        test_index = data[i]
        train = data.copy()
        train.pop(i)
        train_index = [item for sublist in train for item in sublist]
        yield train_index, test_index

### Turns list of words into a dictionary of frequencies

In [4]:
def wordfreq(data):
    unique, counts = np.unique(data, return_counts=True)
    freq = dict(zip(unique, counts))
    return freq

### Standard Naive Bayes function

In [5]:
def naivebayes(X, y, X_test):
    classes_count = wordfreq(y)
    values = classes_count.values()
    total_classes = sum(values)
    words_A = []
    words_B = []
    words_E = []
    words_V = []
    all_words = []
    all_test_words = []
    cond_prob_A = {}
    cond_prob_B = {}
    cond_prob_E = {}
    cond_prob_V = {}
    predict = {}
    prior = {}
    final_prediction = []
    for key, value in classes_count.items():
        prior[key] = value / total_classes    
        
    for count, i in enumerate(y):
        if i == 'A':
            words_A.extend(X[count])
        elif i == 'B':
            words_B.extend(X[count])
        elif i == 'E':
            words_E.extend(X[count])
        elif i == 'V':
            words_V.extend(X[count])
        all_words.extend(X[count])
        
    all_unique_words = wordfreq(all_words)
    words_A_count = wordfreq(words_A)
    words_B_count = wordfreq(words_B)
    words_E_count = wordfreq(words_E)
    words_V_count = wordfreq(words_V)
    
    for i in X_test:
        all_test_words.extend(i)

    unique_words_test = wordfreq(all_test_words)
    for test_key in unique_words_test.keys():
        value_A = words_A_count.get(test_key)
        if value_A != None:
            cond_prob_A[test_key] = value_A + 1 / (len(words_A) + len(all_unique_words))
        else:
            cond_prob_A[test_key] = 0 + 1 / (len(words_A) + len(all_unique_words))
                
        value_B = words_B_count.get(test_key)
        if value_B != None:
            cond_prob_B[test_key] = value_B + 1 / (len(words_B) + len(all_unique_words))
        else:
            cond_prob_B[test_key] = 0 + 1 / (len(words_B) + len(all_unique_words))
                
        value_E = words_E_count.get(test_key)
        if value_E != None:
            cond_prob_E[test_key] = value_E + 1 / (len(words_E) + len(all_unique_words))
        else:
            cond_prob_E[test_key] = 0 + 1 / (len(words_E) + len(all_unique_words))
                
        value_V = words_V_count.get(test_key)   
        if value_V != None:
            cond_prob_V[test_key] = value_V + 1 / (len(words_V) + len(all_unique_words))
        else:
            cond_prob_V[test_key] = 0 + 1 / (len(words_V) + len(all_unique_words))
         

    for abstract in X_test:
        abs_count = wordfreq(abstract)
        for key, value in abs_count.items():
            predict['A'] = predict.get('A', 0) + value * math.log(cond_prob_A.get(key))
            predict['B'] = predict.get('B', 0) + value * math.log(cond_prob_B.get(key))
            predict['E'] = predict.get('E', 0) + value * math.log(cond_prob_E.get(key))
            predict['V'] = predict.get('V', 0) + value * math.log(cond_prob_V.get(key))

        final_A =  predict.get('A') + math.log(prior.get('A'))
        final_B =  predict.get('B') + math.log(prior.get('B'))
        final_E =  predict.get('E') + math.log(prior.get('E'))
        final_V =  predict.get('V') + math.log(prior.get('V'))
        final = {'A': final_A, 'B': final_B, 'E': final_E, 'V': final_V}    
        letter = max(final, key=final.get)    
        final_prediction.extend(letter)
        predict.clear()
        
    return final_prediction

### Complement Naive Bayes function 

In [6]:
def naivebayes_complement(X, y, X_test):
    classes_count = wordfreq(y)
    values = classes_count.values()
    total_classes = sum(values)
    total_docs = len(X)
    words_A = []
    words_B = []
    words_E = []
    words_V = []
    all_words = []
    all_test_words = []
    comp_prob_A = {}
    comp_prob_B = {}
    comp_prob_E = {}
    comp_prob_V = {}
    predict = {}
    prior = {}
    final_prediction = []
    for key, value in classes_count.items():
        prior[key] = value / total_classes    
        
    for count, i in enumerate(y):
        if i == 'A':
            words_A.extend(X[count])
        elif i == 'B':
            words_B.extend(X[count])
        elif i == 'E':
            words_E.extend(X[count])
        elif i == 'V':
            words_V.extend(X[count])
        all_words.extend(X[count])
        
    all_unique_words = wordfreq(all_words)
    words_A_count = wordfreq(words_A)
    words_B_count = wordfreq(words_B)
    words_E_count = wordfreq(words_E)
    words_V_count = wordfreq(words_V)
    
    for i in X_test:
        all_test_words.extend(i)
        
    unique_words_test = wordfreq(all_test_words)
    for test_key in unique_words_test.keys():
        value_A = words_A_count.get(test_key)
        if value_A != None:
            comp_prob_A[test_key] = ((words_B_count.get(test_key, 0) + words_E_count.get(test_key, 0) + words_V_count.get(test_key, 0)) + 1) / ((len(words_B) + len(words_E) + len(words_V)) + len(unique_words_test))
                
        value_B = words_B_count.get(test_key)
        if value_B != None:
            comp_prob_B[test_key] = ((words_A_count.get(test_key, 0) + words_E_count.get(test_key, 0) + words_V_count.get(test_key, 0)) + 1) / ((len(words_A) + len(words_E) + len(words_V)) + len(unique_words_test))
                
        value_E = words_E_count.get(test_key)
        if value_E != None:
            comp_prob_E[test_key] = ((words_A_count.get(test_key, 0) + words_B_count.get(test_key, 0) + words_V_count.get(test_key, 0)) + 1) / ((len(words_A) + len(words_B) + len(words_V)) + len(unique_words_test))
                
        value_V = words_V_count.get(test_key)   
        if value_V != None:
            comp_prob_V[test_key] = ((words_A_count.get(test_key, 0) + words_B_count.get(test_key, 0) + words_E_count.get(test_key, 0)) + 1) / ((len(words_A) + len(words_B) + len(words_E)) + len(unique_words_test))
         

    for abstract in X_test:
        abs_count = wordfreq(abstract)
        for key, value in abs_count.items():
            prob_A = comp_prob_A.get(key)
            if prob_A != None:
                predict['A'] = predict.get('A', 0) + value * math.log(comp_prob_A.get(key))
            else:
                predict['A'] = predict.get('A', 0) + 0
                
            prob_B = comp_prob_B.get(key)
            if prob_B != None:
                predict['B'] = predict.get('B', 0) + value * math.log(comp_prob_B.get(key))
            else:
                predict['B'] = predict.get('B', 0) + 0
                
            prob_E = comp_prob_E.get(key)
            if prob_E != None:
                predict['E'] = predict.get('E', 0) + value * math.log(comp_prob_E.get(key))
            else:
                predict['E'] = predict.get('E', 0) + 0
                
            prob_V = comp_prob_V.get(key)
            if prob_V != None:
                predict['V'] = predict.get('V', 0) + value * math.log(comp_prob_V.get(key))
            else:
                predict['V'] = predict.get('V', 0) + 0

        final_A =  predict.get('A') + math.log(prior.get('A'))
        final_B =  predict.get('B') + math.log(prior.get('B'))
        final_E =  predict.get('E') + math.log(prior.get('E'))
        final_V =  predict.get('V') + math.log(prior.get('V'))
        final = {'A': final_A, 'B': final_B, 'E': final_E, 'V': final_V}    
        letter = min(final, key=final.get)    
        final_prediction.extend(letter)
        predict.clear()
        
    return final_prediction

### Scoring function

In [7]:
def score(nb, y):
    n = np.array(nb)
    correct = (n == y)
    accuracy = correct.sum() / correct.size
    
    return accuracy

### Runs 10 fold CV on NB functions and gives the mean accuracy

In [8]:
total_stand = []
total_comp = []
cv_stand = []
cv_comp = []
for i in range(5):
    for train_index, test_index in cross_validation(classes, 10):
        X_train, X_test = np_abstracts[train_index], np_abstracts[test_index]
        y_train, y_test = np_classes[train_index], np_classes[test_index]
        nb_stand = naivebayes(X_train, y_train, X_test)
        nb_comp = naivebayes_complement(X_train, y_train, X_test)
        score_stand = score(nb_stand, y_test)
        score_comp = score(nb_comp, y_test)
        cv_stand.append(score_stand)
        cv_comp.append(score_comp)
    cv_stand_avg = (np.mean(np.array(cv_stand)))
    cv_comp_avg = (np.mean(np.array(cv_comp)))
    print(f"standard NB score: {cv_stand_avg:.3f}")
    print(f"complement NB score: {cv_comp_avg:.3f}")
    print(" ")
    total_stand.append(cv_stand_avg)
    total_comp.append(cv_comp_avg)
    cv_stand.clear()
    cv_comp.clear()
overall_avg_stand = np.mean(np.array(total_stand))
overall_avg_comp = np.mean(np.array(total_comp))
overall_std_stand = np.std(np.array(total_stand))
overall_std_comp = np.std(np.array(total_comp))
print(f"standard NB mean accuracy: {overall_avg_stand:.3f} std: {overall_std_stand:.3f}")
print(f"complement NB mean accuracy: {overall_avg_comp:.3f} std: {overall_std_comp:.3f}")

standard NB score: 0.930
complement NB score: 0.946
 
standard NB score: 0.929
complement NB score: 0.948
 
standard NB score: 0.929
complement NB score: 0.946
 
standard NB score: 0.927
complement NB score: 0.943
 
standard NB score: 0.929
complement NB score: 0.947
 
standard NB mean accuracy: 0.929 std: 0.001
complement NB mean accuracy: 0.946 std: 0.002
