## Report

### Chosen Representation
The input csv file was converted into a data frame with abstract and class columns containing the corresponding columns in the csv file.
The classes were converted from letters "A", "B", "E", "V" to numbers 0, 1, 2, 3 respectively.  
These were subsequently converted back into the letters when needing to submit for Kaggle.

### Data Processing
Punctuation was removed from the raw data via `string.punctuation`. This prevents words which should be the same be affected by punctuation. This should make our model more generalisable. 

### Method Extensions

#### Stopwords
Stopwords are words which add to the length of a text but do not contribute to the overall meaning of the text. Some very common example are "the" and "a". This allows us to focus on more important words which actually differentiate classes.

#### Underflow prevention
It is common to encounter numeric instability when using a Naive Bayes classifier. When multiplying many small numbers together, we can encounter underflow. This causes the value of a variable to be set to 0, which prevents any comparisons when choosing a class.

The severity of this issue is correlated with input text length and number of words.

### Implementation
Naive Bayes was implemented with smoothing of constant 1.
First the data was read in and preprocessed. Next the 1000 most common words were counted using `Counter`. This represents our features for each abstract text.

The priors and likelihoods are calculated using the determined features and abstracts. From this it is possible to calculate the posteriors for each word.

To find the predicted class for each word, we use the posteriors and priors multiplied together depending on what words are present in the test text.

### Results and Performance

10-fold cross validation was implemented to test the models for accurate results. Extended Naive Bayes with stopwords achieved 93% on Kaggle.

We can see from the following data that each extension adds more accuracy and reduces the standard deviation of predictions.

The most impressive gain in accuracy is from stopwords. This is obvious by looking at the top 10 most common words before removing stop words. Nine out of ten are stop words, with very high frequency counts.
`('the', 46507), ('of', 36531), ('and', 24296), ('a', 16588), ('in', 16082), ('to', 12778), ('that', 7743), ('is', 7540), ('genes', 6396), ('with', 6185)`

Naive Bayes

| Mean  | Median | Standard Deviation |
| :--:  |:-----: | :-----: |
| 0.527 | 0.539  |  0.0262 |

Extended with Stopwords

| Mean  | Median | Standard Deviation |
| :--:  |:-----: | :-----: |
| 0.905 | 0.908  |  0.0144 |

Extended with Stopwords + Underflow prevention

| Mean  | Median | Standard Deviation |
| :--:  |:-----: | :-----: |
| 0.910 | 0.908  |  0.0142 |



## Code

### Read in Data

In [1]:
import csv
import numpy as np 
def readCsv(filename):
    with open(f'{filename}.csv', newline='') as csvfile:
        data = list(csv.reader(csvfile))
    data = np.array(data[1:])
    return data
data = readCsv('trg')

### Preprocessing Data

We want to convert the letter values for the classes into ints so they are easier to work with.

In [2]:
for i, clazz in enumerate(data[:,1]):
    if clazz == "A":
        data[i][1] = 0
    elif clazz == "B":
        data[i][1] = 1
    elif clazz == "E":
        data[i][1] = 2
    elif clazz == "V":
        data[i][1] = 3

Next, we want to remove punctuations

In [3]:
from collections import Counter
import string
counts = Counter()

for i, abstract in enumerate(data[:,2]):
    # Remove punctuation
    abstract = abstract.translate(str.maketrans('', '', string.punctuation))
    words = abstract.split()
    counts.update(words)
    data[i,2] = " ".join(words)

In [4]:
rawData = data
data = {}
data['class'] = [int(clazz) for clazz in rawData[:, 1]]
data['abstract'] = rawData[:, 2]

In [5]:
print(len(data['abstract']))

4000


### Construct Attributes
Top 1000 words

In [6]:
print(len(counts))
print(counts.most_common(50))

31424
[('the', 46507), ('of', 36531), ('and', 24296), ('a', 16588), ('in', 16082), ('to', 12778), ('that', 7743), ('is', 7540), ('genes', 6396), ('with', 6185), ('for', 5830), ('sequence', 5184), ('gene', 5097), ('from', 5081), ('are', 4901), ('was', 4488), ('by', 4317), ('genome', 3728), ('protein', 3503), ('were', 3428), ('as', 3291), ('this', 3136), ('which', 3048), ('an', 3046), ('we', 3030), ('have', 2757), ('amino', 2635), ('two', 2590), ('these', 2549), ('proteins', 2520), ('sequences', 2430), ('acid', 2128), ('human', 2049), ('has', 2048), ('be', 2036), ('dna', 1993), ('been', 1914), ('other', 1843), ('on', 1823), ('at', 1809), ('analysis', 1792), ('cdna', 1774), ('identified', 1645), ('region', 1545), ('or', 1410), ('found', 1384), ('not', 1353), ('chromosome', 1349), ('expression', 1289), ('also', 1250)]


In [7]:
def getAttributes(abstracts):
    word_counts = Counter()
    for abstract in abstracts:
        word_list = abstract.split()
        word_counts.update(word_list)
    attributes = [word for word,count in counts.most_common(1000)]
    return attributes

In [8]:
def extractFeatures(attributes, abstracts):
    feature_counts = []
    for abstract in abstracts:
        new_row = []
        for word in attributes:
            new_row.append(abstract.count(word))
        feature_counts.append(new_row)
    return feature_counts

In [9]:
def getPriors(classes):
    priors = Counter(classes)
    for prior in priors:
        priors[prior] /= len(classes)
    return priors

In [10]:
def getLikelihoods(priors, attributes, classes, abstracts):
    # Generate word counts for each class
    counts_for_class = [Counter() for i in range(4)]
    for clazz, abstract in zip(classes, abstracts):
        cnts = Counter([word for word in abstract.split() if word in attributes])
        counts_for_class[int(clazz)].update(cnts)
    # Convert these word counts into likelihoods per class
    likelihoods = [dict() for i in range(4)]
    for clazz in priors:
        clazz = int(clazz)
        for word in attributes:
            likelihood = (counts_for_class[clazz][word] + 1) / (sum(counts_for_class[clazz].values())+len(attributes))
            likelihoods[clazz][word] = likelihood
    return likelihoods

In [11]:
def getProbabilityData(data):
    classes = data['class']
    abstracts = data['abstract']
    attributes = getAttributes(abstracts)
    
    feature_counts = extractFeatures(attributes, abstracts)
    priors = getPriors(classes)
    likelihoods = getLikelihoods(priors, attributes, classes, abstracts)
    
    return attributes, priors, likelihoods

The `attributes` variable represents the top 1000 most common words in the abstracts data

### Testing Model

In [12]:
def calculate_posterior_naive(clazz, priors, likelihoods, words):
        posterior = priors[clazz]
        for word in words:
            posterior *= likelihoods[clazz].get(word)
        return posterior
    
import math
def calculate_posterior_log(clazz, priors, likelihoods, words):
        posterior = math.log(float(priors[clazz]))
        for word in words:
            posterior += math.log(likelihoods[clazz].get(word))
        return posterior

In [13]:
def makePrediction(test_abstract, likelihoods, priors, attributes, log=False):
    test_words = [word for word in test_abstract.split() if word in attributes]
    
    posteriors = []
    for clazz in range(4):
        if log:
            posterior = calculate_posterior_log(clazz, priors, likelihoods, test_words)
        else: 
            posterior = calculate_posterior_naive(clazz, priors, likelihoods, test_words)
        posteriors.append(posterior)
    return np.argmax(posteriors)

In [14]:
def makePredictions(abstracts, likelihoods, priors, attributes, log=False):
    predictions = []
    for abstract in abstracts:
        prediction = makePrediction(abstract, likelihoods, priors, attributes, log)
        predictions.append(prediction)
    return predictions

In [15]:
def checkAccuracy(predictions, actualClasses):
    totalCorrect = 0
    for prediction, actualClass in zip(predictions, actualClasses):
        if prediction == int(actualClass):
            totalCorrect += 1
    return totalCorrect / len(predictions)

In [16]:
def getTrainTestSplit(start, end, data):
    test = {}
    test['class'] = data['class'][start: end]
    test['abstract'] = data['abstract'][start:end]
    
    train = {}
    train['abstract'] = np.concatenate((data['abstract'][:start], data['abstract'][end:]))
    train['class'] = np.concatenate((data['class'][:start], data['class'][end:]))

    return train, test

def crossValidate(data, folds=10, log=False):
    data['class'] = [int(clazz) for clazz in data['class']]
    percentageTrain = 1 / folds 
    trainLen = int(len(data['class']) * percentageTrain)
    
    accuracies = []
    for fold in range(folds):
        start = fold * trainLen
        end = start + trainLen
        train, test = getTrainTestSplit(start, end, data)
        attributes, priors, likelihoods = getProbabilityData(train)
        predictions = makePredictions(test['abstract'], likelihoods, priors, attributes, log)
        accuracy = checkAccuracy(predictions, test['class'])
        print(f'Accuracy: {accuracy}')
        accuracies.append(accuracy)
    return accuracies

#### Naive Bayes

In [17]:
accuracies = crossValidate(data, 10)

Accuracy: 0.4675
Accuracy: 0.5425
Accuracy: 0.5025
Accuracy: 0.5475
Accuracy: 0.5225
Accuracy: 0.55
Accuracy: 0.5525
Accuracy: 0.545
Accuracy: 0.505
Accuracy: 0.535


In [18]:
print(np.mean(accuracies))
print(np.std(accuracies))
print(np.median(accuracies))

0.5269999999999999
0.02621545345783666
0.5387500000000001


#### With Stopwords

Remove all stopwords

In [19]:
stopwords = ["able","about","above","abroad","according","accordingly","across","actually","adj","after","afterwards","again","against","ago","ahead","ain't","all","allow","allows","almost","alone","along","alongside","already","also","although","always","am","amid","amidst","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","a's","aside","ask","asking","associated","at","available","away","awfully","back","backward","backwards","be","became","because","become","becomes","becoming","been","before","beforehand","begin","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","came","can","cannot","cant","can't","caption","cause","causes","certain","certainly","changes","clearly","c'mon","co","co.","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","c's","currently","dare","daren't","definitely","described","despite","did","didn't","different","directly","do","does","doesn't","doing","done","don't","down","downwards","during","each","edu","eg","eight","eighty","either","else","elsewhere","end","ending","enough","entirely","especially","et","etc","even","ever","evermore","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","fairly","far","farther","few","fewer","fifth","first","five","followed","following","follows","for","forever","former","formerly","forth","forward","found","four","from","further","furthermore","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","had","hadn't","half","happens","hardly","has","hasn't","have","haven't","having","he","he'd","he'll","hello","help","hence","her","here","hereafter","hereby","herein","here's","hereupon","hers","herself","he's","hi","him","himself","his","hither","hopefully","how","howbeit","however","hundred","i'd","ie","if","ignored","i'll","i'm","immediate","in","inasmuch","inc","inc.","indeed","indicate","indicated","indicates","inner","inside","insofar","instead","into","inward","is","isn't","it","it'd","it'll","its","it's","itself","i've","just","k","keep","keeps","kept","know","known","knows","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","likewise","little","look","looking","looks","low","lower","ltd","made","mainly","make","makes","many","may","maybe","mayn't","me","mean","meantime","meanwhile","merely","might","mightn't","mine","minus","miss","more","moreover","most","mostly","mr","mrs","much","must","mustn't","my","myself","name","namely","nd","near","nearly","necessary","need","needn't","needs","neither","never","neverf","neverless","nevertheless","new","next","nine","ninety","no","nobody","non","none","nonetheless","noone","no-one","nor","normally","not","nothing","notwithstanding","novel","now","nowhere","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","one's","only","onto","opposite","or","other","others","otherwise","ought","oughtn't","our","ours","ourselves","out","outside","over","overall","own","particular","particularly","past","per","perhaps","placed","please","plus","possible","presumably","probably","provided","provides","que","quite","qv","rather","rd","re","really","reasonably","recent","recently","regarding","regardless","regards","relatively","respectively","right","round","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","shan't","she","she'd","she'll","she's","should","shouldn't","since","six","so","some","somebody","someday","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","take","taken","taking","tell","tends","th","than","thank","thanks","thanx","that","that'll","thats","that's","that've","the","their","theirs","them","themselves","then","thence","there","thereafter","thereby","there'd","therefore","therein","there'll","there're","theres","there's","thereupon","there've","these","they","they'd","they'll","they're","they've","thing","things","think","third","thirty","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","till","to","together","too","took","toward","towards","tried","tries","truly","try","trying","t's","twice","two","un","under","underneath","undoing","unfortunately","unless","unlike","unlikely","until","unto","up","upon","upwards","us","use","used","useful","uses","using","usually","v","value","various","versus","very","via","viz","vs","want","wants","was","wasn't","way","we","we'd","welcome","well","we'll","went","were","we're","weren't","we've","what","whatever","what'll","what's","what've","when","whence","whenever","where","whereafter","whereas","whereby","wherein","where's","whereupon","wherever","whether","which","whichever","while","whilst","whither","who","who'd","whoever","whole","who'll","whom","whomever","who's","whose","why","will","willing","wish","with","within","without","wonder","won't","would","wouldn't","yes","yet","you","you'd","you'll","your","you're","yours","yourself","yourselves","you've","zero","a","how's","i","when's","why's","b","c","d","e","f","g","h","j","l","m","n","o","p","q","r","s","t","u","uucp","w","x","y","z","I","www","amount","bill","bottom","call","computer","con","couldnt","cry","de","describe","detail","due","eleven","empty","fifteen","fifty","fill","find","fire","forty","front","full","give","hasnt","herse","himse","interest","itse”","mill","move","myse”","part","put","show","side","sincere","sixty","system","ten","thick","thin","top","twelve","twenty","abst","accordance","act","added","adopted","affected","affecting","affects","ah","announce","anymore","apparently","approximately","aren","arent","arise","auth","beginning","beginnings","begins","biol","briefly","ca","date","ed","effect","et-al","ff","fix","gave","giving","heres","hes","hid","home","id","im","immediately","importance","important","index","information","invention","itd","keys","kg","km","largely","lets","line","'ll","means","mg","million","ml","mug","na","nay","necessarily","nos","noted","obtain","obtained","omitted","ord","owing","page","pages","poorly","possibly","potentially","pp","predominantly","present","previously","primarily","promptly","proud","quickly","ran","readily","ref","refs","related","research","resulted","resulting","results","run","sec","section","shed","shes","showed","shown","showns","shows","significant","significantly","similar","similarly","slightly","somethan","specifically","state","states","stop","strongly","substantially","successfully","sufficiently","suggest","thered","thereof","therere","thereto","theyd","theyre","thou","thoughh","thousand","throug","til","tip","ts","ups","usefully","usefulness","'ve","vol","vols","wed","whats","wheres","whim","whod","whos","widely","words","world","youd","youre"]
stopwords = set(stopwords)

for i, abstract in enumerate(data['abstract']):
    words = abstract.split()
    words = [word for word in words if word not in stopwords]
    data['abstract'][i] = " ".join(words)

In [20]:
accuracies = crossValidate(data, 10)

Accuracy: 0.9075
Accuracy: 0.8775
Accuracy: 0.8875
Accuracy: 0.89
Accuracy: 0.9075
Accuracy: 0.91
Accuracy: 0.905
Accuracy: 0.915
Accuracy: 0.9175
Accuracy: 0.9275


In [21]:
print(np.mean(accuracies))
print(np.std(accuracies))
print(np.median(accuracies))

0.9045
0.014439529078193665
0.9075


#### Underflow prevention + Stopwords

In [22]:
accuracies = crossValidate(data, 10, log=True)

Accuracy: 0.9075
Accuracy: 0.8775
Accuracy: 0.9025
Accuracy: 0.905
Accuracy: 0.9075
Accuracy: 0.93
Accuracy: 0.9075
Accuracy: 0.92
Accuracy: 0.9175
Accuracy: 0.9275


In [23]:
print(np.mean(accuracies))
print(np.std(accuracies))
print(np.median(accuracies))

0.9102500000000001
0.014206072645175396
0.9075


### Kaggle Submission

In [24]:
def changePredictionToLetterClass(predictions):
    for i, prediction in enumerate(predictions):
        if prediction == 0:
            predictions[i] = "A"
        elif prediction == 1:
            predictions[i] = "B"
        elif prediction == 2:
            predictions[i] = "E"
        elif prediction == 3:
            predictions[i] = "V"
    return predictions

In [28]:
def kaggleSubmission(data):
    test_data = readCsv('tst')
    attributes, priors, likelihoods = getProbabilityData(data)
    predictions = makePredictions(test_data[:,1], likelihoods, priors, attributes, log=True)
    predictions = changePredictionToLetterClass(predictions)

    output = list(zip(range(1,1001), predictions))

    with open('submission.csv', 'w') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(("id", "class"))
        for row in output:
            writer.writerow(row)

In [29]:
kaggleSubmission(data)