Report section
====

Chosen representation & data preprocessing:
----

1. Data representation is the occurrence frequency of a word in the abstract. The reason is that the we accept the assumption that each word appears independently, and noticed that word combination are connected with "-" which can be regarded as one word, so we do not have to use word combination. The method to do this is just split the abstract passage with space and then we can get the result. 
2. Data preprocessing strategy is removeing common words, single charactors and numbers.   
    Reasons:  
    a. Common words such as "it", "the", "a" have no use for classification.  
    b. Digital numbers are often data to stress in the thesis, which are unique and will very unlikely to appear again.  
    c. Single charactors like 'a', 'b', 'c', '-', are meaningless.    
        
      How to achieve: 
    a. There are a couple stop words in English, I hardcoded them and put them in a list. When preprocessing, if a word is in this list, discard it.
    b. There is a isdigit() method in python3 string class. Use it to judge if it is a digital number. If it is, discard it.
    c. Judge the length of each word, if it is 1, then discard it.   
    After these three steps, we can get data sets with strong characteristics。


Method extensions and implementation(standard and extended)
----
1. For calculating each posterior probability, use the Bayes's theory:    
$$
Posterior ∝ Likelihood × Prior
$$
For this assignment, this equation can be applied as follows:    
$$
P(class | abstract) ∝ P(abstract | class) × P(class)
$$

Since the assumption is that the appearance of each word is indenpent, so the equation is follows:    
$$
P(class | abstract) ∝ \prod_{i = 0}^{n} P(word_i | class) × P(class)
$$

The model will calculate the probability of each classification and choose highest one as prediction. <br>
The theory is using the theory, maximum a posteriori:
$$
h_{MAP} = arg_{h∈H}[max(P(h|D)] =  arg_{h∈H}[max(\prod_{i = 0}^{n} P(word_i | class) × P(class))]
$$
    
2. Laplacian smoothing. Since not every word in the test set appears in the training set, some possibility of word can be zero. This will result in the possibility of the article in a class is zero, but there are other words in this abstract that is effective. So in order to prevent this happen, use Laplacian smoothing. This method will add 1 to every numerator and add the number of unique word plus th e number of the non-unique words. In the equation of Likelihood:
$$
Likelihood = \prod_{i = 0}^{n} P(word_i | class) × P(class)
$$
    
    After Laplacian smoothing, this equation will become:
$$
P(word_i | class) = \frac{(wordAppearTimes+1)} {(totalCount + uniqueCount + NotUnqueCount)}
$$    
        
3. Use log value to represent each possibility. For the reason that in computer, the number is discrete, if a number is very close to 0, the multiplication will be inaccurate. The possibility of each word can be really close to 0, but we can use log value to prevent the problem. Log value is monotone increasing, so it will not affect the relavent relation. And it can use add instead of multiply. So th e final equation of Posterior is:
$$
    \log_2 P(class | abstract) = \log_2 P(class) + \sum_{i=1}^{n} \log_2 P(word_i | class)
$$

4. For standard Naive Bayes, there is no preprocessing, no Laplacian smoothing and do not use Log to represent the probability. But for the words whose probability is 0, I did not multiply 0 and ignore them. Use the following equation:
$$
   P(class | abstract) ∝ \prod_{i = 0}^{n} P(word_i | class) × P(class) 
$$

    

Performance (training and validation) results
----
1. For training and validation, this assigment use K-fold cross validation. Here, I used 5-fold, because it cost less time. The training set takes up 80% of the training data, and validation is 20%. This will insure each training set has enough data so that the validation is more useful. <br>
2. Cross-validation score of the extended Naive Bayes model is [0.94875, 0.94875, 0.96375, 0.94125, 0.945], and the prediction of test set has 95.666% accuracy on Kaggle. 
3. And the standarded model, also use the k-fold validation to train and validate. The scores for validation are [0.24875, 0.21375, 0.24125, 0.22875, 0.175]. 
  

In [1]:
import numpy as np
import pandas as pd
import os
import math

In [2]:
training_set = pd.read_csv(os.path.join("data", "trg.csv"))
test_set = pd.read_csv(os.path.join("data", "tst.csv"))

## KFold class and Extended Naive Bayes Classifier
The classifier class can be used in the following steps: first create the model, then use fit(X, y) to fit the model. Then use the predict(testSet) method to predict the result. It also has a method results_To_csv(filename) to save the prediction to a csv file which we can upload to Kaggle and see the result  
The KFold class is used for k-fold cross validation. Fisrt create the model with the K argument to specify how many folds. Then use a for loop to iterate each training set and validation set.

In [3]:
class MyBayesCLF:
    def __init__(self):
        self.__data_Class = {}
        self.__each_Class_Words = {}
        self.__each_Class_Word_Num = {}
        self.__p = {}
        self.__unique_Words = []
        self.__result = []
        self.__training_set_size = 0
        self.__each_Class_Words_Count = {}
        self.stop_Words = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']

    def fit(self, X, y):
        """
        fit the training data, X ,y
        """
        for [abstract, label] in zip(X, y):
            if label not in self.__data_Class:
                self.__data_Class[label] = 1
            else:
                self.__data_Class[label] += 1
            words = self.__preprocess(abstract.split(" "))
            # words = abstract.split(" ")
            if label not in self.__each_Class_Words:
                self.__each_Class_Words[label] = words
            else:
                self.__each_Class_Words[label] += words
        self.__training_set_size = len(X)
        self.__calculate_Word_count()
        self.__calculate_Unique_Words()
        self.__calculate_Prob_Of_Class()
        self.__count_Each_Class_Words()

    def predict(self, X_test):
        count = 0
        processed_WordList = []
        for index, value in X_test.items():
            words = self.__preprocess(value.split(" "))
            # words = X_test[i].split(" ")

            processed_WordList.append(words)
            for g in words:
                if g not in self.__unique_Words:
                    count += 1
        result = []
        for sample in processed_WordList:
            result.append(self.__predict_Item(sample, count))
        self.__result = result
        return result

    def __predict_Item(self, list_Words, count):
        prob_Dict = {label: value for label, value in self.__p.items()}
#         print(prob_Dict)
#         print(list_Words)
        for word in list_Words:
            for label in prob_Dict.keys():
                if word in self.__each_Class_Words[label]:
                    prob_Dict[label] +=  math.log((self.__each_Class_Word_Num[label][word]+1) / (self.__each_Class_Words_Count[label] + len(self.__unique_Words) + count), 2)
                else:
                    prob_Dict[label] += math.log( 1 / (self.__each_Class_Words_Count[label] + len(self.__unique_Words) + count), 2)
#         print(prob_Dict)
        sorted_prob_Dict = sorted(prob_Dict.items(), key = lambda a: a[1], reverse = True)
#         print(sorted_prob_Dict)
        return sorted_prob_Dict[0][0]

    def __preprocess(self, list_Words):
        result = []
        for word in list_Words:
            # remove stop words
            if word in self.stop_Words:
                continue
            # remove digital words like 1, 2, 3
            if word.isdigit():
                continue
            # remove charactor which is meaningless
            if len(word) == 1:
                continue
            result.append(word)
#         print(result)
        return result

    def results_To_csv(self, fileName):
        df = pd.DataFrame([[id, predict] for [id, predict] in zip(range(1, len(
            self.__result) + 1), self.__result)], columns=["id", "class"]).to_csv(fileName, index=False)

    def __calculate_Prob_Of_Class(self):
        self.__p = {}
        for name in self.__data_Class:
            self.__p[name] = math.log(self.__data_Class[name] / self.__training_set_size, 2)
#         print(self.__p)

    def __calculate_Word_count(self):
        """
        calculate each word appear time in each class
        """
        for y in self.__data_Class.keys():
            if y not in self.__each_Class_Word_Num:
                self.__each_Class_Word_Num[y] = {}
            for word in self.__each_Class_Words[y]:
                if word not in self.__each_Class_Word_Num[y]:
                    self.__each_Class_Word_Num[y][word] = 1
                else:
                    self.__each_Class_Word_Num[y][word] += 1
        # print(self.__each_Class_Word_Num)

    def __calculate_Unique_Words(self):
        words = []
        for y in self.__each_Class_Word_Num.keys():
            words += self.__each_Class_Word_Num[y].keys()
        self.__unique_Words = set(words)
        # print(self.__unique_Words)
    def __count_Each_Class_Words(self):
        for label in self.__each_Class_Words:
            self.__each_Class_Words_Count[label] = len(self.__each_Class_Words[label])
        # print(self.__each_Class_Words_Count)

    def __str__(self):
        msg = ""
        msg += f"Class in the training set: {self.__data_Class}\n"
        msg += f"Words in each class: {self.__each_Class_Words}\n"
        return msg


In [4]:
class KFolds:
    def __init__(self, n_splits, shuffle = True, seed = 4321):
        self.seed = seed
        self.shuffle = shuffle
        self.n_splits = n_splits
    # iterable split function, call it in a for loop and it will iter each train and validation
    def split(self, X):
        num_Of_Samples = X.shape[0]
        indices = np.arange(num_Of_Samples)
        if self.shuffle:
            random_State = np.random.RandomState(self.seed)
            random_State.shuffle(indices)
        for test_mask in self._iter_test_masks(num_Of_Samples, indices):
            train_index = indices[np.logical_not(test_mask)]
            test_index = indices[test_mask]
            train_set = X.filter(train_index, axis = 0)
            test_set = X.filter(test_index, axis = 0)
            yield train_set, test_set
        
    def _iter_test_masks(self, num_Of_Samples, indices):
        fold_sizes = (num_Of_Samples // self.n_splits) * np.ones(self.n_splits, dtype = np.int64)
        fold_sizes[:num_Of_Samples % self.n_splits] += 1

        current = 0
        for fold_size in fold_sizes:
            start, stop = current, current + fold_size
            test_indices = indices[start:stop]
            test_mask = np.zeros(num_Of_Samples, dtype = bool)
            test_mask[test_indices] = True
            yield test_mask
            current = stop
def calScore(pred_List, real_List):
    count = 0
    for pred, real in zip(pred_List, real_List):
        if pred == real:
            count += 1
    return count / len(real_List)

## Cross validation on the training set and calculate the scores of the model.

In [5]:
scores = []
kfold = KFolds(5)
for train,valid in kfold.split(training_set):
    myCLF = MyBayesCLF()
    myCLF.fit(X = train["abstract"], y = train["class"])
    
    result = myCLF.predict(valid["abstract"])
    score = calScore(result, valid["class"])
    scores.append(score)
scores

[0.94875, 0.94875, 0.96375, 0.94125, 0.945]

## Get the final result for prediction on the test set. Use all the training set to fit the model.

In [6]:
model = MyBayesCLF()
model.fit(training_set["abstract"], training_set["class"])
predictions = model.predict(test_set["abstract"])

model.results_To_csv("out.csv")

count = {}
for item in predictions:
    if item in count:
        count[item]+= 1
    else:
        count[item] = 1
count

{'B': 386, 'E': 565, 'A': 30, 'V': 19}

## Standarded Naive Bayes model and cross validation score

In [7]:

class StandardBayesCLF:
    def __init__(self):
        self.__data_Class = {}
        self.__each_Class_Words = {}
        self.__each_Class_Word_Num = {}
        self.__p = {}
        self.__unique_Words = []
        self.__result = []
        self.__training_set_size = 0
        self.__each_Class_Words_Count = {}
        self.stop_Words = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']

    def fit(self, X, y):
        """
        fit the training data, X ,y
        """
        for [abstract, label] in zip(X, y):
            if label not in self.__data_Class:
                self.__data_Class[label] = 1
            else:
                self.__data_Class[label] += 1
            words = abstract.split(" ")
            if label not in self.__each_Class_Words:
                self.__each_Class_Words[label] = words
            else:
                self.__each_Class_Words[label] += words
        self.__training_set_size = len(X)
        self.__calculate_Word_count()
        self.__calculate_Prob_Of_Class()
        self.__count_Each_Class_Words()

    def predict(self, X_test):
        count = 0
        processed_WordList = []
        for index, value in X_test.items():
            words = value.split(" ")

            processed_WordList.append(words)
            for g in words:
                if g not in self.__unique_Words:
                    count += 1
        result = []
        for sample in processed_WordList:
            result.append(self.__predict_Item(sample, count))
        self.__result = result
        return result

    def __predict_Item(self, list_Words, count):
        prob_Dict = {label: value for label, value in self.__p.items()}
#         print(prob_Dict)
#         print(list_Words)
        for word in list_Words:
            for label in prob_Dict.keys():
                if word in self.__each_Class_Words[label]:
                    prob_Dict[label] *=  (self.__each_Class_Word_Num[label][word]+1) / (self.__each_Class_Words_Count[label] )
#         print(prob_Dict)
        sorted_prob_Dict = sorted(prob_Dict.items(), key = lambda a: a[1], reverse = True)
#         print(sorted_prob_Dict)
        return sorted_prob_Dict[0][0]


    def results_To_csv(self, fileName):
        df = pd.DataFrame([[id, predict] for [id, predict] in zip(range(1, len(
            self.__result) + 1), self.__result)], columns=["id", "class"]).to_csv(fileName, index=False)

    def __calculate_Prob_Of_Class(self):
        self.__p = {}
        for name in self.__data_Class:
            self.__p[name] = self.__data_Class[name] / self.__training_set_size
#         print(self.__p)

    def __calculate_Word_count(self):
        """
        calculate each word appear time in each class
        """
        for y in self.__data_Class.keys():
            if y not in self.__each_Class_Word_Num:
                self.__each_Class_Word_Num[y] = {}
            for word in self.__each_Class_Words[y]:
                if word not in self.__each_Class_Word_Num[y]:
                    self.__each_Class_Word_Num[y][word] = 1
                else:
                    self.__each_Class_Word_Num[y][word] += 1
        # print(self.__each_Class_Word_Num)

    def __count_Each_Class_Words(self):
        for label in self.__each_Class_Words:
            self.__each_Class_Words_Count[label] = len(self.__each_Class_Words[label])
        # print(self.__each_Class_Words_Count)

    def __str__(self):
        msg = ""
        msg += f"Class in the training set: {self.__data_Class}\n"
        msg += f"Words in each class: {self.__each_Class_Words}\n"
        return msg
scores = []
kfold = KFolds(5)
for train,valid in kfold.split(training_set):
    myCLF = StandardBayesCLF()
    myCLF.fit(X = train["abstract"], y = train["class"])
    
    result = myCLF.predict(valid["abstract"])
    score = calScore(result, valid["class"])
    scores.append(score)
scores

[0.24875, 0.21375, 0.24125, 0.22875, 0.175]