# Classify news using Naïve Bayes Model
In this computer assignment, we implement a Naïve Bayes model to find the classify news using their descriptions and headlines.

## Libraries
1. `re` is imported to remove non-alphabetic characters from strings.

2. `nlkt` is imported to preprocess given descriptions and headlines such as removing stop words and stemming.

In [1]:
# !pip3 install nltk
import re
import nltk
import operator
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from nltk.stem import WordNetLemmatizer 
from IPython.display import Markdown, display
# nltk.download("all")

## Data
There is a `data.csv` file including a list of news with some details such as authors, category, headline, date and short_description.Also, there is a `test.csv` file as well as `data.csv` except it doesn't contain category field.

## Bayesian Interference
Naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posteriori or MAP decision rule. 
$$ C_{MAP} = argmax \ P(c|d) = argmax \ \frac{P(d|c)P(c)}{P(d)} $$
>### 1.Posterior probability $P(c|d)$:
$$P(c|d) = \frac{P(d|c) P(c)}{P(d)}$$

>### 2.Class prior probability $P(c)$ :
$$P(c) = \frac{count(c)}{\sum_{c_i}count(c_i)}$$

>### 3.Likelihood $P(d|c)$:
$$P(d|c) = P(w_1|c)P(w_2|c)...P(w_n|c) = \prod_{i=1,..,n}\frac{count(w_i, c) + 1}{\sum_{w \in V} count(w, c) + |V|}$$

Using above equations we can pich the most probable class.
$$\Rightarrow C_{MAP} = argmax \ P(d|c)P(c) = argmax \ P(w_1|c)P(w_2|c)...P(w_n|c)P(c) = argmax \prod_{i=1,..,n}\frac{count(w_i, c) + 1}{\sum_{w \in V} count(w, c) + |V|} P(c)$$

## Oversampling

Imbalanced dataset may lead to a difference between value of recall and precision. A method to tackle this issue is oversampling. To do this, after finding the maximum size of classes, the number of data of other classes should be should increased. In other words, oversampling consists in replicating some points from the minority class in order to increase its cardinality.

## Bayesian Classifier
`Bayesian_Classifier()` is in charge of cleaning and training a Bayesian model to classify news using their short_descriptions and headlines.

### 1. `clean_data()`
>This method removes all non-alphabetic characters from given field of a dataframe and stop words which are frequent words in English using nltk. It also producing morphological variants of a root/base word based on given method, stemming or lemmatization.

### 2. `over_sample()`
>In order to handle imbalanced dataset using oversampling method, it's necessary to increase the data size of each category to size of the biggest one. The duplicated data are selected randomly from the data using `DataFrame.sample()`.

### 3. `split_train_validation()`
>It split dataframe into a validation dataframe and train dataframe with $splitRatio = 0.8$ . If the `oversample` flag is set to ture, it calls `over_sample()` to handle our imblanced dataset.

### 4. `prepare_dict()`
>In this method, a nested dictionary is created that store the number of usage each word in every category.

### 5. `preprocessing()`
>Removing NaN values, cleaning data, spliting them into train and validation, and preparing dict are the required preprocessing steps, implemented in this function. 

### 6. `calculate_confusion_matrix()`
>It prepare a numpy array from a dictionary which contains the prediction results of each category. 

### 7. `find_categories()`
>It gives a dataframe and return predicted labels according to bayesian interference in a list.






In [2]:
class Bayesian_Classifier:
        
    def over_sample(self, data, final_size):
        return data.append(data.sample(n = final_size - len(data)), ignore_index = True) 
        
    def split_train_validation(self, data, categories, oversample):
        max_size = data['category'].value_counts().max()*4//5
        categorized_data = data.groupby(['category'])
        train_data = []
        validation_data = []
        category_names = []
        for category, dataframe in categorized_data:
            if (categories) and (not category in categories):
                continue
            train_dataframe = dataframe[:len(dataframe)*4//5]  
            if oversample:
                train_dataframe = self.over_sample(train_dataframe, max_size)
            train_data.append(train_dataframe)
            validation_data.append(dataframe[len(dataframe)*4//5:])
            category_names.append(category)    
        number_of_data_in_each_class = {category_names[i]: len(train_data[i]) for i in range(len(train_data))}
            
        return pd.concat(train_data), pd.concat(validation_data), category_names, number_of_data_in_each_class

    def clean_data(self, data, field, method):
        stop_words = set(stopwords.words('english'))
        for _, row in data.iterrows():
            s = re.sub('[^0-9a-zA-Z]+', ' ',  row[field]).lower()
            word_list = []
            for w in nltk.word_tokenize(s):
                if w in stop_words:
                    continue
                word_list.append(method(w))
            row[field] = ' '.join(word_list)
        return data
    
    def prepare_dict(self, train_data, category_names, field):
        all_words = set()
        training_dict = {c : dict() for c in category_names}    
        for _, row in train_data.iterrows():
            category = row['category']
            for word in row[field].split():
                all_words.add(word)
                if word in training_dict[category]:
                    training_dict[category][word] += 1
                else:
                    training_dict[category][word] = 1
                    
        return training_dict, all_words
    
    def preprocessing(self, data, method, categories, oversample):
        data = data.replace(np.nan, '', regex=True)
        data = pd.DataFrame(data = {'category' : data.loc[:, 'category'], \
                                    'description' : data.loc[:, 'short_description'] + " " + data.loc[:, 'headline'] + " " + data.loc[:, 'headline']})
        
        data = data.dropna().reset_index(drop=True).sample(frac = 1)
        data = self.clean_data(data, 'description', method)
        train_data, validation_data, category_names, number_of_data_in_each_class = self.split_train_validation(data, categories, oversample)
        
        
        training_dict, all_words = self.prepare_dict(train_data, category_names, 'description')
        
        return training_dict, validation_data, category_names, len(all_words), number_of_data_in_each_class

    def __init__(self, train_file = "data.csv", method="lemmatize", categories=[], oversample=True):
        if method == "lemmatize":
            self.method = WordNetLemmatizer().lemmatize
        elif method == "stem":    
            self.method = PorterStemmer().stem
        
        data = pd.read_csv(train_file)
        training_dict, validation_data, category_names, num_of_all_words, number_of_data_in_each_class = self.preprocessing(data, self.method, categories, oversample)

        self.num_of_all_words = num_of_all_words
        self.training_dict = training_dict
        self.category_names = category_names
        self.validation_data = validation_data
        self.num_of_words_each_class = {c : len(training_dict[c]) for c in training_dict}
        
        total_num_of_data = 0
        for c in category_names:
            total_num_of_data += number_of_data_in_each_class[c]
        self.probability_of_each_class = {c : number_of_data_in_each_class[c]/total_num_of_data for c in category_names}
        
    def find_categories(self, dataframe, field):
        result = []
        for _, row in dataframe.iterrows():
            p = {c : self.probability_of_each_class[c] for c in self.category_names}
            for c in self.category_names:
                for word in row[field].split():
                    if word in self.training_dict[c]: 
                        p[c] *= (self.training_dict[c][word] + 1)
                    p[c] *= 5000/(self.num_of_words_each_class[c] + self.num_of_all_words)
            result.append(max(p.items(), key=operator.itemgetter(1))[0])
        return result
    
    def calculate_confusion_matrix(self, result, category_names):
        num_of_classes = len(self.category_names)
        confusion_matrix = np.zeros((num_of_classes, num_of_classes), dtype=int)
        for index1, category1 in enumerate(category_names):
            for index2, category2 in enumerate(category_names):
                confusion_matrix[index1][index2] = result[category1][category2]
            
        return confusion_matrix
        
    def calculate_recall(self, confusion_matrix, index):
        return confusion_matrix[index, index]/np.sum(confusion_matrix[index])
    
    def calculate_precision(self, confusion_matrix, index):
        return confusion_matrix[index, index]/np.sum(confusion_matrix[:, index])
    
    def calculate_accuracy(self, confusion_matrix):
        acc = 0
        for i in range(len(confusion_matrix)):
            acc += confusion_matrix[i, i]
        return acc/np.sum(confusion_matrix)
    
    def show_confusion_matrix_and_evaluation_measures_table(self, confusion_matrix):
        table = "<center>\n"
        table += "<table>\n"
        table += "<tr><th>Confusion matrix </th><th>Evaluation measures</th></tr>\n"
        table += "<tr><td>\n\n"

        table += "| |"
        for c in self.category_names:
            table += c + "|"
        table += "\n|"
        for i in range(len(self.category_names) + 1):
            table += ":-:|"    
        table += "\n"
        for i, c in enumerate(self.category_names):
            table += "|**" + c + "**|"
            for j, _ in enumerate(self.category_names):
                table += str(confusion_matrix[i, j]) + "|"
            table += "\n"
        table += "\n"

        table += "</td><td>\n\n"

        table += "| |"
        for c in self.category_names:
            table += c + "|"
        table += "\n|"
        for i in range(len(self.category_names) + 1):
            table += ":-:|"    
        table += "\n"
        table += "|**Recall**|"
        for i in range(len(self.category_names)):
            table += str("{:.2f}".format(self.calculate_recall(confusion_matrix, i))) + "|"
        table += "\n"
        table += "|**Precision**|"
        for i in range(len(self.category_names)):
            table += str("{:.2f}".format(self.calculate_precision(confusion_matrix, i))) + "|"
        table += "\n\n"

        table += "</td></tr> </table></center>\n"
        
        display(Markdown(table))
    
    def show_validation_result(self, field = 'description'):
        real_lables = self.validation_data['category'].tolist()
        predicted_labels = self.find_categories(self.validation_data ,field)
        result = {c : {cc : 0 for cc in self.category_names} for c in self.category_names}
        for i in range(len(predicted_labels)):
            result[real_lables[i]][predicted_labels[i]] += 1
        
        confusion_matrix = self.calculate_confusion_matrix(result, self.category_names)
        self.show_confusion_matrix_and_evaluation_measures_table(confusion_matrix)
        print("Accuracy is: {:.2f}".format(self.calculate_accuracy(confusion_matrix)))
        
    def evaluate_test(self, filename):
        data = pd.read_csv(filename)
        data = data.replace(np.nan, '', regex=True)
        data = pd.DataFrame(data = {'description' : data.loc[:, 'short_description'] + " " + data.loc[:, 'headline']})
        data = self.clean_data(data, 'description', self.method)
    
        return self.find_categories(data, field='description')

## Confusion Matrix and Evaluation Measures

### 1.Confusion Matrix
>A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. In other words, CM \[i, j\] shows the number of data which belongs to i category but the model classifies it into j category.
### 2. Accuracy
>Accuracy shows the fraction of data that classifies correctly.
$$ Accuracy = \frac{Correct \ Detected}{Total} $$
### 3. Recall
>If the model can detect the class, the detection is highly trustable.
$$ Recall = \frac{Correct \ Detected \ category}{Detected \ category} $$
### 4. Precision
>High precision means that the class is detected well.
$$ Precision = \frac{Correct \ Detected \ category}{All \ category} $$

## Results
In the following parts we show the results of both phase 1 and phase 2 with *lemmatize* method.
### Two classes classification using lemmatize and oversampling

In [3]:
BC1 = Bayesian_Classifier(method="lemmatize", categories=['BUSINESS', 'TRAVEL'])
BC1.show_validation_result()

<center>
<table>
<tr><th>Confusion matrix </th><th>Evaluation measures</th></tr>
<tr><td>

| |BUSINESS|TRAVEL|
|:-:|:-:|:-:|
|**BUSINESS**|1016|53|
|**TRAVEL**|96|1684|

</td><td>

| |BUSINESS|TRAVEL|
|:-:|:-:|:-:|
|**Recall**|0.95|0.95|
|**Precision**|0.91|0.97|

</td></tr> </table></center>


Accuracy is: 0.95


### Three classes classification using lemmatize and oversampling

In [4]:
BC2 = Bayesian_Classifier(method="lemmatize")
BC2.show_validation_result()

<center>
<table>
<tr><th>Confusion matrix </th><th>Evaluation measures</th></tr>
<tr><td>

| |BUSINESS|STYLE & BEAUTY|TRAVEL|
|:-:|:-:|:-:|:-:|
|**BUSINESS**|1008|22|39|
|**STYLE & BEAUTY**|57|1651|29|
|**TRAVEL**|85|63|1632|

</td><td>

| |BUSINESS|STYLE & BEAUTY|TRAVEL|
|:-:|:-:|:-:|:-:|
|**Recall**|0.94|0.95|0.92|
|**Precision**|0.88|0.95|0.96|

</td></tr> </table></center>


Accuracy is: 0.94


### Two classes classification using lemmatize

In [5]:
BC3 = Bayesian_Classifier(method="lemmatize", categories=['BUSINESS', 'TRAVEL'], oversample = False)
BC3.show_validation_result()

<center>
<table>
<tr><th>Confusion matrix </th><th>Evaluation measures</th></tr>
<tr><td>

| |BUSINESS|TRAVEL|
|:-:|:-:|:-:|
|**BUSINESS**|849|220|
|**TRAVEL**|18|1762|

</td><td>

| |BUSINESS|TRAVEL|
|:-:|:-:|:-:|
|**Recall**|0.79|0.99|
|**Precision**|0.98|0.89|

</td></tr> </table></center>


Accuracy is: 0.92


### Three classes classification using lemmatize

In [6]:
BC4 = Bayesian_Classifier(method="lemmatize", oversample = False)
BC4.show_validation_result()

<center>
<table>
<tr><th>Confusion matrix </th><th>Evaluation measures</th></tr>
<tr><td>

| |BUSINESS|STYLE & BEAUTY|TRAVEL|
|:-:|:-:|:-:|:-:|
|**BUSINESS**|828|73|168|
|**STYLE & BEAUTY**|11|1665|61|
|**TRAVEL**|10|50|1720|

</td><td>

| |BUSINESS|STYLE & BEAUTY|TRAVEL|
|:-:|:-:|:-:|:-:|
|**Recall**|0.77|0.96|0.97|
|**Precision**|0.98|0.93|0.88|

</td></tr> </table></center>


Accuracy is: 0.92


### Prepare test labels
Using best bayesian classifier, the predicted labels for test data is saved in `output.csv`.

In [7]:
labels = BC2.evaluate_test("test.csv")
answer = pd.DataFrame(list(zip([i for i in range(len(labels))], labels)), columns =['index', 'category'])
answer.to_csv ('output.csv', index = False, header=True)

## Questions
### 1. Lemmatization vs Stemming
>#### Stemming
Stemmingalgorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word.
>#### Lemmatization
Lemmatization, on the other hand, takes into consideration the morphological analysis of the words.

It has been seen the benefits of a lemmatizer for search engines, beacuase lemmatization, unlike stemming, reduces the inflected words properly ensuring that the root word belongs to the language

### 2. TF-IDF
TF-IDF stands for “Term Frequency — Inverse Document Frequency”.
>**Term Frequency** measures the frequency of a word in a document. To neutralize the effect of the length of a document, we perform normalization on the frequency value. we divide the frequency with the total number of words in the document.

$$ tf(w, d) = \frac{count(w, d)}{\sum_{w_i \in d} count(w_i)}$$

>**Document Frequency** measures the importance of document in whole set of corpus, this is very similar to TF. The only difference is that TF is frequency counter for a term t in document d, where as DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present. We consider one occurrence if the term consists in the document at least once, we do not need to know the number of times the term is present.

$$df(w) = occurrence \ of \ w \ in \ documents$$

To train the Bayesian model with TF-IDF, the likelihood would be calculated with TF-IDF value.

$$P(w_i|c) = \frac{tf(w_i, d)\times\frac{1}{df(w_i)} + 1}{\sum_{w \in V} tf(w_i, d)\times\frac{1}{df(w_i)} + |V|}$$

### 3. High Precision
Precision and recall are two parameter that should be increased together. High precision means that if a classifier detect a class, it would be likely correct. In other words, high precision and low recall result in that the model can’t detect the class well but is highly trustable when it does.

### 4. Rare words
If there is a rare word "Tabriz" in test files that exist in just one category in training data, using a simple posibility calculation may leads to wrong classification, but following equation is used to prevent this issue.

