# P08 - Multinomial Naive Bayes classifier for Fake News recognition

| Name   | ID |
| -------- | ------- |
| Calandra Buonaura Lorenzo | 2107761     |
| Turci Andrea  |2106724   |

# Introduction

In an era where information dissemination is increasingly digital and instantaneous, the proliferation of fake news has become a pervasive issue. The New York Times defines fake news as "a made-up story with an intention to deceive", highlighting its purpose to confuse or mislead the audience. This phenomenon is predominantly propagated through social media platforms and various online applications, embedding itself deeply in our daily lives. The ability to distinguish between fake and authentic news has emerged as one of the most pressing challenges for the modern news industry.

This assignment delves into the application of Multinomial Naive Bayes classifiers for the recognition of fake news. These classifiers are non-machine-learning classifier, renowned for their efficacy in text data analysis, particularly in classification tasks where the objective is to categorize text into multiple classes. Why are they called Multinomial Naive Bayes classifier?
- The term "multinomial" assumes that the features (word counts) are generated from a multinomial distribution; this distribution models the number of times an event occurs in a fixed number of trials, where each trial results in one of several possible outcomes and it is particularly suitable for data where the features are counts or frequencies, such as the number of times a word appears in a document. This assumption, while simplifying computations, has been proven effective in practical text classification tasks.
- The term "naive" refers to the strong (and often unrealistic) assumption of conditional independence between features. In mathematical terms, this means that the probability of observing the conjunction of features is simply the product of the probabilities for each individual feature, given the class label.
- The term "Bayes" refers to the fact that these classifiers are based on Bayes' Theorem and they leverage the probability of features (words or phrases) to determine the likelihood of a particular category (e.g., fake or real news). 

The goal of this project is to implement a Multinomial Naive Bayes classifier in R. The focus will be on evaluating the performance of this classifier in categorizing som sort of document, coming mostly from social media. The project will involve several key steps:

- Data Preprocessing: this involves cleaning the text data to remove noise, such as punctuation and stop words, and transforming the text into a format suitable for analysis, typically using techniques like tokenization and lemmatization.
- Model Training: using the cleaned dataset, the Multinomial Naive Bayes classifier will be trained to learn the patterns associated with fake and real news. This involves calculating the prior probabilities of each class and then the conditional probabilities (posterior) of each word given the class.
- Hyperparameter Tuning: the model's parameters will be adjusted to improve its performance.
- Validation and Testing: the final model will be validated using cross-validation techniques to ensure its robustness and tested on unseen data to assess its generalization capability. This will help in understanding the effectiveness of the model in distinguishing fake news from real news when facing unseen data.

# Datasets

Two different datasets were used in order to test the algortim. 
The first $^1$ contains 10240 documents classified according to six labels:

* $\textit{Barely-True: 0}$
* $\textit{False: 1}$
* $\textit{Half-True: 2}$
* $\textit{Mostly-True: 3}$
* $\textit{Not-Known: 4}$
* $\textit{True: 5}$

In contrast, the second $^2$ contains 20387 news articles, which were merged into a single dataset in which class 0 was assigned to reliable news and class 1 to unreliable news.

In the following sections, the algorithm will be applied initially to the first dataset, the results of which will be displayed, and only afterwards to the second dataset.

$^1 \small{https://www.kaggle.com/datasets/anmolkumar/fake-news-content-detection/overview}$

$^2 \small{https://www.kaggle.com/competitions/fake-news/overview}$

# Theory

### General approach

In text classification, we are given a description $d \in X$ of a document, where $X$ is the document space; and a fixed set of classes $C = {c_1, c_2, \dots, c_J}$ (also called categories or labels). We are also given a training set $D$ of labeled documents $<d, c>$, where $d, c \in X \times C$.  Using a learning method or learning algorithm, we then wish to learn a classifier or classification function $\gamma$ that maps documents to classes: $\gamma : X \rightarrow C$ (so we are in the supervised learning framework).

The probability of a text or document $d$ to belong to category $c$ can be obtained from Bayes' theorem: 

$$ P(c|d) \approx P(c) \ \prod_k P(t_k | c) $$

where $P(t_k | c)$ corresponds to the probability that the term $t_k$ appears in a document of class $c$; this probability is a measure of how much $t_k$ contributes for $c$ to be the correct class to assign to the document. As we can see, here the "naive" property enters, as each word is considered independent from the others and the probability of each document is simply propportional to the product of the conditional probability of each word the document contains. The conditional probability is estimated as the relative frequency of the term $t$ in the documents belonging to the class $c$: 

$$ P(t|c) = \frac{T_{ct}}{\sum_{t'} T_{ct'}} $$

where $T_{ct}$ is the number of times the term $t$ appears in the document of class $c$. Finally $P(c)$ is the a priori probability that a document belongs to class $c$; in general, the priori probability is estimated by exploiting relative frequencies: 

$$ P(c) = \frac{N_c}{N} $$

where $N_c$ corresponds to the total number of documents of class $c$ and $N$ is the total number of documents in the dataset.

Since the probability $P(c|d)$ is given by the product of many conditional probabilities, it is computationally advantageous to consider the logarithms of these quantities and sum them together. Finally we can say that in the Naive Bayes classification the best class is the most likely or maximum a posteriori (MAP) class $c_{map}$:

$$c_{map} = argmax_{c \in C} \; P(c|d) = argmax_{c \in C} \; \left[\log(P(c))+ \sum_{k} \log(P(t_k | c))\right]$$

### Laplace smoothing

In the practical implementation of a Naive Bayes classifier, it is common to encounter the zero probability problem. This occurs when a term $t_k$ does not appear in any of the documents of a class $c$. In that case, the conditional probability $P(t_k | c)$ will be zero, and consequently, the product of the conditional probabilities will become zero, canceling the total probability $P(c|d)$. To avoid this problem, a technique called "Laplace smoothing" or "add-one smoothing" is used, and the prior probability and the conditional probability are modified as follows.

The a priori probability $P(c)$ remains the same, since it is not affected by the presence or absence of terms.

$$ P(c) = \frac{N_c}{N} $$
 
The conditional probability, instead, is calculated by adding 1 to the numerator (number of times the term $t$ appears in class $c$ documents) and adding the total number of distinct terms $|V|$ to the denominator (total number of occurrences of all terms in class $c$ documents):

$$ P(t|c) = \frac{T_{ct} + 1}{\sum_{t'} T_{ct'} + |V|} $$
 
where:

- $T_{ct}$ is the number of times the term $t$ appears in the documents of class $c$.
- $\sum_{t'} T_{ct'}$ is the total number of occurrences of all terms in the documents of class $c$.
- $|V|$ is the number of distinct terms in the total vocabulary, so the length of the vocabulary.

The final formula for determining the best class $c_{map}$ remains similar, but uses smoothed conditional probabilities:

$$ c_{map} = argmax_{c \in C} \; P(c|d) = argmax_{c \in C} \; \left[\log(P(c))+ \sum_{k} \log\left(\frac{T_{ct_k} + 1}{\sum_{t'} T_{ct'} + |V|}\right)\right] $$

The application of Laplace smoothing has several advantages:
- It avoids the problem of zero probabilities.
- It provides a more robust estimate of conditional probabilities.
- It allows better handling of new or rare words that may appear in test documents.

# Code

In this section we present the most important functions that are used for the analysis of the two dataset.

### Libraries

First of all we need to import the libraries used for the project:

In [1]:
library(tm)
library(textstem)
library(SnowballC)
library(dplyr)

Loading required package: NLP



Loading required package: koRpus.lang.en

Loading required package: koRpus

Loading required package: sylly

For information on available language packages for 'koRpus', run

  available.koRpus.lang()

and see ?install.koRpus.lang()



Attaching package: ‘koRpus’


The following object is masked from ‘package:tm’:

    readTagged



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




The libraries `tm`, `textsetm` and `SnowballC` are used for data tokenization, lemmatization and stemming. The library `dplyr`, instead, is designed to make data manipulation easier and more intuitive (allowing for example piping operations, thanks to the pipe operator `>%>`)

### Dataset Cleaning

Before training the model, we need to pre-process our data in order to clean them from noise and unuseful words (like stopwords), so we proceed with a data cleaning, which includes tokenization, lemmatization and stemming.

In [2]:
change_labels <- function(labels) {
  label_map <- c("0" = 2, "1" = 1, "2" = 3, "3" = 4, "4" = 0, "5" = 5)
  new_labels <- sapply(labels, function(label) label_map[as.character(label)])
  return(new_labels)
}

This function change_labels remaps a set of categorical labels according to a specified mapping (defined by  `label_map`). The input labels is expected to be a vector of labels and then the `sapply` function is used to apply this mapping to each label in the input vector; the new vector of remapped labels is then returned by the function. This function is used only for the first dataset, in order to make the labelling more consistent with the meaning of each label; after the mapping, the correspondance is:

* $\textit{Not-Known: 0}$
* $\textit{False: 1}$
* $\textit{Barely-True: 2}$
* $\textit{Half-True: 3}$
* $\textit{Mostly-True: 4}$
* $\textit{True: 5}$

In [3]:
lemmatize_text <- function(text) {
  lemmatized <- textstem::lemmatize_words(unlist(strsplit(text, "\\s+")))
  lemmatized <- SnowballC::wordStem(lemmatized, language = "en")

  return(paste(lemmatized, collapse = " "))
}

The `lemmatize_text` function processes an input string text by first lemmatizing and then stemming each word. Lemmatization (using `textstem::lemmatize_words`) converts words to their base or dictionary form, while stemming (using `SnowballC::wordStem`) reduces words to their stem form. The input text is split into individual words, processed, and then recombined into a single string, which is then returned.

In [4]:
filter_non_english_words <- function(text) {
  tokens <- unlist(strsplit(text, "\\s+"))
  is_english <- hunspell::hunspell_check(tokens)
  english_tokens <- tokens[is_english]
  cleaned_text <- paste(english_tokens, collapse = " ")
  return(cleaned_text)
}

This function `filter_non_english_words` removes non-English words from a given input string text. It tokenizes the text into individual words, checks each word for being an English word using `hunspell::hunspell_check`, and retains only the words identified as English. The cleaned text, composed only of English words, is then reassembled into a single string and returned. This is done because our datasets are made mostly of english articles and the non-english words are not significative in the analysis.

In [5]:
remove_numbers_inside_words <- function(text) {
  words <- unlist(strsplit(text, "\\s+"))

  clean_words <- lapply(words, function(word) {
    if (grepl("\\d", word)) {
      word <- gsub("\\d", "", word)
    }
    return(word)
  })

  cleaned_text <- paste(clean_words, collapse = " ")
  return(cleaned_text)
}

The `remove_numbers_inside_words` function cleans a given input string text by removing any numerical digits within words. It splits the text into individual words, processes each word to remove digits (using `gsub`), and then recombines the cleaned words into a single string. The resulting string, with numbers removed from within words, is returned. 

In [6]:
clean <- function(document, tokenize = TRUE, lemmatize = TRUE) {
  clean_doc <- tm::VCorpus(tm::VectorSource(document))

  if (tokenize) {
    clean_doc <- tm::tm_map(clean_doc, tm::content_transformer(tolower))
    clean_doc <- tm::tm_map(clean_doc, tm::removePunctuation)
    clean_doc <- tm::tm_map(clean_doc, tm::removeWords, tm::stopwords("en"))
    clean_doc <- tm::tm_map(clean_doc, tm::content_transformer(filter_non_english_words))
    clean_doc <- tm::tm_map(clean_doc, tm::content_transformer(remove_numbers_inside_words))
    clean_doc <- tm::tm_map(clean_doc, tm::stripWhitespace)
  }

  if (lemmatize) {
    clean_doc <- tm::tm_map(clean_doc, tm::content_transformer(lemmatize_text))
  }

  return(sapply(clean_doc, NLP::content))
}

The `clean` function performs comprehensive text cleaning on an input document. It first converts the input document into a text corpus using `tm::VCorpus`. If tokenize is set to `TRUE`, the function applies a series of transformations: converting text to lowercase, removing punctuation, removing stop words, filtering non-English words, removing numbers from within words, and stripping whitespace. If lemmatize is set to `TRUE`, it also lemmatizes the text. The function returns the cleaned document as a character vector.

In [7]:
clean_empty_rows <- function(dataframe) {
  empty_rows <- which(nchar(trimws(dataframe$Text)) == 0)
  if (length(empty_rows) != 0) {
    dataframe <- dataframe[-empty_rows, ]
  }
  return(dataframe)
}

The `clean_empty_rows` function removes rows from a dataframe where the `Text` column is empty or contains only whitespace. It identifies such rows trimming the whitespaces (`trimws`) and then counting how many character are still there (`nchar`), then excludes rows with zero characters from the dataframe. The cleaned dataframe, with empty rows removed, is returned. This function ensures that the dataframe only contains rows with meaningful text data, because after the cleaning process it could happen that all the words contained in a row are neglected.

### Vocabularies

After the cleaning process, we need to identify the vocabulary of the clean dataset, which is a sorted list of unique words contained in the dataset. As explained later, we identified different techniques for building the vocabulary: 

In [8]:
get_vocabulary_six <- function(document, threshold) {
  words <- unlist(strsplit(document, "\\s+"))
  words <- words[words != ""]
  words_table <- table(words)

  words_freq <- as.data.frame(words_table, stringsAsFactors = FALSE)
  colnames(words_freq) <- c("word", "occurrencies")

  vocabulary <- words_freq[words_freq$occurrencies >= threshold, ]$word
  return(vocabulary)
}

The `get_vocabulary_six` function creates a vocabulary list from a given text document (of the first dataset, the one with six labels) by including only those words that occur at least a specified number of times. It begins by tokenizing the input document, splitting it into individual words using spaces as delimiters (empty strings that may result from this split are removed). The function then constructs a frequency table of these words, which is converted into a data frame with columns named word and occurrencies, representing each unique word and its frequency of occurrence, respectively. The function filters this data frame to include only those words whose frequency meets or exceeds the specified threshold. The resulting vocabulary is returned as a vector of words that meet this criterion.

In [9]:
get_vocabulary_tags <- function(df, threshold) {
  tag_texts <- list()
  all_tags <- unique(unlist(strsplit(df$Tag, ",")))

  for (tag in all_tags) {
    matching_docs <- df[grep(tag, df$Tag), "Text"]
    doc <- paste(matching_docs, collapse = " ")

    voc <- get_vocabulary_six(doc, threshold)
    tag_texts <- append(tag_texts, voc)
  }

  return(sort(unique(unlist(tag_texts))))
}

The `get_vocabulary_tags` function constructs a vocabulary list based on the tags associated with text documents in a data frame. It first identifies all unique tags in the `Tag` column by splitting the tags on commas and finding unique entries. For each unique tag, the function retrieves the texts of all documents associated with that tag, concatenating them into a single text string. It then uses the `get_vocabulary_six` function to generate a vocabulary list for this concatenated text, filtering words based on the specified threshold. The vocabularies for all tags are combined into a single list, which is returned as a sorted vector of unique words. This approach ensures that the vocabulary reflects the terms most commonly associated with each tag, based on their frequency in the relevant documents.

In [10]:
get_vocabulary_two <- function(document, threshold) {
  words <- unlist(strsplit(document, "\\s+"))
  words <- words[words != ""]
  words_table <- table(words)

  words_freq <- as.data.frame(words_table, stringsAsFactors = FALSE)
  colnames(words_freq) <- c("word", "occurrencies")

  total_words <- sum(words_freq$occurrencies)
  words_freq$occurrencies <- words_freq$occurrencies / total_words

  vocabulary <- words_freq[words_freq$occurrencies >= threshold, ]$word
  return(voc = vocabulary)
}

The `get_vocabulary_two` function also generates a vocabulary list from a text document (from the second dataset, so two labels only), but does so based on the relative frequency of words. Similar to `get_vocabulary_six`, it starts by tokenizing the document and removing any empty strings. It creates a frequency table of the words and converts it into a data frame with word and occurrencies columns. The function then calculates the total number of words and converts the occurrencies column to represent the relative frequency of each word. Words whose relative frequency meets or exceeds the specified threshold are filtered and stored in the vocabulary, which is then returned.

### Training

After data pre-processing and vocabulary building, we are ready for the training of our model. 

In [11]:
train_multinomial_nb <- function(classes, data, threshold, type) {
  n <- length(data$Text)

  if (type == "Six") {
    vocabulary <- get_vocabulary_six(paste(data$Text, collapse = " "), threshold)
  } else if (type == "Two") {
    vocabulary <- get_vocabulary_two(paste(data$Text, collapse = " "), threshold)
  } else if (type == "Tags") {
    vocabulary <- get_vocabulary_tags(data, threshold)
  } else {
    stop("Invalid type specified")
  }

  prior <- numeric(length(classes))
  names(prior) <- classes
  post <- matrix(0, nrow = length(vocabulary), ncol = length(classes), dimnames = list(vocabulary, classes))

  for (c in seq_along(classes)) {
    class_label <- classes[c]
    docs_in_class <- data[data$Label == class_label, "Text"]
    prior[c] <- length(docs_in_class) / n

    textc <- paste(docs_in_class, collapse = " ")
    tokens <- table(strsplit(tolower(textc), "\\W+")[[1]])
    vocab_counts <- sapply(vocabulary, function(t) if (t %in% names(tokens)) tokens[t] else 0)

    post[, c] <- (vocab_counts + 1) / (sum(vocab_counts) + length(vocabulary))
  }

  return(list(vocab = vocabulary, prior = prior, condprob = post))
}

The `train_multinomial_nb` function trains a Multinomial Naive Bayes classifier based on text data and specified classes. This function performs several critical tasks, including constructing the vocabulary, calculating prior probabilities, and computing conditional probabilities for each class.

The function starts by determining the length of the data, which is the number of text documents. It then decides how to build the vocabulary based on the specified type parameter:
- If type is `"Six"`, it calls `get_vocabulary_six` to generate the vocabulary from the combined text of all documents. 
- If type is `"Two"`, it calls `get_vocabulary_two` and retrieves the vocabulary part of the returned list. 
- If type is `"Tags"`, it calls `get_vocabulary_tags` to generate a vocabulary based on tags associated with the documents. 
- If an invalid type is provided, the function stops and raises an error.

Next, the function initializes the prior probability array and the conditional probability matrix. The prior array has a length equal to the number of classes and is named according to the class labels. The conditional probability matrix has rows corresponding to the vocabulary and columns corresponding to the classes, initialized to zeros.

The function then iterates over each class to compute the prior and conditional probabilities. For each class, it filters the documents that belong to the current class and calculates the prior probability as the ratio of the number of documents in the class to the total number of documents. It concatenates the text of all documents in the class into a single string and tokenizes this string into words. It counts the frequency of each word and constructs a frequency table. The function computes the conditional probabilities using Laplace smoothing: for each word in the vocabulary, it adds one to the word count (to avoid zero probabilities) and normalizes by the total word count plus the size of the vocabulary. This ensures that every word has a non-zero probability.

Finally, the function returns a list containing three elements: the vocabulary, the prior probabilities, and the conditional probability matrix. This trained model can then be used for classifying new text documents.

### Log-likelihood

After training, we are ready to use the trained model on unseen data:

In [12]:
apply_multinomial_nb <- function(classes, vocab, prior, condprob, doc) {
  tokens <- intersect(unlist(strsplit(doc, "\\s+")), vocab)

  score_matrix <- matrix(0, nrow = length(tokens), ncol = length(classes))
  rownames(score_matrix) <- tokens
  colnames(score_matrix) <- classes

  for (c in seq_along(classes)) {
    for (t in seq_along(tokens)) {
      term <- tokens[t]
      score_matrix[t, c] <- log(condprob[term, c])
    }
  }

  scores <- colSums(score_matrix) + log(prior)

  return(names(which.max(scores)))
}

The `apply_multinomial_nb` function applies a trained Multinomial Naive Bayes classifier to a new document in order to classify it. This function uses the vocabulary, prior probabilities, and conditional probabilities computed during training to determine the most likely class for the given document.

The function begins by tokenizing the input document into individual words. It then intersects these tokens with the provided vocabulary to ensure that only relevant words (those present both in the vocabulary and in the document) are considered. Then, a score matrix is initialized (filled with zeros), with rows representing the intersected tokens and columns representing the classes. The function then iterates over each class and each token, populating the score matrix with the log of the conditional probability of each token given the class. This involves two nested loops: the outer loop iterates over the classes, and the inner loop iterates over the tokens.

Once the score matrix is populated, the function calculates the total score for each class by summing the log-probabilities in the score matrix and adding the log of the prior probability for each class. The class with the highest total score is selected as the predicted class.
The function returns the name of the class with the maximum score, indicating the predicted classification for the input document. This approach ensures that the classification takes into account both the prior probability of each class and the likelihood of the document given each class, making use of the Naive Bayes assumption that the presence of each word is conditionally independent given the class.

### Validation

We also defined two function to perform hyperparameters tuning, leveraging the validation set:

In [13]:
validation <- function(dataset, thresholds, type) {
  seventy_percent <- floor(length(dataset$Text) * 0.7)
  eightyfive_percent <- floor(length(dataset$Text) * 0.85)
  n <- nrow(dataset)

  dataset <- dataset[sample(n), ]

  training_set <- dataset[1:seventy_percent, ]
  validation_set <- dataset[(seventy_percent + 1):eightyfive_percent, ]

  accuracies <- numeric(length(thresholds))
  classes <- as.integer(sort(unique(dataset$Label)))

  for (i in seq_along(thresholds)) {
    model <- train_multinomial_nb(classes, training_set, thresholds[[i]], type)
    pred_labels <- sapply(validation_set$Text, function(doc) {
      apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
    })

    correct_predictions <- sum(validation_set$Label == pred_labels)
    total_predictions <- length(validation_set$Label)
    accuracies[[i]] <- correct_predictions / total_predictions
  }

  return(data.frame(threshold = thresholds, accuracy = accuracies))
}

The `validation` function is designed to validate a Multinomial Naive Bayes classifier over different threshold values. It does this by splitting the dataset into training and validation sets; then, for each threshold, it trains the model on the training set, and then evaluates its performance on the validation set. The function then returns a data frame containing the accuracy for each threshold value tested.

In [14]:
kfold_cross_validation <- function(dataset, k = 5, thresholds, type) {
  n <- nrow(dataset)
  fold_size <- floor(n / k)

  accuracies <- matrix(0, nrow = k, ncol = length(thresholds))
  classes <- as.integer(sort(unique(dataset$Label)))

  for (fold in 1:k) {
    validation_indices <- ((fold - 1) * fold_size + 1):(fold * fold_size)
    train_indices <- setdiff(1:n, validation_indices)
    training_set <- dataset[train_indices, ]
    validation_set <- dataset[validation_indices, ]

    for (i in seq_along(thresholds)) {
      model <- train_multinomial_nb(classes, training_set, thresholds[i], type)

      pred_labels <- sapply(validation_set$Text, function(doc) {
        apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
      })

      correct_predictions <- sum(validation_set$Label == pred_labels)
      total_predictions <- length(validation_set$Label)
      accuracies[fold, i] <- correct_predictions / total_predictions
    }
  }

  mean_accuracies <- colMeans(accuracies)
  return(data.frame(threshold = thresholds, mean_accuracy = mean_accuracies))
}

The `kfold_cross_validation` function performs k-fold cross-validation on a given dataset to evaluate the performance of a Multinomial Naive Bayes classifier with different threshold values. This function helps to assess the classifier's accuracy by splitting the data into k subsets (folds) and iteratively training and validating the model on these folds.

The function starts by determining the number of rows (documents) in the dataset and calculating the size of each fold. It initializes a matrix accuracies to store the accuracy results for each fold and each threshold value. The unique class labels in the dataset are sorted and stored as integers in the classes vector.

The function then enters a loop that iterates over each fol: for each fold, it determines the indices of the validation set and the training set, which consists of all documents not in the validation set. Within each fold, the function iterates over the specified threshold values: for each threshold, it trains a Multinomial Naive Bayes classifier using the `train_multinomial_nb` function, which builds a vocabulary, computes prior probabilities, and calculates conditional probabilities based on the training set. 

Next, the function applies the trained model to each document in the validation set using the `apply_multinomial_nb` function, which classifies the document based on the trained model and returns the predicted class label. The function compares the predicted labels to the actual labels of the validation set to count the number of correct predictions: the accuracy for each threshold and fold is computed as the ratio of correct predictions to the total number of predictions and stored in the accuracies matrix.

After completing the cross-validation process for all folds and thresholds, the function calculates the mean accuracy for each threshold by taking the column-wise mean of the accuracies matrix. The function returns a data frame containing the threshold values and their corresponding mean accuracies.

This cross-validation approach ensures a robust evaluation of the classifier's performance by training and validating the model on different subsets of the data, thereby reducing the risk of overfitting and providing a more reliable estimate of the classifier's accuracy.

# Analysis

## 1. Six-label dataset

### Data pre-processing

As previously introduced, the first dataset we analyze is composed of documents with assigned one of six labels, which indicate the level of truthness of each document, and a tag that indicates the main topics of the document. We upload the data as a dataframe using the `read.csv()` function, naming the three columns. First of all, as previously explained, we change the labels in order to make their meaning consistent with their value. Secondly, we save the unique labels and tags in two vectors, which will be used later.

In [15]:
dataset <- read.csv("six_label_dataset.csv", col.names = c("Label", "Text", "Tag"))
dataset$Label <- change_labels(dataset$Label)
head(dataset)

Unnamed: 0_level_0,Label,Text,Tag
Unnamed: 0_level_1,<dbl>,<chr>,<chr>
1,1,Says the Annies List political group supports third-trimester abortions on demand.,abortion
2,3,When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.,"energy,history,job-accomplishments"
3,4,"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran.""",foreign-policy
4,1,Health care reform legislation is likely to mandate free sex change surgeries.,health-care
5,3,The economic turnaround started at the end of my term.,"economy,jobs"
6,5,The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.,education


In [16]:
classes <- as.integer(sort(unique(dataset$Label)))
classes

In [17]:
args <- sort(unique(unlist(strsplit(dataset$Tag, ","))))
args

After an initial look to the dataset, we can see how many unique words the dataset contains before cleaning it. Then, after applying the `clean()` function and performing lemmatization and stemming, we can see how much the vocabulary has been reduced.

In [18]:
len_voc <- length(get_vocabulary_six(dataset$Text, threshold = 1))
len_voc

In [19]:
dataset$Text <- clean(dataset$Text)
dataset <- clean_empty_rows(dataset)

In [20]:
len_voc_cleaned <- length(get_vocabulary_six(dataset$Text, threshold = 1))
len_voc_cleaned

In [21]:
len_voc_cleaned <- length(get_vocabulary_six(dataset$Text, threshold = 5))
len_voc_cleaned

We can see that the cleaning process reduces a lot the total number of words that are actually unique in our dataset; in particular we get that, using the previously presented techniques for stemming and lemmatizing, the final vocabulary is only 23.7% of the initial vocabulary. If we include also a frequency check, choosing a threshold greater than 1, we are able to reduce the dimension of the vocabulary even more; for example, for `threshold = 5`, the final vocabulary is only 9.6% of the initial vocabulary.

### Model training

After the preprocessing of the dataset, we are ready to train our Multinomial Naive Bayes model; the first thing to do is to divide the whole dataset in training set, validation set and test set, in order to tune the hyper-parameter of the model annd study its accuracy on unseen data. Before the division we randomly permutate the dataset, in order to remove possible correlation between consecutive documents.

In [22]:
seventy_percent <- floor(length(dataset$Text) * 0.7)
eightyfive_percent <- floor(length(dataset$Text) * 0.85)
n <- nrow(dataset)

dataset <- dataset[sample(n), ]

training_set <- dataset[1:seventy_percent, ]
validation_set <- dataset[(seventy_percent + 1):eightyfive_percent, ]
test_set <- dataset[(eightyfive_percent + 1):n, ]

In this part we consider `threshold = 3` as an example; later in the notebook we proceed to a tuning of this parameter using the validation set and then choosing the model that has the best accuracy on it. After the training, the output of the model are presented to give an idea of how things work.

In [23]:
model <- train_multinomial_nb(classes, training_set, threshold = 3, type = "Six")

In [24]:
print(model$vocab)

   [1] "2"             "3"             "4"             "5"            
   [5] "6"             "abil"          "abl"           "abolish"      
   [9] "abort"         "absente"       "absolut"       "abus"         
  [13] "academi"       "acceler"       "accept"        "access"       
  [17] "accid"         "accident"      "accommod"      "accord"       
  [21] "account"       "accumul"       "accus"         "achiev"       
  [25] "acknowledg"    "acorn"         "acr"           "across"       
  [29] "act"           "action"        "activ"         "activist"     
  [33] "actual"        "ad"            "add"           "addict"       
  [37] "addit"         "address"       "adjust"        "administr"    
  [41] "admir"         "admiss"        "admit"         "adopt"        
  [45] "adult"         "advanc"        "advantag"      "advertis"     
  [49] "advis"         "advisor"       "advisori"      "advoc"        
  [53] "advocaci"      "affair"        "affect"        "affili"       
  [57]

In [25]:
print(model$prior)

         0          1          2          3          4          5 
0.08164689 0.19078856 0.16427076 0.20935101 0.19023029 0.16371249 


In [26]:
model$condprob

Unnamed: 0,0,1,2,3,4,5
2,0.0002843737,5.940006e-04,4.725526e-04,7.563820e-04,1.110803e-03,5.659767e-04
3,0.0002843737,3.712504e-04,3.150350e-04,5.042546e-04,3.471258e-04,5.659767e-04
4,0.0001421868,2.970003e-04,7.875876e-05,1.260637e-04,1.388503e-04,2.425614e-04
5,0.0001421868,1.485001e-04,7.875876e-05,1.890955e-04,2.777006e-04,2.425614e-04
6,0.0001421868,1.485001e-04,1.575175e-04,6.303183e-05,2.082755e-04,1.617076e-04
abil,0.0001421868,3.712504e-04,3.937938e-04,1.260637e-04,2.777006e-04,8.085382e-05
abl,0.0002843737,6.682507e-04,3.937938e-04,3.151592e-04,6.248264e-04,3.234153e-04
abolish,0.0002843737,3.712504e-04,2.362763e-04,1.260637e-04,6.942516e-05,1.617076e-04
abort,0.0008531210,2.153252e-03,1.338899e-03,1.638828e-03,9.025271e-04,1.697930e-03
absente,0.0001421868,1.485001e-04,7.875876e-05,1.260637e-04,1.388503e-04,8.085382e-05


### Testing on validation set

We then use the result from the training to test the accuracy of the produced model on the validation set. The accuracy is simply defined as the number of the correct predicted labels; for a more deep analysis we also provide the confusion matrix, in order to see if specific patterns are present (for example a label which is predicted much more times than the others without any reason). 

In [27]:
pred_labels <- sapply(validation_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [28]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.1848958 

Confusion Matrix:
    Predicted
True  0  1  2  3  4  5
   0  9 25 24 26 33 16
   1  7 61 45 79 81 36
   2 14 46 37 57 46 37
   3 10 51 58 90 53 44
   4 22 71 48 79 60 36
   5 14 46 36 66 46 27


As we can see the accuracy obtained on the validation set is really low. Our model performs a little better than choosing at random (which will give an average accuracy of 0.167, 1 over 6), but obviously this result indicates that this methods is not capable of classifying well the documents. From the conclusion matrix we see that no specific pattern arises and in general we don't have a general behaviour that explains the misclassified documents. 

### Tuning of the hyper-parameters

The only parameter that we can tune using the validation set in this case is the occurrency threshold for our vocabulary. In order to find the best parameter, we can simply train different models and choose the one that maximizes the accuracy on the validation set. 

In [29]:
poss_thresholds <- 1:20
val_results <- validation(dataset, poss_thresholds, type = "Six")
val_results

threshold,accuracy
<int>,<dbl>
1,0.2311198
2,0.2259115
3,0.2207031
4,0.2167969
5,0.2174479
6,0.2174479
7,0.218099
8,0.21875
9,0.2200521
10,0.2213542


In [30]:
best_threshold <- val_results$threshold[which.max(val_results$accuracy)]
cat("Best threshold: ", best_threshold, "\n")
cat("Best accuracy: ", max(val_results$accuracy), "\n")

Best threshold:  1 
Best accuracy:  0.2311198 


In this way we are able to tune the best threshold for our model: as we can see, even after a tuning, we still obtain a really small value for the accuracy, which indicates that this parameter is not the main responsable for the poor performances of the model.

### Testing on test set

After the choice of the bets hyper-parameters we proceed testing the model on unseen data, the test set. We train again the model with the best threshold for the vocabulary and then we study the accuracy on the training set.

In [31]:
model <- train_multinomial_nb(classes, training_set, best_threshold, type = "Six")

pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [32]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.2395833 

Confusion Matrix:
    Predicted
True   0   1   2   3   4   5
   0   4  36  14  41  20  18
   1   6  73  37  96  66  31
   2   7  52  43  68  47  20
   3   4  45  44 113  78  22
   4   5  62  38  77 100  34
   5   5  39  32  72  52  35


From this final analysis we obtain again a very low accuracy for our model; again no specific pattern can be deduced from the confusion matrix.

One thing that in general we can conclude is that we don't have overfitting or underfitting as the training, the validation and the test errors are all similar. One possible cause of the poor performance is the small length of each document in the dataset, which makes hard for the model to classify only on the basis of a few words; at the same time, the presence of six different lables makes things more difficult for the model, as similar labels could share similar general patterns (this is amplified by the small number of words per document).

### K-fold cross validation

Another possible reason for the poor performance of the model is a not enough large dataset for training and validation; in order to remove this possibility we proceed using the K-fold cross validation approach. In the following cells, we perform the same operations done in the previous points, studying possible values for the threshold. Moreover, this time we divide the dataset only in training set and test set, as the validation set is directly selected by the `kfold_cross_validation` function.

In [33]:
eigthy_percent <- floor(length(dataset$Text) * 0.8)
n <- nrow(dataset)

dataset <- dataset[sample(n), ]

training_set <- dataset[1:eigthy_percent, ]
test_set <- dataset[(eigthy_percent + 1):n, ]

In [34]:
poss_thresholds <- 1:20
crossval_results <- kfold_cross_validation(training_set, k = 5, thresholds = poss_thresholds, type = "Six")
crossval_results

threshold,mean_accuracy
<int>,<dbl>
1,0.2260232
2,0.2266341
3,0.2267563
4,0.2222358
5,0.222358
6,0.2217471
7,0.2216249
8,0.2200367
9,0.2197923
10,0.2200367


In [35]:
best_threshold <- crossval_results$threshold[which.max(crossval_results$mean_accuracy)]
best_threshold

In [36]:
model <- train_multinomial_nb(classes, training_set, best_threshold, type = "Six")
pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [37]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.237793 

Confusion Matrix:
    Predicted
True   0   1   2   3   4   5
   0  22  50  36  30  25  19
   1  25 100  55  85  76  54
   2  15  84  62  85  68  25
   3  12  85  53 127  83  51
   4  11  62  46  99 101  69
   5  15  52  33  72  86  75


From this approach we obtain similar results as before. Depending on the initial random shuffling of the dataset we obtain values of accuracies for the best threshold between 0.20 and 0.23, which is still an indicator of a very bad performance of our model. In any case, this result tells us that the k-fold cross validation doesn't change a lot the behaviour of the model; this could indicate the necessity of a different pre-processing technique. 

### Analysis using tags

The approaches used up to this point have not produced a succesfull model. As already anticipated, probably the low number of words for document is one of the biggest problems for the performance of our model: for this reason, we leverage the presence of the column `Tag`, building the vocabulary in a different way. Rather than looking to all the document, we consider the different tags and build a different vocabulary for each tag: then we unify the vocabularies in a single one. The idea behind this process is that for different tags we have different main words and more words are under the threshold (and thus not considered). We load again the dataset in order to prove that the vocabulary obtained this way is smaller than with the previous approach.

In [38]:
dataset <- read.csv("six_label_dataset.csv", col.names = c("Label", "Text", "Tag"))
dataset$Label <- change_labels(dataset$Label)
classes <- as.integer(sort(unique(dataset$Label)))
args <- sort(unique(unlist(strsplit(dataset$Tag, ","))))

In [39]:
len_voc <- length(get_vocabulary_tags(dataset, threshold = 1))
len_voc

In [40]:
dataset$Text <- clean(dataset$Text)
dataset <- clean_empty_rows(dataset)

In [41]:
len_voc <- length(get_vocabulary_tags(dataset, threshold = 5))
len_voc

As we can see, using `threshold = 5` in this case we able to reduce the vocabulary to 5.3% of the initial vocabulary. Next, we proceed to a k-fold cross validation in order to select the best threshold.

In [42]:
poss_thresholds <- 0:20
crossval_results <- kfold_cross_validation(training_set, k = 5, thresholds = poss_thresholds, type = "Tags")
crossval_results

threshold,mean_accuracy
<int>,<dbl>
0,0.225901
1,0.225901
2,0.2254123
3,0.2245571
4,0.2250458
5,0.2218693
6,0.2188149
7,0.2196701
8,0.2216249
9,0.2186927


In [43]:
best_threshold <- crossval_results$threshold[which.max(crossval_results$mean_accuracy)]
best_threshold

In [44]:
model <- train_multinomial_nb(classes, training_set, best_threshold, type = "Tags")
pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [45]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.2421875 

Confusion Matrix:
    Predicted
True   0   1   2   3   4   5
   0  16  47  34  42  26  17
   1  15  98  52 103  80  47
   2   8  83  62  89  71  26
   3   6  80  51 146  81  47
   4   6  60  38 120 106  58
   5   6  55  31  81  92  68


Again, also in this case, we are not able to achieve an accuracy higher than 25%, thus we can conclude that also this approach is not correct. The only thing that we can observe is that reducing the size of the vocabulary without any other kind of preprocessing doesn't really produce any gain in the accuracy; thus, this is probably not the best strategy for this dataset and other possibilities should be studies.

____________________________________________________

## 2. Two-label dataset

### Data pre-processing

The second dataset is composed of documents labelled either 0 (reliable) or 1 (unreliable); due to the large number of document and also to the large number of words per document, we don't print lot of results in order to leave the notebook lighter. This time when computing the vocabulary, we use a frequency threshold instead of an occurrency threshold, as the number of words for document is much higher. 

In [46]:
dataset <- read.csv("two_label_dataset.csv", col.names = c("ID", "Title", "Author", "Text", "Label"))
classes <- as.integer(sort(unique(dataset$Label)))

In [47]:
len_voc <- length(get_vocabulary_two(dataset$Text, threshold = 0))
len_voc

In [48]:
dataset$Text <- clean(dataset$Text)
dataset <- clean_empty_rows(dataset)

In [49]:
len_voc <- length(get_vocabulary_two(dataset$Text, threshold = 0))
len_voc

In [50]:
len_voc <- length(get_vocabulary_two(dataset$Text, threshold = 5e-5))
len_voc

Also in this case we can see that the cleaning process reduces a lot the total number of words that are actually unique in our dataset; in particular we get that, using the previously presented techniques for stemming and lemmatizing, the final vocabulary is only 5.1% of the initial vocabulary. If we include also a frequency check, choosing a threshold greater than 0, we are able to reduce the dimension of the vocabulary even more; for example, for `threshold = 5e-5`, the final vocabulary is only 0.5% of the initial vocabulary.

### Model training and k-fold cross validation

From the previous analysis we obtained that k-fold cross validation yields similar values of accuracy as normal validation; given the recognized strength of this method, we use it directly when analyzing the second dataset.

In [51]:
eighty_percent <- as.integer(length(dataset$Text) * 0.8)

training_set <- dataset[1:eighty_percent, ]
test_set <- dataset[(eighty_percent + 1):length(dataset$Text), ]

In [52]:
poss_thresholds <- c(5e-08, 1e-07, 5e-07, 1e-06, 5e-06, 1e-05, 1.6e-05, 2e-05, 5e-05)
crossval_results <- kfold_cross_validation(training_set, k = 5, thresholds = poss_thresholds, type = "Two")
crossval_results

threshold,mean_accuracy
<dbl>,<dbl>
5e-08,0.8670313
1e-07,0.8670313
5e-07,0.8669705
1e-06,0.8659982
5e-06,0.8635673
1e-05,0.8615618
1.6e-05,0.8586448
2e-05,0.8576724
5e-05,0.8497113


In [53]:
best_threshold <- crossval_results$threshold[which.max(crossval_results$mean_accuracy)]
best_threshold

As we can see, in this case we are able to obtain a much higher validation accuracy, probably due to both the presence of only two labels and of much longer documents, which help the model choosing the correct labelling. In any case, the best thresholds is chosen to be 5e-08, with a validation accuracy of 0.867. 

### Testing on test set

In [54]:
model <- train_multinomial_nb(classes, training_set, best_threshold, type = "Two")
pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [55]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.8746051 

Confusion Matrix:
    Predicted
True    0    1
   0 1803  239
   1  277 1796


From the testing we are able to gain an even higher accuracy, which indicates that the general strategy, when applied to a large enough dataset, works pretty well (even if some improvements can be still obtained). In particular for this dataset we should try to reduce the number of false-negative, as a fake-news classified reliable is much worse than the opposite; some specific techniques should be developed to cope with this problem.