In [51]:
source("./aux.R")

# Analysis

## 1. Six-label dataset

### Data preprocessing

As previously introduced, the first dataset we analyze is composed of documents with assigned one of six labels, which indicate the level of truthness of each document, and a tag that indicates the main topics of the document. We upload the data as a dataframe using the `read.csv()` function, naming the three columns. First of all, as previously explained, we change the labels in order to make their meaning consistent with their value. Secondly, we save the unique labels and tags in two vectors, which will be used later.

In [4]:
dataset <- read.csv("six_label_dataset.csv", col.names = c("Label", "Text", "Tag"))
dataset$Label <- change_labels(dataset$Label)
head(dataset)

Unnamed: 0_level_0,Label,Text,Tag
Unnamed: 0_level_1,<dbl>,<chr>,<chr>
1,1,Says the Annies List political group supports third-trimester abortions on demand.,abortion
2,3,When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.,"energy,history,job-accomplishments"
3,4,"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran.""",foreign-policy
4,1,Health care reform legislation is likely to mandate free sex change surgeries.,health-care
5,3,The economic turnaround started at the end of my term.,"economy,jobs"
6,5,The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.,education


In [5]:
classes <- as.integer(sort(unique(dataset$Label)))
classes

In [6]:
args <- sort(unique(unlist(strsplit(dataset$Tag, ","))))
args

After an initial look to the dataset, we can see how many unique words the dataset contains before cleaning it. Then, after applying the `clean()` function and performing lemmatization and stemming, we can see how much the vocabulary has been reduced.

In [7]:
len_voc <- length(get_vocabulary_six(dataset$Text, threshold = 1))
len_voc

In [8]:
dataset$Text <- clean(dataset$Text)
dataset <- clean_empty_rows(dataset)

In [19]:
len_voc_cleaned <- length(get_vocabulary_six(dataset$Text, threshold = 1))
len_voc_cleaned

In [17]:
len_voc_cleaned <- length(get_vocabulary_six(dataset$Text, threshold = 5))
len_voc_cleaned

We can see that the cleaning process reduces a lot the total number of words that are actually unique in our dataset; in particular we get that, using the previously presented techniques for stemming and lemmatizing, the final vocabulary is only 23.7% of the initial vocabulary. If we include also a frequency check, choosing a threshold greater than 1, we are able to reduce the dimension of the vocabulary even more; for example, for `threshold = 5`, the final vocabulary is only 9.6% of the initial vocabulary.

### Model Training

After the preprocessing of the dataset, we are ready to train our Multinomial Naive Bayes model; the first thing to do is to divide the whole dataset in training set, validation set and test set, in order to tune the hyper-parameter of the model annd study its accuracy on unseen data. Before the division we randomly permutate the dataset, in order to remove possible correlation between consecutive documents.

In [10]:
seventy_percent <- floor(length(dataset$Text) * 0.7)
eightyfive_percent <- floor(length(dataset$Text) * 0.85)
n <- nrow(dataset)

dataset <- dataset[sample(n), ]

training_set <- dataset[1:seventy_percent, ]
validation_set <- dataset[(seventy_percent + 1):eightyfive_percent, ]
test_set <- dataset[(eightyfive_percent + 1):n, ]

In this part we consider `threshold = 3` as an example; later in the notebook we proceed to a tuning of this parameter using the validation set and then choosing the model that has the best accuracy on it. After the training, the output of the model are presented to give an idea of how things work.

In [20]:
model <- train_multinomial_nb(classes, training_set, threshold = 3, type = "Six")

In [22]:
print(model$vocab)

   [1] "2"               "3"               "4"               "5"              
   [5] "6"               "9"               "abil"            "abl"            
   [9] "abolish"         "abort"           "absente"         "absolut"        
  [13] "abus"            "academ"          "academi"         "accept"         
  [17] "access"          "accid"           "accord"          "account"        
  [21] "accumul"         "accus"           "achiev"          "acknowledg"     
  [25] "acorn"           "acr"             "across"          "act"            
  [29] "action"          "activ"           "activist"        "actual"         
  [33] "ad"              "add"             "addict"          "addit"          
  [37] "address"         "adjust"          "administ"        "administr"      
  [41] "admir"           "admiss"          "admit"           "adopt"          
  [45] "adult"           "advanc"          "advantag"        "advertis"       
  [49] "advis"           "advisor"         "advisori

In [15]:
print(model$prior)

        0         1         2         3         4         5 
0.0820656 0.1907886 0.1624564 0.2092114 0.1912073 0.1642708 


In [23]:
model$condprob

Unnamed: 0,0,1,2,3,4,5
2,0.000142288,5.224660e-04,3.191320e-04,6.292474e-04,9.793634e-04,4.029658e-04
3,0.000142288,2.239140e-04,3.989150e-04,5.663227e-04,4.197272e-04,4.029658e-04
4,0.000142288,2.985520e-04,7.978299e-05,1.258495e-04,2.798181e-04,1.611863e-04
5,0.000142288,1.492760e-04,7.978299e-05,1.887742e-04,3.497726e-04,2.417795e-04
6,0.000142288,7.463801e-05,1.595660e-04,6.292474e-05,2.098636e-04,1.611863e-04
9,0.000142288,1.492760e-04,1.595660e-04,1.258495e-04,1.399091e-04,8.059317e-05
abil,0.000142288,3.731900e-04,3.191320e-04,2.516990e-04,4.197272e-04,8.059317e-05
abl,0.000284576,5.224660e-04,5.584809e-04,3.146237e-04,7.694998e-04,3.223727e-04
abolish,0.000142288,5.224660e-04,7.978299e-05,1.887742e-04,6.995453e-05,1.611863e-04
abort,0.001280592,1.642036e-03,1.755226e-03,1.887742e-03,6.295908e-04,1.450677e-03


### Testing on validation set

We then use the result from the training to test the accuracy of the produced model on the validation set. The accuracy is simply defined as the number of the correct predicted labels; for a more deep analysis we also provide the confusion matrix, in order to see if specific patterns are present (for example a label which is predicted much more times than the others without any reason). 

In [26]:
pred_labels <- sapply(validation_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [27]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.1927083 

Confusion Matrix:


    Predicted
True  0  1  2  3  4  5
   0  8 18 22 35 26 16
   1 19 64 47 84 68 33
   2 11 53 50 59 45 21
   3 16 56 46 80 81 42
   4 17 67 43 64 69 45
   5  9 56 33 60 48 25


As we can see the accuracy obtained on the validation set is really low. Our model performs a little better than choosing at random (which will give an average accuracy of 0.167, 1 over 6), but obviously this result indicates that this methods is not capable of classifying well the documents. From the conclusion matrix we see that no specific pattern arises and in general we don't have a general behaviour that explains the misclassified documents. 

### Tuning of the hyper-parameters

The only parameter that we can tune using the validation set in this case is the occurrency threshold for our vocabulary. In order to find the best parameter, we can simply train different models and choose the one that maximizes the accuracy on the validation set. 

In [33]:
# SHOULD WE DO A FUNCTION FOR THIS???? FOR ME YES, IT'S REALLY EASY

poss_thresholds <- 1:20
accuracies <- numeric(length(poss_thresholds))

for (i in seq_along(poss_thresholds)) {
  model <- train_multinomial_nb(classes, training_set, threshold = poss_thresholds[[i]], type = "Six")
  pred_labels <- sapply(validation_set$Text, function(doc) {
    apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
  })

  correct_predictions <- sum(test_set$Label == pred_labels)
  total_predictions <- length(test_set$Label)
  accuracies[[i]] <- correct_predictions / total_predictions
}

best_threshold <- poss_thresholds[which.max(accuracies)]
cat("Best threshold: ", best_threshold, "\n")
cat("Best accuracy: ", max(accuracies), "\n")

Best threshold:  2 
Best accuracy:  0.1966146 


In this way we are able to tune the best threshold for our model: as we can see, even after a tuning, we still obtain a really small value for the accuracy, which indicates that this parameter is not the main responsable for the poor performances of the model.

### Testing on test set

After the choice of the bets hyper-parameters we proceed testing the model on unseen data, the test set. We train again the model with the best threshold for the vocabulary and then we study the accuracy on the training set.

In [35]:
model <- train_multinomial_nb(classes, training_set, best_threshold, type = "Six")

pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [36]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.2128906 

Confusion Matrix:
    Predicted
True  0  1  2  3  4  5
   0 10 25 22 30 22 16
   1 17 70 55 89 45 39
   2  8 50 44 66 45 26
   3 13 58 55 88 73 34
   4 12 50 40 70 83 50
   5  7 45 27 58 62 32


From this final analysis we obtain again a very low accuracy for our model; again no specific pattern can be deduced from the confusion matrix.

One thing that in general we can conclude is that we don't have overfitting or underfitting as the training, the validation and the test errors are all similar. One possible cause of the poor performance is the small length of each document in the dataset, which makes hard for the model to classify only on the basis of a few words; at the same time, the presence of six different lables makes things more difficult for the model, as similar labels could share similar general patterns (this is amplified by the small number of words per document).

### K-fold cross validation

Another possible reason for the poor performance of the model is a not enough large dataset for training and validation; in order to remove this possibility we proceed using the K-fold cross validation approach. In the following cells, we perform the same operations done in the previous points, studying possible values for the threshold. Moreover, this time we divide the dataset only in training set and test set, as the validation set is directly selected by the `kfold_cross_validation` function.

In [37]:
eigthy_percent <- floor(length(dataset$Text) * 0.8)
n <- nrow(dataset)

dataset <- dataset[sample(n), ]

training_set <- dataset[1:eigthy_percent, ]
test_set <- dataset[(eigthy_percent + 1):n, ]

In [38]:
poss_thresholds <- 1:20
crossval_results <- kfold_cross_validation(training_set, k = 5, thresholds = poss_thresholds, type = "Six")
crossval_results

threshold,mean_accuracy
<int>,<dbl>
1,0.225901
2,0.2219914
3,0.2238241
4,0.2233354
5,0.2227245
6,0.2229688
7,0.220281
8,0.2217471
9,0.2217471
10,0.2224801


In [42]:
best_threshold <- crossval_results$threshold[which.max(crossval_results$mean_accuracy)]
best_threshold

In [43]:
model <- train_multinomial_nb(classes, training_set, best_threshold, type = "Six")
pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [44]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.2294922 

Confusion Matrix:
    Predicted
True   0   1   2   3   4   5
   0  18  49  21  38  27  15
   1  10 100  59 106  78  45
   2   7  67  56  98  68  36
   3   8  81  49 128  98  50
   4   2  52  32 133 114  59
   5   7  54  33  90 106  54


From this approach we obtain similar results as before. Depending on the initial random shuffling of the dataset we obtain values of accuracies for the best threshold between 0.20 and 0.23, which is still an indicator of a very bad performance of our model. In any case, this result tells us that the k-fold cross validation doesn't change a lot the behaviour of the model; this could indicate the necessity of a different pre-processing technique. 

### Analysis using tags

The approaches used up to this point have not produced a succesfull model. As already anticipated, probably the low number of words for document is one of the biggest problems for the performance of our model: for this reason, we leverage the presence of the column `Tag`, building the vocabulary in a different way. Rather than looking to all the document, we consider the different tags and build a different vocabulary for each tag: then we unify the vocabularies in a single one. The idea behind this process is that for different tags we have different main words and more words are under the threshold (and thus not considered).   

In [55]:
# THIS DOESN'T WORK: IT SHOULD RETURN 21768, I DON'T UNDERSTAND WHY...
# COULD IT BE SAME PROBLEM OF YESTERDAY?

len_voc <- length(get_vocabulary_tags(dataset, threshold = 0))
len_voc

In [53]:
len_voc <- length(get_vocabulary_tags(dataset, threshold = 5))
len_voc

As we can see, using `threshold = 5` in this case we able to reduce the vocabulary to 5.3% of the initial vocabulary. Next, we proceed to a k-fold cross validation in order to select the best threshold.

In [54]:
poss_thresholds <- 0:20
crossval_results <- kfold_cross_validation(training_set, k = 5, thresholds = poss_thresholds, type = "Tags")
crossval_results

threshold,mean_accuracy
<int>,<dbl>
0,0.2260232
1,0.2260232
2,0.2222358
3,0.2215027
4,0.2215027
5,0.2224801
6,0.222358
7,0.223091
8,0.2263897
9,0.227978


In [56]:
best_threshold <- crossval_results$threshold[which.max(crossval_results$mean_accuracy)]
best_threshold

In [57]:
model <- train_multinomial_nb(classes, training_set, best_threshold, type = "Tags")
pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [58]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.2216797 

Confusion Matrix:
    Predicted
True   0   1   2   3   4   5
   0  17  55  26  35  15  20
   1  18 102  56 113  61  48
   2  16  76  64  90  51  35
   3  13  78  73 121  75  54
   4  10  67  40 109  95  71
   5  12  72  39  71  95  55


Again, also in this case, we are not able to achieve an accuracy higher than 25%, thus we can conclude that also this approach is not correct. The only thing that we can observe is that reducing the size of the vocabulary without any other kind of preprocessing doesn't really produce any gain in the accuracy; thus, this is probably not the best strategy for this dataset and other possibilities should be studies.

____________________________________________________

## 2. Two-label dataset

In [36]:
dataset <- read.csv("two_label_dataset.csv", col.names = c("ID", "Title", "Author", "Text", "Label"))
classes <- as.integer(sort(unique(dataset$Label)))

In [37]:
dataset$Text <- clean(dataset$Text)
dataset <- clean_empty_rows(dataset)

In [38]:
eighty_percent <- as.integer(length(dataset$Text) * 0.8)

training_set <- dataset[1:eighty_percent, ]
test_set <- dataset[(eighty_percent + 1):length(dataset$Text), ]

: 

In [39]:
crossval_results <- kfold_cross_validation(training_set, k = 5, thresholds = c(1e-10, 1e-9, 5e-9, 1e-8, 5e-8, 1e-7, 5e-7, 1e-6, 5e-6, 1e-5, 1.6e-5, 2e-5, 5e-5), type = "Two")
crossval_results

In [None]:
best_threshold <- crossval_results$threshold[which.max(crossval_results$mean_accuracy)]
model <- train_multinomial_nb(classes, training_set, best_threshold, type = "Two")

In [None]:
pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [None]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)

cat("Accuracy:", accuracy, "\n\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)