# **Coronavirus tweets-Text Classification**

#### Autore: Martina Cavallucci
#### Email: nome.cognome@studio.unibo.it
#### Release: Gennaio, 2020

*Questo script R esegue una modellazione e una classificazione di testi di tweet durante il perodo di Marzo 2019 e Aprile 2019.
Tali tweet si riferiscono ad un topic specifico: Covid-19.
L'obiettivo è quello di classificazione dei tweet rispetto al sentiment (Positive, Negative, Neutral).*



---

#### Import delle librerie R e di Text Mining
##### Questa operazione impiega alcuni minuti
---




In [None]:
install.packages("data.table")
library(data.table) 
install.packages("checkmate")
library(checkmate)
install.packages("stringr") 
library(stringr)
install.packages("caret") 
library(caret)
install.packages("quanteda") 
library(quanteda)
install.packages("quanteda.textmodels")
library(quanteda.textmodels)
install.packages("R.utils")
library("R.utils")
library(tidyverse)

---

#### Import del text set da Github

---


In [None]:
download.file('https://github.com/CavallucciMartina/Coronavirus-tweets-Text-Classification/blob/main/input/Corona_NLP_test.csv.gz?raw=true', 'test.csv.gz') #, method="curl")
gunzip('test.csv.gz')
download.file('https://github.com/CavallucciMartina/Coronavirus-tweets-Text-Classification/blob/main/input/Corona_NLP_train.csv.gz?raw=true', 'train.csv.gz') #, method="curl")
gunzip('train.csv.gz')

---

#### Prima visualizzazione del train set

---


In [None]:
train <- read.csv("train.csv")
test <- read.csv("test.csv")
head(train)

---

#### Dimensioni del train set e del test set

---


In [None]:
dim(train)
dim(test)


---

#### Preparazione analisi risultati di Sentiment in Train

---

In [None]:
uniqueSentiment = count(train,Sentiment)
uniqueSentiment

---

#### Grafico delle percentuali di sentiment nel train set

---

In [None]:
#jpeg("PieChartSentiment.jpg")
sentiment <- c(uniqueSentiment[,c(2)]) 
lbls <- c('Positive','Negative','Neutral','Extremely Positive','Extremely Negative')
pct <- round(sentiment/sum(sentiment)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(sentiment,labels = lbls, col=c("#66CC00","#CC0000","#CCCCCC","#336600","#990000"),
   main="Pie Chart of sentiment") 
#dev.off()

---

#### Preparazione train e test per classificazione.
Trasformazione da 5 classi a 3: Positive, Negative, Neutral

---

In [46]:
new_train = data.frame(
                text = train$OriginalTweet,
                labels = train$Sentiment,
                stringsAsFactors=F)

new_test = data.frame(
                text = test$OriginalTweet,
                labels = test$Sentiment,
                stringsAsFactors=F)

In [8]:
#Free memory
rm(train)
rm(test)

In [None]:
# Seleziono solo 3 classi e non 5.

classes_def <- function(x)
    if (x ==  "Extremely Positive"){
         "2"
    }else if( x == "Extremely Negative"){
         "0"
    }else if(x == "Negative"){
         "0"
    }else if(x ==  "Positive"){
         "2"
    }else {
         "1"
    }

new_train$labels = lapply(new_train$labels, function(x) classes_def(x))
new_test$labels = lapply(new_test$labels, function(x) classes_def(x))

uniqueSentiment_trasf = count(new_train,labels)
uniqueSentiment_trasf


In [None]:
#jpeg("PieChartSentiment.jpg")
sentiment <- c(uniqueSentiment_trasf[,c(2)]) 
lbls <- c('Positive','Negative','Neutral')
pct <- round(sentiment/sum(sentiment)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(sentiment,labels = lbls, col=c("#66CC00", "#CC0000", "#CCCCCC"),
   main="Pie Chart of sentiment") 
#dev.off()

---

#### Creazione del corpus

---

In [None]:
train_corpus <- corpus(new_train)
docvars(train_corpus, "Textno") <-
  sprintf("%02d", 1:ndoc(train_corpus)) 

In [None]:
train_corpus.stats <- summary(train_corpus)
head(train_corpus.stats, n = 10)

In [None]:
head(kwic(train_corpus, "covid19", window=4),10)

In [None]:
head(kwic(train_corpus, "work", window=4),10)

In [None]:
head(kwic(train_corpus, "food", window=4),10)

---

#### Text-preprocessing

---

In [None]:
head(train_corpus,5)

In [19]:
train_token <-
  tokens(
    train_corpus,
    remove_numbers = TRUE,
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE,
    split_hyphens = TRUE,
    include_docvars = TRUE
  )

In [None]:
head(train_token,20)

In [None]:
token_ungd <- tokens_select(
  train_token,
  c("(http|https)://([^\\s]+)", "<.*?>","#\\w+","@\\w+","\\s+","&\\w+"),
  selection = "remove",
  valuetype = "regex",
  verbose = TRUE
)
toks_nostop <- tokens_select(token_ungd, pattern = stopwords("en"), selection = "remove")
print(toks_nostop)

In [None]:
#Free memory
rm(train_corpus)

---

#### Creazione matrice documenti-termini e applicazione TF-IDF

---

In [None]:
dfm_train <- dfm(toks_nostop,
                       tolower = TRUE,
                       stem = FALSE)
dfm_train_tfidf.trim <- dfm_trim(dfm_train, min_termfreq = 10, min_docfreq = 2)
dfm_train_tfidf <- dfm_tfidf(dfm_train_tfidf.trim)
dfm_train_tfidf

In [23]:
#Free memory
rm(new_train)

---

#### Generazione wordcloud

---

In [None]:
#jpeg("wordcloud.jpeg")
set.seed(100)
textplot_wordcloud(dfm_train_tfidf.trim, min_count = 100, random_order = FALSE,
                   rotation = .25, 
                   color = RColorBrewer::brewer.pal(8,"Dark2"))
#dev.off()

---

## **Classificazione**

---

---

##### Funzione per generare la confusion matrix dato un modello e i dati di train e test.

---

In [25]:
computeConfusionMatrix <- function(test, train, model) {
  dfmat_matched <- dfm_match(test, features = featnames(train))
  actual_class <- unlist(dfmat_matched$labels)
  predicted_class <- predict(model, newdata = dfmat_matched, force = TRUE)
  tab_class <- table(actual_class, predicted_class)
  confusionMatrix <- confusionMatrix(tab_class, mode = "everything")
  return (confusionMatrix)
}

---

##### Costruzione della matrice termini-document e applicazione TF-IDF per il test set.

---

In [None]:
test_corpus <- corpus(new_test)
token <-
  tokens(
    test_corpus,
    remove_numbers = TRUE,
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE,
    split_hyphens = TRUE,
    include_docvars = TRUE
  )

token_ungd <- tokens_select(
  token,
  c("(http|https)://([^\\s]+)", "<.*?>","#\\w+","@\\w+","\\s+","&\\w+"),
  selection = "remove",
  valuetype = "regex",
  verbose = TRUE
)
toks_nostop <- tokens_select(token_ungd, pattern = stopwords("en"), selection = "remove")
dfm_test <- dfm(token_ungd,
                       tolower = TRUE,
                       stem = FALSE)
dfm_test_tfidf.trim <- dfm_trim(dfm_test, min_termfreq = 10, min_docfreq = 2)
dfm_test_tfidf <- dfm_tfidf(dfm_test_tfidf.trim)
dfm_test_tfidf

In [27]:
#Free memory
rm(new_test)
rm(test_corpus)


---

## **Naive Bayes Multinomial**

---

In [74]:
nb_mult <- textmodel_nb(dfm_train_tfidf, unlist(docvars(dfm_train_tfidf, "labels")), distribution = c("multinomial"))

In [75]:
computeConfusionMatrix(dfm_test_tfidf,dfm_train_tfidf, nb_mult)

Confusion Matrix and Statistics

            predicted_class
actual_class    0    1    2
           0 1101  235  297
           1  136  388   95
           2  267  188 1091

Overall Statistics
                                          
               Accuracy : 0.6793          
                 95% CI : (0.6642, 0.6941)
    No Information Rate : 0.396           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4958          
                                          
 Mcnemar's Test P-Value : 1.185e-12       

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            0.7320   0.4784   0.7357
Specificity            0.7681   0.9227   0.8035
Pos Pred Value         0.6742   0.6268   0.7057
Neg Pred Value         0.8139   0.8669   0.8259
Precision              0.6742   0.6268   0.7057
Recall                 0.7320   0.4784   0.7357
F1                     0.7019   0.5427   0.7204
Prevalence   

---

## **Naive Bayes Bernoulli**

---

In [76]:
nb_bern <- textmodel_nb(dfm_train_tfidf, unlist(docvars(dfm_train_tfidf, "labels")), distribution = c("Bernoulli"))

In [77]:
computeConfusionMatrix(dfm_test_tfidf,dfm_train_tfidf, nb_bern)

Confusion Matrix and Statistics

            predicted_class
actual_class    0    1    2
           0 1016  391  226
           1   77  495   47
           2  267  355  924

Overall Statistics
                                          
               Accuracy : 0.6411          
                 95% CI : (0.6256, 0.6564)
    No Information Rate : 0.3581          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4599          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            0.7471   0.3989   0.7719
Specificity            0.7469   0.9515   0.7609
Pos Pred Value         0.6222   0.7997   0.5977
Neg Pred Value         0.8411   0.7653   0.8788
Precision              0.6222   0.7997   0.5977
Recall                 0.7471   0.3989   0.7719
F1                     0.6789   0.5323   0.6737
Prevalence   


---

## **Linear SVM**

---

In [78]:
svm <- textmodel_svm(dfm_train_tfidf, y = quanteda::docvars(dfm_train_tfidf, "labels"))

In [79]:
computeConfusionMatrix(dfm_test_tfidf,dfm_train_tfidf, svm)

Confusion Matrix and Statistics

            predicted_class
actual_class    0    1    2
           0 1151  240  242
           1   49  517   53
           2  184  174 1188

Overall Statistics
                                          
               Accuracy : 0.752           
                 95% CI : (0.7379, 0.7656)
    No Information Rate : 0.3905          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6151          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            0.8316   0.5553   0.8011
Specificity            0.8003   0.9644   0.8454
Pos Pred Value         0.7048   0.8352   0.7684
Neg Pred Value         0.8924   0.8698   0.8690
Precision              0.7048   0.8352   0.7684
Recall                 0.8316   0.5553   0.8011
F1                     0.7630   0.6671   0.7844
Prevalence   


---

## **Logistic regression**

---

In [80]:
log_reg <- textmodel_lr(dfm_train_tfidf, unlist(quanteda::docvars(dfm_train_tfidf, "labels")))

In [81]:
computeConfusionMatrix(dfm_test_tfidf,dfm_train_tfidf, log_reg)

Confusion Matrix and Statistics

            predicted_class
actual_class    0    1    2
           0 1152  246  235
           1   34  538   47
           2  159  172 1215

Overall Statistics
                                          
               Accuracy : 0.7649          
                 95% CI : (0.7511, 0.7783)
    No Information Rate : 0.3942          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6362          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            0.8565   0.5628   0.8116
Specificity            0.8039   0.9715   0.8561
Pos Pred Value         0.7055   0.8691   0.7859
Neg Pred Value         0.9109   0.8685   0.8748
Precision              0.7055   0.8691   0.7859
Recall                 0.8565   0.5628   0.8116
F1                     0.7737   0.6832   0.7986
Prevalence   