**CLASSIFICATION**

In this notebook the goal is to use the data previously analyzed for training some classification model such as SVM, Random Forest and KNN, and try to predict if some comments contain a specified sentiment. So, we will have different binaries model. This choice is due to the low number of data and samples for each class.
Let's import all the necessary packages

In [18]:
import spacy
import pandas as pd
import numpy as np
from imblearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, make_scorer

These are the models we compared. We also used a Dummy Classifier as baseline.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import  KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn.svm import LinearSVC

As always, we need to read each .csv file

In [3]:
emotions = ['love','anger','fear','surprise','joy','sadness']

df_love = pd.read_csv('C:/users/kecco/Documenti/Github/DataMining-EmotionDetection/data/Processed/Dataset/StackOverflow/love.csv',sep=';',encoding='iso-8859-1')

df_anger = pd.read_csv('C:/users/kecco/Documenti/Github/DataMining-EmotionDetection/data/Processed/Dataset/StackOverflow/anger.csv',sep=';',encoding='iso-8859-1')

df_fear = pd.read_csv('C:/users/kecco/Documenti/Github/DataMining-EmotionDetection/data/Processed/Dataset/StackOverflow/fear.csv',sep=';',encoding='iso-8859-1')

df_surprise = pd.read_csv('C:/users/kecco/Documenti/Github/DataMining-EmotionDetection/data/Processed/Dataset/StackOverflow/surprise.csv',sep=';',encoding='iso-8859-1')

df_joy = pd.read_csv('C:/users/kecco/Documenti/Github/DataMining-EmotionDetection/data/Processed/Dataset/StackOverflow/joy.csv',sep=';',encoding='iso-8859-1')

df_sadness = pd.read_csv('C:/users/kecco/Documenti/Github/DataMining-EmotionDetection/data/Processed/Dataset/StackOverflow/sadness.csv',sep=';',encoding='iso-8859-1')

df_list = [df_love,df_anger,df_fear,df_surprise,df_joy,df_sadness]

Word vectors, or word embeddings, are vectors of numbers that provide information about the meaning of a word, as well as its context.

spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. So in order to use real word vectors, we need to download a larger pipeline package, trained using the word2vec family of algorithms:

In [4]:
#!pip install spacy-transformers
# en_core_web_trf

!python -m spacy download en_core_web_lg

nlp = spacy.load("en_core_web_lg")

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
     -------------------------------------- 587.7/587.7 MB 2.4 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')



[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In each dataframe we have same set of sentences, so x will be the same

In [5]:
sentences = df_love['Text']

print(sentences)

0       SVG transform on text attribute works excellen...
1       Excellent! This is exactly what I needed. Thanks!
2       Have added a modern solution as of May 2014 in...
3       Have you tried removing 'preload' attribute? (...
4       A smarter, entirely C++-way of doing what you ...
                              ...                        
4795    Yes - that feature is extremely useful for wri...
4796    Works great! And you can add "desc" after the ...
4797    Yeah, I didn't know about the non-greedy thing...
4798    Fortunately I'm doing *very* little with Offic...
4799    Another very fast approach is the [seek method...
Name: Text, Length: 4800, dtype: object


We need to create a Transformer class to insert the spacy features in the scikit pipeline in the next phase. This transformer need to have a fit and a transform method, where we apply the preprocessing steps to clean the text, and then we extract comments' vectors.

Spacy packages that come with built-in word vectors make them available as the Token.vector attribute.
Doc.vector and Span.vector will default to an average of their token vectors. Vector averaging means that the vector of multiple tokens is insensitive to the order of the words.
The vector will be a one-dimensional Numpy array of float numbers and has 300 dimensions. The sentence vector is the same shape as the word vector because it is made up of the average of the word vectors over each word in the sentence.

In [6]:
class WordVectorTransformer(TransformerMixin,BaseEstimator):
    def __init__(self, model="en_core_web_lg"):
        self.model = model

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        nlp = spacy.load(self.model)

        docs = []

        for doc in X:

            filtered = [token.text for token in nlp(doc) if not token.is_stop  and not token.is_digit and not token.is_punct and not token.is_bracket and not token.like_num and not token.like_url and not token.is_quote]

            docs.append(filtered)

        final_docs = []

        for d in docs:

            j = ' '.join(d)
            final_docs.append(j)

        return np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in final_docs])

Now, we need to associate some specific sentiment labels (y).

In [7]:
# i indice vettore nomi emozioni

def set_labels(df,i):
    y = df['label']
    y = y.fillna(0)
    y = y.map({emotions[i].upper(): 1, 0: 0}).astype(int)
    return y

x = sentences

y = set_labels(df_anger,1)

SVM

We create a pipeline in wich we have 3 steps:
- the WordVectorTransformer class previously created, that will preprocess the text
- a SMOTE instance, that will apply a SMOTE algorithm to our data. We tried a RandomUnderSampler but we find SMOTE better.
- a LinearSVC instance as classifier

In [28]:
steps = [
    ("scaler", WordVectorTransformer()),
    ("sampler", SMOTE()),
    ("classification", LinearSVC(max_iter=10000))
]

pipeline = Pipeline(steps)


We also defined a function for printing the classification report. Then, we use cross validation for evaluating. The StratifiedKFold function returns k=5 folds preserving the same percentage of samples for each class.

In [29]:
def classification_report_with_accuracy_score(y_test, y_pred):

    print(classification_report(y_test, y_pred)) # print classification report
    return accuracy_score(y_test, y_pred) # return accuracy score

svm_results = cross_val_score(
    pipeline,
    x,
    y,
    cv = StratifiedKFold(shuffle=True, random_state=15, n_splits=5),
    scoring=make_scorer(classification_report_with_accuracy_score)
)

print(svm_results)



              precision    recall  f1-score   support

           0       0.91      0.77      0.84       784
           1       0.40      0.66      0.50       176

    accuracy                           0.75       960
   macro avg       0.65      0.72      0.67       960
weighted avg       0.82      0.75      0.78       960





              precision    recall  f1-score   support

           0       0.92      0.76      0.83       784
           1       0.40      0.72      0.52       176

    accuracy                           0.75       960
   macro avg       0.66      0.74      0.68       960
weighted avg       0.83      0.75      0.78       960





              precision    recall  f1-score   support

           0       0.91      0.80      0.85       784
           1       0.43      0.65      0.52       176

    accuracy                           0.78       960
   macro avg       0.67      0.73      0.69       960
weighted avg       0.82      0.78      0.79       960





              precision    recall  f1-score   support

           0       0.92      0.82      0.87       783
           1       0.46      0.68      0.55       177

    accuracy                           0.79       960
   macro avg       0.69      0.75      0.71       960
weighted avg       0.83      0.79      0.81       960





              precision    recall  f1-score   support

           0       0.93      0.77      0.84       783
           1       0.43      0.76      0.54       177

    accuracy                           0.77       960
   macro avg       0.68      0.76      0.69       960
weighted avg       0.84      0.77      0.79       960

[0.75416667 0.753125   0.77604167 0.79479167 0.76666667]


RANDOM FOREST

In [13]:
steps = [
    ("scaler", WordVectorTransformer()),
    ("sampler", SMOTE()),
    ("classification", RandomForestClassifier())
]

pipeline = Pipeline(steps)

rf_results = cross_val_score(
    pipeline,
    x,
    y,
    cv = StratifiedKFold(shuffle=True, random_state=15, n_splits=5),
    scoring=make_scorer(classification_report_with_accuracy_score)
)

print(rf_results)

              precision    recall  f1-score   support

           0       0.86      0.93      0.90       784
           1       0.53      0.34      0.41       176

    accuracy                           0.82       960
   macro avg       0.69      0.63      0.65       960
weighted avg       0.80      0.82      0.81       960

              precision    recall  f1-score   support

           0       0.85      0.95      0.90       784
           1       0.54      0.27      0.36       176

    accuracy                           0.82       960
   macro avg       0.70      0.61      0.63       960
weighted avg       0.80      0.82      0.80       960

              precision    recall  f1-score   support

           0       0.86      0.91      0.89       784
           1       0.47      0.35      0.40       176

    accuracy                           0.81       960
   macro avg       0.66      0.63      0.64       960
weighted avg       0.79      0.81      0.80       960

              preci

KNN

In [30]:
steps = [
    ("scaler", WordVectorTransformer()),
    ("sampler", SMOTE()),
    ("classification", KNeighborsClassifier())
]

pipeline = Pipeline(steps)

knn_results = cross_val_score(
    pipeline,
    x,
    y,
    cv = StratifiedKFold(shuffle=True, random_state=15, n_splits=5),
    scoring=make_scorer(classification_report_with_accuracy_score)
)

print(knn_results)

              precision    recall  f1-score   support

           0       0.95      0.41      0.57       784
           1       0.26      0.90      0.40       176

    accuracy                           0.50       960
   macro avg       0.60      0.66      0.49       960
weighted avg       0.82      0.50      0.54       960

              precision    recall  f1-score   support

           0       0.94      0.34      0.50       784
           1       0.24      0.90      0.37       176

    accuracy                           0.45       960
   macro avg       0.59      0.62      0.44       960
weighted avg       0.81      0.45      0.48       960

              precision    recall  f1-score   support

           0       0.96      0.39      0.56       784
           1       0.25      0.93      0.40       176

    accuracy                           0.49       960
   macro avg       0.61      0.66      0.48       960
weighted avg       0.83      0.49      0.53       960

              preci

DUMMY

In [31]:
steps = [
    ("scaler", WordVectorTransformer()),
    ("sampler", SMOTE()),
    ("classification", DummyClassifier())
]

pipeline = Pipeline(steps)

dummy_results = cross_val_score(
    pipeline,
    x,
    y,
    cv = StratifiedKFold(shuffle=True, random_state=15, n_splits=5),
    scoring=make_scorer(classification_report_with_accuracy_score)
)

print(dummy_results)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.82      1.00      0.90       784
           1       0.00      0.00      0.00       176

    accuracy                           0.82       960
   macro avg       0.41      0.50      0.45       960
weighted avg       0.67      0.82      0.73       960



KeyboardInterrupt: 

Let's try the other sentiment that was well represented, LOVE

In [32]:
y = set_labels(df_love,0)

SVM

In [33]:
steps = [
    ("scaler", WordVectorTransformer()),
    ("sampler", SMOTE()),
    ("classification", LinearSVC(max_iter=10000))
]

pipeline = Pipeline(steps)

svm_results_love = cross_val_score(
    pipeline,
    x,
    y,
    cv = StratifiedKFold(shuffle=True, random_state=15, n_splits=5),
    scoring=make_scorer(classification_report_with_accuracy_score)
)

print(svm_results_love)



              precision    recall  f1-score   support

           0       0.91      0.83      0.87       716
           1       0.61      0.77      0.68       244

    accuracy                           0.81       960
   macro avg       0.76      0.80      0.77       960
weighted avg       0.83      0.81      0.82       960





              precision    recall  f1-score   support

           0       0.91      0.83      0.87       716
           1       0.61      0.76      0.68       244

    accuracy                           0.81       960
   macro avg       0.76      0.80      0.77       960
weighted avg       0.83      0.81      0.82       960





              precision    recall  f1-score   support

           0       0.90      0.82      0.86       716
           1       0.58      0.72      0.64       244

    accuracy                           0.79       960
   macro avg       0.74      0.77      0.75       960
weighted avg       0.82      0.79      0.80       960





              precision    recall  f1-score   support

           0       0.90      0.80      0.85       716
           1       0.56      0.73      0.63       244

    accuracy                           0.79       960
   macro avg       0.73      0.77      0.74       960
weighted avg       0.81      0.79      0.79       960





              precision    recall  f1-score   support

           0       0.91      0.84      0.87       716
           1       0.62      0.75      0.68       244

    accuracy                           0.82       960
   macro avg       0.76      0.80      0.78       960
weighted avg       0.83      0.82      0.82       960

[0.81458333 0.81354167 0.79479167 0.78541667 0.81770833]


RANDOM FOREST

In [34]:
steps = [
    ("scaler", WordVectorTransformer()),
    ("sampler", SMOTE()),
    ("classification", RandomForestClassifier())
]

pipeline = Pipeline(steps)

rf_results_love = cross_val_score(
    pipeline,
    x,
    y,
    cv = StratifiedKFold(shuffle=True, random_state=15, n_splits=5),
    scoring=make_scorer(classification_report_with_accuracy_score)
)

print(rf_results_love)

              precision    recall  f1-score   support

           0       0.86      0.93      0.89       716
           1       0.73      0.57      0.64       244

    accuracy                           0.84       960
   macro avg       0.80      0.75      0.77       960
weighted avg       0.83      0.84      0.83       960

              precision    recall  f1-score   support

           0       0.86      0.92      0.89       716
           1       0.71      0.57      0.64       244

    accuracy                           0.83       960
   macro avg       0.79      0.75      0.76       960
weighted avg       0.83      0.83      0.83       960

              precision    recall  f1-score   support

           0       0.87      0.93      0.90       716
           1       0.75      0.58      0.65       244

    accuracy                           0.84       960
   macro avg       0.81      0.76      0.77       960
weighted avg       0.84      0.84      0.84       960

              preci

With SADNESS

In [35]:
y = set_labels(df_sadness,5)

SVM

In [36]:
steps = [
    ("scaler", WordVectorTransformer()),
    ("sampler", SMOTE()),
    ("classification", LinearSVC(max_iter=10000))
]

pipeline = Pipeline(steps)

svm_results_sadness = cross_val_score(
    pipeline,
    x,
    y,
    cv = StratifiedKFold(shuffle=True, random_state=15, n_splits=5),
    scoring=make_scorer(classification_report_with_accuracy_score)
)

print(svm_results_sadness)



              precision    recall  f1-score   support

           0       0.96      0.83      0.89       914
           1       0.10      0.37      0.16        46

    accuracy                           0.81       960
   macro avg       0.53      0.60      0.53       960
weighted avg       0.92      0.81      0.86       960





              precision    recall  f1-score   support

           0       0.97      0.84      0.90       914
           1       0.14      0.54      0.23        46

    accuracy                           0.82       960
   macro avg       0.56      0.69      0.56       960
weighted avg       0.93      0.82      0.87       960





              precision    recall  f1-score   support

           0       0.97      0.86      0.91       914
           1       0.13      0.39      0.19        46

    accuracy                           0.84       960
   macro avg       0.55      0.63      0.55       960
weighted avg       0.93      0.84      0.88       960





              precision    recall  f1-score   support

           0       0.97      0.85      0.91       914
           1       0.14      0.46      0.21        46

    accuracy                           0.84       960
   macro avg       0.55      0.66      0.56       960
weighted avg       0.93      0.84      0.87       960





              precision    recall  f1-score   support

           0       0.97      0.81      0.88       914
           1       0.13      0.59      0.22        46

    accuracy                           0.79       960
   macro avg       0.55      0.70      0.55       960
weighted avg       0.93      0.79      0.85       960

[0.81145833 0.82291667 0.840625   0.83541667 0.79479167]


RANDOM FOREST

In [37]:
steps = [
    ("scaler", WordVectorTransformer()),
    ("sampler", SMOTE()),
    ("classification", RandomForestClassifier())
]

pipeline = Pipeline(steps)

rf_results_sadness = cross_val_score(
    pipeline,
    x,
    y,
    cv = StratifiedKFold(shuffle=True, random_state=15, n_splits=5),
    scoring=make_scorer(classification_report_with_accuracy_score)
)

print(rf_results_sadness)

              precision    recall  f1-score   support

           0       0.96      0.98      0.97       914
           1       0.22      0.09      0.12        46

    accuracy                           0.94       960
   macro avg       0.59      0.54      0.55       960
weighted avg       0.92      0.94      0.93       960

              precision    recall  f1-score   support

           0       0.96      0.99      0.98       914
           1       0.50      0.20      0.28        46

    accuracy                           0.95       960
   macro avg       0.73      0.59      0.63       960
weighted avg       0.94      0.95      0.94       960

              precision    recall  f1-score   support

           0       0.96      0.99      0.97       914
           1       0.37      0.15      0.22        46

    accuracy                           0.95       960
   macro avg       0.66      0.57      0.59       960
weighted avg       0.93      0.95      0.94       960

              preci

**CONCLUSIONS**

For our study case, we think it's important to consider the single precision and recall score for each class, more than the single accuracy value, thus we used classification reports to evaluate each run of cross validation. We could use macro averages for comparing models.
Generally, the SVM perform better. The Random Forest sometimes has good scores too and KNN does not perform well as the previous ones.