<a href="https://colab.research.google.com/github/MatteoGuglielmi-tech/Polarity-and-Subjectivity-Detection/blob/main/src/BaselineModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseline model:
The baseline is obtained exploiting a Multinomial Naive Bayes classifier. 
The actual code is partly taken from the SA dedicated laboratoy.

## Importing modules and dowloading archives
The following cell is used to import the necessary modules to achieve a reference accuracy to surpass.

In [None]:
import nltk
from nltk.corpus import movie_reviews
from nltk.corpus import subjectivity
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report
from typing import List, Dict
from nltk.sentiment.util import mark_negation
import pandas as pd

Dowloading list of punctuation signs from nltk. The former will be used in the preprocessing phase of sentences.

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Dowloading the movie reviews dataset. In this project, this is used as polarity dataset on top of which classification is perfomed.

In [None]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

Dowloading the sabjectivity dataset, used to recognize whether a specific sentence express a subjective opinion or not.

In [None]:
nltk.download('subjectivity')

[nltk_data] Downloading package subjectivity to /root/nltk_data...
[nltk_data]   Package subjectivity is already up-to-date!


True

### Subjectivity

In [None]:
def subj_negative_marking(sent: List[str]) -> str:
    ''' Apply double negation flipping

        Parameters :
        ------------
            sent : list(str)
                sentence, organized as listo of words, to which apply double negation flipping
        
        Return :
        ------------
            str: 
                Processed sentence
    '''

    # https://www.nltk.org/api/nltk.sentiment.util.html#nltk.sentiment.util.mark_negation -> wants a list
    negated_doc = mark_negation(sent, double_neg_flip=True)
    return " ".join([w for w in negated_doc])

In the following cell, subjective and objective sentences are fetched and a single corpus is build by concatenating two lists.

In [None]:
subj_docs = [sent for sent in subjectivity.sents(categories='subj')]
obj_docs = [sent for sent in subjectivity.sents(categories='obj')]
corp = subj_docs+obj_docs

The double negation function previously mentioned is applied sentence wise to all the corpus phrases.

In [None]:
subj_corpus = [subj_negative_marking(los) for los in corp]
subj_labels = numpy.array([1] * len(subj_docs) + [0] * len(obj_docs))

A [Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.transform) and a [Naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) are initialized. These will be used for :
- switching from sentences to ids
- from vectors to accuracy 
respectively.

In [None]:
vectorizer = CountVectorizer()
classifier = MultinomialNB()

In the following, first the vectorizer is used to tranform each sentence in a vector of ids to be used as input to the classifier to get an accuracy measure. To do so, the [cross_validate](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) method by scikit-learn is used. In this particular case, a 10-fold cross validation is performed.
Worth to note here is that through the flag `return_estimato=True`, a dictionary with statistics corresponding to each split is returned. This is exploited to extract the best classifier across all splits.

In [None]:
# building sparse matrix with count vectors
vectors = vectorizer.fit_transform(subj_corpus)

# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html -> see return estimator here
scores = cross_validate(classifier, vectors, subj_labels, cv=StratifiedKFold(n_splits=10) , scoring=['accuracy'], return_estimator=True)
scores

{'fit_time': array([0.00653934, 0.00634074, 0.0064044 , 0.00648046, 0.00838113,
        0.00634336, 0.00632811, 0.00621843, 0.00624347, 0.00624967]),
 'score_time': array([0.00110602, 0.0010879 , 0.00101066, 0.00099754, 0.00103593,
        0.00098228, 0.00099468, 0.00098443, 0.00098181, 0.00107074]),
 'estimator': [MultinomialNB(),
  MultinomialNB(),
  MultinomialNB(),
  MultinomialNB(),
  MultinomialNB(),
  MultinomialNB(),
  MultinomialNB(),
  MultinomialNB(),
  MultinomialNB(),
  MultinomialNB()],
 'test_accuracy': array([0.89 , 0.909, 0.919, 0.894, 0.918, 0.912, 0.912, 0.927, 0.896,
        0.898])}

In [None]:
# classifier with the highest accuracy across all fits
best_est = scores['estimator'][scores["test_accuracy"].argmax()]
best_score_idx = scores["test_accuracy"].argmax()
print(f"Chosen {best_est} estimator with peak accuracy of : {scores['test_accuracy'][best_score_idx]}")

Chosen MultinomialNB() estimator with peak accuracy of : 0.927


### Polarity

In [None]:
def pol_negative_marking(doc : List[str]) -> List[str]:
    '''
        Parameters:
        ------------
            doc : list[str]
                document where each element is a list of strings
        Returns :
        ------------
            str :
                document after having applied double negation
    '''

    flat_doc = [w for sent in doc for w in sent]
    negated_doc = mark_negation(flat_doc, double_neg_flip=True)

    return " ".join([w for w in negated_doc])

In [None]:
def filter_objectiveness(doc: List[List[str]],
                         labels: List[int],
                         vect: CountVectorizer, 
                         clf: MultinomialNB
                         ) -> Tuple[List[str], List[int]]:
    ''' This function allow to filter sentences based on the prediction of a classifier.
    Only the sentences predicted as belongin to class 1 are kept. In this case class 1
    corresponds to "Subjective".

        Parameters :
        ------------
            doc : list(list(str))
                sentences arranged document-wise
            labels : list(int)
                corresponding labels of each document
            vect : CounterVectorizer
                vectorizer used to encode the sentences
            clf : MultinomialNB
                classifier used to make predictions

        Returns :
        -----------
            df_pol_list : List[str]
                list containing all the sentences predicted as label 1
            df_pol_label : List[int]
                ground truth of the sentences predicted as subjective
    '''
    # https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.transform


    original_corpus = [pol_negative_marking(d) for d in doc]
    pol_corpus = [vectorizer.transform([pol_negative_marking(d)]) for d in doc]
    preds = [clf.predict(sent) for sent in pol_corpus]

    df_pol_corpus = pd.DataFrame(original_corpus)
    df_pol_labels = pd.DataFrame(labels)
    df_pol_pred = pd.DataFrame(preds)

    df_pol_corpus.rename(columns={0:'text'}, inplace=True)
    df_pol_labels.rename(columns={0:'labels'}, inplace=True)
    df_pol_pred.rename(columns={0:'predictions'}, inplace=True)

    df_pol = pd.concat([df_pol_corpus, df_pol_labels, df_pol_pred], axis=1)
    
    df_pol = df_pol.loc[df_pol['predictions'] == 1]
    df_pol_list = df_pol.text.values.tolist()
    df_pol_label = df_pol.labels.values.tolist()

    return df_pol_list, df_pol_label

In the following cells, first the polarity dataset is initialized and a new corpus is made by the composition of positive and negative documents. 
Subsequently, objective sentences are filtered out by exploiting the `filter_objectiveness` function encountered before.

In [None]:
mr = movie_reviews
neg = mr.paras(categories='neg')
pos = mr.paras(categories='pos')
cor = pos+neg
pol_labels = numpy.array([0] * len(neg) + [1] * len(pos))

In [None]:
df_pol_list, df_pol_labels = filter_objectiveness(cor, pol_labels, vectorizer, best_est)

Now, a brand new vectorizer and a new classifier are instantiated to act upon the pre-processed `movie_reviews` dataset.

In [None]:
# instantiating a new vectorizer and classifier
pol_vec = CountVectorizer()
pol_clf = MultinomialNB()

In [None]:
pol_vectors = pol_vec.fit_transform(df_pol_list)

Last but not least, analogously as for subjectivity classification, a 10-fold cross-validation is performed to assess the preformances of the model.  
As final result, the average polarity classification accuracy across the 10 splits is `83.2%`.

In [None]:
# 10-fold cross-validation
scores = cross_validate(pol_clf, pol_vectors, df_pol_labels, cv=StratifiedKFold(n_splits=10), scoring=['accuracy'])
average = sum(scores['test_accuracy'])/len(scores['test_accuracy'])
print(f"Baseline : {round(average,3)} ACC")

Baseline : 0.832 ACC
