# SENTIMENT ANALYSIS

En este ejercicio nos vamos a basar en los ejemplos que vimos en casa y en los que vienen en el libro "Practical Machine Learning with Python" para hacer un estudio de un modelo de *Sentiment Analysis*.

Empezamos instalando los paquetes necesarios que vamos a usar:

In [1]:
!pip install spacy

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/24/de/ac14cd453c98656d6738a5669f96a4ac7f668493d5e6b78227ac933c5fd4/spacy-2.0.12.tar.gz (22.0MB)
[K    100% |████████████████████████████████| 22.0MB 1.7MB/s 
Collecting murmurhash<0.29,>=0.28 (from spacy)
  Downloading https://files.pythonhosted.org/packages/5e/31/c8c1ecafa44db30579c8c457ac7a0f819e8b1dbc3e58308394fff5ff9ba7/murmurhash-0.28.0.tar.gz
Collecting cymem<1.32,>=1.30 (from spacy)
  Downloading https://files.pythonhosted.org/packages/f8/9e/273fbea507de99166c11cd0cb3fde1ac01b5bc724d9a407a2f927ede91a1/cymem-1.31.2.tar.gz
Collecting preshed<2.0.0,>=1.0.0 (from spacy)
[?25l  Downloading https://files.pythonhosted.org/packages/be/fc/09684555ce0ee7086675e6be698e4efeb6d9b315fd5aa96bed347572282b/preshed-1.0.1.tar.gz (112kB)
[K    100% |████████████████████████████████| 122kB 20.8MB/s 
[?25hCollecting thinc<6.11.0,>=6.10.3 (from spacy)
[?25l  Downloading https://files.pythonhosted.org/packages/94/b1/47a88

In [2]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz#egg=en_core_web_md==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz (120.8MB)
[K    100% |████████████████████████████████| 120.9MB 46.7MB/s 
[?25hInstalling collected packages: en-core-web-md
  Running setup.py install for en-core-web-md ... [?25l- \ | / - \ | / - \ | done
[?25hSuccessfully installed en-core-web-md-2.0.0

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_md -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')



In [0]:
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
import unicodedata

nlp = spacy.load('en_core_web_md')
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

np.set_printoptions(precision=2, linewidth=80)

Subimos nuestro dataset. Lo cargamos y comprobamos su tamaño.

In [5]:
from google.colab import files
uploaded = files.upload()

Saving train_sentiment_utf8.csv to train_sentiment_utf8.csv


In [6]:
dataset = pd.read_csv(r'train_sentiment_utf8.csv')

# take a peek at the data
print(dataset.head())

   ItemID  Sentiment                                      SentimentText
0       1          0                       is so sad for my APL frie...
1       2          0                     I missed the New Moon trail...
2       3          1                            omg its already 7:30 :O
3       4          0            .. Omgaga. Im sooo  im gunna CRy. I'...
4       5          0           i think mi bf is cheating on me!!!   ...


In [7]:
dataset.shape

(99989, 3)

En esta primera fase vamos a preprocesar el texto y normalizarlo. Para ello creamos una serie de funciones a las que luego llamaremos para que se encargan de prepar nuestro texto. Las funciones que usamos son:

*   Strip_html_tags: eliminamos etiquetas HTML. Nos ayudamos de la librería BeatifulSoup.
*   Remove_accented_chars: quitamos los caracteres acentuados a su equivalente en ASCII.
*   Remove_special_characters: suprimimos los caracteres especiales.
*   Lemmatize_text: eliminamos sufijos y nos quedamos con la raíz de la palabra.
*   Remove stopwords: suprimimos aquellas palabras que no tienen mucha importancia.

También aprovechamos para borrar saltos de línea, espacios innecesarios y demás. Todo ello desde la función normalize_corpus.



In [0]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [0]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [0]:
def remove_special_characters(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text

In [0]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [0]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [0]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # insert spaces between special characters to isolate them    
        special_char_pattern = re.compile(r'([{.(-)!}])')
        doc = special_char_pattern.sub(" \\1 ", doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters    
        if special_char_removal:
            doc = remove_special_characters(doc)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

Los pasos que seguimos ahora (para un problema de Sentiment Analysis con aprendizaje supervisado) son:


1.   Preparar datasets de *train* y *test*.
2.   Preprocesar y normalizar los datasets.
3.   Obtener características.
4.   Entrenar el modelo.
5.   Evaluación y predicción del modelo.


Dividimos nuestro dataset en *train* y *test*. Teníamos 99989 filas así que, para hacer una proporción de 80%-20% vamos a dividirlo por la fila 79000.

In [0]:
sentimentTexts = np.array(dataset['SentimentText'])
sentiments = np.array(dataset['Sentiment'])

# build train and test datasets
train_sentimentTexts = sentimentTexts[:79000]
train_sentiments = sentiments[:79000]
test_sentimentTexts = sentimentTexts[79000:]
test_sentiments = sentiments[79000:]

Subimos el fichero model_evaluation_utils.py que utilizaremos para nuestros modelos.

In [15]:
from google.colab import files
src = list(files.upload().values())[0]

Saving model_evaluation_utils.py to model_evaluation_utils.py


In [0]:
open('model_evaluation_utils.py','wb').write(src)
import model_evaluation_utils as meu

Vamos con el punto 2 y normalizamos el texto de nuestro dataset.

In [0]:
# normalize datasets
norm_train_sentimentTexts = normalize_corpus(train_sentimentTexts)
norm_test_sentimentTexts = normalize_corpus(test_sentimentTexts)

Para la obtención de características, usamos el modelo BOW y TF-IDF

In [0]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(norm_train_sentimentTexts)
# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2),
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(norm_train_sentimentTexts)

In [0]:
# transform test reviews into features
cv_test_features = cv.transform(norm_test_sentimentTexts)
tv_test_features = tv.transform(norm_test_sentimentTexts)

In [20]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

BOW model:> Train features shape: (79000, 445683)  Test features shape: (20989, 445683)
TFIDF model:> Train features shape: (79000, 445683)  Test features shape: (20989, 445683)


Entrenamos el modelo y realizamos evaluación de la predicción y del rendimiento.

In [0]:
from sklearn.linear_model import SGDClassifier, LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
svm = SGDClassifier(loss='hinge', n_iter=100)

Funciones que vamos a usar para evaluar y entrenar el modelo.

In [0]:
from sklearn import metrics

def get_metrics(true_labels, predicted_labels):
    
    print('Accuracy:', np.round(
                        metrics.accuracy_score(true_labels, 
                                               predicted_labels),
                        4))
    print('Precision:', np.round(
                        metrics.precision_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4))
    print('Recall:', np.round(
                        metrics.recall_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4))
    print('F1 Score:', np.round(
                        metrics.f1_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4))

def display_confusion_matrix(true_labels, predicted_labels, classes=[1,0]):
    
    total_classes = len(classes)
    level_labels = [total_classes*[0], list(range(total_classes))]

    cm = metrics.confusion_matrix(y_true=true_labels, y_pred=predicted_labels, 
                                  labels=classes)
    cm_frame = pd.DataFrame(data=cm, 
                            columns=pd.MultiIndex(levels=[['Predicted:'], classes], 
                                                  labels=level_labels), 
                            index=pd.MultiIndex(levels=[['Actual:'], classes], 
                                                labels=level_labels)) 
    print(cm_frame) 
    
def display_classification_report(true_labels, predicted_labels, classes=[1,0]):

    report = metrics.classification_report(y_true=true_labels, 
                                           y_pred=predicted_labels, 
                                           labels=classes) 
        
    print(report)
    
def display_model_performance_metrics(true_labels, predicted_labels, classes=[1,0]):
    print('Model Performance metrics:')
    print('-'*30)
    get_metrics(true_labels=true_labels, predicted_labels=predicted_labels)
    print('\nModel Classification report:')
    print('-'*30)
    display_classification_report(true_labels=true_labels, predicted_labels=predicted_labels)
    
    print('\nPrediction Confusion Matrix:')
    print('-'*30)
    display_confusion_matrix(true_labels=true_labels, predicted_labels=predicted_labels)

Regresión logística con las características de BOW.

In [30]:
# Logistic Regression model on BOW features
lr_bow_predictions = meu.train_predict_model(classifier=lr, 
                                             train_features=cv_train_features, train_labels=train_sentiments,
                                             test_features=cv_test_features, test_labels=test_sentiments)
display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_bow_predictions,
                                      classes=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy: 0.7658
Precision: 0.7644
Recall: 0.7658
F1 Score: 0.7647

Model Classification report:
------------------------------
             precision    recall  f1-score   support

          1       0.79      0.82      0.81     12445
          0       0.72      0.69      0.70      8544

avg / total       0.76      0.77      0.76     20989


Prediction Confusion Matrix:
------------------------------
          Predicted:      
                   1     0
Actual: 1      10206  2239
        0       2677  5867


Regresión logística con las características de TF-IDF.

In [33]:
# Logistic Regression model on TF-IDF features
lr_tfidf_predictions = meu.train_predict_model(classifier=lr, 
                                               train_features=tv_train_features, train_labels=train_sentiments,
                                               test_features=tv_test_features, test_labels=test_sentiments)
#·meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_tfidf_predictions,
#·                                      classes=['positive', 'negative'])

meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_tfidf_predictions)

Model Performance metrics:
------------------------------
Accuracy: 0.7653
Precision: 0.7659
Recall: 0.7653
F1 Score: 0.7655

Model Classification report:
------------------------------
             precision    recall  f1-score   support

          1       0.81      0.80      0.80     12445
          0       0.71      0.72      0.71      8544

avg / total       0.77      0.77      0.77     20989


Prediction Confusion Matrix:
------------------------------
          Predicted:      
                   1     0
Actual: 1       9915  2530
        0       2397  6147


Como acabamos los datos son muy similares, quizá un poco mejor en el primer ejemplo.

Probamos ahora SVM sobre BOW.

In [35]:
svm_bow_predictions = meu.train_predict_model(classifier=svm, 
                                             train_features=cv_train_features, train_labels=train_sentiments,
                                             test_features=cv_test_features, test_labels=test_sentiments)
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=svm_bow_predictions)



Model Performance metrics:
------------------------------
Accuracy: 0.7658
Precision: 0.764
Recall: 0.7658
F1 Score: 0.764

Model Classification report:
------------------------------
             precision    recall  f1-score   support

          1       0.79      0.83      0.81     12445
          0       0.73      0.67      0.70      8544

avg / total       0.76      0.77      0.76     20989


Prediction Confusion Matrix:
------------------------------
          Predicted:      
                   1     0
Actual: 1      10341  2104
        0       2811  5733


Y ahora SVM sobre TF-IDF.

In [36]:
svm_tfidf_predictions = meu.train_predict_model(classifier=svm, 
                                                train_features=tv_train_features, train_labels=train_sentiments,
                                                test_features=tv_test_features, test_labels=test_sentiments)
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=svm_tfidf_predictions)



Model Performance metrics:
------------------------------
Accuracy: 0.7527
Precision: 0.7507
Recall: 0.7527
F1 Score: 0.7509

Model Classification report:
------------------------------
             precision    recall  f1-score   support

          1       0.78      0.82      0.80     12445
          0       0.71      0.66      0.68      8544

avg / total       0.75      0.75      0.75     20989


Prediction Confusion Matrix:
------------------------------
          Predicted:      
                   1     0
Actual: 1      10185  2260
        0       2930  5614


Igual que el anterior, obtenemos mejores resultados sobre BOW que sobre TF-IDF.

**DEEP LEARNING**

Nos toca ahora aplicar algoritmos de Deep Learning. Lo primero que hacemos es buscar los tokens de este dataset así que cada sentencia se descompone en sus respectivos tokens.

In [0]:
tokenized_train = [tokenizer.tokenize(text) for text in norm_train_sentimentTexts]
tokenized_test = [tokenizer.tokenize(text) for text in norm_test_sentimentTexts]

Construimos nuestro propio diccionario.

In [43]:
from collections import Counter

# build word to index vocabulary
token_counter = Counter([token for review in tokenized_train for token in review])
vocab_map = {item[0]: index+1 for index, item in enumerate(dict(token_counter).items())}
max_index = np.max(list(vocab_map.values()))
vocab_map['PAD_INDEX'] = 0
vocab_map['NOT_FOUND_INDEX'] = max_index+1
vocab_size = len(vocab_map)
# view vocabulary size and part of the vocabulary map
print('Vocabulary Size:', vocab_size)
print('Sample slice of vocabulary map:', dict(list(vocab_map.items())[10:20]))

Vocabulary Size: 82721
Sample slice of vocabulary map: {'omgaga': 11, 'sooo': 12, 'gunna': 13, 'cry': 14, 'dentist': 15, 'since': 16, '11': 17, 'supos': 18, '2': 19, 'get': 20}


Ahora codificamos cada uno de las opiniones transformándolo en una secuencia numérica.

In [44]:
from keras.preprocessing import sequence
from sklearn.preprocessing import LabelEncoder

# get max length of train corpus and initialize label encoder
le = LabelEncoder()
num_classes=2 # positive -> 1, negative -> 0
max_len = np.max([len(review) for review in tokenized_train])

## Train reviews data corpus
# Convert tokenized text reviews to numeric vectors
train_X = [[vocab_map[token] for token in tokenized_review] for tokenized_review in tokenized_train]
train_X = sequence.pad_sequences(train_X, maxlen=max_len) # pad 
## Train prediction class labels
# Convert text sentiment labels (negative\positive) to binary encodings (0/1)
train_y = le.fit_transform(train_sentiments)

## Test reviews data corpus
# Convert tokenized text reviews to numeric vectors
test_X = [[vocab_map[token] if vocab_map.get(token) else vocab_map['NOT_FOUND_INDEX'] 
           for token in tokenized_review] 
              for tokenized_review in tokenized_test]
test_X = sequence.pad_sequences(test_X, maxlen=max_len)
## Test prediction class labels
# Convert text sentiment labels (negative\positive) to binary encodings (0/1)
test_y = le.transform(test_sentiments)

# view vector shapes
print('Max length of train review vectors:', max_len)
print('Train review vectors shape:', train_X.shape, ' Test review vectors shape:', test_X.shape)

Using TensorFlow backend.


Max length of train review vectors: 82
Train review vectors shape: (79000, 82)  Test review vectors shape: (20989, 82)


Saltamos ahora a los pasos 3 y 4. Podemos introducir la capa "*embedding*" usando la arquitectura basada en LSTM.

In [0]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Dropout, SpatialDropout1D
from keras.layers import LSTM

EMBEDDING_DIM = 128 # dimension for dense embeddings for each token
LSTM_DIM = 64 # total LSTM units

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=EMBEDDING_DIM, input_length=max_len))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(LSTM_DIM, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])

In [46]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 82, 128)           10588288  
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 82, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 10,637,761
Trainable params: 10,637,761
Non-trainable params: 0
_________________________________________________________________
None


In [54]:
batch_size = 100
model.fit(train_X, train_y, epochs=5, batch_size=batch_size, 
          shuffle=True, validation_split=0.1, verbose=1)

Train on 71100 samples, validate on 7900 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f808fcff048>

Vemos que en este entrenamiento que acabamos de hacer, a pesar de tener solo 5 épocas, tenemos muy buenos valores en el campo accuracy.

In [58]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predictions)

Model Performance metrics:
------------------------------
Accuracy: 0.713
Precision: 0.7265
Recall: 0.713
F1 Score: 0.7155

Model Classification report:
------------------------------
             precision    recall  f1-score   support

          1       0.80      0.69      0.74     12445
          0       0.62      0.74      0.68      8544

avg / total       0.73      0.71      0.72     20989


Prediction Confusion Matrix:
------------------------------
          Predicted:      
                   1     0
Actual: 1       8617  3828
        0       2195  6349


En los resultados que hemos obtenido vemos que tenemos un resultado de 74% en el marcador F1-score lo cual no está mal.