# Case Study: Sentiment Analysis

## Dataset: Movie Reviews

http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

Referencias:

- [scikit-learn: Working With Text Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
- [NLTK: Learning to Classify Text](http://www.nltk.org/book/ch06.html)

  _Apparently in this corpus, a review that mentions "Seagal" is almost 8 times more likely to be negative than positive, while a review that mentions "Damon" is about 6 times more likely to be positive._


## Load

In [1]:
from sklearn.datasets import load_files
dataset = load_files('review_polarity/txt_sentoken', shuffle=False)

In [2]:
dataset.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
dataset['target_names'][1]

'pos'

In [6]:
dataset['filenames'][0]
# dataset['data'][0]
# print(dataset['data'][0].decode('utf-8'))
dataset['target'][0]

0

In [7]:
import pandas as pd
data = pd.DataFrame({'data': dataset['data'], 'target': dataset['target']})

In [8]:
data.describe()

Unnamed: 0,target
count,2000.0
mean,0.5
std,0.500125
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [9]:
data['data_len'] = data['data'].apply(lambda x: len(x))

In [10]:
data.describe()

Unnamed: 0,target,data_len
count,2000.0,2000.0
mean,0.5,3893.002
std,0.500125,1712.425852
min,0.0,91.0
25%,0.0,2737.75
50%,0.5,3622.5
75%,1.0,4720.25
max,1.0,14957.0


In [11]:
data.groupby('target').describe()

Unnamed: 0_level_0,data_len,data_len,data_len,data_len,data_len,data_len,data_len,data_len
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
target,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,1000.0,3661.721,1530.136157,91.0,2667.25,3455.5,4423.75,11408.0
1,1000.0,4124.283,1849.144232,727.0,2833.25,3840.5,5016.25,14957.0


In [12]:
[x for x in dataset['data'] if len(x) < 100]

[b"this film is extraordinarily horrendous and i'm not going to waste any more words on it . \n"]

## División: Train, Dev y Test

Hay 2000 instancias
Haremos el siguiente split:
  - Train: 60% (1200)
  - Dev: 15% (300)
  - Test: 25% (500)

Primero extraemos Test:

In [14]:
from sklearn.model_selection import train_test_split
docs, X_test, y, y_test = train_test_split(
    dataset.data,
    dataset.target,
    test_size=0.25,
    random_state=42
)

Y separamos train de dev:

In [15]:
X_train, X_dev, y_train, y_dev = train_test_split(
    docs,
    y,
    test_size=0.2,
    random_state=42)

In [16]:
len(X_train), len(X_dev), len(X_test)

(1200, 300, 500)

In [17]:
from collections import Counter
Counter(y_train), Counter(y_dev), Counter(y_test)

(Counter({1: 619, 0: 581}),
 Counter({1: 138, 0: 162}),
 Counter({1: 243, 0: 257}))

**Ejercicio:** ¿Cuál es la "resolución" de los datasets? Es decir, ¿cuánto vale en accuracy cada ítem de dev y test?

## Baselines

In [19]:
# requerido por DummyClassifier:
import numpy as np
X_train = np.reshape(X_train, (-1, 1))
X_dev = np.reshape(X_dev, (-1, 1))
X_test = np.reshape(X_test, (-1, 1))

Clasificar siempre 'neg':

In [20]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='constant', constant=0)
clf.fit(X_train, y_train)

DummyClassifier(constant=0, random_state=None, strategy='constant')

Clasificar con la clase mayoritaria ('pos'):

In [23]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='most_frequent')
clf.fit(X_train, y_train)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

Clasificar al azar, respetando la distribución de clases:

In [24]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='stratified', random_state=0)
clf.fit(X_train, y_train)

DummyClassifier(constant=None, random_state=0, strategy='stratified')

## Persistencia

Guardar en disco un modelo:

In [25]:
import pickle
filename = '2018-07-27_random_baseline'
f = open(filename, 'wb')
pickle.dump(clf, f)

Cargar un modelo guardado en disco:

In [26]:
import pickle
filename = '2018-07-27_random_baseline'
f = open(filename, 'rb')
clf = pickle.load(f)

## Evaluación y Métricas

Calcularemos accuracy y macro F1.

En development:

In [27]:
y_pred = clf.predict(X_dev)

In [28]:
from sklearn import metrics
acc = metrics.accuracy_score(y_dev, y_pred)
print('accuracy\t{:2.2f}\n'.format(acc))
print(metrics.classification_report(y_dev, y_pred, target_names=['neg', 'pos']))

accuracy	0.53

             precision    recall  f1-score   support

        neg       0.56      0.54      0.55       162
        pos       0.49      0.51      0.50       138

avg / total       0.53      0.53      0.53       300



Matriz de confusión:

In [30]:
cm = metrics.confusion_matrix(y_dev, y_pred)
print(cm)

[[87 75]
 [67 71]]


Evaluación en test:

In [31]:
y_pred = clf.predict(X_test)
acc = metrics.accuracy_score(y_test, y_pred)
print('accuracy\t{:2.2f}\n'.format(acc))
print(metrics.classification_report(y_test, y_pred, target_names=['neg', 'pos']))

accuracy	0.50

             precision    recall  f1-score   support

        neg       0.51      0.48      0.49       257
        pos       0.48      0.51      0.50       243

avg / total       0.50      0.50      0.50       500

