# Sentiment Analysis: Análisis de Error

¿Porqué anda mal el modelo con filtrado de stopwords?

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
from util import load_datasets
train, dev, test = load_datasets()
X_train, y_train = train
X_dev, y_dev = dev
X_test, y_test = test

## Actual Estado del Arte + Stop Words

In [5]:
from model import build_pipeline

pipeline = build_pipeline()
pipeline.set_params(vect__stop_words='english')
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=None, min_df=3,
        ngram_range=(1, 5), preprocessor=None, stop_words='english',
        s...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [6]:
from util import print_eval
print_eval(pipeline, X_dev, y_dev)

accuracy	0.85

             precision    recall  f1-score   support

        neg       0.86      0.86      0.86       162
        pos       0.84      0.83      0.84       138

avg / total       0.85      0.85      0.85       300

[[140  22]
 [ 23 115]]


## Listar Errores

Veamos en qué se equivocó.

In [8]:
y_pred = pipeline.predict(X_dev)

In [88]:
import pandas as pd
# pd.options.display.max_colwidth = 0

errors = []
for x, y1, y2 in zip(X_dev, y_dev, y_pred):
    if y1 != y2:
        errors.append({
            'item': x,
            'true': y1,
            'pred': y2})

errdf = pd.DataFrame(errors)
errdf['len'] = errdf['item'].apply(lambda x: len(x))

In [89]:
errdf

Unnamed: 0,item,pred,true,len
0,b'topless women talk about their lives falls i...,1,0,2535
1,"b' "" the red violin "" is a cold , sterile feat...",1,0,2038
2,b'it was a rainy friday afternoon in columbus ...,0,1,3237
3,b'bob the happy bastard\'s quickie review : \n...,0,1,1887
4,"b""well i'll be damned . . . \nthe canadians ca...",0,1,2545
5,"b""warning : anyone offended by blatant , leeri...",0,1,1917
6,b'senseless ( r ) marlon wayans is a very tale...,1,0,2665
7,"b""just look back two years ago at the coen bro...",0,1,4063
8,b'synopsis : nice girl susanne has sex with he...,1,0,2275
9,b' * * * the following review contains some ha...,1,0,6766


## Listar "Peores" Errores

Usemos las probabilidads para ver en qué se equivocó más fiero.

In [11]:
y_prob = pipeline.predict_proba(X_dev)

In [28]:
import pandas as pd
# pd.options.display.max_colwidth = 0

errors = []
for i, (x, y1, y2, y2p) in enumerate(zip(X_dev, y_dev, y_pred, y_prob)):
    if y1 != y2:
        diff = y2p[y1] - y2p[y2]
        errors.append({
            'index': i,
            'item': x,
            'true': y1,
            'pred': y2,
            'pneg': y2p[0],
            'ppos': y2p[1],
            'diff': diff})

errdf = pd.DataFrame(errors)
errdf.sort_values('diff', inplace=True)

In [29]:
errdf[:10]

Unnamed: 0,diff,index,item,pneg,ppos,pred,true
44,-0.99612,294,"b'miramax "" disinvited "" on-line media from pr...",0.00194,0.99806,1,0
40,-0.927204,268,"b""carry on at your convenience is all about th...",0.036398,0.963602,1,0
14,-0.925333,96,b'plot : a little boy born in east germany ( n...,0.037334,0.962666,1,0
1,-0.829715,7,"b' "" the red violin "" is a cold , sterile feat...",0.085142,0.914858,1,0
17,-0.820588,121,b'bob the happy bastard\'s quickie review : \n...,0.910294,0.089706,0,1
29,-0.817714,204,b'it was with great trepidation that i approac...,0.908857,0.091143,0,1
21,-0.797474,135,b'it might surprise some to know that joel and...,0.898737,0.101263,0,1
3,-0.719233,23,b'bob the happy bastard\'s quickie review : \n...,0.859616,0.140384,0,1
20,-0.711007,131,"b""an indian runner was more than a courier . \...",0.855503,0.144497,0,1
32,-0.678771,237,b'what do you get when you combine clueless an...,0.839385,0.160615,0,1


## Inspeccionar un Error

Agarremos un ejemplo y veamos qué sucede.

In [119]:
x = X_dev[294]
print(x.decode('utf-8'))
#[s for s in x.decode('utf-8').split('\n') if 'movie' in s]

miramax " disinvited " on-line media from press screenings of scream 3 . 
they ostensibly feared that folks like me would write spoiler-filled reviews and post them prior to the film's february 4th release date-unsound reasoning . 
you see , 'net critics established enough to be on any sort of vip list are professionals-miramax surely knows the difference between a member of the on-line film critics society ( ofcs ) and the type of fanboy who posts spy reports at ain't it cool news . 
no , the ? mini major' was afraid we'd let a bigger cat out of the bag than whodunit , that scream 3 is a dismal conclusion to the beloved ( by this writer , at least ) franchise . 
something smells rotten in the state of california right from the get-go : cotton weary ( liev schrieber ) , the former lover and would-be killer of maureen prescott , sidney's mother , is juggling phone calls in his luxury car . 
 ( once considered a danger to society , weary now hosts his own talk show , " 100% cotton " , a 

Observaciones:

- Muchas palabras negativas: dismal, rotten, clever, disappointingly, worn off, creaky, silly,
laughless, sanctimony, woefully, nightmare, dissapoint...
- Mucho ruido: mucho contenido acerca del guión.
  - El último párrafo parece más relevante.
  - Las partes en las que se menciona la película (o 'movie').
  

In [112]:
vect = pipeline.named_steps['vect']
features = vect.get_feature_names()
neg_ws = 'dismal rotten clever disappointingly creaky silly laughless sanctimony woefully nightmare dissapoint'.split()
set(neg_ws) - set(features)

{'dissapoint', 'laughless', 'sanctimony'}

In [104]:
new_x = x.decode('utf-8').replace('dismal', 'bad')
pipeline.predict_proba([x, new_x])

array([[0.00193984, 0.99806016],
       [0.00452386, 0.99547614]])

In [56]:
vect = pipeline.named_steps['vect']
clf = pipeline.named_steps['clf']
coef = clf.coef_
coef.shape

(1, 24642)

In [109]:
features = vect.get_feature_names()
x2 = vect.transform([x])
active = vect.inverse_transform(x2)[0]
active

array(['100', '13th', '4th', 'act', 'actress', 'acts', 'actually',
       'advisor', 'afraid', 'ain', 'akin', 'alternative', 'american',
       'american pop', 'american pop culture', 'appeal', 'arquette',
       'articulate', 'aside', 'available', 'available video', 'bag',
       'beings', 'beloved', 'bets', 'bigger', 'blatantly', 'bob',
       'boyfriend', 'breaking', 'california', 'calls', 'cameo', 'camera',
       'campbell', 'car', 'cat', 'character', 'chase', 'cheating',
       'classmate', 'clever', 'climaxes', 'college', 'combination',
       'come', 'comedy', 'coming', 'conclusion', 'considered', 'cool',
       'costume', 'costumes', 'cotton', 'course', 'cox', 'craven',
       'creaky', 'critics', 'crowd', 'cues', 'culture', 'danger', 'date',
       'dated', 'death', 'denied', 'departure', 'detective', 'dewey',
       'difference', 'direction', 'disappoint', 'disappointingly',
       'dismal', 'distracting', 'door', 'draft', 'duo', 'effective',
       'elaborate', 'electronic'

In [67]:
coef[x2.nonzero()]

array([ 7.46707137e-02, -6.07881591e-02,  3.24702186e-02, -3.20960576e-02,
       -8.26177301e-02,  2.86623452e-02,  2.97099736e-02,  1.02419531e-01,
       -5.34478850e-02,  2.24655147e-02,  2.12032040e-03,  3.48536312e-02,
        3.97134842e-01,  2.81924462e-02,  2.81924462e-02, -1.03947843e-01,
       -2.90210058e-02, -7.15029251e-03, -9.75696498e-03,  8.49456202e-02,
        9.74311479e-03,  1.62097026e-02,  9.39678120e-02,  7.48764826e-02,
       -4.43224084e-02, -6.41704409e-02, -5.96478271e-02,  1.11983883e-02,
        4.59420816e-02, -2.44780940e-02,  5.06063880e-02, -5.20867085e-02,
       -5.80600544e-02, -6.70297108e-03,  7.08489108e-02, -9.17564191e-02,
       -8.50600287e-03,  7.34729176e-03,  8.65829381e-02, -2.63850536e-02,
       -3.09672603e-02,  1.71081391e-01,  5.60834466e-02,  1.14672101e-01,
        5.11535832e-02, -2.21458797e-01, -3.11064753e-01,  5.11307609e-02,
        3.50202881e-02,  1.42962249e-02, -7.30518372e-02,  1.21660380e-02,
       -1.06314899e-02,  

In [87]:
active_df = pd.DataFrame({'name': active, 'coef': coef[x2.nonzero()]})
active_df.sort_values('coef', inplace=True)
active_df[-20:]

Unnamed: 0,name,coef
54,course,0.172425
242,radio,0.174973
311,suspenseful,0.190568
189,movies,0.191495
293,society,0.217737
112,genre,0.224701
159,life,0.226431
172,master,0.22898
107,game,0.236412
213,parts,0.241621


In [74]:
clf.decision_function(x2)
# clf.intercept_ + coef[x2.nonzero()].sum()

array([6.2432087])