# Sentiment Analysis: Primera Exploración

Ya tenemos un buen modelo. Queremos mejorarlo.

Las opciones son tantas que el enfoque es explorar superficialmente cada una.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from util import load_datasets
train, dev, test = load_datasets()
X_train, y_train = train
X_dev, y_dev = dev
X_test, y_test = test

## Distintos Modelos de Clasificación

Probamos distintos modelos de clasificación usando los valores por defecto.

Evaluamos en train (bias) y en dev (variance).

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier

clfs = [
    KNeighborsClassifier(),
    MultinomialNB(),
    DecisionTreeClassifier(random_state=0),
    LogisticRegression(random_state=0),
    LinearSVC(random_state=0),
    SVC(random_state=0),
    RandomForestClassifier(random_state=0),
]

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from util import print_short_eval

vect = CountVectorizer(binary=True)

for clf in clfs:
    print(str(clf.__class__))
    pipeline = Pipeline([
        ('vect', vect),
        ('clf', clf),
    ])
    pipeline.fit(X_train, y_train)
    print_short_eval(pipeline, X_train, y_train)
    print_short_eval(pipeline, X_dev, y_dev)

<class 'sklearn.neighbors.classification.KNeighborsClassifier'>
accuracy	0.70	macro f1	0.67
accuracy	0.45	macro f1	0.40
<class 'sklearn.naive_bayes.MultinomialNB'>
accuracy	1.00	macro f1	1.00
accuracy	0.81	macro f1	0.81
<class 'sklearn.tree.tree.DecisionTreeClassifier'>
accuracy	1.00	macro f1	1.00
accuracy	0.67	macro f1	0.67
<class 'sklearn.linear_model.logistic.LogisticRegression'>
accuracy	1.00	macro f1	1.00
accuracy	0.85	macro f1	0.85
<class 'sklearn.svm.classes.LinearSVC'>
accuracy	1.00	macro f1	1.00
accuracy	0.84	macro f1	0.84
<class 'sklearn.svm.classes.SVC'>


  'precision', 'predicted', average, warn_for)


accuracy	0.52	macro f1	0.34
accuracy	0.46	macro f1	0.32
<class 'sklearn.ensemble.forest.RandomForestClassifier'>
accuracy	1.00	macro f1	1.00
accuracy	0.65	macro f1	0.64


Los mejores modelos parecen ser la regresión logística y la SVM con kernel lineal.

¿Qué otras conclusiones podemos sacar? **¡Ejercicio!**

**Más ejercicios:**
1. Bajar el bias de KNN.
2. ¿Qué le pasa a la SVM con kernel RBF?
3. Bajar el overfitting de Random Forest.

## Vecctorizador TF-IDF

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from util import print_short_eval

vect = TfidfVectorizer(binary=True)

for clf in clfs:
    print(str(clf.__class__))
    pipeline = Pipeline([
        ('vect', vect),
        ('clf', clf),
    ])
    pipeline.fit(X_train, y_train)
    print_short_eval(pipeline, X_train, y_train)
    print_short_eval(pipeline, X_dev, y_dev)

<class 'sklearn.neighbors.classification.KNeighborsClassifier'>
accuracy	0.82	macro f1	0.82
accuracy	0.75	macro f1	0.75
<class 'sklearn.naive_bayes.MultinomialNB'>
accuracy	0.99	macro f1	0.99
accuracy	0.74	macro f1	0.74
<class 'sklearn.tree.tree.DecisionTreeClassifier'>
accuracy	1.00	macro f1	1.00
accuracy	0.59	macro f1	0.58
<class 'sklearn.linear_model.logistic.LogisticRegression'>
accuracy	1.00	macro f1	1.00
accuracy	0.84	macro f1	0.84
<class 'sklearn.svm.classes.LinearSVC'>
accuracy	1.00	macro f1	1.00
accuracy	0.85	macro f1	0.85
<class 'sklearn.svm.classes.SVC'>


  'precision', 'predicted', average, warn_for)


accuracy	0.52	macro f1	0.34
accuracy	0.46	macro f1	0.32
<class 'sklearn.ensemble.forest.RandomForestClassifier'>
accuracy	0.99	macro f1	0.99
accuracy	0.65	macro f1	0.64


Da todo muy parecido. Por ahora nos quedaremos con el CountVectorizer.

## Regresión Logística

Adoptemos la regresión logística que es más sencilla e interpretable que la SVM.

Más adelante, si queremos, podemos retomar la SVM lineal.

In [6]:
from util import print_eval

pipeline = Pipeline([
    ('vect', CountVectorizer(binary=True)),
    ('clf', LogisticRegression(random_state=0)),
])
pipeline.fit(X_train, y_train)
print_eval(pipeline, X_dev, y_dev)

accuracy	0.85

             precision    recall  f1-score   support

        neg       0.85      0.88      0.86       162
        pos       0.85      0.82      0.83       138

avg / total       0.85      0.85      0.85       300

[[142  20]
 [ 25 113]]


Evaluamos en test y guardamos el modelo:

In [7]:
print_eval(pipeline, X_test, y_test)
from util import save_model
save_model(pipeline, '2018-07-27_count_logreg')

accuracy	0.83

             precision    recall  f1-score   support

        neg       0.82      0.85      0.84       257
        pos       0.84      0.81      0.82       243

avg / total       0.83      0.83      0.83       500

[[219  38]
 [ 47 196]]
