# Sentiment Analysis: Vectorización

Referencias:

- [scikit-learn: Text feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from util import load_datasets
train, _, _ = load_datasets()
X_train, y_train = train

## Bag-of-words

Entrenamos un vectorizador de tipo bag-of-words:

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Vectorizamos un elemento. El resultado es una matriz dispersa ("sparse"):

In [4]:
x = vect.transform([X_train[0]])
x

<1x32422 sparse matrix of type '<class 'numpy.int64'>'
	with 642 stored elements in Compressed Sparse Row format>

Veamos cuáles son los elementos distintos de cero:

In [8]:
#x.toarray()
[(i, x[0, i]) for i in range(32422) if x[0, i]]

[(567, 3),
 (592, 6),
 (594, 1),
 (614, 1),
 (665, 1),
 (783, 1),
 (784, 1),
 (788, 1),
 (794, 3),
 (815, 1),
 (870, 2),
 (1022, 1),
 (1034, 1),
 (1036, 2),
 (1048, 1),
 (1089, 1),
 (1219, 4),
 (1223, 1),
 (1247, 1),
 (1260, 2),
 (1276, 1),
 (1278, 3),
 (1296, 1),
 (1379, 1),
 (1388, 1),
 (1398, 1),
 (1404, 1),
 (1430, 6),
 (1466, 60),
 (1502, 1),
 (1563, 1),
 (1631, 3),
 (1636, 2),
 (1637, 1),
 (1640, 1),
 (1691, 1),
 (1805, 9),
 (1875, 1),
 (1897, 1),
 (1953, 11),
 (2059, 1),
 (2080, 1),
 (2098, 8),
 (2162, 1),
 (2183, 1),
 (2198, 1),
 (2237, 1),
 (2246, 1),
 (2250, 2),
 (2330, 1),
 (2340, 1),
 (2348, 1),
 (2377, 1),
 (2387, 4),
 (2424, 1),
 (2444, 3),
 (2662, 1),
 (2743, 6),
 (2802, 5),
 (2813, 1),
 (2815, 1),
 (2840, 5),
 (2874, 1),
 (2899, 1),
 (2918, 1),
 (2922, 1),
 (2924, 1),
 (3010, 1),
 (3042, 5),
 (3065, 1),
 (3131, 1),
 (3147, 2),
 (3190, 1),
 (3215, 1),
 (3449, 2),
 (3547, 1),
 (3610, 1),
 (3796, 1),
 (3843, 1),
 (3867, 1),
 (3895, 2),
 (3905, 1),
 (3912, 1),
 (3975, 1),
 

In [12]:
features = vect.get_feature_names()
features[592]

'about'

## Min counts

Quizás mejor exigir que las palabras a considerar tengan una frecuencia mínima:

In [13]:
vect = CountVectorizer(min_df=5)
vect.fit(X_train)
x = vect.transform(X_train[:1])
x

<1x9756 sparse matrix of type '<class 'numpy.int64'>'
	with 593 stored elements in Compressed Sparse Row format>

Podemos preguntarle al vectorizador qué features encontró:

In [14]:
features = vect.get_feature_names()
features[2000:2010]

['coverage',
 'covered',
 'covering',
 'covers',
 'cowboy',
 'cox',
 'crab',
 'crack',
 'cracked',
 'cracking']

In [31]:
[(features[i], x[0, i]) for i in range(x.shape[1]) if x[0, i]]

[('able', 3),
 ('about', 6),
 ('above', 1),
 ('absolutely', 1),
 ('actor', 1),
 ('actors', 1),
 ('acts', 1),
 ('actually', 3),
 ('adapting', 1),
 ('admired', 2),
 ('after', 1),
 ('afterwards', 1),
 ('again', 2),
 ('agent', 1),
 ('aid', 1),
 ('all', 4),
 ('alleged', 1),
 ('allowed', 1),
 ('almost', 2),
 ('already', 1),
 ('also', 3),
 ('although', 1),
 ('amiable', 1),
 ('amongst', 1),
 ('amount', 1),
 ('an', 6),
 ('and', 60),
 ('angle', 1),
 ('annoying', 1),
 ('any', 3),
 ('anyone', 2),
 ('anything', 1),
 ('anyway', 1),
 ('appearance', 1),
 ('are', 9),
 ('around', 1),
 ('arrive', 1),
 ('as', 11),
 ('assume', 1),
 ('astonishingly', 1),
 ('at', 8),
 ('attention', 1),
 ('attributed', 1),
 ('audience', 1),
 ('austin', 1),
 ('authenticity', 1),
 ('authorities', 2),
 ('away', 1),
 ('awhile', 1),
 ('awry', 1),
 ('baby', 1),
 ('back', 4),
 ('bad', 1),
 ('bag', 3),
 ('basically', 1),
 ('be', 6),
 ('because', 5),
 ('become', 1),
 ('becoming', 1),
 ('been', 5),
 ('begins', 1),
 ('being', 1),
 ('bel

## Max Features

También podemos limitar la cantidad de features a los N más frecuentes:

In [37]:
vect = CountVectorizer(max_features=100)
vect.fit(X_train)
x = vect.transform(X_train[:1])
x

<1x100 sparse matrix of type '<class 'numpy.int64'>'
	with 92 stored elements in Compressed Sparse Row format>

Podemos ver que los features elegidos no parecen ser muy informativos en cuanto a polaridad:

In [39]:
features = vect.get_feature_names()
features[10:20]

['be',
 'because',
 'been',
 'but',
 'by',
 'can',
 'character',
 'characters',
 'do',
 'does']

In [40]:
[(features[i], x[0, i]) for i in range(x.shape[1]) if x[0, i]]

[('about', 6),
 ('after', 1),
 ('all', 4),
 ('also', 3),
 ('an', 6),
 ('and', 60),
 ('any', 3),
 ('are', 9),
 ('as', 11),
 ('at', 8),
 ('be', 6),
 ('because', 5),
 ('been', 5),
 ('but', 18),
 ('by', 9),
 ('can', 4),
 ('character', 9),
 ('characters', 8),
 ('do', 3),
 ('does', 3),
 ('even', 6),
 ('film', 12),
 ('films', 1),
 ('first', 1),
 ('for', 10),
 ('from', 6),
 ('good', 3),
 ('had', 4),
 ('has', 2),
 ('have', 6),
 ('he', 11),
 ('her', 1),
 ('him', 5),
 ('his', 12),
 ('if', 7),
 ('in', 18),
 ('into', 2),
 ('is', 21),
 ('it', 29),
 ('its', 5),
 ('just', 3),
 ('like', 9),
 ('little', 2),
 ('make', 4),
 ('more', 4),
 ('most', 5),
 ('movie', 2),
 ('much', 9),
 ('never', 2),
 ('no', 3),
 ('not', 11),
 ('of', 40),
 ('off', 3),
 ('on', 2),
 ('one', 5),
 ('only', 1),
 ('or', 6),
 ('other', 1),
 ('out', 4),
 ('over', 2),
 ('plot', 4),
 ('scene', 4),
 ('see', 1),
 ('so', 9),
 ('some', 6),
 ('story', 4),
 ('that', 28),
 ('the', 87),
 ('their', 3),
 ('them', 4),
 ('there', 6),
 ('they', 15),
 

## Vocabulary

También podemos limitar los features a un vocabulario predefinido.
Por ejemplo, si contamos con lexicones de palabras positivas y negativas:

In [46]:
positive_words = [
    'good',
    'best',
    'excellent',
    'awesome',
]
negative_words = [
    'bad',
    'worst',
    'horrendous',
    'awful',
]
vocabulary = positive_words + negative_words

In [47]:
vect = CountVectorizer(vocabulary=vocabulary)
vect.fit(X_train)
x = vect.transform(X_train[:1])
x

<1x8 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [49]:
features = vect.get_feature_names()
[(features[i], x[0, i]) for i in range(x.shape[1]) if x[0, i]]

[('good', 3), ('best', 5), ('bad', 1)]

## Otros parámetros

- **binary=True**: binarizar conteos (0 o 1)
- **ngram_range=(p, q)**: contar n-gramas de palabras con n en (p, q)
- **stop_words**: filtrar algunas palabras
- **analyzer='char'**: caracteres en lugar de palabras

y varios más...

## TF-IDF

Algunas palabras son muy frecuentes en todos los documentos (artículos, preposiciones), y por ende poco informativas.
TF-IDF divide el conteo por un número que mide esto.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(binary=True)
vect.fit(X_train)

TfidfVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [17]:
features = vect.get_feature_names()
x = vect.transform(X_train[:1])
x

<1x32422 sparse matrix of type '<class 'numpy.float64'>'
	with 642 stored elements in Compressed Sparse Row format>

Podemos ver, por ejemplo, que la palabra 'the' tiene mucho menos peso que la palabra 'annoying', a pesar de ser mucho más frecuente:

In [18]:
x[0, features.index('the')], x[0, features.index('annoying')]

(0.00949812921017883, 0.03301349592851511)

In [133]:
tokens = X_train[0].decode('utf-8').split()
tokens.count('the'), tokens.count('annoying')

(86, 1)