# Procesamiento del lenguaje natural con Tf-idf

## Tf-idf vectorizador

Tf-idf se modela con la funcion `TfidfVectorizer`

Esta funcion puede realizar en un solo paso:

1. Limpieza
    1. Minusculas
    1. Eliminar acentos
    1. Stop words
1. Tokenizar por
    1. palabras
    1. caracteres
    1. expresion regular
1. Matriz Tf-Idf
    1. con `fit-transform`



### Importar el vectorizador

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

### Parametros de TfidVectorizer

(seleccion de los mas utilizados)

**input** : string {'filename', 'file', 'content'}

**encoding** : string, 'utf-8' by default.

**strip_accents** : {'ascii', 'unicode', None}
    Remove accents and perform other character normalization during the preprocessing step.
    'ascii' is a fast method that only works on characters that have an direct ASCII mapping.
    'unicode' is a slightly slower method that works on any characters.
    None (default) does nothing.

**lowercase** : boolean, default True

**analyzer** : string, {'word', 'char'} or callable
    Whether the feature should be made of word or character n-grams.

**stop_words** : string {'english'}, list, or None (default)
    'english' is currently the only supported string value. There are several known issues with 'english' and you should consider an alternative (see :ref:`stop_words`). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.

**token_pattern** : string
    Regular expression denoting what constitutes a "token", only used if ``analyzer == 'word'``



### Lista completa de parametros

In [2]:
TfidfVectorizer?

[1;31mInit signature:[0m
[0mTfidfVectorizer[0m[1;33m([0m[1;33m
[0m    [1;33m[[0m[1;34m"input='content'"[0m[1;33m,[0m [1;34m"encoding='utf-8'"[0m[1;33m,[0m [1;34m"decode_error='strict'"[0m[1;33m,[0m [1;34m'strip_accents=None'[0m[1;33m,[0m [1;34m'lowercase=True'[0m[1;33m,[0m [1;34m'preprocessor=None'[0m[1;33m,[0m [1;34m'tokenizer=None'[0m[1;33m,[0m [1;34m"analyzer='word'"[0m[1;33m,[0m [1;34m'stop_words=None'[0m[1;33m,[0m [1;34m"token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b'"[0m[1;33m,[0m [1;34m'ngram_range=(1, 1)'[0m[1;33m,[0m [1;34m'max_df=1.0'[0m[1;33m,[0m [1;34m'min_df=1'[0m[1;33m,[0m [1;34m'max_features=None'[0m[1;33m,[0m [1;34m'vocabulary=None'[0m[1;33m,[0m [1;34m'binary=False'[0m[1;33m,[0m [1;34m"dtype=<class 'numpy.float64'>"[0m[1;33m,[0m [1;34m"norm='l2'"[0m[1;33m,[0m [1;34m'use_idf=True'[0m[1;33m,[0m [1;34m'smooth_idf=True'[0m[1;33m,[0m [1;34m'sublinear_tf=False'[0m[1;33m][0m[1;33m,[0m[1;

## Aplicando TfidVectorizer

### 1er Ejemplo

#### Input

Tfidf se lo utiliza cuando tenemos un conjunto de textos. Pero en este caso analizare un texto solamente.

Una de las opciones de input es una lista de textos, por eso el formato es `texto = ['bla ... bla']`

In [3]:
texto = ["Now we have cleaned the data. The next thing we can do is Feature Engineering.\
Feature Engineering is basically a technique for finding Feature or Data from the currently available data.\
There are several ways to do this technique. More often, it is about common sense. Let’s take a look at the \
Embarked data: it is filled with Q, S, or C. The Python library will not be able to process this, since it is\
only able to process numbers. So you need to do something called One Hot Vectorization, changing the column into\
three columns. Let’s say Embarked_Q, Embarked_S, and Embarked_C which are filled with 0 or 1 whether the person\
embarked from that harbor or not. The other example is SibSp and Parch. Maybe there is nothing interesting in both\
of those columns, but you might want to know how big the family was of the passenger who boarded in the ship. You might\
assume that if the family was bigger, then the chance of survival would increase, since they could help each other. On\
other hand, solo people would’ve had it hard. So you want to create another column called family size, which consists\
of sibsp + parch + 1 (the passenger themself). The last example is called bin columns. It is a technique which creates\
ranges of values to group several things together, since you assume it is hard to differentiate things with similar value.\
For example, Age. For a person aged 5 and 6, is there any significant difference? or for person aged 45 and 46, is there\
any big difference?"]

#### Definir vectorizador con parametros

In [4]:
vectorizador = TfidfVectorizer(lowercase=True,  stop_words="english") 

#### fit-transform

In [5]:
matriz = vectorizador.fit_transform(texto)

El resultado es una matriz que tiene tantas filas como textos en la lista de inputs (en este caso, solo 1).

La cantidad de columnas de esta matriz depende de cuantas palabras distintas hay en todos los inputs juntos. 

In [6]:
matriz.shape

(1, 83)

Los titulos de las columas son:

In [7]:
print(vectorizador.get_feature_names())

['45', '46', 'able', 'age', 'aged', 'assume', 'available', 'basically', 'big', 'bigger', 'bin', 'boarded', 'bothof', 'called', 'chance', 'changing', 'cleaned', 'column', 'columns', 'common', 'consistsof', 'create', 'createsranges', 'currently', 'data', 'difference', 'differentiate', 'embarked', 'embarked_c', 'embarked_q', 'embarked_s', 'engineering', 'example', 'family', 'feature', 'filled', 'finding', 'group', 'hand', 'harbor', 'hard', 'help', 'hot', 'increase', 'interesting', 'intothree', 'isonly', 'know', 'let', 'library', 'look', 'maybe', 'mightassume', 'need', 'numbers', 'onother', 'parch', 'passenger', 'people', 'person', 'personembarked', 'process', 'python', 'say', 'sense', 'ship', 'sibsp', 'significant', 'similar', 'size', 'solo', 'survival', 'technique', 'themself', 'thereany', 'thing', 'things', 'value', 'values', 've', 'vectorization', 'want', 'ways']


Y la matriz es:

In [8]:
matriz.todense()

matrix([[0.07179582, 0.07179582, 0.14359163, 0.07179582, 0.14359163,
         0.07179582, 0.07179582, 0.07179582, 0.14359163, 0.07179582,
         0.07179582, 0.07179582, 0.07179582, 0.21538745, 0.07179582,
         0.07179582, 0.07179582, 0.14359163, 0.21538745, 0.07179582,
         0.07179582, 0.07179582, 0.07179582, 0.07179582, 0.28718326,
         0.14359163, 0.07179582, 0.07179582, 0.07179582, 0.07179582,
         0.07179582, 0.14359163, 0.21538745, 0.21538745, 0.21538745,
         0.14359163, 0.07179582, 0.07179582, 0.07179582, 0.07179582,
         0.14359163, 0.07179582, 0.07179582, 0.07179582, 0.07179582,
         0.07179582, 0.07179582, 0.07179582, 0.14359163, 0.07179582,
         0.07179582, 0.07179582, 0.07179582, 0.07179582, 0.07179582,
         0.07179582, 0.14359163, 0.14359163, 0.07179582, 0.14359163,
         0.07179582, 0.14359163, 0.07179582, 0.07179582, 0.07179582,
         0.07179582, 0.14359163, 0.07179582, 0.07179582, 0.07179582,
         0.07179582, 0.07179582, 0

### 2o Ejemplo

#### Input

En este caso tenemos un conjunto de 5 textos.

In [10]:
txt = ['hola, mi nombre es Ricardo', 'Hola, me llamo Juan', 'hola! yo me llamo Julia!', 'Mi nombre es Ana', "Ana es mi nombre tambien"]

#### Tf-idf

In [11]:
vect = TfidfVectorizer()
matr = vect.fit_transform(txt)

In [12]:
vect

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [13]:
print(vect.get_feature_names())

['ana', 'es', 'hola', 'juan', 'julia', 'llamo', 'me', 'mi', 'nombre', 'ricardo', 'tambien', 'yo']


In [14]:
print(matr.todense())

[[0.         0.40065484 0.40065484 0.         0.         0.
  0.         0.40065484 0.40065484 0.59824977 0.         0.        ]
 [0.         0.         0.40382593 0.60298477 0.         0.48648432
  0.48648432 0.         0.         0.         0.         0.        ]
 [0.         0.         0.34582166 0.         0.51637397 0.41660727
  0.41660727 0.         0.         0.         0.         0.51637397]
 [0.57099526 0.47397764 0.         0.         0.         0.
  0.         0.47397764 0.47397764 0.         0.         0.        ]
 [0.46607785 0.38688672 0.         0.         0.         0.
  0.         0.38688672 0.38688672 0.         0.57769148 0.        ]]


#### Saquemos las stop words del español

Para esto tenemos que importar el conjunto de palabras de `nltk.corpus`

In [15]:
from nltk.corpus import stopwords
superfluas = set(stopwords.words('spanish'))

La variable `superfluas` contiene la lista, pero si queremos agregarle la palabra 'tambien' tenemos que usar `.add()`

In [16]:
superfluas.add('tambien')

Se usa `add` porque es un `set` (array de elementos unicos)

In [18]:
type(superfluas)

set

#### Tf-idf

In [19]:
vec = TfidfVectorizer(stop_words = superfluas)
mat = vec.fit_transform(txt)

In [20]:
print(vec.get_feature_names())

['ana', 'hola', 'juan', 'julia', 'llamo', 'nombre', 'ricardo']


In [21]:
print(mat.todense())

[[0.         0.48624042 0.         0.         0.         0.48624042
  0.72604443]
 [0.         0.4622077  0.69015927 0.         0.55681615 0.
  0.        ]
 [0.         0.4622077  0.         0.69015927 0.55681615 0.
  0.        ]
 [0.76944707 0.         0.         0.         0.         0.63871058
  0.        ]
 [0.76944707 0.         0.         0.         0.         0.63871058
  0.        ]]


Como se puede ver, hay una columna por palabra del `.get_feature_names()` (y en el mismo orden).

Las filas corresponden a los 5 textos del input.

Los valores reflejan el peso que se le da a cada palabra.