## Crear los vectores de características

Utilizamos el método fit_transform de CountVectorizer para crear el vocabulario de la bag-of-words y transformar los enunciados en vectores.

1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


### La bolsa de palabras (bag-of-words) nos permite representar texto como vectores de características númericas. 

**Algoritmo:**

1. Crear un vocabulario de palabras únicas a partir de un conjunto único de documentos.
2. Construir un vector de características a partir de cada documento que contiene el recuento de la frecuencia en que cada palabra aparece en un documento en concreto. (se puede representar como binario)


In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(binary=True, strip_accents= 'ascii') # sin el contaje, solo 0 y 1
frases = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one one and one is two'])
bag = count.fit_transform(frases)

In [2]:
bag

<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [3]:
print(bag)

  (0, 6)	1
  (0, 4)	1
  (0, 1)	1
  (0, 3)	1
  (1, 6)	1
  (1, 1)	1
  (1, 8)	1
  (1, 5)	1
  (2, 6)	1
  (2, 4)	1
  (2, 1)	1
  (2, 3)	1
  (2, 8)	1
  (2, 5)	1
  (2, 0)	1
  (2, 2)	1
  (2, 7)	1


In [4]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [5]:
count.get_feature_names_out()

array(['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two',
       'weather'], dtype=object)

In [6]:
#La matriz generada también se conoce como modelo unigrama
bag1 = bag.toarray()
print(bag1) #Podemos observar la frecuencia de cada token de las 3 frases

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [1 1 1 1 1 1 1 1 1]]


In [7]:
import pandas as pd
df_bow = pd.DataFrame(bag1, columns = count.get_feature_names_out())
df_bow.head()

Unnamed: 0,and,is,one,shining,sun,sweet,the,two,weather
0,0,1,0,1,1,0,1,0,0
1,0,1,0,0,0,1,1,0,1
2,1,1,1,1,1,1,1,1,1


In [8]:
frases

array(['The sun is shining', 'The weather is sweet',
       'The sun is shining, the weather is sweet, and one one and one is two'],
      dtype='<U68')

# TF (Term Frequency)
Contamos todas las apariciones

In [15]:
count = CountVectorizer() #(max_df=0.9) # binary=True max_df si lo supera no lo pone, min_df al reves
frases = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
tf = count.fit_transform(frases) # aqui lo transforma a sparse

tf = pd.DataFrame(tf.toarray(), columns = count.get_feature_names_out())
tf.head()

Unnamed: 0,and,is,one,shining,sun,sweet,the,two,weather
0,0,1,0,1,1,0,1,0,0
1,0,1,0,0,0,1,1,0,1
2,2,3,2,1,1,1,2,1,1


## tf-idf: term frequency-inverse document frequency

La siguiente técnica se puede traducir como frecuencia de término-frecuencia inversa de documento, se utiliza para disminuir el peso de las palabras que aparecen muchas veces en multiples documentos, la ecuación es la siguiente: 

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

* El producto de la frecuencia de término $(tf(t,d))$ y la frecuencia inversa de documento $(idf(t,d))$. Donde ***t*** es el número de terminos y ***d*** el número de documento.

* ***idf(t,d)*** se calcula con la siguiente ecuación:


$$\text{idf}(t,d) = \text{log}\frac{n_d}{\text{1+df}(d, t)},$$

* Donde $n_d$ es el número total de documentos 
* ***df (d,t)*** es el número de documentos ***d*** que contienen el término ***t*** 
* El logaritmo se utiliza para evitar que las bajas frecuencias de documentos no adquieran demasiado peso

### Ejemplo:
$$\text{idf}("is", d3) = log \frac{3}{3} = 0$$


$$\text{tf-idf}("is",d3)= 3/14 \times (0) = 0$$

In [18]:
idf__ = np.log(10000/(10000+1)) # si aparece en muchos documentos el idf es casi 0
idf__

-9.999500033332494e-05

In [21]:
idf__ = np.log(100000/1000) # si aparece en pocos el idf es grande
idf__

4.605170185988092

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_array = tfidf.fit_transform(frases)
print(tfidf_array)
tfidf_array=tfidf_array.toarray()

TF_IDF_sk=pd.DataFrame(tfidf_array, columns=tfidf.get_feature_names_out())
TF_IDF_sk

  (0, 3)	0.5584778353707552
  (0, 1)	0.4337078595086741
  (0, 4)	0.5584778353707552
  (0, 6)	0.4337078595086741
  (1, 5)	0.5584778353707552
  (1, 8)	0.5584778353707552
  (1, 1)	0.4337078595086741
  (1, 6)	0.4337078595086741
  (2, 7)	0.25119322405394995
  (2, 2)	0.5023864481078999
  (2, 0)	0.5023864481078999
  (2, 5)	0.191038921512224
  (2, 8)	0.191038921512224
  (2, 3)	0.191038921512224
  (2, 1)	0.44507629390649395
  (2, 4)	0.191038921512224
  (2, 6)	0.296717529270996


Unnamed: 0,and,is,one,shining,sun,sweet,the,two,weather
0,0.0,0.433708,0.0,0.558478,0.558478,0.0,0.433708,0.0,0.0
1,0.0,0.433708,0.0,0.0,0.0,0.558478,0.433708,0.0,0.558478
2,0.502386,0.445076,0.502386,0.191039,0.191039,0.191039,0.296718,0.251193,0.191039


In [23]:
phrase = ['hi, the sun is two Adios hola']

In [24]:
pd.DataFrame(tfidf.transform(phrase).toarray(), columns=tfidf.get_feature_names_out())
# tf*(1+idf)

Unnamed: 0,and,is,one,shining,sun,sweet,the,two,weather
0,0.0,0.391484,0.0,0.0,0.504107,0.0,0.391484,0.66284,0.0



Las ecuaciones para idf y tf-idf implementadas en scikit-learn son: 

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

Scikit-learn implementa la siguiente normalización a las frecuencias (L2), que devuelve un vector de longitud 1, diviendo un vector de característica no normalizado ***v*** por su ***norma L2***

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$