## __Text mining y Procesamiento de Lenguaje Natural (NLP)__

__Profesor__: Anthony D. Cho

__Tema__: Representación de documentos

__Método__: Modelo de espacio vectorial

***

__Dependencias__

```{python}
    python -m pip install nltk spacy
    python -m spacy download en_core_web_sm
    python -m spacy download es_core_news_sm
    
```

## Librerias

In [1]:
from glob import glob
import re
import joblib
from pandas import DataFrame

from string import punctuation
from spacy.lang.es.stop_words import STOP_WORDS
from spacy import load

from sklearn.feature_extraction.text import TfidfVectorizer

## Instancia del modelo de lenguaje
nlp = load('es_core_news_sm')

## Carga de documentos

In [2]:
## Encontrar la ruta de cada archivo de interes
path_docs = glob('*/doc*.txt')

## Almacenamiendo de contenido de los documentos e id (nombre del archivo)
corpus, doc_id = [], [] 

## Incio de proceso de carga de documentos
if len(path_docs):
    for file in path_docs:

        ## Se carga el texto
        text = open(file, 'r', encoding='utf-8').read()
        
        ## Se almacena el texto
        corpus.append(text)
        
        id = file.split('\\')[-1].split('.')[0]

        ## Se almacena el id
        doc_id.append(id)
else:
    print('No corpus have found.')

In [3]:
doc_id

['doc1', 'doc2', 'doc3', 'doc4', 'doc5', 'doc6', 'doc7']

In [4]:
corpus

['Ninguna laptop en años recientes ha tenido un impacto mayor que la XPS 13. Esta fue la que comenzó la carrera por los biseles delgados en 2015, cosa que influyó sobre absolutamente todos los dispositivos que cuentan con pantalla.\n\nEn su versión más reciente, Dell ha llevado las cosas aún más lejos. La XPS 13 de 2020 ha ampliado su pantalla con una relación de aspecto de 16:10, encogiendo el bisel inferior.\n\nEl resultado es una pantalla más grande en una laptop del mismo tamaño. La XPS 13 de este año también tiene un teclado y un touchpad más grandes, aprovechando así todas las superficies posibles del dispositivo.\n\nClaro, es tan poderosa y duradera como siempre, sin sacrificar funcionalidad por el diseño. Ya no es tan barata como antes, pero sin duda se ha ganado su lugar dentro de las opciones premium. Definitivamente, esta es la mejor laptop que puedes comprar.\n\nLa XPS 13 empieza con un Core i3-1115G4, 8 GB de RAM, un 256 GB SSD y un pantalla Full HD (1,920 x 1,080 pixeles)

#### Preprocesamiento

In [5]:
## Limpieza de textos
cleanTexts = []

for doc in corpus:

    ## Remover numeros y puntuaciones
    doc = re.sub(r'[\"\¿\°\d+]', '', doc)
    doc = [s for s in doc if s not in punctuation]
    doc = ''.join(doc)

    ## Normalización y remover stopwords
    documento = nlp(doc.lower())
    tokens = [word.text for word in documento]
    doc = [word for word in tokens if word not in STOP_WORDS]
    doc = ' '.join(doc)
    doc = re.sub(pattern='\s+', repl=' ', string=doc)
    
    ## Aplicar lemmatización
    documento = nlp(doc)
    lemmas = [word.lemma_ for word in documento]
    doc = ' '.join(lemmas)
    doc = re.sub(pattern='\s+', repl=' ', string=doc)

    ## Almacenado de contenido procesado
    cleanTexts.append(doc)

## Mostar contenido procesado
cleanTexts
    

['laptop año reciente impacto xps comenzar carrera bisel delgado cosa influir absolutamente dispositivo contar pantalla versión reciente dell llevado cosa lejos xps ampliado pantalla relación aspecto encoger bisel inferior resultado pantalla laptop tamaño xps año teclado touchpad aprovechar superficie posible dispositivo poderoso duradero sacrificar funcionalidad diseño barata duda ganado lugar opción premium definitivamente laptop poder comprar xps empezar core ig gb ram gb ssd pantalla full hd x pixel configurada core ig gb ram tb almacenamiento pantalla oled',
 'ver macbook air reciente desviación mac año macbook pro mac mini macbook air funcionar silicio apple chip m costo procesador m aportar ventaja importante macbook air convirtiéndola portátil perfecto estudiante chip m poder configurar él gb ram tb ssd mejora traer chip m duración rendimiento bateer viejo macbook air incapacitado procesador lento doble núcleo utilizado mantener sistema completamente ventilador apple realizar t

## Modelo: TfidfVectorizer

Existen múltiples hiperparámetros para el modelo [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):<br>

```{python}
    model = TfidfVectorizer(use_idf=True,
                            norm='l2',
                            ngram_range=(1,1),
                            binary=False,
                            max_df=1.0,
                            min_df=1, 
                            max_features=None
                            )
```

| Hiperparámetros | Descripción |
|-----------------|-------------|
| __use_idf__  | Habilitar cómputo de los pesos IDF (default, True).|
| __norm__ | Tipo de normalización a aplicar a cada documento: L1, L2 o ninguno (default, None)|
| __ngram_range__ | rango en tamaño de n-grama. (default: (1,1)) | 
| __max_df__ | Remover los términos que estén por encima del umbral. Si es flotante (porción de los documentos) va en rango de [0.0, 1.0]. Si es un valor entero representaría la cantidad de documentos. (default, 100%) |
| __min_df__ | Remover los términos que estén por debajo del umbral. Si es flotante (porción de los documentos) va en rango de [0.0, 1.0]. Si es un valor entero representaría la cantidad de documentos. (default, 1 documento)|
| __max_features__ | Indicar la cantidad máxima de términos a retorna, tomando como criterio los términos más frecuentes. (default, None (todos)) |


Para ajuste y transformación de los datos se emplean las siguientes funciones:

* __fit(X)__ : entrena el modelo usando un conjunto de documentos.

* __fit_transform(X)__: entrena el modelo y retorna una matriz de documento-término.

* __transform(X)__: Transforma los documentos a matríz de documento-término

#### Caracteristicas del modelo entrenado
Una vez entrenado el modelo, existen nuevas características que podemos observar (atributos)<br>

<br>

| Atributos | Descripción |
|-----------|-------------|
| vocabulary_ | retorna un diccionario de indices y términos. |
| idf_ | retorna un vector IDF calculada. |

## Representación binaria

In [6]:
## Instancia del modelo
model = TfidfVectorizer(use_idf=False,  ## <- 
                        norm=None,
                        ngram_range=(1, 1),
                        binary=True     ## <- 
                        )

## Ajuste del modelo y retorno de TF matrix
tf_sparse = model.fit_transform(cleanTexts)

## Extraer Vocabulario creado por el modelo (dict :: key (word), value (index))
vocabulary = model.vocabulary_

In [7]:
features = sorted(vocabulary.items(), key=lambda x: x[1])
features = [f for f, _ in features]

tf_table = DataFrame(tf_sparse.toarray(), columns=features)
tf_table

Unnamed: 0,absolutamente,acer,activo,actual,actualizado,agreguir,aguantar,air,alattiyah,alemán,...,ver,versión,victoria,video,viejo,volver,web,xps,él,óptimo
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [9]:
## Si desean exportar esta información a un archivo
vsm_info = {'vsm': tf_table, 
            'IDF': [],
            'vocabulary': vocabulary}

## Exportar a un archivo 
joblib.dump(value=vsm_info, 
            filename='binary_model.joblib', 
            compress=9)

## Para importar se puede emplear
modelo_binario = data = joblib.load(filename='binary_model.joblib')

## Representación de frecuencia

In [10]:
## Instancia del modelo
model = TfidfVectorizer(use_idf=False,      # <--
                        norm=None,
                        ngram_range=(1,1),
                        binary=False        # <--
                        )

## Ajuste del modelo y retorno de TF matrix
tf_sparse = model.fit_transform(cleanTexts)

## Extraer Vocabulario creado por el modelo (dict :: key (word), value (index))
vocabulary = model.vocabulary_

In [11]:
features = sorted(vocabulary.items(), key=lambda x: x[1])
features = [f for f, _ in features]

tf_table = DataFrame(tf_sparse.toarray(), columns=features)
tf_table

Unnamed: 0,absolutamente,acer,activo,actual,actualizado,agreguir,aguantar,air,alattiyah,alemán,...,ver,versión,victoria,video,viejo,volver,web,xps,él,óptimo
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [13]:
## Si desean exportar esta información a un archivo
vsm_info = {'vsm': tf_table, 
            'IDF': [],
            'vocabulary': vocabulary}

## Exportar a un archivo 
joblib.dump(value=vsm_info, 
            filename='tf_model.joblib', 
            compress=9)

## Para importar se puede emplear
modelo_binario = data = joblib.load(filename='tf_model.joblib')

## Representación de Término de frecuencia inversa

In [14]:
## Instancia del modelo
model = TfidfVectorizer(use_idf=True,      # <--
                        norm=None,          # <--
                        ngram_range=(1,1),
                        binary=False        
                        )

## Ajuste del modelo y retorno de TF matrix
tf_sparse = model.fit_transform(cleanTexts)

## Extraer Vocabulario creado por el modelo (dict :: key (word), value (index))
vocabulary = model.vocabulary_

In [15]:
features = sorted(vocabulary.items(), key=lambda x: x[1])
features = [f for f, _ in features]

tf_table = DataFrame(tf_sparse.toarray(), columns=features)
tf_table

Unnamed: 0,absolutamente,acer,activo,actual,actualizado,agreguir,aguantar,air,alattiyah,alemán,...,ver,versión,victoria,video,viejo,volver,web,xps,él,óptimo
0,2.386294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.980829,0.0,0.0,0.0,0.0,0.0,9.545177,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,2.386294,7.923317,0.0,0.0,...,1.980829,0.0,0.0,0.0,2.386294,0.0,0.0,0.0,1.980829,0.0
2,0.0,4.772589,0.0,0.0,2.386294,0.0,0.0,0.0,0.0,0.0,...,0.0,1.980829,0.0,2.386294,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,2.386294,0.0,1.980829,0.0,0.0,...,1.980829,0.0,0.0,0.0,0.0,2.386294,2.386294,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.386294
6,0.0,0.0,2.386294,2.386294,0.0,0.0,0.0,0.0,2.386294,2.386294,...,0.0,0.0,7.158883,0.0,0.0,0.0,0.0,0.0,1.980829,0.0


In [None]:
## Si desean exportar esta información a un archivo
vsm_info = {'vsm': tf_table, 
            'IDF': model.idf_,
            'vocabulary': vocabulary}

## Exportar a un archivo 
joblib.dump(value=vsm_info, 
            filename='tfidf_model.joblib', 
            compress=9)

## Para importar se puede emplear
modelo_binario = data = joblib.load(filename='tfidf_model.joblib')