El dataset resenna contiene reseñas de una biblioteca de España.

Se desea clasificarlas según las demandas de los usuarios. 

- Lea el archivo y codifique las reseñas usando Tf*Idf agrupando 1, 2 o (1 y 2) palabras.
- Defina un criterio para reducir la dimensión de los vectores generados por la codificación. 
- Reduzca la dimensionalidad de las reseñas codificadas y clasifique por K-means.
- Que agrupación separó mejor las reseñas de acuerdo a los recursos usados en la biblioteca? (esto es, deficiencias de WiFi, red eléctrica, trato del personal, etc.)
- Finalmente, evalúe las siguientes oraciones con la transformación Tf*Idf, el algoritmo de Feature hashing y el algoritmo de K-means entrenado:

'Hace demasiado calor',
'Poco tiempo de servicio.',
'Falta de personal capacitado.',
'Excelente servicio',
'Falta de enchufes',
'No hay internet',
'Poco espacio para buscar libros',
'El internet falla mucho',
'Falta iluminacion',
'Falta literatura',
'Luz pesima',
'Mucho frio',
'Mucho calor en invierno',
'Calefaccion espantosa',
'Mucho frio en verano',
'Falta de contactos',
'La luz es deficiente',
'Personal maleducado',
'La climatizacion es espantosa',
'Trato desagradable de los empleados',
'Falta de mantenimiento',
'Falta de limpieza'


In [1]:
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
data = pd.read_csv('data/resenna.csv')
data.head(15)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nhernand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,CAUSA
0,Vacio
1,Vacio
2,Vacio
3,Vacio
4,Vacio
5,Hace mucho calor en la sala de lectura y entra...
6,Vacio
7,Unos días hace mucho frio y no se puede estar ...
8,Cuando ponen la calefación hace mucho calor y ...
9,Vacio


In [2]:
# Vacio no aporta valor al análisis. Se elimina usando boolean series
data = data.loc[data['CAUSA'] != 'Vacio']
# Remover Números como parte del preprocesado, no nos interesan. 
data['CAUSA'] = data['CAUSA'].str.replace('\d+', '')
data.head(15)

  data['CAUSA'] = data['CAUSA'].str.replace('\d+', '')


Unnamed: 0,CAUSA
5,Hace mucho calor en la sala de lectura y entra...
7,Unos días hace mucho frio y no se puede estar ...
8,Cuando ponen la calefación hace mucho calor y ...
10,"Hace demasiado calor en invierno y verano, imp..."
16,Temperatura en la sala de estudio excesivament...
17,Mucho calor
18,Espacio de préstamo infantil muy caluroso
22,Hace demasiado calor en la sala de lectura. Ya...
23,no se pueden cargar los ordenadores porque no ...
24,Falta de enchufes en sala de lectura. Mala ref...


In [3]:
# Desde la librería stopwords, traemos las stop words tipicas en Español.
stop_words_array = stopwords.words("spanish")
stop_words_array[0:15] # Visualizar primeros 15

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con']

## Lea el archivo y codifique las reseñas usando Tf*Idf agrupando 1, 2 o (1 y 2) palabras.

In [4]:
# Se defione el vectorizador
# Partimos de que cada ROW del dataset es un documento
vectorizer = TfidfVectorizer(
    lowercase=True, # Convert all characters to lowercase before tokenizing.
    min_df= 2, # Umbral mínimo de frecuencia. Aparición 2 veces en el documento (row)
    max_df=.7, # Umbral máximo de frecuencia. El término debe aparecer en menos del 80% de los rows para considerarse
    stop_words=stop_words_array,
    ngram_range=(1, 2) # Generación de unigramas y bigramas
    )
vectorizer

In [5]:
# Fit con el vectorizador 
vectorizer.fit(data['CAUSA'])

In [6]:
# Creando el vector de palabras
column_causa = data['CAUSA']
X = vectorizer.transform(column_causa)
TermFrecuencyInverseDocumentFrecuency_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
TermFrecuencyInverseDocumentFrecuency_df

Unnamed: 0,abiertas,abrigo,abrigo puesto,abrir,abrir ventanas,accesibilidad,accesible,acceso,acondicionado,acondicionado hace,...,ventanas abiertas,ventilación,ventilación natural,ver,verano,volver,vuelva,web,wifi,wifi falla
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.308987,0.379932,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.229314,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.337730,0.0,0.0,0.0,0.0,0.0
194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0


In [7]:
importancia_palabras = TermFrecuencyInverseDocumentFrecuency_df.sum().sort_values(ascending=False)
importancia_palabras = pd.DataFrame(importancia_palabras, columns=['sum'])
importancia_palabras.describe()

Unnamed: 0,sum
count,318.0
mean,1.464614
std,1.540607
min,0.369867
25%,0.673752
50%,0.92736
75%,1.528653
max,14.527798


In [8]:
importancia_palabras.head(10)

Unnamed: 0,sum
calor,14.527798
hace,10.95847
funciona,8.397159
verano,7.156647
invierno,6.763244
hace calor,6.420005
biblioteca,6.34107
frío,6.023094
sala,6.014995
refrigeración,5.720571


## Defina un criterio para reducir la dimensión de los vectores generados por la codificación. 
El criterio que se utilizará es PCA


## Reduzca la dimensionalidad de las reseñas codificadas y clasifique por K-means.

In [9]:
from sklearn.decomposition import PCA

# crear una instancia de la clase PCA para reducir la dimensionalidad de las reseñas codificadas
pca = PCA(n_components=2)

# reducir la dimensionalidad de las reseñas codificadas
reviews_pca = pca.fit_transform(X.toarray())
reviews_pca

array([[ 0.15258876, -0.11110612],
       [ 0.35182968,  0.02647342],
       [ 0.41669818,  0.23892432],
       [ 0.31196818, -0.03337227],
       [ 0.00620207, -0.13190784],
       [ 0.46609883,  0.035856  ],
       [-0.12938551, -0.06769313],
       [ 0.09219803, -0.12061776],
       [-0.10491486, -0.08391995],
       [-0.14077438, -0.19894233],
       [-0.10214463, -0.07960582],
       [-0.12686335, -0.00508984],
       [ 0.02281982, -0.17992644],
       [-0.0943992 , -0.04695833],
       [-0.0907715 , -0.04465964],
       [ 0.3114007 , -0.04465398],
       [-0.11682945, -0.18600494],
       [-0.1118454 , -0.10543309],
       [-0.06992316, -0.03038499],
       [-0.11159444, -0.06883686],
       [-0.07032275,  0.23439102],
       [-0.09171841, -0.04873579],
       [ 0.34142214, -0.00477061],
       [-0.09497392, -0.06809644],
       [ 0.31398541,  0.19957605],
       [-0.08982576, -0.10938977],
       [-0.07767121, -0.02946374],
       [-0.13660464, -0.10260577],
       [-0.12039803,

In [15]:
# Respecto a la clasificación pro Kmeans, los grupos se controlaran  con n_clusters
from sklearn.cluster import KMeans

n_grupos = 5
# crear una instancia de la clase KMeans para clasificar las reseñas por K-means
kmeans = KMeans(n_clusters=n_grupos, random_state=42)

# clasificar las reseñas por K-means
kmeans.fit(reviews_pca)

# agregar los grupos como una nueva columna al dataframe original
data['grupo'] = kmeans.labels_
data.sort_values(by='grupo', ascending=False).head()



Unnamed: 0,CAUSA,grupo
5,Hace mucho calor en la sala de lectura y entra...,4
789,Demasiado calor a partir de mayo,4
364,Siempre hay problemas con la temperatura del e...,4
379,Hace muchísimo calor!!!!,4
461,No he sentido ni frío en invierno ni calor en ...,4


## Que agrupación separó mejor las reseñas de acuerdo a los recursos usados en la biblioteca? (esto es, deficiencias de WiFi, red eléctrica, trato del personal, etc.)
 La que separó mejor fue la 4 y la 2, haciendo referencia ambas a la temperatura, esto indica que podríamos bajar la cantidad de grupos.

## Finalmente, evalúe las oraciones (descritas arriba) con la transformación Tf*Idf, el algoritmo de Feature hashing y el algoritmo de K-means entrenado


In [16]:
oraciones_propuestas = ['Hace demasiado calor',
'Poco tiempo de servicio.',
'Falta de personal capacitado.',
'Excelente servicio',
'Falta de enchufes',
'No hay internet',
'Poco espacio para buscar libros',
'El internet falla mucho',
'Falta iluminacion',
'Falta literatura',
'Luz pesima',
'Mucho frio',
'Mucho calor en invierno',
'Calefaccion espantosa',
'Mucho frio en verano',
'Falta de contactos',
'La luz es deficiente',
'Personal maleducado',
'La climatizacion es espantosa',
'Trato desagradable de los empleados',
'Falta de mantenimiento',
'Falta de limpieza']

X = vectorizer.transform(oraciones_propuestas)
TermFrecuencyInverseDocumentFrecuency_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
TermFrecuencyInverseDocumentFrecuency_df


Unnamed: 0,abiertas,abrigo,abrigo puesto,abrir,abrir ventanas,accesibilidad,accesible,acceso,acondicionado,acondicionado hace,...,ventanas abiertas,ventilación,ventilación natural,ver,verano,volver,vuelva,web,wifi,wifi falla
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
reviews_pca = pca.fit_transform(X.toarray())
reviews_pca

array([[-0.26047039, -0.01582732],
       [-0.32931163, -0.26498335],
       [ 0.44754615, -0.01672011],
       [-0.32931163, -0.26498335],
       [ 0.30474217,  0.00106561],
       [-0.33116411, -0.39738654],
       [-0.25084782, -0.01349831],
       [-0.33116411, -0.39738654],
       [ 0.74633835,  0.00792819],
       [ 0.74633835,  0.00792819],
       [-0.31884121, -0.08964331],
       [-0.33685734,  0.79283935],
       [-0.26047039, -0.01582732],
       [-0.18678375, -0.00570469],
       [-0.33685734,  0.79283935],
       [ 0.74633835,  0.00792819],
       [-0.31884121, -0.08964331],
       [-0.13221511, -0.03115206],
       [-0.18678375, -0.00570469],
       [-0.25084782, -0.01349831],
       [ 0.74633835,  0.00792819],
       [ 0.42312588,  0.00350213]])

In [20]:
n_grupos = 6
# crear una instancia de la clase KMeans para clasificar las reseñas por K-means
kmeans = KMeans(n_clusters=n_grupos, random_state=42)

# clasificar las reseñas por K-means
kmeans.fit(reviews_pca)

# agregar los grupos como una nueva columna al dataframe original
second_data = pd.DataFrame(oraciones_propuestas, columns=['CAUSA'])
second_data['grupo'] = kmeans.labels_
second_data.sort_values(by='grupo', ascending=False)



Unnamed: 0,CAUSA,grupo
18,La climatizacion es espantosa,5
17,Personal maleducado,5
13,Calefaccion espantosa,5
1,Poco tiempo de servicio.,4
3,Excelente servicio,4
5,No hay internet,4
7,El internet falla mucho,4
21,Falta de limpieza,3
2,Falta de personal capacitado.,3
4,Falta de enchufes,3
