#### TF-IDF: Ejemplos con reviews en español

La transformación TF-IDF (Term Frequency-Inverse Document Frequency) convierte un conjunto de documentos de texto en una matriz de valores numéricos que representan la importancia de cada término en cada documento.


##### 1. Ejemplo sencillo

In [63]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
import numpy as np

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tomas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:

# Descargar stopwords en español (solo la primera vez)


# Documentos de ejemplo
documents = [
    "El gato está en el tejado",
    "El perro está en el jardín",
    "El gato y el perro son amigos"
]

# Lista de stopwords en español
spanish_stopwords = stopwords.words('spanish')

# Crear un vectorizador TF-IDF con eliminación de stopwords en español
vectorizer = TfidfVectorizer(stop_words=spanish_stopwords)

# Ajustar y transformar los documentos
tfidf_matrix = vectorizer.fit_transform(documents)

# Convertir la matriz TF-IDF en un DataFrame de pandas
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Crear un DataFrame con variables adicionales
additional_data = {
    'author': ['Juan', 'Ana', 'Pedro'],
    'length': [5, 5, 6]
}
additional_df = pd.DataFrame(additional_data)

# Concatenar los DataFrames
combined_df = pd.concat([additional_df, tfidf_df], axis=1)

print(combined_df)


  author  length    amigos      gato    jardín     perro    tejado
0   Juan       5  0.000000  0.605349  0.000000  0.000000  0.795961
1    Ana       5  0.000000  0.000000  0.795961  0.605349  0.000000
2  Pedro       6  0.680919  0.517856  0.000000  0.517856  0.000000


##### 1. Ejemplo 2
https://www.kaggle.com/datasets/pshiju/djinn-spanish-text-vector

# dudas:
que calores devuelve la matriz, frecuencias? done
que hace la matriz densa?
como juntar la matriz densa? cambiado el proceso de merge
por hacerlo mas facil podríamos coger el review title

In [25]:
df2 = pd.read_table(path, sep='\|\|', header=0, engine='python')
df2.sample(5)

  df2 = pd.read_table(path, sep='\|\|', header=0, engine='python')


Unnamed: 0,film_name,gender,film_avg_rate,review_rate,review_title,review_text
5010,Mientras dure la guerra,Drama,68,6.0,Mientras dure la guerra por Cine de Patio,Un momento clave de la historia de España es a...
1357,El orfanato,Terror,67,9.0,"Una de las mejores películas del cine español,...","""El orfanato"" es una mezcla de intriga y terro..."
1645,Torrente 2: Misión en Marbella,Comedia,53,9.0,Muy buena segunda parte de Torrente,Santiago Segura ha encontrado la gallina de lo...
5920,Tres metros sobre el cielo,Romance,47,4.0,"""FEA"" como dice el protagonista al inicio, en ...",Tan famosa que me digné por fin a verla.Chico ...
3354,Palmeras en la nieve,Drama,60,6.0,Interruptus,"Pensaba que..., pero... agridulce sensación. T..."


In [33]:
nltk.download('stopwords')
spanish_stopwords = stopwords.words('spanish')
vectorizer = TfidfVectorizer(stop_words=spanish_stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tomas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [99]:
films = df2.groupby(['film_name']).count()
print(films)

                                         gender  film_avg_rate  review_rate  \
film_name                                                                     
Ahora o nunca                                81             81           81   
AirBag                                       98             98           98   
Alatriste                                   445            445          445   
Atrapa la bandera                            51             51           51   
Campeones                                   199            199          199   
Celda 211                                   487            487          487   
Días de fútbol                               57             57           57   
El bola                                      64             64           64   
El laberinto del fauno                      495            495          495   
El mejor verano de mi vida                   54             54           54   
El orfanato                                 446     

In [88]:
df3=df2[df2['film_name']=='Tadeo Jones 2']
df3

Unnamed: 0,film_name,gender,film_avg_rate,review_rate,review_title,review_text
3236,Tadeo Jones 2,Animación,57,7.0,Superándose paso a paso.,"""¿Qué es eso que me llama este señor de gitani..."
3237,Tadeo Jones 2,Animación,57,6.0,Tadeo Jones 2. El secreto del rey Midas por Ci...,l simpático aventurero de producción española ...
3238,Tadeo Jones 2,Animación,57,8.0,"Muy divertida, magnífica factura técnica y mej...",Creo que no me divertía tanto en una sala de c...
3239,Tadeo Jones 2,Animación,57,6.0,Tadeo conquista Granada,Es una excelente noticia que Tadeo Jones sea u...
3240,Tadeo Jones 2,Animación,57,4.0,5 Opiniones Rápidas de Tadeo Jones 2,1- Lo mejor que se puede decir de esta películ...
3241,Tadeo Jones 2,Animación,57,8.0,Tadeo vuelve a lo grande.,Ahora que han pasado cinco años desde el estre...
3242,Tadeo Jones 2,Animación,57,8.0,Superior a la primera entrega y tremendamente ...,Me ha parecido superior en casi todos los aspe...
3243,Tadeo Jones 2,Animación,57,6.0,Entretenimiento a medias,"Me hace dudar un poco, a mi niña le ha encanta..."
3244,Tadeo Jones 2,Animación,57,7.0,"Para mí, mejor que la primera",Me fuí a verla con mi família y nos divertimos...
3245,Tadeo Jones 2,Animación,57,5.0,Aparenta pero no cuaja,La historia tiene cierto interés y cierto brío...


In [90]:
reviews = df3['review_text'].tolist()
tfidf_matrix3 = vectorizer.fit_transform(reviews)
tfidf_df3 = pd.DataFrame(tfidf_matrix3.toarray(), columns=vectorizer.get_feature_names_out())

In [83]:
tfidf_df3.head(20)

Unnamed: 0,000,05,10,100,10retalesdeacetato,10verde,150,18,2009,2011,...,énfasis,ésta,ésto,éxito,últimamente,último,últimos,única,únicamente,único
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.056082,0.0,0.0,0.0,0.0,0.0,0.0,0.112164,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.084752,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.053793,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.141791,0.07908,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.056411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [70]:
 tfidf_df3.columns.tolist()

['000',
 '05',
 '10',
 '100',
 '10retalesdeacetato',
 '10verde',
 '150',
 '18',
 '2009',
 '2011',
 '2012',
 '2015',
 '2016',
 '2018',
 '27',
 '2d',
 '3d',
 '51',
 '600',
 '80',
 '81',
 '90',
 'abandona',
 'abierto',
 'absoluta',
 'absolutamente',
 'absurdos',
 'aburrida',
 'aburrí',
 'acaba',
 'acabar',
 'acabo',
 'acabó',
 'acción',
 'acertada',
 'acertados',
 'acierto',
 'acompaña',
 'acompañado',
 'acompañan',
 'acompañaras',
 'acorde',
 'actor',
 'acudir',
 'adecuada',
 'adelantaba',
 'adelante',
 'además',
 'adinerada',
 'adrian',
 'adultos',
 'afeitarse',
 'afirma',
 'agarre',
 'agotamiento',
 'agradable',
 'agradece',
 'agradecen',
 'agradecer',
 'aguas',
 'ahora',
 'ahí',
 'ajena',
 'albañil',
 'alberto',
 'alguna',
 'alguno',
 'algún',
 'alhambra',
 'aliciente',
 'alimón',
 'allá',
 'alma',
 'alonso',
 'alrededor',
 'alta',
 'altamente',
 'alto',
 'altura',
 'amante',
 'ambas',
 'ambienta',
 'ambientación',
 'ambientes',
 'ambulante',
 'amena',
 'amenaza',
 'americana',
 'amer

In [96]:
#combined_df3 = pd.concat([df3, tfidf_df3], axis=0)
df3['key'] = range(len(df3))
tfidf_df3['key'] = range(len(tfidf_df3))

combined_df3 = pd.merge(df3, tfidf_df3, on='key', how='left')
print(combined_df3)

        film_name     gender film_avg_rate  review_rate  \
0   Tadeo Jones 2  Animación           5,7          7.0   
1   Tadeo Jones 2  Animación           5,7          6.0   
2   Tadeo Jones 2  Animación           5,7          8.0   
3   Tadeo Jones 2  Animación           5,7          6.0   
4   Tadeo Jones 2  Animación           5,7          4.0   
5   Tadeo Jones 2  Animación           5,7          8.0   
6   Tadeo Jones 2  Animación           5,7          8.0   
7   Tadeo Jones 2  Animación           5,7          6.0   
8   Tadeo Jones 2  Animación           5,7          7.0   
9   Tadeo Jones 2  Animación           5,7          5.0   
10  Tadeo Jones 2  Animación           5,7          6.0   
11  Tadeo Jones 2  Animación           5,7          6.0   
12  Tadeo Jones 2  Animación           5,7          6.0   
13  Tadeo Jones 2  Animación           5,7          9.0   
14  Tadeo Jones 2  Animación           5,7          6.0   
15  Tadeo Jones 2  Animación           5,7          4.0 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['key'] = range(len(df3))


In [97]:
combined_df3.loc[32]

film_name        Tadeo Jones 2
gender               Animación
film_avg_rate              5,7
review_rate                1.0
review_title             SOPOR
                     ...      
último                     0.0
últimos                    0.0
única                      0.0
únicamente                 0.0
único                 0.181096
Name: 32, Length: 1881, dtype: object

In [98]:
combined_df3.to_csv('tadeo.xlsx', index=False)

In [67]:
# matriz densa
dense_tfidf_matrix = tfidf_matrix3.todense()
print("Matriz TF-IDF:")
print(dense_tfidf_matrix)

Matriz TF-IDF:
[[0.         0.         0.         ... 0.1121638  0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.05609628 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [68]:
# el combined no funciona con matrices densas!
combined_df3 = pd.concat([df3, dense_tfidf_matrix], axis=1)
print(combined_df3)

TypeError: cannot concatenate object of type '<class 'numpy.matrix'>'; only Series and DataFrame objs are valid