In [1]:
import pandas as pd

In [2]:
path_clean_df = "/content/drive/MyDrive/00. Universidad/02. Analisis de Datos/evento evaluativo 4/amazon_review_stemming.parquet"
df = pd.read_parquet(path_clean_df, engine="pyarrow")
df.head()

Unnamed: 0,rating,clean_title,clean_review,clean_review_stemming
0,3,more like funchuck,gave this to my dad for a gag gift after direc...,gave dad gag gift direct nunsens got reall kick
1,5,inspiring,i hope a lot of people hear this cd we need mo...,hope lot peopl hear cd need strong posit vibe ...
2,5,the best soundtrack ever to anything,im reading a lot of reviews saying that this i...,im read lot review say best game soundtrack fi...
3,4,chrono cross ost,the music of yasunori misuda is without questi...,music yasunori misuda without question close s...
4,5,too good to be true,probably the greatest soundtrack in history us...,probabl greatest soundtrack histori usual bett...


In [3]:
df.isnull().sum()

Unnamed: 0,0
rating,0
clean_title,0
clean_review,0
clean_review_stemming,0


In [4]:
# Borra las columnas 'columna1' y 'columna2' del DataFrame actual
df.drop(columns=['clean_review', 'clean_title'], inplace=True)

# (opcional) fuerza liberaci√≥n de memoria
import gc
gc.collect()


16

## Representaci√≥n Vectorial: Bag-of-Words y  TF-IDF




**Objetivo: Convertir las rese√±as en representaciones num√©ricas mediante Bag-of-Words y TF-IDF para su posterior an√°lisis.**


**Bag-of-Words (BoW):** Es una t√©cnica que convierte texto en una representaci√≥n num√©rica al contar la frecuencia de cada palabra en un documento, ignorando el orden y la gram√°tica.

Cada documento se representa como un vector donde cada dimensi√≥n corresponde a una palabra del vocabulario y el valor es la frecuencia de esa palabra en el documento.

**TF-IDF (Term Frequency-Inverse Document Frequency):** Esta t√©cnica mejora la representaci√≥n BoW al ponderar la frecuencia de las palabras por su importancia en el corpus.

Calcula la frecuencia de una palabra en un documento (TF) y la multiplica por la inversa de la frecuencia de documentos que contienen esa palabra (IDF), reduciendo la influencia de palabras comunes y destacando t√©rminos m√°s informativos.


In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [6]:
corpus = df["clean_review_stemming"].tolist()

### üìç Representaci√≥n Bag-of-Words con CountVectorizer

In [7]:
cv = CountVectorizer()
bow_matrix = cv.fit_transform(corpus)
print("Dimensiones de la matriz Bag-of-Words:", bow_matrix.shape)

Dimensiones de la matriz Bag-of-Words: (3629444, 1759292)


In [8]:
print("Ejemplo de t√©rminos (BoW):", cv.get_feature_names_out()[:100])

Ejemplo de t√©rminos (BoW): ['aa' 'aaa' 'aaaa' 'aaaaa' 'aaaaaa' 'aaaaaaa' 'aaaaaaaa' 'aaaaaaaaaa'
 'aaaaaaaaaaaa' 'aaaaaaaaaaaaaaaaaaa' 'aaaaaaaaaaaaaaaaaaaaa'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaa' 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaheven'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaagh' 'aaaaaaaaaaaaaaaaaaaaaaaaaamazoooon'
 'aaaaaaaaaaaaaaaaaaaaaaaaayyyyyyyyyyyyyyyiiiiiiiiiiiiiiiaaaaaaaaaahhhhhh'
 'aaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhh' 'aaaaaaaaaaaaaaaaaaaahal'
 'aaaaaaaaaaaaaaaaaaahhh' 'aaaaaaaaaaaaaaaaaaargh'
 'aaaaaaaaaaaaaaaaaahhhhhhhhhhhhh' 'aaaaaaaaaaaaaaaaahhhhhhh'
 'aaaaaaaaaaaaaaaaahhhhhhhhhhhhh' 'aaaaaaaaaaaaaaaagggggggggg'
 'aaaaaaaaaaaaaaaarh' 'aaaaaaaaaaaaaaagh' 'aaaaaaaaaaaaaaaplatoon'
 'aaaaaaaaaaaaaaargh' 'aaaaaaaaaaaaaaauuuuuu' 'aaaaaaaaaaaaaagh'
 'aaaaaaaaaaaaahhhhhhhhhhh' 'aaaaaaaaaaaaahit'
 'aaaaaaaaaaaaasssssssssssfrom' 'aaaaaaaaaaaahhhhhh'
 'aaaaaaaaaaaahhhhhhhhhhhhhhhhh' 'aaaaaaaaaaaaiiiieeeeee'
 'aaaaaaaaaaaarrrrrrrrgh' 'aaaaaaaaaaaawwwwwwwwwsssommmmeeecas'
 'aaaaaaaaaaahggggghhhoooooo

### üìç Representaci√≥n TF-IDF con TfidfVectorizer

In [9]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)
print("Dimensiones de la matriz TF-IDF:", tfidf_matrix.shape)

Dimensiones de la matriz TF-IDF: (3629444, 1759292)


In [10]:
print("Ejemplo de t√©rminos (TF-IDF):", tfidf.get_feature_names_out()[:100])

Ejemplo de t√©rminos (TF-IDF): ['aa' 'aaa' 'aaaa' 'aaaaa' 'aaaaaa' 'aaaaaaa' 'aaaaaaaa' 'aaaaaaaaaa'
 'aaaaaaaaaaaa' 'aaaaaaaaaaaaaaaaaaa' 'aaaaaaaaaaaaaaaaaaaaa'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaa' 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaheven'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaagh' 'aaaaaaaaaaaaaaaaaaaaaaaaaamazoooon'
 'aaaaaaaaaaaaaaaaaaaaaaaaayyyyyyyyyyyyyyyiiiiiiiiiiiiiiiaaaaaaaaaahhhhhh'
 'aaaaaaaaaaaaaaaaaaaaahhhhhhhhhhhhhh' 'aaaaaaaaaaaaaaaaaaaahal'
 'aaaaaaaaaaaaaaaaaaahhh' 'aaaaaaaaaaaaaaaaaaargh'
 'aaaaaaaaaaaaaaaaaahhhhhhhhhhhhh' 'aaaaaaaaaaaaaaaaahhhhhhh'
 'aaaaaaaaaaaaaaaaahhhhhhhhhhhhh' 'aaaaaaaaaaaaaaaagggggggggg'
 'aaaaaaaaaaaaaaaarh' 'aaaaaaaaaaaaaaagh' 'aaaaaaaaaaaaaaaplatoon'
 'aaaaaaaaaaaaaaargh' 'aaaaaaaaaaaaaaauuuuuu' 'aaaaaaaaaaaaaagh'
 'aaaaaaaaaaaaahhhhhhhhhhh' 'aaaaaaaaaaaaahit'
 'aaaaaaaaaaaaasssssssssssfrom' 'aaaaaaaaaaaahhhhhh'
 'aaaaaaaaaaaahhhhhhhhhhhhhhhhh' 'aaaaaaaaaaaaiiiieeeeee'
 'aaaaaaaaaaaarrrrrrrrgh' 'aaaaaaaaaaaawwwwwwwwwsssommmmeeecas'
 'aaaaaaaaaaahggggghhhooo

## 3) Extracci√≥n de T√©rminos Clave y Modelado de Temas üîç

**Objetivo: Utilizar LDA para extraer temas y t√©rminos clave de las rese√±as.**

**Modelado de temas con LDA (Latent Dirichlet Allocation)**: LDA es una t√©cnica de modelado generativo que asume que cada documento es una mezcla de temas y que cada tema es una mezcla de palabras. Ayuda a descubrir temas ocultos en una colecci√≥n de documentos.

**Extracci√≥n de palabras clave:** M√©todos como la frecuencia de t√©rminos, TF-IDF y algoritmos como RAKE (Rapid Automatic Keyword Extraction) se utilizan para identificar palabras o frases que capturan la esencia de un documento.

In [11]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(bow_matrix)

In [None]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Tema %d:" % topic_idx)
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [None]:
display_topics(lda, cv.get_feature_names_out(), 10)