In [1]:
import pandas as pd

In [2]:
path = "/home/juanbetancur/analisis_datos_universidad/evento_evaluativo_4/ejercicio_3/amazon_review_lemmatized_sampled.parquet"
df = pd.read_parquet(path, engine="pyarrow")
df.head()

Unnamed: 0,rating,clean_title,clean_review,clean_review_stemming,clean_review_lemmatization
0,1,useless junk,i thought this would be a nifty gadget safer t...,thought would nifti gadget safer knife easier ...,think would nifty gadget safe knife easy use s...
1,1,poor quality cord light,i purchased one of these in the length and af...,purchas one length year light duti home garag ...,purchase one length year light duty home g...
2,1,i would give it no stars if i could,i bought this book looking for fun things to d...,bought book look fun thing date serious doubt ...,buy book look fun thing date seriously doubt a...
3,1,this program is flawed,this program has a lot of bugs in it it has th...,program lot bug tendenc crash system addit ans...,program lot bug tendency crash system addition...
4,1,sending it back,i too was disappointed in my set i thought it ...,disappoint set thought would wood open saw car...,disappoint set think would wood open see cardb...


In [3]:
df.isnull().sum()

rating                        0
clean_title                   0
clean_review                  0
clean_review_stemming         0
clean_review_lemmatization    0
dtype: int64

In [4]:
# Borra las columnas 'columna1' y 'columna2' del DataFrame actual
df.drop(columns=['clean_review', 'clean_title'], inplace=True)

# (opcional) fuerza liberaci√≥n de memoria
import gc
gc.collect()


0

## Representaci√≥n Vectorial: Bag-of-Words y  TF-IDF




**Objetivo: Convertir las rese√±as en representaciones num√©ricas mediante Bag-of-Words y TF-IDF para su posterior an√°lisis.**


**Bag-of-Words (BoW):** Es una t√©cnica que convierte texto en una representaci√≥n num√©rica al contar la frecuencia de cada palabra en un documento, ignorando el orden y la gram√°tica.

Cada documento se representa como un vector donde cada dimensi√≥n corresponde a una palabra del vocabulario y el valor es la frecuencia de esa palabra en el documento.

**TF-IDF (Term Frequency-Inverse Document Frequency):** Esta t√©cnica mejora la representaci√≥n BoW al ponderar la frecuencia de las palabras por su importancia en el corpus.

Calcula la frecuencia de una palabra en un documento (TF) y la multiplica por la inversa de la frecuencia de documentos que contienen esa palabra (IDF), reduciendo la influencia de palabras comunes y destacando t√©rminos m√°s informativos.


In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [6]:
corpus = df["clean_review_stemming"].tolist()

### üìç Representaci√≥n Bag-of-Words con CountVectorizer

In [7]:
cv = CountVectorizer()
bow_matrix = cv.fit_transform(corpus)
print("Dimensiones de la matriz Bag-of-Words:", bow_matrix.shape)

Dimensiones de la matriz Bag-of-Words: (362944, 339264)


In [9]:
print("Ejemplo de t√©rminos (BoW):", cv.get_feature_names_out()[:100])

Ejemplo de t√©rminos (BoW): ['aa' 'aaa' 'aaaa' 'aaaaa' 'aaaaaaaaaaaa'
 'aaaaaaaaaaaaaaaaaaaaaaaaaamazoooon'
 'aaaaaaaaaaaaaaaaaaaaaaaaayyyyyyyyyyyyyyyiiiiiiiiiiiiiiiaaaaaaaaaahhhhhh'
 'aaaaaaaaaaaaaaaarh' 'aaaaaaaaaaaarrrrrrrrgh' 'aaaaaaaaahhh' 'aaaaaahhh'
 'aaaaaahhhhh' 'aaaaaahhhhhhhahahahahhahahahathi' 'aaaaahahaha'
 'aaaaahhhhh' 'aaaaand' 'aaaaaon' 'aaaaarrrrrggggggghhhhh' 'aaaah'
 'aaaahhh' 'aaaahhhh' 'aaaahhhhh' 'aaaahhhhhhh' 'aaaalllllll' 'aaaargh'
 'aaaarrrggghh' 'aaaawooooooooooooow' 'aaagghh' 'aaagh' 'aaaghhhh' 'aaah'
 'aaahhh' 'aaahhhhhhhh' 'aaahth' 'aaand' 'aaaprogram' 'aaaremot' 'aaargh'
 'aaarrrgggghh' 'aaarrrggghhhyou' 'aaawwwsbut' 'aaberg' 'aabook' 'aac'
 'aacbbc' 'aaccept' 'aachen' 'aack' 'aad' 'aadd' 'aadland' 'aadp' 'aae'
 'aaearo' 'aaer' 'aaf' 'aafair' 'aafter' 'aagain' 'aagghhh' 'aago' 'aah'
 'aaha' 'aahh' 'aahhh' 'aahhhh' 'aahhhhh' 'aahhhhhhhhhhh' 'aahz' 'aahzzzz'
 'aai' 'aain' 'aaition' 'aaj' 'aak' 'aaker' 'aakhir' 'aaland' 'aalash'
 'aaliah' 'aaliyah' 'aaliyahi'

### üìç Representaci√≥n TF-IDF con TfidfVectorizer

In [10]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)
print("Dimensiones de la matriz TF-IDF:", tfidf_matrix.shape)

Dimensiones de la matriz TF-IDF: (362944, 339264)


In [11]:
print("Ejemplo de t√©rminos (TF-IDF):", tfidf.get_feature_names_out()[:100])

Ejemplo de t√©rminos (TF-IDF): ['aa' 'aaa' 'aaaa' 'aaaaa' 'aaaaaaaaaaaa'
 'aaaaaaaaaaaaaaaaaaaaaaaaaamazoooon'
 'aaaaaaaaaaaaaaaaaaaaaaaaayyyyyyyyyyyyyyyiiiiiiiiiiiiiiiaaaaaaaaaahhhhhh'
 'aaaaaaaaaaaaaaaarh' 'aaaaaaaaaaaarrrrrrrrgh' 'aaaaaaaaahhh' 'aaaaaahhh'
 'aaaaaahhhhh' 'aaaaaahhhhhhhahahahahhahahahathi' 'aaaaahahaha'
 'aaaaahhhhh' 'aaaaand' 'aaaaaon' 'aaaaarrrrrggggggghhhhh' 'aaaah'
 'aaaahhh' 'aaaahhhh' 'aaaahhhhh' 'aaaahhhhhhh' 'aaaalllllll' 'aaaargh'
 'aaaarrrggghh' 'aaaawooooooooooooow' 'aaagghh' 'aaagh' 'aaaghhhh' 'aaah'
 'aaahhh' 'aaahhhhhhhh' 'aaahth' 'aaand' 'aaaprogram' 'aaaremot' 'aaargh'
 'aaarrrgggghh' 'aaarrrggghhhyou' 'aaawwwsbut' 'aaberg' 'aabook' 'aac'
 'aacbbc' 'aaccept' 'aachen' 'aack' 'aad' 'aadd' 'aadland' 'aadp' 'aae'
 'aaearo' 'aaer' 'aaf' 'aafair' 'aafter' 'aagain' 'aagghhh' 'aago' 'aah'
 'aaha' 'aahh' 'aahhh' 'aahhhh' 'aahhhhh' 'aahhhhhhhhhhh' 'aahz' 'aahzzzz'
 'aai' 'aain' 'aaition' 'aaj' 'aak' 'aaker' 'aakhir' 'aaland' 'aalash'
 'aaliah' 'aaliyah' 'aaliya

## 3) Extracci√≥n de T√©rminos Clave y Modelado de Temas üîç

**Objetivo: Utilizar LDA para extraer temas y t√©rminos clave de las rese√±as.**

**Modelado de temas con LDA (Latent Dirichlet Allocation)**: LDA es una t√©cnica de modelado generativo que asume que cada documento es una mezcla de temas y que cada tema es una mezcla de palabras. Ayuda a descubrir temas ocultos en una colecci√≥n de documentos.

**Extracci√≥n de palabras clave:** M√©todos como la frecuencia de t√©rminos, TF-IDF y algoritmos como RAKE (Rapid Automatic Keyword Extraction) se utilizan para identificar palabras o frases que capturan la esencia de un documento.

In [12]:
from sklearn.decomposition import LatentDirichletAllocation

In [13]:
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(bow_matrix)

0,1,2
,n_components,5
,doc_topic_prior,
,topic_word_prior,
,learning_method,'batch'
,learning_decay,0.7
,learning_offset,10.0
,max_iter,10
,batch_size,128
,evaluate_every,-1
,total_samples,1000000.0


In [14]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Tema %d:" % topic_idx)
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [15]:
display_topics(lda, cv.get_feature_names_out(), 10)

Tema 0:
album song cd music like one sound good listen great
Tema 1:
movi watch one dvd good film would great time like
Tema 2:
game like movi play one film get good make fun
Tema 3:
use work one get product would like time great good
Tema 4:
book read stori one like good would charact time author


In [16]:
df.to_parquet("/home/juanbetancur/analisis_datos_universidad/evento_evaluativo_4/ejercicio_3/v2_amazon_review_lemmatized_sampled.parquet", engine="pyarrow")