@author: Octavio Gutiérrez de Código Máquina

URL del canal: https://www.youtube.com/CodigoMaquina

URL del video: https://youtu.be/_yQ0N83LXyg

# Instalación de Gensim:
## Modelado de tópicos para humanos

In [1]:
# Reinicia la sesión de Colab después de instalar gensim
!pip install --upgrade gensim



# Importar Módulos

In [2]:
from gensim.models import Word2Vec
import pandas as pd
from time import time

# Cargar corpus normalizado

In [3]:
def carga_csv(nombre_archivo):
  lista = []
  with open(nombre_archivo) as dialogos:
    for dialogo in dialogos:
        lista.append(dialogo.strip().split(","))
  return lista

# Descargar diálogos de repositorio de Código Máquina
!wget https://raw.githubusercontent.com/CodigoMaquina/code/main/datos/dialogos_simpsons.csv

# Cargar diálogos
dialogos = carga_csv("dialogos_simpsons.csv")

# Ejemplo de frase original sin normalizar
original = "No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it."

print("FRASE:", original,"\n\n", "NORMALIZADA:", str(dialogos[0]))

--2025-05-25 19:29:10--  https://raw.githubusercontent.com/CodigoMaquina/code/main/datos/dialogos_simpsons.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3943400 (3.8M) [text/plain]
Saving to: ‘dialogos_simpsons.csv’


2025-05-25 19:29:11 (10.4 MB/s) - ‘dialogos_simpsons.csv’ saved [3943400/3943400]

FRASE: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it. 

 NORMALIZADA: ['actually', 'little', 'sometimes', 'disease', 'magazine', 'news', 'show', 'natural', 'think']


# Preparación del Modelo Word2Vec

In [4]:
w2v_model = Word2Vec(min_count=20,
                     window=3,
                     vector_size=300,
                     alpha=0.03,
                     min_alpha=0.0008)

# Creación e Impresión del Vocabulario

In [5]:
w2v_model.build_vocab(dialogos)

print("Primeros elementos del vocabulario:")
for i in range(10):
  print(w2v_model.wv.index_to_key[i])

Primeros elementos del vocabulario:
get
go
well
oh
know
like
one
want
hey
make


# Entrenamiento del Modelo Word2Vec

In [6]:
inicio = time()
w2v_model.train(dialogos, total_examples=w2v_model.corpus_count, epochs=200)
print("Tiempo de entrenamiento (m):", (time() - inicio)/60)

Tiempo de entrenamiento (m): 3.5568908214569093


# Impresión de Vectores de Word2Vec

In [7]:
vectores = {}
for token in w2v_model.wv.key_to_index:
  vectores[token] = w2v_model.wv.get_vector(token)
pd.DataFrame(vectores).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
get,-0.341634,-0.147085,0.452184,0.165629,-0.045593,-0.806689,-0.179760,0.554167,-0.181399,0.121413,...,-0.177341,-0.190816,-0.130059,0.012598,-0.054698,0.143954,0.502874,0.827890,0.398336,-0.271156
go,-0.250801,0.348151,0.534864,0.156522,0.625729,-0.875020,-0.305177,0.659000,0.406319,0.276048,...,0.161218,-0.013050,0.012121,0.197311,0.010109,-0.140635,0.359158,0.762844,0.342202,-0.630401
well,-0.430355,0.989587,-0.302183,-0.099498,0.270420,-0.545650,-0.386592,0.377178,0.120030,0.269556,...,0.382344,-1.066782,0.577887,0.160749,0.468509,0.043186,0.667267,0.735951,-0.303049,-0.907901
oh,-0.143332,0.738958,-0.601170,0.011486,-0.636087,-0.265921,-0.184024,0.602027,-0.433004,-0.136089,...,0.420697,-0.238304,0.175771,-0.082627,-0.056166,0.002163,0.282298,-0.041295,0.225549,-1.057170
know,0.119214,0.460971,-0.268641,-0.496179,0.506486,-0.711128,-0.254007,0.338830,0.358949,0.060950,...,0.342039,-0.577183,0.278252,-0.061161,0.197765,0.341048,0.715658,0.805424,-0.239793,-1.477056
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
axe,0.221762,-0.785455,0.457117,-0.719445,0.296938,0.133906,-1.109314,1.398572,-1.354237,-0.012611,...,-0.305576,0.519550,0.095241,0.384838,-1.219508,1.763072,0.431381,-0.256670,-0.637031,-0.008623
creation,0.838550,-2.262791,-0.200947,-0.389770,0.130256,0.404359,-0.968281,-1.127842,-0.587720,1.191573,...,0.075675,-0.903077,-1.510179,1.835596,-0.865057,0.192526,-0.012474,-0.264978,-0.261964,-0.029345
goody,1.303053,-0.817680,-0.216581,-0.444975,-0.006102,0.042719,-0.827180,-0.735294,0.851971,-0.399054,...,0.538625,-0.110128,-1.306515,-0.851550,1.492288,-1.451139,-0.921094,-0.339436,-0.729292,0.118068
sec,0.316105,-1.045335,1.698271,-0.094029,0.721454,0.763765,0.334262,0.128074,-0.640058,-0.842532,...,1.262637,0.848095,0.561692,-0.208138,-0.061811,1.385835,1.017699,-0.871946,-0.395905,0.088709


# Impresión de Vector de Homero

In [8]:
w2v_model.wv.get_vector("homer")

array([ 0.21551219,  0.7324492 , -0.09772965,  0.06095229,  0.0999118 ,
        0.17195983, -0.42796835,  0.73555726, -0.43890068, -0.15403172,
        0.46705943,  0.25333604, -0.12723036,  0.11804603, -0.09819105,
        0.3688238 ,  0.78898716, -0.4697144 , -0.17905807, -0.30443656,
       -0.02638143, -0.10123169,  0.50068814,  0.53991574,  0.01498612,
       -0.28858772, -0.08235972,  0.6069229 , -0.04642946,  0.14377058,
       -0.09737492, -0.12917532,  0.18688703,  0.00200988, -0.26855507,
        0.43361855, -0.60166234, -0.1140897 , -0.15934493, -0.47394174,
       -0.44271696,  0.16210704,  0.4093245 ,  0.05178909,  0.6845876 ,
        0.26417124, -0.13741554, -0.12935393,  0.442258  ,  0.367383  ,
        0.02501332, -0.02682321,  0.51590544, -0.5147172 , -0.5206722 ,
       -0.3135782 , -0.01828674, -0.16872741,  0.30188763, -0.14667925,
        0.47589234, -0.36216196, -0.6799591 ,  0.58322865, -0.24766643,
        0.365898  ,  0.41958594, -0.21645059,  0.15736462, -0.10

# Vectores/Palabras más Similares para Protagonistas

In [9]:
personajes = ["homer", "marge", "bart", "lisa", "maggie"]
for personaje in personajes:
  for similitud in w2v_model.wv.most_similar(positive=[personaje], topn=2):
    print(personaje, similitud)
  print("")

homer ('marge', 0.5101196765899658)
homer ('dad', 0.39377567172050476)

marge ('homer', 0.5101196765899658)
marge ('homie', 0.39671674370765686)

bart ('lisa', 0.5017989873886108)
bart ('dad', 0.48083242774009705)

lisa ('bart', 0.5017989873886108)
lisa ('honey', 0.3668513000011444)

maggie ('bart', 0.3305065929889679)
maggie ('baby', 0.2943394184112549)



# Similitud de Bart con otros Personajes

In [10]:
principal = "bart"
personajes = ["lisa", "milhouse", "homer", "marge", "nelson"]

for personaje in personajes:
  print("Similitud entre", principal, "y", personaje, "=",
        w2v_model.wv.similarity(principal, personaje))

Similitud entre bart y lisa = 0.501799
Similitud entre bart y milhouse = 0.33638844
Similitud entre bart y homer = 0.31096557
Similitud entre bart y marge = 0.2752386
Similitud entre bart y nelson = 0.23447007


# Qué personaje no forma parte del grupo

In [11]:
grupo = ["homer", "patty", "selma"]
print("El raro del grupo", grupo, "es", w2v_model.wv.doesnt_match(grupo))

El raro del grupo ['homer', 'patty', 'selma'] es homer


# Operaciones con Vectores


# Bart - divertido = Lisa

In [12]:
w2v_model.wv.most_similar(positive=["bart"], negative=["funny"], topn=1)[0][0]

'lisa'

# Lisa - inteligente = Bart

In [13]:
w2v_model.wv.most_similar(positive=["lisa"], negative=["intelligent"], topn=1)[0][0]

'bart'

# Homero - dona = Marge

In [14]:
w2v_model.wv.most_similar(positive=["homer"], negative=["donut"], topn=1)[0][0]

'marge'

# Bart + adulto = Papá

In [15]:
w2v_model.wv.most_similar(positive=["bart", "adult"], topn=1)[0][0]

'dad'

# Homero - mujer = Marge

In [16]:
w2v_model.wv.most_similar(positive=["homer"], negative=["woman"], topn=1)[0][0]

'marge'

 #

# Referencias

### Conjunto de datos normalizados (Gutiérrez-Garcia, 2025):

Gutiérrez-García, J.O. [Código Máquina]. (2025). Diálogos Normalizados de los Simpsons [Conjunto de Datos]. https://github.com/CodigoMaquina/code/blob/main/datos/dialogos_simpsons.csv

### Conjunto de datos sin normalizar (Ambarish, 2025):

Ambarish "BUKUN" (2025). Fun in Text Mining with Simpsons [Conjunto de Datos]. Kaggle. https://www.kaggle.com/ambarish/fun-in-text-mining-with-simpsons/data

### Megret (2025) es el autor de la idea original de ejercicio de Word2Vec con los Simpsons:
Megret, P. (2025). Gensim Word2Vec tutorial [Notebook]. Kaggle. https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial