<a href="https://colab.research.google.com/github/SanieRojas/Creating_a_GPT_chatbot/blob/main/Word2vec_(Ingles).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec con Gensim

En este cuaderno de Jupyter vas a utilizar la biblioteca [Gensim](https://radimrehurek.com/gensim/index.html) para experimentar con word2vec. Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación. Este cuaderno está inspirado en esta [guía](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

Para ver qué se puede hacer con Word2Vec, descarguemos un modelo previamente entrenado y jugaremos con él. Obtendremos el modelo Word2Vec entrenado con un conjunto de datos de Google News, que abarca aproximadamente 3 millones de palabras y frases. Un modelo de este tipo puede tardar horas en entrenarse, pero como ya está disponible, descargarlo y cargarlo con Gensim lleva unos minutos.

**Importante**

El modelo tiene aproximadamente 2 GB, por lo que necesitarás una conexión de red decente para continuar.


## 1. Instalación y cargar el modelo

In [1]:
!pip install --upgrade gensim



In [2]:
import gensim.downloader as api

In [3]:
#bajar modelo
model = api.load('word2vec-google-news-300')



In [4]:
model["king"].shape

(300,)

In [None]:
model["king"]

## 2. Similitud de palabras

En esta sección veremos cómo conseguir la similitud entre dos palabras utilizando un word embedding ya entrenado.

In [6]:
model.similarity("king", "queen")

0.6510957

In [7]:
model.similarity("king", "man")

0.22942673

In [8]:
model.similarity("king", "potato")

0.09978465

In [9]:
model.similarity("king", "king")

1.0

Ahora veremos cómo encontrar las palabras con mayor similitud al conjunto de palabras especificado.

In [10]:
model.most_similar(["king", "queen"], topn=5)

[('monarch', 0.7042067050933838),
 ('kings', 0.6780861616134644),
 ('princess', 0.6731551885604858),
 ('queens', 0.6679497957229614),
 ('prince', 0.6435247659683228)]

In [11]:
model.most_similar(["tomato", "carrot"], topn=5)

[('carrots', 0.7536594867706299),
 ('tomatoes', 0.7129638195037842),
 ('celery', 0.7025030851364136),
 ('broccoli', 0.6796350479125977),
 ('cherry_tomatoes', 0.662927508354187)]

Pero incluso puedes hacer cosas interesantes como ver qué palabra no corresponde a una lista.

In [12]:
model.doesnt_match(["summer", "fall", "spring", "air"])

'air'

## Ejercicios

1. Usa el modelo word2vec para hacer un ranking de las siguientes 15 palabras según su similitud con las palabras "man" y "woman". Para cada par, imprime su similitud.

In [13]:
words = [
"wife",
"husband",
"child",
"queen",
"king",
"man",
"woman",
"birth",
"doctor",
"nurse",
"teacher",
"professor",
"engineer",
"scientist",
"president"]


In [22]:
for word in words:
  print("Man vs", word, "similarity rate:", model.similarity("man", word))

Man vs wife similarity rate: 0.32920915
Man vs husband similarity rate: 0.34499747
Man vs child similarity rate: 0.31633338
Man vs queen similarity rate: 0.16658202
Man vs king similarity rate: 0.22942673
Man vs man similarity rate: 1.0
Man vs woman similarity rate: 0.76640123
Man vs birth similarity rate: 0.11078789
Man vs doctor similarity rate: 0.31448963
Man vs nurse similarity rate: 0.2547229
Man vs teacher similarity rate: 0.25000125
Man vs professor similarity rate: 0.09415862
Man vs engineer similarity rate: 0.15128928
Man vs scientist similarity rate: 0.15824963
Man vs president similarity rate: 0.028424604


In [23]:
for word in words:
  print("Woman vs", word, "similarity rate:", model.similarity("woman", word))

Woman vs wife similarity rate: 0.444824
Woman vs husband similarity rate: 0.49281383
Woman vs child similarity rate: 0.47500372
Woman vs queen similarity rate: 0.31618136
Woman vs king similarity rate: 0.12847973
Woman vs man similarity rate: 0.76640123
Woman vs woman similarity rate: 1.0
Woman vs birth similarity rate: 0.21471293
Woman vs doctor similarity rate: 0.37945858
Woman vs nurse similarity rate: 0.44135594
Woman vs teacher similarity rate: 0.31357846
Woman vs professor similarity rate: 0.13077852
Woman vs engineer similarity rate: 0.09435377
Woman vs scientist similarity rate: 0.15486898
Woman vs president similarity rate: 0.062676705


In [None]:
import pandas as pd
df = pd.DataFrame()

In [25]:
import pandas as pd

# Initialize empty lists to store the data
word_list = []
similarity_woman_list = []
similarity_men_list = []


# Iterate through the words and calculate similarity
for word in words:
    similarity_woman = model.similarity("woman", word)
    similarity_men = model.similarity("men", word)
    word_list.append(word)
    similarity_woman_list.append(similarity_woman)
    similarity_men_list.append(similarity_men)


# Create a DataFrame from the lists
data = {'Word': word_list, 'Similarity_woman': similarity_woman_list, 'Similarity_men': similarity_men_list}
df = pd.DataFrame(data)
df

Unnamed: 0,Word,Similarity_woman,Similarity_men
0,wife,0.444824,0.186912
1,husband,0.492814,0.250296
2,child,0.475004,0.178309
3,queen,0.316181,0.104873
4,king,0.12848,0.104076
5,man,0.766401,0.548976
6,woman,1.0,0.476602
7,birth,0.214713,0.089403
8,doctor,0.379459,0.182995
9,nurse,0.441356,0.113563


**2. Completa las siguientes analogías por tu cuenta (sin usar el modelo)**

a. king is to throne as judge is to _

b. Usa is to burger as Mexico is _

c. French is to France as Spaniard is to _

d. bad is to good as sad is to _

e. nurse is to hospital as teacher is to _

f. universe is to planet as house is to _

**2. Ahora completa las analogías usando un modelo word2vec**

Aquí hay un ejemplo de cómo hacerlo. Puedes resolver analogías como "A es a B como C es a _" haciendo  C + B - A.

In [26]:
# man is to woman as king is to ___?
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

[('queen', 0.7118193507194519)]

In [27]:
# usa is to burger as Mexico is to ___?
model.most_similar(positive=["Mexico", "burger"], negative=["USA"], topn=1)

[('taco', 0.6266060471534729)]

In [28]:
# nurse is to hospital as teacher is to ___?
model.most_similar(positive=["teacher", "hospital"], negative=["nurse"], topn=1)

[('school', 0.60170978307724)]

In [29]:
# king is to throne as judge is to ___?
model.most_similar(positive=["judge", "throne"], negative=["king"], topn=1)

[('appellate_court', 0.5845253467559814)]