<a href="https://colab.research.google.com/github/Paolino1994/NLP-CEIA-Fiuba/blob/main/TP1/1a_word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/FIUBA-Posgrado-Inteligencia-Artificial/procesamiento_lenguaje_natural/raw/main/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Word2vect


In [54]:
import numpy as np
import pandas as pd

In [2]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * (np.linalg.norm(b)))

### Datos

In [3]:
corpus = np.array(['que dia es hoy', 'martes el dia de hoy es martes', 'martes muchas gracias'])

Documento 1 --> que dia es hoy \
Documento 2 --> martes el dia de hoy es martes \
Documento 3 --> martes muchas gracias

### 1 - Obtener el vocabulario del corpus (los términos utilizados)
- Cada documento transformarlo en una lista de términos
- Armar un vector de términos no repetidos de todos los documentos

In [16]:
corpusFinal=np.array([])
for phrase in corpus:
  for word in phrase.split(" "):
    if(word not in corpusFinal):
      corpusFinal=np.append(corpusFinal,word)


  after removing the cwd from sys.path.


In [53]:
corpusDict=dict(zip(corpusFinal,list(range(0,len(corpusFinal)))))
corpusDict

{'que': 0,
 'dia': 1,
 'es': 2,
 'hoy': 3,
 'martes': 4,
 'el': 5,
 'de': 6,
 'muchas': 7,
 'gracias': 8}

### 2- OneHot encoding
Data una lista de textos, devolver una matriz con la representación oneHotEncoding de estos

In [30]:
OneHot=np.zeros((len(corpus),len(corpusFinal)))
for i,phrase in enumerate(corpus):
  for word in phrase.split(" "):
    OneHot[i][corpusDict[word]]=1

In [84]:
OneHot

array([[1., 1., 1., 1., 0., 0., 0., 0., 0.],
       [0., 1., 1., 1., 1., 1., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 1., 1.]])

In [72]:
pd.DataFrame(OneHot,columns=corpusDict.keys())

Unnamed: 0,que,dia,es,hoy,martes,el,de,muchas,gracias
0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0


### 3- Vectores de frecuencia
Data una lista de textos, devolver una matriz con la representación de frecuencia de estos

In [38]:
freqVecs=np.zeros((len(corpus),len(corpusFinal)))
for i,phrase in enumerate(corpus):
  for word in phrase.split(" "):
    freqVecs[i][corpusDict[word]]=freqVecs[i][corpusDict[word]]+1

In [85]:
freqVecs

array([[1., 1., 1., 1., 0., 0., 0., 0., 0.],
       [0., 1., 1., 1., 2., 1., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 1., 1.]])

In [71]:
pd.DataFrame(freqVecs,columns=corpusDict.keys())

Unnamed: 0,que,dia,es,hoy,martes,el,de,muchas,gracias
0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0


### 4- TF-IDF
Data una lista de textos, devolver una matriz con la representacion TFIDF

In [86]:
TFIDF=freqVecs*np.log10(len(corpus)/OneHot.sum(axis=0))
TFIDF


array([[0.47712125, 0.17609126, 0.17609126, 0.17609126, 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.17609126, 0.17609126, 0.17609126, 0.35218252,
        0.47712125, 0.47712125, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.17609126,
        0.        , 0.        , 0.47712125, 0.47712125]])

In [87]:
pd.DataFrame(TFIDF,columns=corpusDict.keys())

Unnamed: 0,que,dia,es,hoy,martes,el,de,muchas,gracias
0,0.477121,0.176091,0.176091,0.176091,0.0,0.0,0.0,0.0,0.0
1,0.0,0.176091,0.176091,0.176091,0.352183,0.477121,0.477121,0.0,0.0
2,0.0,0.0,0.0,0.0,0.176091,0.0,0.0,0.477121,0.477121


### 5 - Comparación de documentos
Realizar una funcion que reciba el corpus y el índice de un documento y devuelva los documentos ordenados por la similitud coseno

In [113]:
def similaridad(corpus,idx):
  sims=np.array([])
  freqVecs=np.zeros((len(corpus),len(corpusFinal)))
  for i,phrase in enumerate(corpus):
    for word in phrase.split(" "):
      freqVecs[i][corpusDict[word]]=freqVecs[i][corpusDict[word]]+1
  TFIDF=freqVecs*np.log10(len(corpus)/OneHot.sum(axis=0))
  for i,phrase in enumerate(corpus):
    if i!=idx:
      sims=np.append(sims,cosine_similarity(TFIDF[idx],TFIDF[i]))
  return np.delete(corpus, idx)[-np.argsort(-sims)]

In [115]:
similaridad(corpus,0)

array(['martes el dia de hoy es martes', 'martes muchas gracias'],
      dtype='<U30')