<img src="https://github.com/FIUBA-Posgrado-Inteligencia-Artificial/procesamiento_lenguaje_natural/raw/main/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Word2vect


In [1]:
import numpy as np

In [2]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * (np.linalg.norm(b)))

### Datos

In [3]:
corpus = np.array(['que dia es hoy', 'martes el dia de hoy es martes', 'martes muchas gracias'])

Documento 1 --> que dia es hoy \
Documento 2 --> martes el dia de hoy es martes \
Documento 3 --> martes muchas gracias

### 1 - Obtener el vocabulario del corpus (los términos utilizados)
- Cada documento transformarlo en una lista de términos
- Armar un vector de términos no repetidos de todos los documentos

In [4]:
def splitDocs(corp):
  splited = [doc.split(" ") for doc in corp]
  return splited

splited_docs = splitDocs(corpus)
print(splited_docs)

[['que', 'dia', 'es', 'hoy'], ['martes', 'el', 'dia', 'de', 'hoy', 'es', 'martes'], ['martes', 'muchas', 'gracias']]


In [5]:
import itertools
all_docs = list(itertools.chain.from_iterable(splited_docs))
all_terms = np.unique(all_docs)
print(all_terms)

['de' 'dia' 'el' 'es' 'gracias' 'hoy' 'martes' 'muchas' 'que']


### 2- OneHot encoding
Dada una lista de textos, devolver una matriz con la representación oneHotEncoding de estos

In [6]:
import pandas as pd

def getOneHotEncoding(txt_docs):

  docs = splitDocs(txt_docs)

  data = []
  for doc in splited_docs:
    data_row = np.zeros(len(all_terms))
    for i,col in enumerate(all_terms):
        if col in doc:
          data_row[i] = np.int8(1)
    data.append(data_row)
  
  return pd.DataFrame(data, columns=all_terms)
  
df_OneHotEncoding = getOneHotEncoding(corpus)
df_OneHotEncoding.head()

Unnamed: 0,de,dia,el,es,gracias,hoy,martes,muchas,que
0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0


### 3- Vectores de frecuencia
Data una lista de textos, devolver una matriz con la representación de frecuencia de estos

In [7]:
def getTF(txt_docs):

  docs = splitDocs(txt_docs)
  
  data = []
  for doc in docs:
    data_row = np.zeros(len(all_terms),dtype=int)
    for term in doc:
        data_row[list(all_terms).index(term)] += 1
    data.append(data_row)
  
  return pd.DataFrame(data, columns=all_terms)

TF = getTF(corpus)
TF.head()

Unnamed: 0,de,dia,el,es,gracias,hoy,martes,muchas,que
0,0,1,0,1,0,1,0,0,1
1,1,1,1,1,0,1,2,0,0
2,0,0,0,0,1,0,1,1,0


### 4- TF-IDF
Data una lista de textos, devolver una matriz con la representacion TFIDF

In [8]:
N = len(corpus)
DF = df_OneHotEncoding.sum(axis=0)
IDF = np.log10(N/DF)
IDF

de         0.477121
dia        0.176091
el         0.477121
es         0.176091
gracias    0.477121
hoy        0.176091
martes     0.176091
muchas     0.477121
que        0.477121
dtype: float64

In [9]:
TF_IDF = TF*IDF
TF_IDF

Unnamed: 0,de,dia,el,es,gracias,hoy,martes,muchas,que
0,0.0,0.176091,0.0,0.176091,0.0,0.176091,0.0,0.0,0.477121
1,0.477121,0.176091,0.477121,0.176091,0.0,0.176091,0.352183,0.0,0.0
2,0.0,0.0,0.0,0.0,0.477121,0.0,0.176091,0.477121,0.0


### 5 - Comparación de documentos
Realizar una funcion que reciba el corpus y el índice de un documento y devuelva los documentos ordenados por la similitud coseno

In [10]:
def compareDocs(txt_docs,idx):
  docs = np.array(getTF(txt_docs))  # Get TF from Docs
  refDocs = docs[idx]               # Get document to compare with
  
  simLvl = []
  for doc in docs:
    simLvl.append(cosine_similarity(doc,refDocs))
  simLvl = np.array(simLvl) # cosine Similarity for each document

  arrIdx = simLvl.argsort()
  sorted_txt_docs = txt_docs[arrIdx[::-1]]
  return sorted_txt_docs, simLvl

In [11]:
idx = 1
sorted_txt_docs, simLvl = compareDocs(corpus,idx)

print("Original corpus:")
print(corpus)
print("\nRefer doc (idx = "+str(idx)+"): "+str(corpus[idx]))
print("\nSimilarity: "+str(simLvl))

print("\nOrdered corpus by similarity: ")
print(sorted_txt_docs)

Original corpus:
['que dia es hoy' 'martes el dia de hoy es martes' 'martes muchas gracias']

Refer doc (idx = 1): martes el dia de hoy es martes

Similarity: [0.5        1.         0.38490018]

Ordered corpus by similarity: 
['martes el dia de hoy es martes' 'que dia es hoy' 'martes muchas gracias']
