<img src="https://github.com/FIUBA-Posgrado-Inteligencia-Artificial/procesamiento_lenguaje_natural/raw/main/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Vectorización


In [None]:
import numpy as np

In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * (np.linalg.norm(b)))

### Datos

In [None]:
corpus = np.array(
    ['que dia es hoy', 'martes el dia de hoy es martes', 'martes muchas gracias'])

Documento 1 --> que dia es hoy \
Documento 2 --> martes el dia de hoy es martes \
Documento 3 --> martes muchas gracias

### 1 - Obtener el vocabulario del corpus (los términos utilizados)
- Cada documento transformarlo en una lista de términos
- Armar un vector de términos no repetidos de todos los documentos

In [4]:
import nltk
nltk.download('punkt')


# List of words
document1 = "que dia es hoy"
document2 = "martes el dia de hoy es martes"
document3 = "martes muchas gracias"


def document2list(lst):
    return ([i for i in lst.split()])


print(document2list(document1))

print(document2list(document2))

document3_to_list = nltk.word_tokenize(document3)
print(document3_to_list)

[nltk_data] Downloading package punkt to /home/benja/nltk_data...


['que', 'dia', 'es', 'hoy']
['martes', 'el', 'dia', 'de', 'hoy', 'es', 'martes']
['martes', 'muchas', 'gracias']


[nltk_data]   Unzipping tokenizers/punkt.zip.


In [11]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["que dia es hoy",
          "martes el dia de hoy es martes",
          "martes muchas gracias"]

# Create an object of CountVectorizer class
vectorizer = CountVectorizer()

# Tokenize and buil vocabulary
vectorizer.fit(corpus)

# Encode
vector = vectorizer.transform(corpus)
print(vectorizer.vocabulary_)

{'que': 8, 'dia': 1, 'es': 3, 'hoy': 5, 'martes': 6, 'el': 2, 'de': 0, 'muchas': 7, 'gracias': 4}


### 2- OneHot encoding
Dada una lista de textos, devolver una matriz con la representación oneHotEncoding de estos

In [31]:
# Function with one-hot-encoding
from sklearn.preprocessing import Binarizer

corpus = ['The cat sat on the mat.',
          'The dog chased the cat.',
           'CS224n at Stanford is the best NLP class you can ever take!']


freq = CountVectorizer()
corpus = freq.fit_transform(corpus)
one_hot_enconding = Binarizer()
matrix_words = one_hot_enconding.fit_transform(corpus.toarray())
print(matrix_words)

[[0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0]
 [0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0]
 [1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 1]]


### 3- Vectores de frecuencia
Dada una lista de textos, devolver una matriz con la representación de frecuencia de estos

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ['The cat sat on the mat.',
             'The dog chased the cat.',
             'CS224n at Stanford is the best NLP class you can ever take!']

vectorizer = CountVectorizer(lowercase=False)

vectorizer.fit(sentences)

print(vectorizer.vocabulary_)

print(vectorizer.transform(sentences).toarray())

{'The': 3, 'cat': 7, 'sat': 15, 'on': 14, 'the': 17, 'mat': 13, 'dog': 10, 'chased': 8, 'CS224n': 0, 'at': 4, 'Stanford': 2, 'is': 12, 'best': 5, 'NLP': 1, 'class': 9, 'you': 18, 'can': 6, 'ever': 11, 'take': 16}
[[0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0]
 [0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0]
 [1 1 1 0 1 1 1 0 0 1 0 1 1 0 0 0 1 1 1]]


### 4- TF-IDF
Data una lista de textos, devolver una matriz con la representacion TFIDF

In [30]:
# Based on https://hackernoon.com/document-term-matrix-in-nlp-count-and-tf-idf-scores-explained
from sklearn.feature_extraction.text import TfidfVectorizer

text = ["You don’t want to waste your time. If you’re going to put aside the time and energy needed to learn new programming languages, you want to make sure, without a doubt, that the ones you choose are the most in-demand programming languages on the market. "]

vectorizer = TfidfVectorizer(stop_words='english', smooth_idf=True)

input_matrix = vectorizer.fit_transform(text).todense()
print(input_matrix)

[[0.1796053 0.1796053 0.1796053 0.1796053 0.1796053 0.1796053 0.1796053
  0.3592106 0.1796053 0.1796053 0.1796053 0.1796053 0.1796053 0.1796053
  0.3592106 0.1796053 0.3592106 0.3592106 0.1796053]]


### 5 - Comparación de documentos
Realizar una funcion que reciba el corpus y el índice de un documento y devuelva los documentos ordenados por la similitud coseno