<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Bot con NLTK utilizando un corpus de wikipedia


In [1]:
import json
import string
import random
import re # Regular Expressions (regex)
import urllib.request

import numpy as np

# Para leer y parsear el texto en HTML de wikipedia
import bs4 as bs

import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### Datos
Se consumirán los datos del artículo de wikipedia sobre Lionel Messi en inglés.

In [3]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Our_Lady_of_Guadalupe')
raw_html = raw_html.read()

# Parsear artículo, 'lxml' es el parser a utilizar
article_html = bs.BeautifulSoup(raw_html, 'lxml')

# Encontrar todos los párrafos del HTML (bajo el tag <p>)
# y tenerlos disponible como lista
article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()

In [4]:
# Demos un vistazo
article_text

'coordinates: 19°29′04″n 99°07′02″w\ufeff / \ufeff19.48444°n 99.11722°w\ufeff / 19.48444; -99.11722\nour lady of guadalupe (spanish: nuestra señora de guadalupe), also known as the virgin of guadalupe (spanish: virgen de guadalupe), is a catholic title of mary, mother of jesus associated with a series of five marian apparitions, which are believed to have occurred in december 1531, and a venerated image on a cloak enshrined within the basilica of our lady of guadalupe in mexico city. the basilica is the most-visited catholic shrine in the world, and the world\'s third most-visited sacred site.[1][2]\npope leo xiii granted the image a decree of canonical coronation on 8 february 1887 and it was pontifically crowned on 12 october 1895.\naccording to the nican mopohua, included in the 17th-century huei tlamahuiçoltica, written in nahuatl, the virgin mary appeared four times to juan diego, a chichimec peasant, and once to his uncle, juan bernardino. the first apparition occurred on the mor

In [5]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 41848


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [6]:
# substituir con regex con espacio vacío:

text = re.sub(r'\[(.*?)\]', ' ', article_text) # substituir caracteres entre corchetes
text = re.sub(r'\((.*?)\)', ' ', text) # substituir caracteres entre parentesis

text = re.sub(r'\s+', ' ', text) # substituir más de un caracter de espacio, salto de línea o tabulación
text = re.sub(r'\)', ' ', text) # substituir parentesis sueltos


In [7]:
# Demos un vistazo
text

'coordinates: 19°29′04″n 99°07′02″w\ufeff / \ufeff19.48444°n 99.11722°w\ufeff / 19.48444; -99.11722 our lady of guadalupe , also known as the virgin of guadalupe , is a catholic title of mary, mother of jesus associated with a series of five marian apparitions, which are believed to have occurred in december 1531, and a venerated image on a cloak enshrined within the basilica of our lady of guadalupe in mexico city. the basilica is the most-visited catholic shrine in the world, and the world\'s third most-visited sacred site. pope leo xiii granted the image a decree of canonical coronation on 8 february 1887 and it was pontifically crowned on 12 october 1895. according to the nican mopohua, included in the 17th-century huei tlamahuiçoltica, written in nahuatl, the virgin mary appeared four times to juan diego, a chichimec peasant, and once to his uncle, juan bernardino. the first apparition occurred on the morning of saturday, 9 december 1531 gregorian calendar in present use . juan di

In [8]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 39939


### 3 - Dividir el texto en sentencias y en palabras

In [9]:
corpus = nltk.sent_tokenize(text) # divide en oraciones
words = nltk.word_tokenize(text) # divide en términos

In [10]:
# Demos un vistazo
corpus[:10]

['coordinates: 19°29′04″n 99°07′02″w\ufeff / \ufeff19.48444°n 99.11722°w\ufeff / 19.48444; -99.11722 our lady of guadalupe , also known as the virgin of guadalupe , is a catholic title of mary, mother of jesus associated with a series of five marian apparitions, which are believed to have occurred in december 1531, and a venerated image on a cloak enshrined within the basilica of our lady of guadalupe in mexico city.',
 "the basilica is the most-visited catholic shrine in the world, and the world's third most-visited sacred site.",
 'pope leo xiii granted the image a decree of canonical coronation on 8 february 1887 and it was pontifically crowned on 12 october 1895. according to the nican mopohua, included in the 17th-century huei tlamahuiçoltica, written in nahuatl, the virgin mary appeared four times to juan diego, a chichimec peasant, and once to his uncle, juan bernardino.',
 'the first apparition occurred on the morning of saturday, 9 december 1531 gregorian calendar in present u

In [11]:
# Demos un vistazo
words[:20]

['coordinates',
 ':',
 '19°29′04″n',
 '99°07′02″w\ufeff',
 '/',
 '\ufeff19.48444°n',
 '99.11722°w\ufeff',
 '/',
 '19.48444',
 ';',
 '-99.11722',
 'our',
 'lady',
 'of',
 'guadalupe',
 ',',
 'also',
 'known',
 'as',
 'the']

In [12]:
print("Vocabulario:", len(words))

Vocabulario: 7439


### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [13]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# ord() nos da el código Unicode para un caracter dado
punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula (string.lower())
    # 2 - quitar los simbolos de puntuacion (string.translate())
    # 3 - realiza la tokenización (nltk.word_tokenize)
    # 4 - realiza la lematización (nuestra función perform_lemmatization)
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus del artículo de wikipedia

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    # la entrada del usuario se usa para tokenizar y vectorizar
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0: # si la similaridad coseno fue nula (ningún término en común)
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number] # obtener el documento del corpus más similar
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada. Sugerencias a ensayar:
- Juan Diego
- Mexico
- Jesus
- Virgin
- Guadalupe
- eyes

In [15]:
# Se utilizará gradio para ensayar el bot
# Herramienta poderosa para crear interfaces rápidas para ensayar modelos
# https://gradio.app/
import sys
!{sys.executable} -m pip install gradio --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.9/129.9 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.3/75.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.6/135.6 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.5/50.5 kB[0m [31m5.3 MB/s[0m 

In [16]:
import gradio as gr

def bot_response(human_text):
    print("Q:", human_text)    
    resp = generate_response(human_text.lower(), corpus)
    print("A:", resp)
    return resp

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)



Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Q: Guadalupe




A: coordinates: 19°29′04″n 99°07′02″w﻿ / ﻿19.48444°n 99.11722°w﻿ / 19.48444; -99.11722 our lady of guadalupe , also known as the virgin of guadalupe , is a catholic title of mary, mother of jesus associated with a series of five marian apparitions, which are believed to have occurred in december 1531, and a venerated image on a cloak enshrined within the basilica of our lady of guadalupe in mexico city.
Q: Juan Diego
A: he concluded that juan diego had not existed.
Q: Juan Diego
A: he concluded that juan diego had not existed.
Q: Juan
A: by monday, december 11, however, juan diego's uncle, juan bernardino, became ill, which obligated juan diego to attend to him.
Q: Eyes
A: in 1929 and 1951 photographers said they found a figure reflected in the virgin's eyes; upon inspection they said that the reflection was tripled in what is called the purkinje effect, commonly found in human eyes.
Keyboard interruption in main thread... closing server.




### Conclusiones

- El bot funciona como una especie de buscador, retornando el texto más "parecido" que encuentra en el corpus
- Lo que encuentra como más "parecido" no necesariamente es lo más relevante en relación a la pregunta o busqueda
- Si la palabra no coincide exactamente no comprende lo que preguntamos