### Procesamiento de lenguaje natural-Sistema de obtención de información con NLTK utilizando un corpus de wikipedia
### Tp 2


In [39]:
! pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [40]:
import json
import string
import random
import re # Regular Expressions (regex)
import urllib.request
import numpy as np
# Para leer y parsear el texto en HTML de wikipedia
from bs4 import BeautifulSoup
import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Datos
Se consumirán los datos del artículo de wikipedia sobre el deporte "Tennis" en inglés.

In [41]:
url = 'https://www.hsph.harvard.edu/nutritionsource/vitamin-c/'

# Obtener el contenido HTML de la página web
response = urllib.request.urlopen(url)
raw_html = response.read()

# Parsear el HTML utilizando BeautifulSoup
article_html = BeautifulSoup(raw_html, 'lxml')

# Encontrar todos los párrafos del HTML (bajo el tag <p>)
# y obtener el texto de cada párrafo
article_paragraphs = article_html.find_all('p')
article_text = ' '.join([para.text for para in article_paragraphs])

# Convertir el texto a minúsculas
article_text = article_text.lower()

# Imprimir el texto y la cantidad de caracteres en la nota
print(article_text,'\n')
print("Cantidad de caracteres en la nota:", len(article_text))


is a glass of oj or vitamin c tablets your go-to when the sniffles come? loading up on this vitamin was a practice spurred by linus pauling in the 1970s, a double nobel laureate and self-proclaimed champion of vitamin c who promoted daily megadoses (the amount in 12 to 24 oranges) as a way to prevent colds and some chronic diseases. vitamin c, or ascorbic acid, is a water-soluble vitamin. this means that it dissolves in water and is delivered to the body’s tissues but is not well stored, so it must be taken daily through food or supplements. even before its discovery in 1932, nutrition experts recognized that something in citrus fruits could prevent scurvy, a disease that killed as many as two million sailors between 1500 and 1800. [1] vitamin c plays a role in controlling infections and healing wounds, and is a powerful antioxidant that can neutralize harmful free radicals. it is needed to make collagen, a fibrous protein in connective tissue that is weaved throughout various systems 

### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [42]:
# substituir con regex con espacio vacío:
text = re.sub(r'\[[0-9]*\]|\([^)]*\)', ' ', article_text)  # Substituir los números entre corchetes y los paréntesis
text = re.sub(r'\s+', ' ', text)  # Substituir más de un caracter de espacio, salto de línea o tabulación
print(text,'\n')
print("Cantidad de caracteres en el texto:", len(text))

is a glass of oj or vitamin c tablets your go-to when the sniffles come? loading up on this vitamin was a practice spurred by linus pauling in the 1970s, a double nobel laureate and self-proclaimed champion of vitamin c who promoted daily megadoses as a way to prevent colds and some chronic diseases. vitamin c, or ascorbic acid, is a water-soluble vitamin. this means that it dissolves in water and is delivered to the body’s tissues but is not well stored, so it must be taken daily through food or supplements. even before its discovery in 1932, nutrition experts recognized that something in citrus fruits could prevent scurvy, a disease that killed as many as two million sailors between 1500 and 1800. vitamin c plays a role in controlling infections and healing wounds, and is a powerful antioxidant that can neutralize harmful free radicals. it is needed to make collagen, a fibrous protein in connective tissue that is weaved throughout various systems in the body: nervous, immune, bone, c

### 3 - Dividir el texto en sentencias y en palabras

In [43]:
corpus = nltk.sent_tokenize(text) # divide en oraciones
words = nltk.word_tokenize(text) # divide en términos
# Demos un vistazo
print(corpus[:10])
# Demos un vistazo
print(words[:20])
print("Vocabulario:", len(words))

['is a glass of oj or vitamin c tablets your go-to when the sniffles come?', 'loading up on this vitamin was a practice spurred by linus pauling in the 1970s, a double nobel laureate and self-proclaimed champion of vitamin c who promoted daily megadoses as a way to prevent colds and some chronic diseases.', 'vitamin c, or ascorbic acid, is a water-soluble vitamin.', 'this means that it dissolves in water and is delivered to the body’s tissues but is not well stored, so it must be taken daily through food or supplements.', 'even before its discovery in 1932, nutrition experts recognized that something in citrus fruits could prevent scurvy, a disease that killed as many as two million sailors between 1500 and 1800. vitamin c plays a role in controlling infections and healing wounds, and is a powerful antioxidant that can neutralize harmful free radicals.', 'it is needed to make collagen, a fibrous protein in connective tissue that is weaved throughout various systems in the body: nervous

### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [44]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

lemmatizer = WordNetLemmatizer()
punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)
stop_words = set(stopwords.words('english'))

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

def get_processed_text(document):
    # Reduce el texto a minúsculas
    document = document.lower()

    # Quita los símbolos de puntuación
    document = document.translate(punctuation_removal)

    # Realiza la tokenización
    tokens = nltk.word_tokenize(document)

    # Elimina las stop words
    tokens = [token for token in tokens if token not in stop_words]

    # Realiza la lematización
    lemmatized_tokens = perform_lemmatization(tokens)

    return lemmatized_tokens




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus del artículo de wikipedia

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    # la entrada del usuario se usa para tokenizar y vectorizar
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0: # si la similaridad coseno fue nula (ningún término en común)
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number] # obtener el documento del corpus más similar
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada. Sugerencias a ensayar:
- Grand slam
- tournaments
- nadal
- artificial intelligence

In [46]:
# Se utilizará gradio para ensayar el bot
# Herramienta poderosa para crear interfaces rápidas para ensayar modelos
# https://gradio.app/
import sys
!{sys.executable} -m pip install gradio --quiet

In [47]:
import gradio as gr

def bot_response(human_text):
    print("Q:", human_text)    
    resp = generate_response(human_text.lower(), corpus)
    print("A:", resp)
    return resp

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)

  iface = gr.Interface(


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

Q: What vitamin does orange contain?
A: vitamin c, or ascorbic acid, is a water-soluble vitamin.




Q: what is vitamin c?
A: vitamin c, or ascorbic acid, is a water-soluble vitamin.
Q: how much vitamin c should a person consume?
A: small trials suggest that the amount of vitamin c in a typical multivitamin taken at the start of a cold might ease symptoms, but for the average person, there is no evidence that megadoses make a difference, or that they prevent colds.
Q: How far does an orange fall from the tree?
A: I am sorry, I could not understand you
Keyboard interruption in main thread... closing server.




### El bot entrega las respuestas razonables en función del texto que fue proporcionado. Es evidente que para obtener una respuesta con mejor construcción o que abarque mayór cantidad de temas requiere de un dataset mas grande.

### Alumno

- Tomar un ejemplo de los bots utilizados (uno de los dos) y construir el propio.
- Sacar conclusiones de los resultados.

__IMPORTANTE__: Recuerde para la entrega del ejercicio debe quedar registrado en el colab las preguntas y las respuestas del BOT para que podamos evaluar el desempeño final.