<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Sistema de obtención de información con NLTK utilizando un corpus de wikipedia


In [41]:
import json
import string
import random
import re # Regular Expressions (regex)
import urllib.request

import numpy as np
import pandas as pd
# Para leer y parsear el texto en HTML de wikipedia
import bs4 as bs

import nltk
# Descargar el diccionario

###solo al primer uso
#nltk.download("punkt")
#nltk.download("wordnet")
#nltk.download('omw-1.4')

### Datos
Se consumirán los datos del artículo de wikipedia sobre el deporte "Tennis" en inglés.

In [65]:

raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Colombia')
raw_html = raw_html.read()

# Parsear artículo, 'lxml' es el parser a utilizar
article_html = bs.BeautifulSoup(raw_html, 'lxml')

# Encontrar todos los párrafos del HTML (bajo el tag <p>)
# y tenerlos disponible como lista
article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()

In [None]:
'''
with open('C:\\Users\\Jesus\\Documents\\CEIA\\Procesamiento de lenguaje natural\\Trabajos propuestos\\2\\corpus.txt', 'r') as file:
    article_text = file.read()
    print(article_text)
    
article_text = article_text.lower()
'''

In [66]:
# Demos un vistazo
article_text

'\ncolombia (/kəˈlʌmbiə/ (listen), /-ˈlɒm-/;[13] spanish:\xa0[koˈlombja] (listen)), officially the republic of colombia,[a] is a country  mostly in south america with insular regions in north america. the colombian mainland is bordered by the caribbean sea to the north, venezuela to the east and northeast, brazil to the southeast, ecuador and peru to the south and southwest, the pacific ocean to the west, and panama to the northwest. colombia is divided into 32 departments. the capital district of bogotá is also the country\'s largest city hosting the main financial and cultural hub, and other major urbes include medellín, cali, barranquilla, cartagena, santa marta, cúcuta, ibagué, villavicencio and bucaramanga. it covers an area of 1,141,748 square kilometers (440,831 sq mi), and has a population of around 52 million. colombia is the largest spanish-speaking country in south america. its cultural heritage[14]—including language, religion, cuisine, and art—reflects its history as a col

In [67]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 80007


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [68]:
import unicodedata

#retirar acentos
text = unicodedata.normalize('NFKD', article_text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

In [69]:
# Repaso de regex:
# https://docs.python.org/3/library/re.html

# Para practicar regex:
# https://regex101.com/

# el inicio con 'r' antes de cada string indica que se interprete como raw string
# '\n' es interpretado por Python como salto de linea
# r'\n' es interpretado por Python como el string formado por dos caracteres: 
#  backslash y n

# substituir con regex con espacio vacío:
text = re.sub(r'\[[0-9]*\]', ' ', text) # substituir los números entre corchetes
# (notar que los corchetes son interpretados literalmente por los backlsash)
text = re.sub(r'\s+', ' ', text) # substituir más de un caracter de espacio, salto de línea o tabulación
text = re.sub(r'\\u200b', ' ', text) # substituir más de un caracter de espacio, salto de línea o tabulación

# probar en regex101 con los patrones anteriores:
# 'Hola [1], [], [ estoy bien   [123]. [12sss]. OK!   .'

In [70]:
# Demos un vistazo
text

' colombia (/klmbi/ (listen), /-lm-/; spanish: [kolombja] (listen)), officially the republic of colombia,[a] is a country mostly in south america with insular regions in north america. the colombian mainland is bordered by the caribbean sea to the north, venezuela to the east and northeast, brazil to the southeast, ecuador and peru to the south and southwest, the pacific ocean to the west, and panama to the northwest. colombia is divided into 32 departments. the capital district of bogota is also the country\'s largest city hosting the main financial and cultural hub, and other major urbes include medellin, cali, barranquilla, cartagena, santa marta, cucuta, ibague, villavicencio and bucaramanga. it covers an area of 1,141,748 square kilometers (440,831 sq mi), and has a population of around 52 million. colombia is the largest spanish-speaking country in south america. its cultural heritage including language, religion, cuisine, and artreflects its history as a colony, fusing cultural 

In [71]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 77584


### 3 - Dividir el texto en sentencias y en palabras

In [72]:
corpus = nltk.sent_tokenize(text) # divide en oraciones
words = nltk.word_tokenize(text) # divide en términos

In [73]:
len(corpus)

483

In [74]:
# Demos un vistazo
corpus[:]

[' colombia (/klmbi/ (listen), /-lm-/; spanish: [kolombja] (listen)), officially the republic of colombia,[a] is a country mostly in south america with insular regions in north america.',
 'the colombian mainland is bordered by the caribbean sea to the north, venezuela to the east and northeast, brazil to the southeast, ecuador and peru to the south and southwest, the pacific ocean to the west, and panama to the northwest.',
 'colombia is divided into 32 departments.',
 "the capital district of bogota is also the country's largest city hosting the main financial and cultural hub, and other major urbes include medellin, cali, barranquilla, cartagena, santa marta, cucuta, ibague, villavicencio and bucaramanga.",
 'it covers an area of 1,141,748 square kilometers (440,831 sq mi), and has a population of around 52 million.',
 'colombia is the largest spanish-speaking country in south america.',
 'its cultural heritage including language, religion, cuisine, and artreflects its history as a 

In [75]:
# Demos un vistazo
words[:20]

['colombia',
 '(',
 '/klmbi/',
 '(',
 'listen',
 ')',
 ',',
 '/-lm-/',
 ';',
 'spanish',
 ':',
 '[',
 'kolombja',
 ']',
 '(',
 'listen',
 ')',
 ')',
 ',',
 'officially']

In [76]:
print("Vocabulario:", len(words))

Vocabulario: 13771


### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [77]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# ord() nos da el código Unicode para un caracter dado
punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula (string.lower())
    # 2 - quitar los simbolos de puntuacion (string.translate())
    # 3 - realiza la tokenización (nltk.word_tokenize)
    # 4 - realiza la lematización (nuestra función perform_lemmatization)
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus del artículo de wikipedia

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    # la entrada del usuario se usa para tokenizar y vectorizar
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0: # si la similaridad coseno fue nula (ningún término en común)
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number] # obtener el documento del corpus más similar
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada. Sugerencias a ensayar:
- Grand slam
- tournaments
- nadal
- artificial intelligence

In [39]:
# Se utilizará gradio para ensayar el bot
# Herramienta poderosa para crear interfaces rápidas para ensayar modelos
# https://gradio.app/
import sys
#!{sys.executable} -m pip install gradio --quiet

In [80]:
import gradio as gr

def bot_response(human_text):
    print("Q:", human_text)    
    resp = generate_response(human_text.lower(), corpus)
    print("A:", resp,"\n")
    return resp

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)

  iface = gr.Interface(


Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Q: which is the capital
A: [citation needed] colombia is divided into 32 departments and one capital district, which is treated as a department (bogota also serves as the capital of the department of cundinamarca). 





Q: which is the religion
A: 1.8% of colombians adhere to jehovah's witnesses and adventism and less than 1% adhere to other religions, such as the bahai faith, islam, judaism, buddhism, mormonism, hinduism, indigenous religions, hare krishna movement, rastafari movement, eastern orthodox church, and spiritual studies. 

Q: which is the language
A: more than 99.2% of colombians speak spanish, also called castilian; 65 amerindian languages, two creole languages, the romani language and colombian sign language are also used in the country. 

Q: what is the area of the country
A: about 82.5% of the country's total area lies in the warm altitudinal zone. 

Q: how is it divided?
A: colombia is divided into 32 departments. 

Q: what is the current population
A: colombia is projected to have a population of 55.3 million by 2050. the population is concentrated in the andean highlands and along the caribbean coast, also the population densities are generally higher in the andean region. 

Q: wha



### Alumno

- Tomar un ejemplo de los bots utilizados (uno de los dos) y construir el propio.
- Sacar conclusiones de los resultados.

__IMPORTANTE__: Recuerde para la entrega del ejercicio debe quedar registrado en el colab las preguntas y las respuestas del BOT para que podamos evaluar el desempeño final.