<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Sistema de obtención de información con NLTK utilizando un corpus de wikipedia


In [1]:
import json
import string
import random
import re # Regular Expressions (regex)
import urllib.request

import numpy as np

# Para leer y parsear el texto en HTML de wikipedia
import bs4 as bs

import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alba_\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\alba_\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\alba_\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Datos
Se consumirán los datos del artículo de wikipedia sobre el deporte "Tennis" en inglés.

In [2]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Association_football')
#raw_html = urllib.request.urlopen('https://es.wikipedia.org/wiki/Futbol')
raw_html = raw_html.read()

# Parsear artículo, 'lxml' es el parser a utilizar
article_html = bs.BeautifulSoup(raw_html, 'lxml')

# Encontrar todos los párrafos del HTML (bajo el tag <p>)
# y tenerlos disponible como lista
article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()

In [3]:
# Demos un vistazo
article_text

'\nassociation football, more commonly known as football or soccer,[a] is a team sport played between two teams of 11 players each, who primarily use their feet to propel a ball around a rectangular field called a pitch. the objective of the game is to score more goals than the opposing team by moving the ball beyond the goal line into a rectangular-framed goal defended by the opposing team. traditionally, the game has been played over two 45-minute halves, for a total match time of 90 minutes. with an estimated 250 million players active in over 200 countries and territories, it is the world\'s most popular sport.\nthe game of association football is played in accordance with the laws of the game, a set of rules that has been in effect since 1863 and maintained by the ifab since 1886. the game is played with a football that is 68–70\xa0cm (27–28\xa0in) in circumference. the two teams compete to get the ball into the other team\'s goal (between the posts and under the bar), thereby sco

In [4]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 42127


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [5]:
# Repaso de regex:
# https://docs.python.org/3/library/re.html

# Para practicar regex:
# https://regex101.com/

# el inicio con 'r' antes de cada string indica que se interprete como raw string
# '\n' es interpretado por Python como salto de linea
# r'\n' es interpretado por Python como el string formado por dos caracteres: 
#  backslash y n

# substituir con regex con espacio vacío:
text = re.sub(r'\[[0-9]*\]', ' ', article_text) # substituir los números entre corchetes
text = re.sub(r'\[[a-z]*\]', ' ', article_text) # substituir las referencias alfabeticas entre corchetes

# (notar que los corchetes son interpretados literalmente por los backlsash)
text = re.sub(r'\s+', ' ', text) # substituir más de un caracter de espacio, salto de línea o tabulación

# probar en regex101 con los patrones anteriores:
# 'Hola [1], [], [ estoy bien   [123]. [12sss]. OK!   .'

In [6]:
# Demos un vistazo
text

' association football, more commonly known as football or soccer, is a team sport played between two teams of 11 players each, who primarily use their feet to propel a ball around a rectangular field called a pitch. the objective of the game is to score more goals than the opposing team by moving the ball beyond the goal line into a rectangular-framed goal defended by the opposing team. traditionally, the game has been played over two 45-minute halves, for a total match time of 90 minutes. with an estimated 250 million players active in over 200 countries and territories, it is the world\'s most popular sport. the game of association football is played in accordance with the laws of the game, a set of rules that has been in effect since 1863 and maintained by the ifab since 1886. the game is played with a football that is 68–70 cm (27–28 in) in circumference. the two teams compete to get the ball into the other team\'s goal (between the posts and under the bar), thereby scoring a goal

In [7]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 42104


### 3 - Dividir el texto en sentencias y en palabras

In [8]:
corpus = nltk.sent_tokenize(text) # divide en oraciones
words = nltk.word_tokenize(text) # divide en términos

In [9]:
# Demos un vistazo
corpus[:10]

[' association football, more commonly known as football or soccer, is a team sport played between two teams of 11 players each, who primarily use their feet to propel a ball around a rectangular field called a pitch.',
 'the objective of the game is to score more goals than the opposing team by moving the ball beyond the goal line into a rectangular-framed goal defended by the opposing team.',
 'traditionally, the game has been played over two 45-minute halves, for a total match time of 90 minutes.',
 "with an estimated 250 million players active in over 200 countries and territories, it is the world's most popular sport.",
 'the game of association football is played in accordance with the laws of the game, a set of rules that has been in effect since 1863 and maintained by the ifab since 1886. the game is played with a football that is 68–70 cm (27–28 in) in circumference.',
 "the two teams compete to get the ball into the other team's goal (between the posts and under the bar), the

In [10]:
# Demos un vistazo
words[:20]

['association',
 'football',
 ',',
 'more',
 'commonly',
 'known',
 'as',
 'football',
 'or',
 'soccer',
 ',',
 'is',
 'a',
 'team',
 'sport',
 'played',
 'between',
 'two',
 'teams',
 'of']

In [11]:
print("Vocabulario:", len(words))

Vocabulario: 8297


### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [12]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# ord() nos da el código Unicode para un caracter dado
punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula (string.lower())
    # 2 - quitar los simbolos de puntuacion (string.translate())
    # 3 - realiza la tokenización (nltk.word_tokenize)
    # 4 - realiza la lematización (nuestra función perform_lemmatization)
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus del artículo de wikipedia

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    # la entrada del usuario se usa para tokenizar y vectorizar
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0: # si la similaridad coseno fue nula (ningún término en común)
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number] # obtener el documento del corpus más similar
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada.

In [16]:
def bot_response(human_text):
    print("Q:", human_text)    
    resp = generate_response(human_text.lower(), corpus)
    print("A:", resp)
    return resp

In [26]:
bot_response('what is soccer')

Q: what is soccer
A: the recognised international governing body of football (and associated games, such as futsal and beach soccer) is fifa.


'the recognised international governing body of football (and associated games, such as futsal and beach soccer) is fifa.'

In [19]:
bot_response('how many players are there in a game?')

Q: how many players are there in a game?
A: while players typically spend most of the game in a specific position, there are few restrictions on player movement, and players can switch positions at any time.


'while players typically spend most of the game in a specific position, there are few restrictions on player movement, and players can switch positions at any time.'

In [25]:
bot_response('how long does a game last?')

Q: how long does a game last?
A: women may have been playing football for as long as the game has existed.


'women may have been playing football for as long as the game has existed.'

In [24]:
bot_response('when was soccer created?')

Q: when was soccer created?
A: [45] in the second half of the century, the european cup and the copa libertadores were created, and the champions of these two club competitions would contest the intercontinental cup to prove which team was the best in the world.


'[45] in the second half of the century, the european cup and the copa libertadores were created, and the champions of these two club competitions would contest the intercontinental cup to prove which team was the best in the world.'

In [23]:
bot_response('what is the length of the pitch?')

Q: what is the length of the pitch?
A: fields for non-international matches may be 90–120 m (100–130 yd) in length and 45–90 m (50–100 yd) in width, provided the pitch does not become square.


'fields for non-international matches may be 90–120 m (100–130 yd) in length and 45–90 m (50–100 yd) in width, provided the pitch does not become square.'

In [28]:
bot_response('Discuss the evolution of soccer as a global phenomenon, exploring its historical origins, the development of its rules, the growth of international competitions and the influence of key players and managers on the game')

Q: Discuss the evolution of soccer as a global phenomenon, exploring its historical origins, the development of its rules, the growth of international competitions and the influence of key players and managers on the game
A: in the case of international club competition, it is the country of origin of the clubs involved, not the nationalities of their players, that renders the competition international in nature.


'in the case of international club competition, it is the country of origin of the clubs involved, not the nationalities of their players, that renders the competition international in nature.'

In [30]:
bot_response('What is golf?')

Q: What is golf?
A: I am sorry, I could not understand you


'I am sorry, I could not understand you'

### 7 - Conclusiones

El sistema da respuestas solidas cuando las preguntas son basadas en el contenido del corpus, podría ser particularmente útil en el caso de que se necesite encontrar respuestas de preguntas similares al contenido.

Sin embargo tambien puede otorgar respuestas erroneas ya que depende de la calidad de los datos del corpus.

Aparentemente no captura el significado semántico de las oraciones y es posible que pueda mejorarse utilizando embeddings.