<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Bot con NLTK utilizando un corpus de wikipedia


In [1]:
import json
import string
import random
import re
import urllib.request

import numpy as np

# Para leer y parsear el texto en HTML de wikipedia
import bs4 as bs

import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\apguz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\apguz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\apguz\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Datos
Se consumira los datos del artículo de wikipedia sobre el "memex" en ingles.

El memex fue una dispositivo nunca desarrollado (o por lo menos nunca desarrollado por Vannevar Bush) que basicamente era un consultor de textos y apuntes usando microfilms. Un poco la idea era la misma que con este bot: consultar textos basado en palabras claves.

Contexto: https://proyectoidis.org/memex/

In [2]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Memex')
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()

In [3]:
# Demos un vistazo
article_text

'memex is a hypothetical electromechanical device for interacting with microform documents and described in vannevar bush\'s 1945 article "as we may think". bush envisioned the memex as a device in which individuals would compress and store all of their books, records, and communications, "mechanized so that it may be consulted with exceeding speed and flexibility". the individual was supposed to use the memex as an automatic personal filing system, making the memex "an enlarged intimate supplement to his memory".[1] the name memex is a portmanteau of memory and expansion.\nthe concept of the memex influenced the development of early hypertext systems, eventually leading to the creation of the world wide web, and personal knowledge base software.[2] the hypothetical implementation depicted by bush for the purpose of concrete illustration was based upon a document bookmark list of static microfilm pages and lacked a true hypertext system, where parts of pages would have internal structu

In [4]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 12692


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [5]:
#sustituir los números entre corchetes por un espacio en blanco
text = re.sub(r'\[[0-9]*\]', ' ', article_text)
#sustituir más de un caracter de espacio, salto de línea o tabulación
text = re.sub(r'\s+', ' ', text)

In [6]:
# Demos un vistazo
text

'memex is a hypothetical electromechanical device for interacting with microform documents and described in vannevar bush\'s 1945 article "as we may think". bush envisioned the memex as a device in which individuals would compress and store all of their books, records, and communications, "mechanized so that it may be consulted with exceeding speed and flexibility". the individual was supposed to use the memex as an automatic personal filing system, making the memex "an enlarged intimate supplement to his memory". the name memex is a portmanteau of memory and expansion. the concept of the memex influenced the development of early hypertext systems, eventually leading to the creation of the world wide web, and personal knowledge base software. the hypothetical implementation depicted by bush for the purpose of concrete illustration was based upon a document bookmark list of static microfilm pages and lacked a true hypertext system, where parts of pages would have internal structure beyo

In [7]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 12585


### 3 - Dividir el texto en sentencias y en palabras

In [8]:
corpus = nltk.sent_tokenize(text)
words = nltk.word_tokenize(text)

In [9]:
# Demos un vistazo
corpus[:10]

['memex is a hypothetical electromechanical device for interacting with microform documents and described in vannevar bush\'s 1945 article "as we may think".',
 'bush envisioned the memex as a device in which individuals would compress and store all of their books, records, and communications, "mechanized so that it may be consulted with exceeding speed and flexibility".',
 'the individual was supposed to use the memex as an automatic personal filing system, making the memex "an enlarged intimate supplement to his memory".',
 'the name memex is a portmanteau of memory and expansion.',
 'the concept of the memex influenced the development of early hypertext systems, eventually leading to the creation of the world wide web, and personal knowledge base software.',
 'the hypothetical implementation depicted by bush for the purpose of concrete illustration was based upon a document bookmark list of static microfilm pages and lacked a true hypertext system, where parts of pages would have in

In [10]:
# Demos un vistazo
words[:20]

['memex',
 'is',
 'a',
 'hypothetical',
 'electromechanical',
 'device',
 'for',
 'interacting',
 'with',
 'microform',
 'documents',
 'and',
 'described',
 'in',
 'vannevar',
 'bush',
 "'s",
 '1945',
 'article',
 '``']

In [11]:
print("Vocabulario:", len(words))

Vocabulario: 2319


### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [48]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula
    # 2 - quitar los simbolos de puntuacion
    # 3 - realiza la tokenización
    # 4 - realiza la lematización
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus de wikipedia

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number]
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada. Sugerencias ensayar:
- Vannevar Bush
- Memory
- Hypertext
- Microfilm

In [None]:
import gradio as gr

def bot_response(human_text):
    print("Q:", human_text)    
    resp = generate_response(human_text.lower(), corpus)
    print("A:", resp)
    return resp

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)



Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Q: Vannebar Bush




A: in 1962, engelbart sent bush a draft article for comment, bush never replied.
Q: Memory
A: the name memex is a portmanteau of memory and expansion.
Q: Memory
A: the name memex is a portmanteau of memory and expansion.
Q: hypertext
A: in 1968, nelson collaborated with andries van dam to implement the hypertext editing system (hes).
Q: memex
A: the top of the memex would have a transparent platen.
Q: device

A: the article claims that magnetic tape would be central to the creation of a modern memex device.
Q: documents

A: memex is a hypothetical electromechanical device for interacting with microform documents and described in vannevar bush's 1945 article "as we may think".
Q: technological

A: published 22 years after his initial conception of the memex, bush details the various technological advancements that have made his vision a possibility.
Q: colombia

A: I am sorry, I could not understand you
Q: digital

A: bush also relates that, unlike digital technology, memex would be of 

## Segunda Prueba

La siguiente prueba es con una página que plantea el aprendizaje automático como un nooscopio. 

Acá el buscador: https://nooscope.ai/

In [34]:
raw_html = urllib.request.urlopen('https://nooscope.ai/')
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text_2 = ''

for para in article_paragraphs:
    article_text_2 += para.text

article_text_2 = article_text_2.lower()

In [35]:
article_text_2

"download diagram pdf and essay pdf\n              by vladan joler and matteo pasquinelli\n              (2020)\n            the nooscope is a cartography of the limits of artificial intelligence, intended as a provocation to both computer science and the humanities. any map is a partial perspective, a way to provoke debate. similarly, this map is a manifesto — of ai dissidents. its main purpose is to challenge the mystifications of artificial intelligence. first, as a technical definition of intelligence and, second, as a political form that would be autonomous from society and the human.\n                  1\n                  in the expression ‘artificial intelligence’ the adjective ‘artificial’ carries the myth of the technology’s autonomy: it hints to caricatural ‘alien minds’ that self-reproduce in silico but, actually, mystifies two processes of proper alienation: the growing geopolitical autonomy of hi-tech companies and the invisibilization of workers’ autonomy worldwide. the 

In [13]:
print("Cantidad de caracteres en la nota:", len(article_text_2))

Cantidad de caracteres en la nota: 30002


In [36]:
#sustituir los números que aparecen como citas al final de algunos párrafos
text_2 = re.sub(r'\[0-9]*', ' ', article_text_2)
#sustituir más de un caracter de espacio, salto de línea o tabulación
text_2 = re.sub(r'\s+', ' ', article_text_2)

In [37]:
text_2

"download diagram pdf and essay pdf by vladan joler and matteo pasquinelli (2020) the nooscope is a cartography of the limits of artificial intelligence, intended as a provocation to both computer science and the humanities. any map is a partial perspective, a way to provoke debate. similarly, this map is a manifesto — of ai dissidents. its main purpose is to challenge the mystifications of artificial intelligence. first, as a technical definition of intelligence and, second, as a political form that would be autonomous from society and the human. 1 in the expression ‘artificial intelligence’ the adjective ‘artificial’ carries the myth of the technology’s autonomy: it hints to caricatural ‘alien minds’ that self-reproduce in silico but, actually, mystifies two processes of proper alienation: the growing geopolitical autonomy of hi-tech companies and the invisibilization of workers’ autonomy worldwide. the modern project to mechanise human reason has clearly mutated, in the 21st century

In [39]:
print("Cantidad de caracteres en el texto:", len(text_2))

Cantidad de caracteres en el texto: 69986


In [41]:
corpus_2 = nltk.sent_tokenize(text_2)
words_2 = nltk.word_tokenize(text_2)

In [43]:
corpus_2[:5]

['download diagram pdf and essay pdf by vladan joler and matteo pasquinelli (2020) the nooscope is a cartography of the limits of artificial intelligence, intended as a provocation to both computer science and the humanities.',
 'any map is a partial perspective, a way to provoke debate.',
 'similarly, this map is a manifesto — of ai dissidents.',
 'its main purpose is to challenge the mystifications of artificial intelligence.',
 'first, as a technical definition of intelligence and, second, as a political form that would be autonomous from society and the human.']

In [46]:
words_2[:30]

['download',
 'diagram',
 'pdf',
 'and',
 'essay',
 'pdf',
 'by',
 'vladan',
 'joler',
 'and',
 'matteo',
 'pasquinelli',
 '(',
 '2020',
 ')',
 'the',
 'nooscope',
 'is',
 'a',
 'cartography',
 'of',
 'the',
 'limits',
 'of',
 'artificial',
 'intelligence',
 ',',
 'intended',
 'as',
 'a']

In [47]:
print("Vocabulario:", len(words_2))

Vocabulario: 12227


In [None]:
import gradio as gr

def bot_response(human_text):
    print("Q:", human_text)    
    resp = generate_response(human_text.lower(), corpus)
    print("A:", resp)
    return resp

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)



Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Q: layers




A: in this modality, the neural network is run backwards (moving from the smaller output layer toward the larger input layer) to generate new patterns after being trained at classifying them, a process that usually moves from the larger input layer to the smaller output layer.
Q: ethics
A: I am sorry, I could not understand you
Q: ethic
A: I am sorry, I could not understand you
Q: black boxes
A: through techniques of face obfuscation, humans have decided to become unintelligible to artificial intelligence: that is to become, themselves, black boxes.
Q: nooscope
A: there is an important perspective to take into account, in order to understand ai as a nooscope.
Q: nooscope diagram
A: similarly, the nooscope diagram exposes the skeleton of the ai black box and shows that ai is not a thinking automaton but an algorithm that performs pattern recognition.
Q: algorithm
A: but what do the algorithms of machine learning really do?
Q: hallucination
A: more in general, ai is a new regime of truth