## Clasificación de textos utilizando AutoML


La clasificación de textos consiste en, dado un texto, asignarle una entre varias categorías. Algunos ejemplos de esta tarea son:

- dado un tweet, categorizar su connotación como positiva, negativa o neutra.
- dado un post de Facebook, clasificarlo como portador de un lenguaje ofensivo o no.  

En la actividad exploraremos cómo utilizar soluciones *out of the box* para esta tarea incluidas en la librería [AutoGOAL](https://github.com/autogoal/autogoal) y su aplicación para clasificar reviews de [IMDB](https://www.imdb.com/) sobre películas en las categorías \[$positive$, $negative$\]. 



**Instrucciones:**

- siga las indicaciones y comentarios en cada apartado.


**Después de esta actividad nos habremos familiarizado con:**
- cómo modelar un problema de clasificación con AutoGOAL
- cómo utilizar AutoGOAL para buscar automáticamente un *pipeline* para clasificación de textos.
- utilizar este *pipeline* para clasificar nuevos textos.

**Requerimientos**
- python 3.6.12 - 3.8
- tensorflow==2.3.0
- autogoal==0.3.2
- pandas==1.1.5
- plotly==4.13.0
- tqdm==4.56.0


## Instalación de librerías e importación de dependencias.

Para comenzar, es preciso instalar las dependencias y realizar los imports necesarios.

Ejecute las siguientes casillas prestando atención a las instrucciones adicionales en los comentarios.

In [2]:
# instalar librerías. Esta casilla es últil por ejemplo si se ejecuta el cuaderno en Google Colab
# Note que existen otras dependencias como tensorflow==2.3.0, pandas==1.1.3 etc. que en este caso se encontrarían ya instaladas
%%capture
!pip install plotly==4.13.0 tqdm==4.51.0 autogoal[contrib]

print('Done!')

In [3]:
import pandas as pd
import plotly.graph_objects as go
from collections import Counter

print('Done!')

Done!


In [4]:
from autogoal.ml import AutoML
from autogoal.datasets import haha
from autogoal.search import (
    Logger,
    PESearch,
    ConsoleLogger,
    ProgressLogger,
    MemoryLogger,
)
from autogoal.kb import List, Sentence, Tuple, CategoricalVector
from autogoal.contrib import find_classes
from sklearn.metrics import f1_score

print('Done!')



Done!


In [5]:
iterations = 1
popsize = 50
timeout = 600
global_timeout = None
memory = 20
examples = None

classifier = AutoML(
    search_algorithm=PESearch,
    input=List(Sentence()),
    output=CategoricalVector(),
    search_iterations=iterations,
    score_metric=f1_score,
    search_kwargs=dict(
        pop_size=popsize,
        search_timeout=global_timeout,
        evaluation_timeout=timeout,
        memory_limit=memory * 1024 ** 3,
    ),
)

In [6]:
class CustomLogger(Logger):
    def error(self, e: Exception, solution):
        if e and solution:
            with open("haha_errors.log", "a") as fp:
                fp.write(f"solution={repr(solution)}\nerror={repr(e)}\n\n")

    def update_best(self, new_best, new_fn, *args):
        with open("haha.log", "a") as fp:
            fp.write(f"solution={repr(new_best)}\nfitness={new_fn}\n\n")

# Basic logging configuration.

logger = MemoryLogger()
loggers = [ProgressLogger(), ConsoleLogger(), logger]


In [7]:
X_train, y_train, X_test, y_test = haha.load(max_examples=examples)

100%|██████████| 1.60M/1.60M [00:00<00:00, 17.1MB/s]


In [None]:
classifier.fit(X_train, y_train, logger=loggers)

  defaults = yaml.load(f)
Failed to start diagnostics server on port 8787. [Errno 99] Cannot assign requested address
Could not launch service 'bokeh' on port 8787. Got the following message:

[Errno 99] Cannot assign requested address
  self.scheduler.start(scheduler_address)


Sentence()
List(Word())
Word()
Tuple(List(Word()), List(Flags()))
Flags()
List(Postag())
Postag()
MatrixContinuousDense()
List(Stem())
Stem()
List(ContinuousVector())
ContinuousVector()
List(ContinuousVector())
ContinuousVector()
List(Flags())
Flags()
List(Flags())
Flags()
List(Summary())
Summary()
List(Summary())
Summary()
List(Flags())
Flags()
MatrixContinuousDense()
MatrixContinuousSparse()
MatrixContinuousSparse()
DiscreteVector()
MatrixContinuousDense()
ContinuousVector()
MatrixContinuousDense()
MatrixContinuousSparse()
Flags()
List(List(Sentence()))
List(Sentence())
Sentence()
List(List(Flags()))
List(Flags())
Flags()
List(MatrixContinuousSparse())
MatrixContinuousSparse()
List(Tensor3())
Tensor3()
List(List(List(Word())))
List(List(Word()))
List(Word())
Word()
List(List(Tuple(List(Word()), List(Flags()))))
List(Tuple(List(Word()), List(Flags())))
Tuple(List(Word()), List(Flags()))
List(List(Flags()))
List(Flags())
Flags()
List(MatrixContinuousDense())
MatrixContinuousDense()
Lis

100%|██████████| 2.73G/2.73G [06:06<00:00, 8.00MB/s]


[31m(!) Error evaluating pipeline: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 20 and the array at index 1 has size 10

Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/autogoal/utils/_process.py", line 40, in _restricted_function
    result = self.function(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/autogoal/ml/_automl.py", line 187, in fitness_fn
    pipeline.run((X_train, y_train))
  File "/usr/local/lib/python3.7/dist-packages/autogoal/kb/_algorithm.py", line 427, in run
    raise e from None
  File "/usr/local/lib/python3.7/dist-packages/autogoal/kb/_algorithm.py", line 425, in run
    x = step.run(x)
  File "/usr/local/lib/python3.7/dist-packages/autogoal/kb/_data.py", line 257, in run_method
    elements[index] = self.inner.run(elements[index])
  File "/usr/local/lib/python3.7/dist-packages/autogoal/contrib/wrappers.py", line 62, i



  lis = BeautifulSoup(html).find_all('li')


[31m(!) Error evaluating pipeline: [0m
[34mFitness=0.000[0m
[31m(!) Error evaluating pipeline: Error while generating solution: Cannot find compatible implementations for interface <class 'types.Algorithm[List(Word()), List(Word())]'>[0m
[31m(!) Error evaluating pipeline: Error while generating solution: Cannot find compatible implementations for interface <class 'types.Algorithm[List(Word()), List(Word())]'>[0m
[1m[37mEvaluating pipeline:[0m
Pipeline(
    steps=[
        TupleWrapper[
            Tuple(List(Sentence()), CategoricalVector()),
            Tuple(List(List(Word())), CategoricalVector()),
        ](
            inner=ListAlgorithm[List(Sentence()), List(List(Word()))](
                inner=MWETokenizer()
            )
        ),
        TupleWrapper[
            Tuple(List(List(Word())), CategoricalVector()),
            Tuple(List(List(Summary())), CategoricalVector()),
        ](
            inner=ListAlgorithm[
                List(List(Word(domain=general, 



  lis = BeautifulSoup(html).find_all('li')
