<a href="https://colab.research.google.com/github/CD-AC/Master_AI/blob/main/mia07_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ACTIVIDAD DE CLASIFICACIÓN DE TEXTO

En esta actividad vamos a trabajar en clasificar textos. Se recorrerá todo el proceso desde traer el dataset hasta proceder a dicha clasificación. Durante la actividad se llevarán a cabo muchos procesos como la creación de un vocabulario, el uso de embeddings y la creación de modelos.

Las cuestiones presentes en esta actividad están basadas en un Notebook creado por François Chollet, uno de los creadores de Keras y autor del libro "Deep Learning with Python".

En este Notebook se trabaja con el dataset "Newsgroup20" que contiene aproximadamente 20000 mensajes que pertenecen a 20 categorías diferentes.

El objetivo es entender los conceptos que se trabajan y ser capaz de hacer pequeñas experimentaciones para mejorar el Notebook creado.

# Librerías

In [23]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Descarga de Datos

In [None]:
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

Downloading data from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz
[1m17329808/17329808[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 0us/step


In [24]:
import os
import pathlib

!tar -xzf /content/news20.tar.gz -C /content

data_path = "/content"

print("Ruta definida:", data_path)
print("Contenido en data_path:\n", os.listdir(data_path))

Ruta definida: /content
Contenido en data_path:
 ['.config', '20_newsgroup', 'news20.tar.gz', 'sample_data']


In [28]:
#Estructura de directorios del dataset
data_dir = pathlib.Path(data_path).parent / "/content/20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

Number of directories: 20
Directory names: ['soc.religion.christian', 'rec.sport.baseball', 'rec.autos', 'alt.atheism', 'talk.politics.misc', 'comp.sys.ibm.pc.hardware', 'comp.graphics', 'rec.motorcycles', 'comp.sys.mac.hardware', 'comp.windows.x', 'sci.space', 'misc.forsale', 'sci.electronics', 'talk.politics.guns', 'talk.politics.mideast', 'rec.sport.hockey', 'comp.os.ms-windows.misc', 'sci.med', 'talk.religion.misc', 'sci.crypt']


In [29]:
print(data_dir)

/content/20_newsgroup


In [30]:
#Algunos archivos de la categoria "com.graphics"
fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of files in comp.graphics: 1000
Some example filenames: ['38258', '38778', '38884', '38337', '38585']


In [31]:
import pathlib

print("Ruta de data_path:", data_path)
print("Contenido de su carpeta padre:")
parent_dir = pathlib.Path(data_path).parent
print(list(parent_dir.iterdir()))


Ruta de data_path: /content
Contenido de su carpeta padre:
[PosixPath('/tmp'), PosixPath('/libx32'), PosixPath('/bin'), PosixPath('/lib64'), PosixPath('/sys'), PosixPath('/srv'), PosixPath('/dev'), PosixPath('/opt'), PosixPath('/home'), PosixPath('/mnt'), PosixPath('/run'), PosixPath('/proc'), PosixPath('/root'), PosixPath('/usr'), PosixPath('/lib'), PosixPath('/boot'), PosixPath('/media'), PosixPath('/sbin'), PosixPath('/var'), PosixPath('/lib32'), PosixPath('/etc'), PosixPath('/content'), PosixPath('/.dockerenv'), PosixPath('/tools'), PosixPath('/datalab'), PosixPath('/python-apt'), PosixPath('/python-apt.tar.xz'), PosixPath('/NGC-DL-CONTAINER-LICENSE'), PosixPath('/cuda-keyring_1.1-1_all.deb')]


In [32]:
#Ejemplo de un texto de la categoría "com.graphics"
print(open(data_dir / "comp.graphics" / "37261").read())

Xref: cantaloupe.srv.cs.cmu.edu comp.graphics:37261 alt.graphics:519 comp.graphics.animation:2614
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu!zaphod.mps.ohio-state.edu!darwin.sura.net!dtix.dt.navy.mil!oasys!lipman
From: lipman@oasys.dt.navy.mil (Robert Lipman)
Newsgroups: comp.graphics,alt.graphics,comp.graphics.animation
Subject: CALL FOR PRESENTATIONS: Navy SciViz/VR Seminar
Message-ID: <32850@oasys.dt.navy.mil>
Date: 19 Mar 93 20:10:23 GMT
Article-I.D.: oasys.32850
Expires: 30 Apr 93 04:00:00 GMT
Reply-To: lipman@oasys.dt.navy.mil (Robert Lipman)
Followup-To: comp.graphics
Distribution: usa
Organization: Carderock Division, NSWC, Bethesda, MD
Lines: 65


			CALL FOR PRESENTATIONS
	
      NAVY SCIENTIFIC VISUALIZATION AND VIRTUAL REALITY SEMINAR

			Tuesday, June 22, 1993

	    Carderock Division, Naval Surface Warfare Center
	      (formerly the David Taylor Research Center)
			  Bethesda, Maryland

SPONSOR: NESS (Navy Engineering Software System) is sponsori

In [33]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

total_tokens=0
for i in (range(5)):
    doc = nlp(pathlib.Path(data_dir / "comp.graphics" / fnames[i]).read_text(encoding="latin-1"))
    doc.__len__()
    total_tokens+=doc.__len__()

average_tokens = total_tokens/5

print(average_tokens)

2020.4


In [34]:
#Algunos archivos de la categoria "talk.politics.misc"
fnames = os.listdir(data_dir / "talk.politics.misc")
print("Number of files in talk.politics.misc:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of files in talk.politics.misc: 1000
Some example filenames: ['178913', '178805', '178437', '178859', '176903']


In [35]:
#Ejemplo de un texto de la categoría "talk.politics.misc"
print(open(data_dir / "talk.politics.misc" / "178463").read())

Xref: cantaloupe.srv.cs.cmu.edu talk.politics.guns:54219 talk.politics.misc:178463
Newsgroups: talk.politics.guns,talk.politics.misc
Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!darwin.sura.net!martha.utcc.utk.edu!FRANKENSTEIN.CE.UTK.EDU!VEAL
From: VEAL@utkvm1.utk.edu (David Veal)
Subject: Re: Proof of the Viability of Gun Control
Message-ID: <VEAL.749.735192116@utkvm1.utk.edu>
Lines: 21
Sender: usenet@martha.utcc.utk.edu (USENET News System)
Organization: University of Tennessee Division of Continuing Education
References: <1qpbqd$ntl@access.digex.net> <C5otvp.ItL@magpie.linknet.com>
Date: Mon, 19 Apr 1993 04:01:56 GMT

[alt.drugs and alt.conspiracy removed from newsgroups line.]

In article <C5otvp.ItL@magpie.linknet.com> neal@magpie.linknet.com (Neal) writes:

>   Once the National Guard has been called into federal service,
>it is under the command of the present. Tha N

In [36]:
# Seleccionar sólo ciertas clases
list_all_dir = [
    'alt.atheism',
    'comp.graphics',
    'comp.sys.mac.hardware',
    'comp.windows.x',
    'misc.forsale',
    'rec.autos',
    'rec.sport.baseball',
    'rec.sport.hockey',
    'sci.crypt',
    'sci.med',
    'sci.space',
    'soc.religion.christian',
    'talk.politics.guns',
    'talk.politics.misc',
    'talk.religion.misc'
]

In [37]:
samples = []
labels = []
class_names = []
class_index = 0
for dirname in list_all_dir:
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.misc', 'talk.religion.misc']
Number of samples: 14997


# Mezclando los datos para separarlos en Traning y Test

In [38]:
# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)
keras.utils.set_random_seed(seed)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

In [39]:
print(train_samples[:3])

["Lines: 13\n\nIn article <C5t05K.DB6@research.canon.oz.au>, enzo@research.canon.oz.au (Enzo Liguori) writes:\n\n<<<most of message deleted>>>\n\n> What about light pollution in observations? (I read somewhere else that\n> it might even be visible during the day, leave alone at night).\n\n> Really, really depressed.\n> \n>              Enzo\n\nNo need to be depressed about this one.  Lights aren't on during the day\nso there shouldn't be any daytime light pollution.\n", "\nIn article <30121@ursa.bear.com>, halat@pooh.bears (Jim Halat) writes:\n>In article <115288@bu.edu>, jaeger@buphy.bu.edu (Gregg Jaeger) writes:\n>>\n>>He'd have to be precise about is rejection of God and his leaving Islam.\n>>One is perfectly free to be muslim and to doubt and question the\n>>existence of God, so long as one does not _reject_ God. I am sure that\n>>Rushdie has be now made his atheism clear in front of a sufficient \n>>number of proper witnesses. The question in regard to the legal issue\n>>is his st

In [40]:
print(val_samples[:1])

["NNTP-Posting-Host: hardy.u.washington.edu\n\npb6755@csc.albany.edu (BROWN PHILIP H) writes:\n\n>I watched the final inning of Bosio's no-hitter with several people at\n>work. After Vizquel made that barehanded grab of the chopper up the\n>middle, someone remarked that if he had fielded it with his glove, he\n>wouldn't have had time to throw Riles out. Yet, the throw beat Riles\n>by about two steps. I wonder how many others who watched the final out\n>think Vizquel had no choice but to make the play with his bare hand.\n\nIn this morning's paper (or was it on the radio?), Vizquel was quoted as\nsaying that he could have fielded the ball with his glove and still\neasily thrown out Riles, that he barehanded it instead so as to make the\nfinal play more memorable.  Seems a litle cocky to me, but he made it\nwork so he's entitled.\n-- \nDoug Dudgeon                             Dept. of Chemical Engineering, BF-10\ndudgeon@opus.cheme.washington.edu        University of Washington, Seattle\

In [41]:
print(train_labels[:1])

[10]


In [42]:
print(val_labels[:1])

[6]


# Tokenización de las palabras con TextVectorization

In [43]:
from tensorflow.keras.layers import TextVectorization
vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

In [44]:
vectorizer.get_vocabulary()[:5]

['', '[UNK]', 'the', 'to', 'of']

In [45]:
len(vectorizer.get_vocabulary())

20000

# Viendo la salida de Vectorizer

In [46]:
output = vectorizer([["the cat sat on the mat"]])
output.numpy()[0, :6]

array([   2, 3709, 2056,   19,    2, 9656])

In [47]:
output

<tf.Tensor: shape=(1, 200), dtype=int64, numpy=
array([[   2, 3709, 2056,   19,    2, 9656,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,   

In [48]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [49]:
test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]

[2, 3709, 2056, 19, 2, 9656]

# Tokenización de los datos de entrenamiento y validación

In [50]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

# Creación y entrenamiento del modelo

In [51]:
from tensorflow.keras import layers

# Número de clases = cantidad de categorías en class_names
num_classes = len(class_names)

# Definimos un modelo secuencial con:
# 1) Capa Embedding para mapear indices de tokens a vectores
# 2) GlobalAveragePooling1D (simplifica la información de toda la secuencia)
# 3) Una capa densa de salida con softmax para clasificación multiclase
modeloEmbeddingGloveTransformers = keras.Sequential(
    [
        layers.Embedding(
            input_dim=20000,        # Coincide con max_tokens en TextVectorization
            output_dim=64,          # Dimensión del embedding
            input_length=200        # Coincide con output_sequence_length
        ),
        layers.GlobalAveragePooling1D(),
        layers.Dense(64, activation="relu"),
        layers.Dropout(0.2),
        layers.Dense(num_classes, activation="softmax"),
    ],
    name="modeloEmbeddingGloveTransformers"
)

# Compilamos el modelo
modeloEmbeddingGloveTransformers.compile(
    loss="sparse_categorical_crossentropy",  # Usamos sparse porque y_train e y_val no están one-hot
    optimizer="adam",
    metrics=["accuracy"]
)

# Mostramos un resumen de la arquitectura
modeloEmbeddingGloveTransformers.summary()

# Entrenamos el modelo
history = modeloEmbeddingGloveTransformers.fit(
    x_train,
    y_train,
    epochs=5,
    batch_size=32,
    validation_data=(x_val, y_val)
)



Epoch 1/5
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 14ms/step - accuracy: 0.1270 - loss: 2.6228 - val_accuracy: 0.3044 - val_loss: 2.1198
Epoch 2/5
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 12ms/step - accuracy: 0.3459 - loss: 1.9382 - val_accuracy: 0.5922 - val_loss: 1.4867
Epoch 3/5
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 16ms/step - accuracy: 0.5404 - loss: 1.4009 - val_accuracy: 0.6959 - val_loss: 1.1275
Epoch 4/5
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 0.6812 - loss: 1.0161 - val_accuracy: 0.7449 - val_loss: 0.8871
Epoch 5/5
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 0.7649 - loss: 0.7611 - val_accuracy: 0.7806 - val_loss: 0.7430


# Evaluación

In [55]:
import tensorflow as tf

string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = modeloEmbeddingGloveTransformers(x)
end_to_end_model = keras.Model(string_input, preds)

# Prepara tus textos como un tensor de strings
pred_texts = tf.constant(
    [["this message is about computer graphics and 3D modeling"]],
    dtype=tf.string
)

probabilities = end_to_end_model.predict(pred_texts)
print("Predictions:", probabilities)
print("Predicted class:", class_names[np.argmax(probabilities[0])])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 257ms/step
Predictions: [[0.06691647 0.4548046  0.05941276 0.04086138 0.07130327 0.03141871
  0.0046276  0.00063965 0.00800469 0.09905253 0.01470952 0.11056899
  0.00333143 0.0046735  0.02967502]]
Predicted class: comp.graphics


In [59]:
pred_texts = tf.constant(
    [["politics and federal courts law that people understand with politician and elects congressman"]],
    dtype=tf.string
)

probabilities = end_to_end_model.predict(pred_texts)
predicted_label = class_names[np.argmax(probabilities[0])]
print("Predicted class:", predicted_label)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
Predicted class: talk.religion.misc


In [60]:
pred_texts = tf.constant(
    [["we are talking about religion"]],
    dtype=tf.string
)

probabilities = end_to_end_model.predict(pred_texts)
predicted_label = class_names[np.argmax(probabilities[0])]
print("Predicted class:", predicted_label)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step
Predicted class: soc.religion.christian
