### Vectorización de texto y modelo de clasificación Naïve Bayes con el dataset 20 newsgroups

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import f1_score

# 20newsgroups por ser un dataset clásico de NLP ya viene incluido y formateado
# en sklearn
from sklearn.datasets import fetch_20newsgroups
import numpy as np

## Carga de datos

In [2]:
# cargamos los datos (ya separados de forma predeterminada en train y test)
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

## Vectorización

In [3]:
# instanciamos un vectorizador
# ver diferentes parámetros de instanciación en la documentación de sklearn
tfidfvect = TfidfVectorizer()

In [4]:
# en el atributo `data` accedemos al texto
newsgroups_train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

In [5]:
# con la interfaz habitual de sklearn podemos fitear el vectorizador
# (obtener el vocabulario y calcular el vector IDF)
# y transformar directamente los datos
X_train = tfidfvect.fit_transform(newsgroups_train.data)
# `X_train` la podemos denominar como la matriz documento-término

In [6]:
# recordar que las vectorizaciones por conteos son esparsas
# por ello sklearn convenientemente devuelve los vectores de documentos
# como matrices esparsas
print(type(X_train))
print(f'shape: {X_train.shape}')
print(f'cantidad de documentos: {X_train.shape[0]}')
print(f'tamaño del vocabulario (dimensionalidad de los vectores): {X_train.shape[1]}')

<class 'scipy.sparse._csr.csr_matrix'>
shape: (11314, 101631)
cantidad de documentos: 11314
tamaño del vocabulario (dimensionalidad de los vectores): 101631


In [7]:
# una vez fiteado el vectorizador, podemos acceder a atributos como el vocabulario
# aprendido. Es un diccionario que va de términos a índices.
# El índice es la posición en el vector de documento.
tfidfvect.vocabulary_['car']

25775

In [8]:
# es muy útil tener el diccionario opuesto que va de índices a términos
idx2word = {v: k for k,v in tfidfvect.vocabulary_.items()}

In [10]:
# hay 20 clases correspondientes a los 20 grupos de noticias
print(f'clases {np.unique(newsgroups_test.target)}')
newsgroups_test.target_names

clases [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Consigna del desafío 1

**1**. Vectorizar documentos. Tomar 5 documentos al azar y medir similaridad con el resto de los documentos.
Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido
la similaridad según el contenido del texto y la etiqueta de clasificación.

**2**. Entrenar modelos de clasificación Naïve Bayes para maximizar el desempeño de clasificación
(f1-score macro) en el conjunto de datos de test. Considerar cambiar parámteros
de instanciación del vectorizador y los modelos y probar modelos de Naïve Bayes Multinomial
y ComplementNB.

**3**. Transponer la matriz documento-término. De esa manera se obtiene una matriz
término-documento que puede ser interpretada como una colección de vectorización de palabras.
Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares.


## Ejercicio 1

In [47]:
# Genero un vector de 5 enteros aleatorios, representan los indices que vamos a seleccionar
randomIdx = np.random.randint(len(newsgroups_train.data), size=5)
# Recuperamos los documentos, luego los mostraremos.
randomDocuments = list(map(lambda idx : newsgroups_train.data[idx],randomIdx))

In [48]:
# Instanciamos un vectorizador
tfidfvect = TfidfVectorizer()

In [49]:
# Entrenamos el vectorizador y transformamos los datos
X_train = tfidfvect.fit_transform(newsgroups_train.data)

In [50]:
def get_highest_similarity_texts(idx,docTermMatrix,number=5):
    # midamos la similaridad coseno con todos los documentos de train
    cossim = cosine_similarity(docTermMatrix[idx], docTermMatrix)[0]
    # los N documentos más similares:
    # empezamos del primero por que consideramos que el documento mas similar es siempre el mismo.
    return np.argsort(cossim)[::-1][1:number+1]

### Primer documento

In [51]:
##Tipo
print(newsgroups_train.target_names[y_train[randomIdx[0]]])

sci.space


In [52]:
highest_sim = get_highest_similarity_texts(randomIdx[0],X_train)

print(randomDocuments[0])
for i in highest_sim:
  print(newsgroups_train.target_names[y_train[i]])

Is anybody out there willing to discuss with me careers in the Army that deal
with space?  After I graduate, I will have a commitment to serve in the Army, 
and I would like to spend it in a space-related field.  I saw a post a long
time ago about the Air Force Space Command which made a fleeting reference to
its Army counter-part.  Any more info on that would be appreciated.  I'm 
looking for things like: do I branch Intelligence, or Signal, or other?  To
whom do I voice my interest in space?  What qualifications are necessary?
Etc, etc.  BTW, my major is computer science engineering.

Please reply to ktj@reef.cis.ufl.edu
sci.space
sci.space
sci.space
sci.space
sci.space


Al parecer todos tienen la misma tematica, por lo que tiene sentido que se midan como alta similitud coseno.

### Segundo documento

In [53]:
##Tipo
print(newsgroups_train.target_names[y_train[randomIdx[1]]])

comp.os.ms-windows.misc


In [54]:
highest_sim = get_highest_similarity_texts(randomIdx[1],X_train)

print(randomDocuments[1])
for i in highest_sim:
  print(newsgroups_train.target_names[y_train[i]])



I don't believe IRQ5 is the problem. I tried a mouse on COM3, IRQ4 (the
usual place) and it still did not like it. Simply, Windows seems to only
support mice on COM1 or COM2. The funny part is, though, that
Microsoft's own mouse driver (8.xx) was quite happy with my mouse
sitting on COM3. Why can't Windows use the mouse driver, or at least
support COM3? :-)


I've tried this too. Actually, I wanted to be able to use my second
modem (COM3/IRQ5) from Windows. It still will not talk to that modem. I
created two profiles, AMSTRAD (for my Amstrad modem on COM1/IRQ4) and
MAESTRO (for my Maestro on COM3/IRQ5). It will not talk to the Maestro
at all.


Nor here. (Windows 3.0).


I've seen nothing like that. I've experimented with Logitech's mouse
driver too, with no sucess.


If you have a SoundBlaster Pro, it should support IRQ10 as well.
Finally, a board that supports IRQs >9. The only one I have (except my
IDE controller).

hamish

comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys

Podemos ver que hay cierta confusion por lo menos en el tipo al que corresponde. Podriamos elegir alguno de los textos por ejemplo el 2do y mostrarlo para analizar el contenido.

In [56]:
print(newsgroups_train.data[highest_sim[1]])


Not completly true.  For AT class and later machines, IRQ5 is
reserved for LPT2.  Since it's rare to have a second parallel
port in a PC, it's usually a good safe choice if you need an
interrupt.

On the other hand, we just ran into a problem with that here
at work on a Gateway computer (4DX-33V).  It has a Modem on COM1,
a Mouse on COM2, and the other serial port was set to COM3 (which
normally uses the same interrupt as COM1).  We had a real fight
with a board when trying to use IRQ5, and discoverd the problem
was that Gateway had set it up such that COM3 used IRQ5.  As soon
as we disabled COM3, our problems went away.  Grumble ... after
several days of trying to figure out why the interrupt didn't work.


Si bien la tematica es comp.sys.ibm.pc.hardware tambien se estan hablando de modems, por lo tanto es logico que la simulitud sea alta.

### Tercer documento

In [57]:
##Tipo
print(newsgroups_train.target_names[y_train[randomIdx[2]]])

rec.sport.baseball


In [60]:
highest_sim = get_highest_similarity_texts(randomIdx[2],X_train)

print(randomDocuments[2])
for i in highest_sim:
  print(newsgroups_train.target_names[y_train[i]])

Was going over some videos last night.....

Studying 1986 and 1992 videotapes of Jose Canseco proved to be very
interesting.  And enlightening.

Here's my analysis of Jose Canseco, circa Sep '92, and Jose Canseco,
circa June 1986.

1.  He's bulked up too much.  Period.  He needs to LOSE about 20 pounds,
    not gain more bulk.

2.  His bat speed has absolutely VANISHED.  Conservatively, I'd say he's
    lost 4%-7% of his bat speed, and that's a HUGE amount of speed.

3.  That open stance is KILLING him.   Note that he acts sort of like
    Brian Downing - way open to start, then closes up as ball is
    released.  Downing could do this without significant head movement -
    Canseco can't.  Also, note that Canseco doesn't always close his
    stance the same way - sometimes, his hips are open, sometimes,
    they're fully closed.  Without a good starting point, it's hard
    to make adjustments in your swing.

What would I do, if I were Jose?

Aside from salting away a large sum of a c

Al parecer el texto es de baseball, sin embargo en la lista de los 5 mas parecido aparace un texto sobre religion, imprimamoslo para notar similitudes.

In [61]:
print(newsgroups_train.data[highest_sim[3]])

A listmember (D Andrew Killie, I think) wrote, in response to the
suggestion that genocide may sometimes be the will of God:

 > Any God who works that way is indescribably evil,
 > and unworthy of my worship or faith.

Nobuya "Higgy" Higashiyama replied (as, in substance, did others):

 > Where is your source of moral standards by which you judge God's
 > behavior?

It is often argued that we have no standing by which to judge God's
actions.  Who is the clay to talk back to the potter? But we find a
contrary view in Scripture. When God proposes to destroy the city of

 + Suppose that there are some good men in the city.
 + Will you destroy the righteous along with the wicked?
 + Far be it from you, Lord, to do such a thing!
 + Shall not the Judge of all the earth do right?

I am told that the Hebrew is actually a bit stronger than this, and
can perhaps be better rendered (dynamic equivalence) as

 + Shame on you, Lord, if you do such a thing!

There are those who say that the definiti

Este texto es bastante largo, pero parece estar criticando posturas respecto a Dios, si bien lejos de estar relacionado con baseball, el texto de baseball es la critica a un jugador, prodriamos decir que en ese sentido tienen alta similitud.

### Cuarto documento

In [64]:
##Tipo
print(newsgroups_train.target_names[y_train[randomIdx[3]]])

rec.sport.baseball


In [65]:
highest_sim = get_highest_similarity_texts(randomIdx[3],X_train)

print(randomDocuments[3])
for i in highest_sim:
  print(newsgroups_train.target_names[y_train[i]])

Count me interested in a Cardinal's mailing list.  If anyone
finds one or starts one, please let me know.

Thanks,

Dick Detweiler
comp.windows.x
sci.electronics
rec.sport.baseball
rec.sport.baseball
rec.sport.baseball


Analicemos el mas similar, que es sobre computacion.

In [66]:
print(newsgroups_train.data[highest_sim[0]])

Please subscribe me to this mailing list


Como podemos ver, tienen muchas palabras repetidas, el primer texto es dificil de clasificar a ciegas como uno de baseball. Y por eso se confunde con otros topicos no muy cercanos.

### Quinto documento

In [67]:
##Tipo
print(newsgroups_train.target_names[y_train[randomIdx[4]]])

comp.sys.mac.hardware


In [68]:
highest_sim = get_highest_similarity_texts(randomIdx[4],X_train)

print(randomDocuments[4])
for i in highest_sim:
  print(newsgroups_train.target_names[y_train[i]])



OTOH, some of us get lucky-- I've unplugged and replugged SCSI and ADB
quite often, and never blown anything.  I blew out the ADB by shorting
the cable, though.

comp.sys.mac.hardware
comp.sys.mac.hardware
comp.sys.mac.hardware
comp.sys.mac.hardware
comp.sys.mac.hardware


En este caso todos los mas similares comparten el mismo tema, debe ser por la mencion de SCSI y ADB

## Ejercicio 2

In [72]:
X_test = tfidfvect.transform(newsgroups_test.data)
y_test = newsgroups_test.target

In [73]:
# Entrenamos Naive Bayes
clf = MultinomialNB()
clf.fit(X_train, y_train)
# Medimos el F1 score de los datos de test
y_pred =  clf.predict(X_test)
f1_score(y_test, y_pred, average='macro')

0.5854345727938506

In [75]:
# Entrenamos Naive Bayes
clf = MultinomialNB(force_alpha=True)
clf.fit(X_train, y_train)
# Medimos el F1 score de los datos de test
y_pred =  clf.predict(X_test)
f1_score(y_test, y_pred, average='macro')

0.5854345727938506

In [82]:
# Entrenamos Naive Bayes
clf = MultinomialNB(alpha=0.01,force_alpha=True)
clf.fit(X_train, y_train)
# Medimos el F1 score de los datos de test
y_pred =  clf.predict(X_test)
f1_score(y_test, y_pred, average='macro')

0.682861129525057

Llevando el alpha a 0.01 mejora mucho el f1-score!

Problemos ahora con un ComplementNB

In [83]:
# Entrenamos Naive Bayes
clf = ComplementNB()
clf.fit(X_train, y_train)
# Medimos el F1 score de los datos de test
y_pred =  clf.predict(X_test)
f1_score(y_test, y_pred, average='macro')

0.692953349950875

In [99]:
# Entrenamos Naive Bayes
clf = ComplementNB(alpha=0.3,force_alpha=True)
clf.fit(X_train, y_train)
# Medimos el F1 score de los datos de test
y_pred =  clf.predict(X_test)
f1_score(y_test, y_pred, average='macro')

0.6999368058272992

## Ejercicio 3

In [169]:
termDocMatrix = X_train.T
def find_closest_words(word):
    cossim = cosine_similarity(termDocMatrix[tfidfvect.vocabulary_[word]], termDocMatrix)[0]
    top5 = np.argsort(cossim)[::-1][1:6]
    print(f'Las palabras mas parecidas a {word} son: ')
    for i in top5:
        print(idx2word[i])

In [170]:
find_closest_words('car')

Las palabras mas parecidas a car son: 
cars
criterium
civic
owner
dealer


In [165]:
find_closest_words('god')

Las palabras mas parecidas a god son: 
jesus
bible
that
existence
christ


In [166]:
find_closest_words('politics')

Las palabras mas parecidas a politics son: 
iftccu
hesh
fascism
bmwmoa
lapse


In [167]:
find_closest_words('windows')

Las palabras mas parecidas a windows son: 
dos
ms
microsoft
nt
for


In [168]:
find_closest_words('email')

Las palabras mas parecidas a email son: 
please
me
replies
reconsidered
address


Los resultados no son 'ideales' pero se ajustan a lo observado en el dataset.