In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import f1_score

## Consigna 1

Vectorizar documentos. Tomar 5 documentos al azar y medir similaridad con el resto de los documentos.
Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido
la similaridad según el contenido del texto y la etiqueta de clasificación.

### Carga de datos y vectorización

In [2]:
# Carga del dataset 20newsgroups.
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

# Vectorización con TF-IDF del set de datos de entrenamiento.
tfidfvect = TfidfVectorizer()
X_train = tfidfvect.fit_transform(newsgroups_train.data) # Matriz documento-término
y_train = newsgroups_train.target

docs_count = X_train.shape[0]
print(f'El corpus consta de {docs_count} documentos\n')

print(f'Las clases en las que los documentos son clasificados son\n{newsgroups_test.target_names}\n')

# Porcentaje de documentos para cada clase.
for t in np.unique(newsgroups_test.target):
    print(f'El {np.count_nonzero(newsgroups_test.target == t) / docs_count:.2%} de los datos pertenece a la clase {newsgroups_test.target_names[t]}')

print('\nAlgunos de los términos del vectorizador:')
for term in np.sort(list(tfidfvect.vocabulary_.keys()))[20000:20020]:
    print(term)

El corpus consta de 11314 documentos

Las clases en las que los documentos son clasificados son
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

El 2.82% de los datos pertenece a la clase alt.atheism
El 3.44% de los datos pertenece a la clase comp.graphics
El 3.48% de los datos pertenece a la clase comp.os.ms-windows.misc
El 3.46% de los datos pertenece a la clase comp.sys.ibm.pc.hardware
El 3.40% de los datos pertenece a la clase comp.sys.mac.hardware
El 3.49% de los datos pertenece a la clase comp.windows.x
El 3.45% de los datos pertenece a la clase misc.forsale
El 3.50% de los datos pertenece a la clase rec.autos
El 3.52% de los datos pertenec

### Vectorizar 5 documentos y estudiar similitud

In [3]:
# Seleccionar al azar los índices de 5 documentos procedentes del set de entrenamiento.
# Para cada uno de ellos, obtener los índices de los 5 documentos más similares y almacenar
# en un diccionario.
indices = np.random.randint(0, X_train.shape[0] - 1, size=5)
doc_to_similars = {}
for i in indices:
    cos_sim = cosine_similarity(X_train[i], X_train)[0]
    most_sim = np.argsort(cos_sim)[::-1][1:6]
    doc_to_similars[i] = most_sim
    print(f'Los 5 documentos más similares al documento en índice {i} son: {most_sim}')

Los 5 documentos más similares al documento en índice 5210 son: [7068 6635 5446 6933 4253]
Los 5 documentos más similares al documento en índice 10033 son: [  747  8754 10779  1519 10323]
Los 5 documentos más similares al documento en índice 6423 son: [ 4988  5856  1220 10575  6635]
Los 5 documentos más similares al documento en índice 4018 son: [ 2369 10217  9459  6542  9448]
Los 5 documentos más similares al documento en índice 9961 son: [3981 3354 9973 5849 1225]


#### Analizar documento 1

In [4]:
# El documento 1 es:
doc_idx = 0
print(newsgroups_train.data[indices[doc_idx]])

# Su target es:
target_doc = newsgroups_train.target_names[y_train[indices[doc_idx]]]
print(f'\nSu target es {target_doc}')

Top Ten Ways Slick Willie Could Improve His Standing With Americans



10. Institute a national sales tax to pay for the socialization of
    America's health care resources.

9.  Declare war on Serbia. Reenact the draft.

8.  Stimulate the economy with massive income transfers to Democtratic
    constituencies.

7.  Appoint an unrepetent socialist like Mario Cuomo to the Suprmeme Court.

6.  Focus like a laser beam on gays in the military.

5.  Put Hillary in charge of the Ministry of Truth and move Stephanopoulos
    over to socialzed health care.

4.  Balance the budget through confiscatory taxation.

3.  Remind everyone, again, how despite the Democrats holding the
    Presidency, the majority of seats in the House, and in the Senate,
    the Republicans have still managed to block his tax-and-spend programs.

2.  Go back to England and get a refresher course in European Socialism.

1.  Resign, now!



Copyright (c) Edward A. Ipser, Jr., 1993

Su target es talk.politics.misc


In [5]:
# Los 5 más similares al documento 1 son:
targets_accum = 0
for i in range(5):
    j = doc_to_similars[indices[doc_idx]][i]
    print(newsgroups_train.data[j])
    target_similar = newsgroups_train.target_names[y_train[j]]
    targets_accum += 1 if target_doc == target_similar else 0
    print(f'\nSu target es {target_similar}')
    print('\n====================================================\n')

print(f'\nEl {100 * targets_accum / 5}% dos los 5 más similares comparten el mismo target.')

:>Top Ten Ways Slick Willie Could Improve His Standing With Americans
:>
:>10. Institute a national sales tax to pay for the socialization of
:>    America's health care resources.
:>
:>9.  Declare war on Serbia. Reenact the draft.
:>
:>8.  Stimulate the economy with massive income transfers to Democtratic
:>    constituencies.
:>
:>7.  Appoint an unrepetent socialist like Mario Cuomo to the Suprmeme Court.
:>
:>6.  Focus like a laser beam on gays in the military.
:>
:>5.  Put Hillary in charge of the Ministry of Truth and move Stephanopoulos
:>    over to socialzed health care.
:>
:>4.  Balance the budget through confiscatory taxation.
:>
:>3.  Remind everyone, again, how despite the Democrats holding the
:>    Presidency, the majority of seats in the House, and in the Senate,
:>    the Republicans have still managed to block his tax-and-spend programs.
:>
:>2.  Go back to England and get a refresher course in European Socialism.
:>

  ***SNIP***

And the number one way Slick Willie c

#### Analizar documento 2

In [6]:
# El documento 2 es:
doc_idx = 1
print(newsgroups_train.data[indices[doc_idx]])

# Su target es:
target_doc = newsgroups_train.target_names[y_train[indices[doc_idx]]]
print(f'\nSu target es {target_doc}')



See, there you go again, saying that a moral act is only significant
if it is "voluntary."  Why do you think this?

And anyway, humans have the ability to disregard some of their instincts.


You are attaching too many things to the term "moral," I think.
Let's try this:  is it "good" that animals of the same species
don't kill each other.  Or, do you think this is right? 

Or do you think that animals are machines, and that nothing they do
is either right nor wrong?



Those weren't arbitrary killings.  They were slayings related to some sort
of mating ritual or whatnot.


Yes it was, but I still don't understand your distinctions.  What
do you mean by "consider?"  Can a small child be moral?  How about
a gorilla?  A dolphin?  A platypus?  Where is the line drawn?  Does
the being need to be self aware?

What *do* you call the mechanism which seems to prevent animals of
the same species from (arbitrarily) killing each other?  Don't
you find the fact that they don't at all significant

In [7]:
# Los 5 más similares al documento 2 son:
targets_accum = 0
for i in range(5):
    j = doc_to_similars[indices[doc_idx]][i]
    print(newsgroups_train.data[j])
    target_similar = newsgroups_train.target_names[y_train[j]]
    targets_accum += 1 if target_doc == target_similar else 0
    print(f'\nSu target es {target_similar}')
    print('\n====================================================\n')

print(f'\nEl {100 * targets_accum / 5}% dos los 5 más similares comparten el mismo target.')


If you force me to do something, am I morally responsible for it?


Well, make up your mind.    Is it to be "instinctive not to murder"
or not?


It's not even correct.    Animals of the same species do kill
one another.


Sigh.   I wonder how many times we have been round this loop.

I think that instinctive bahaviour has no moral significance.
I am quite prepared to believe that higher animals, such as
primates, have the beginnings of a moral sense, since they seem
to exhibit self-awareness.


So what?     Are you trying to say that some killing in animals
has a moral significance and some does not?   Is this your
natural morality>



Are you blind?   What do you think that this sentence means?

	"There must be the possibility that the organism - it's not 
	just people we are talking about - can consider alternatives."

What would that imply?


I find the fact that they do to be significant. 

Su target es alt.atheism



/(hudson)
/If someone inflicts pain on themselves, whether the

#### Analizar documento 3

In [8]:
# El documento 3 es:
doc_idx = 2
print(newsgroups_train.data[indices[doc_idx]])

# Su target es:
target_doc = newsgroups_train.target_names[y_train[indices[doc_idx]]]
print(f'\nSu target es {target_doc}')


IMHO, the original poster has no business soliciting diagnoses off the net,
nor does Dr./Mr.  O'Donnell have any business supplying same. This is one
major reason real physicians avoid this newsgroup like the plague. It is
also another example of the double standard: if I as a physician offered
to diagnose and treat on the net, I can be sued. But people without
qualifications are free to do whatever they want and disclaim it all with
"I'm not a doctor."

Get and keep this crap off the net. Period.

Su target es sci.med


In [9]:
# Los 5 más similares al documento 3 son:
targets_accum = 0
for i in range(5):
    j = doc_to_similars[indices[doc_idx]][i]
    print(newsgroups_train.data[j])
    target_similar = newsgroups_train.target_names[y_train[j]]
    targets_accum += 1 if target_doc == target_similar else 0
    print(f'\nSu target es {target_similar}')
    print('\n====================================================\n')

print(f'\nEl {100 * targets_accum / 5}% dos los 5 más similares comparten el mismo target.')

GREAT post Martin.  Very informative, well-balanced, and humanitarian
without neglecting the need for scientific rigor.


(Cross-posted to alt.psychology.personality since some personality typing
will be discussed at the beginning - Note: I've set all followups to sci.med
since most of my comments are more sci.med oriented and I'm sure most of the
replies, if any, will be med-related.)




They are just responding in their natural way:  Hyper-Choleric Syndrome (HCS).
Oops, that is not a recognized "illness" in the psychological community,
better not say that since it therefore must not, and never will, exist.  :^)

Actually, it is fascinating that a disproportionate number of physicians
will type out as NT (for those not familiar with the Myers-Briggs system,
just e-mail me and I'll send a summary file to you).  In the general
population, NT's comprise only about 12% of the population, but among
physicians it is much much higher (I don't know the exact percentage -
any help here a.p.p.

#### Analizar documento 4

In [10]:
# El documento 4 es:
doc_idx = 3
print(newsgroups_train.data[indices[doc_idx]])

# Su target es:
target_doc = newsgroups_train.target_names[y_train[indices[doc_idx]]]
print(f'\nSu target es {target_doc}')



Try this:

char *name=NULL;
unsigned long value;

if(XGetFontProperty(font, XA_FONT, value)) 
    name=XGetAtomName(dpy, value);

where dpy is your Display connection and font your XFontStruct pointer.


Su target es comp.windows.x


In [11]:
# Los 5 más similares al documento 4 son:
targets_accum = 0
for i in range(5):
    j = doc_to_similars[indices[doc_idx]][i]
    print(newsgroups_train.data[j])
    target_similar = newsgroups_train.target_names[y_train[j]]
    targets_accum += 1 if target_doc == target_similar else 0
    print(f'\nSu target es {target_similar}')
    print('\n====================================================\n')

print(f'\nEl {100 * targets_accum / 5}% dos los 5 más similares comparten el mismo target.')

Anyone know how an application can retrieve the name of the font from
an application given an XFontStruct *? 
Would XGetFontProperty work if I passed XA_FONT_NAME? 
anyone know details of this?  Thanks in advance.
Brian


Su target es comp.windows.x


Greetings,


I am using an X server that provides 3 visuals:
PseudoColor 8 bit, Truecolor 24 bit and DirectColor 24 bit.

A problem occurs when I try to create a window with a visual that is different
from the visual of the parent (which uses the default visual which is TC24).

In the Xlib reference guide from 'O reilly one can read in the
section about XCteateWindow, something like:
"In the current implementation of X11: When using a visual other than the
parent's, be sure to create or find a suitable colourmap which is to be used 
in the window attributes when creating, or else a BadMatch occurs."

of the X11R5 guides.

However, even if I pass along a suitable colourmap, I still get a BadMatch
when I create a window with a non-default v

#### Analizar documento 5

In [12]:
# El documento 5 es:
doc_idx = 4
print(newsgroups_train.data[indices[doc_idx]])

# Su target es:
target_doc = newsgroups_train.target_names[y_train[indices[doc_idx]]]
print(f'\nSu target es {target_doc}')

Y'all lighten up on Harry, Skip'll be like that in a couple of years!!>

Harry's a great personality.  He's the reason I like Cubs broadcasts.
(It's certainly not the quality of the team).

Chop Chop

Michael Mule'



Su target es rec.sport.baseball


In [13]:
# Los 5 más similares al documento 5 son:
targets_accum = 0
for i in range(5):
    j = doc_to_similars[indices[doc_idx]][i]
    print(newsgroups_train.data[j])
    target_similar = newsgroups_train.target_names[y_train[j]]
    targets_accum += 1 if target_doc == target_similar else 0
    print(f'\nSu target es {target_similar}')
    print('\n====================================================\n')

print(f'\nEl {100 * targets_accum / 5}% dos los 5 más similares comparten el mismo target.')


Didn't Alicea get a hit, though? 

See y'all at the ballyard
Go Braves
Chop Chop

Michael Mule'


Su target es rec.sport.baseball



e,

If memory serves me well, Alicea hit it, and damn near tied the game.
Torre obviously knows his players better than you do. 


See y'all at the ballyard
Go Braves
Chop Chop

Michael Mule'


Su target es rec.sport.baseball


Deion Sanders hit a home run in his only AB today.  Nixon was 1 for 4.  Infield
single.  Deion's batting over .400 Nixon: around .200.   Whom would YOU start?
Wise up, Bobby. 


See y'all at the ballyard
Go Braves
Chop Chop

Michael Mule'


Su target es rec.sport.baseball


HEY!!! All you Yankee fans who've been knocking my prediction of Baltimore.
You flooded my mailbox with cries of "Militello's good, Militello's good."

Where is he??!! I noticed he got skipped over after that oh so strong first
outing.  He's not by any chance in Columbus  now, is he?  Please don't tell
me you're relying on this guy to be the *fourth*, not the f

### Comentarios

Se tomaron los 5 documentos más similares a cada uno de 5 documentos elegidos al azar. Se usó como métrica la distancia del coseno y a TF-IDF como técnica de vectorización. Para cada uno de los documentos analizados se calculó el porcentaje de los 5 más similares que poseen el mismo target que el documento analizado. En sucesivas pruebas se observó que a veces dicho porcentaje fue inferior al 50%, lo que sugiere que la técnica de vectorización empleada podría ser mejorada.

## Consigna 2

Entrenar modelos de clasificación Naive Bayes para maximizar el desempeño de clasificación
(f1-score macro) en el conjunto de datos de test. Considerar cambiar parámteros
de instanciación del vectorizador y los modelos y probar modelos de Naive Bayes Multinomial
y ComplementNB.

### Naive Bayes y Complement Naive Bayes usando la vectorización obtenida hasta aquí y parámetros por defecto

In [14]:
# Instanciar y fitear un multinomial Naive Bayes classifier.
mult_nb = MultinomialNB()
mult_nb.fit(X_train, y_train)

# Vectorizar set de datos de testing y realizar predict.
X_test = tfidfvect.transform(newsgroups_test.data)
y_test = newsgroups_test.target
y_pred =  mult_nb.predict(X_test)

# Calcular F1-score macro.
print(f'F1 score macro con multinomial Naive Bayes:', f1_score(y_test, y_pred, average='macro'))

# Instanciar y fitear un complement Naive Bayes classifier.
compl_nb = ComplementNB()
compl_nb.fit(X_train, y_train)

# Realizar predict.
y_pred = compl_nb.predict(X_test)

# Calcular F1-score macro.
print(f'F1 score macro con complement Naive Bayes:', f1_score(y_test, y_pred, average='macro'))

F1 score macro con multinomial Naive Bayes: 0.5854345727938506
F1 score macro con complement Naive Bayes: 0.692953349950875


In [15]:
# Volver a instanciar el vectorizador, esta vez usando stop words del inglés.
tfidfvect = TfidfVectorizer(stop_words='english')
X_train = tfidfvect.fit_transform(newsgroups_train.data)

X_test = tfidfvect.transform(newsgroups_test.data)

# Instanciar y fitear un complement Naive Bayes classifier.
compl_nb = ComplementNB()
compl_nb.fit(X_train, y_train)

# Realizar predict.
y_pred = compl_nb.predict(X_test)

print(f'F1 score macro con complement Naive Bayes:', f1_score(y_test, y_pred, average='macro'))

F1 score macro con complement Naive Bayes: 0.6936107849650025


## Consigna 3

Transponer la matriz documento-término. De esa manera se obtiene una matriz
término-documento que puede ser interpretada como una colección de vectorización de palabras.
Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares.

In [18]:
# Transponer X_train para obtener la matriz de términos a documentos. Cada fila puede ser interpretada como una vectorización de un término.
X_train_t = X_train.T
X_train_t.shape

# Contruir diccionario que mapea índices con términos.
idx_to_term = { v: k for k, v in tfidfvect.vocabulary_.items() }

# Seleccionar al azar los índices de 5 términos procedentes del set de entrenamiento.
# Para cada uno de ellos, obtener los índices de los 5 términos más similares y almacenar
# en un diccionario.
indices = np.random.randint(0, X_train_t.shape[0] - 1, size=5)
term_to_similars = {}
for i in indices:
    cos_sim = cosine_similarity(X_train_t[i], X_train_t)[0]
    most_sim = np.argsort(cos_sim)[::-1][1:6]
    term_to_similars[i] = most_sim
    print(f'Los 5 términos más similares al término en índice {i} son: {most_sim}')

Los 5 términos más similares al término en índice 34180 son: [75052 90044 76983  9759  2271]
Los 5 términos más similares al término en índice 25920 son: [33764 97119 30898 37424 25920]
Los 5 términos más similares al término en índice 4653 son: [80746 98059 81272 81033 81013]
Los 5 términos más similares al término en índice 38227 son: [45333 30039 73364 20808 35108]
Los 5 términos más similares al término en índice 57399 son: [20303 20200 56550 20720 88291]


#### Analizar término 1

In [19]:
# El término 1 es:
term_idx = 0
print(f'El término en índice {indices[term_idx]} es {idx_to_term[indices[term_idx]]}')

El término en índice 34180 es dissect


In [21]:
# Los 5 más similares al término 1 son:
for i in range(5):
    j = term_to_similars[indices[term_idx]][i]
    print(f'Índice: {j}. Term: {idx_to_term[j]}')
    print('\n====================================================\n')

Índice: 75052. Term: quato


Índice: 90044. Term: trnc


Índice: 76983. Term: rekindled


Índice: 9759. Term: 5zo


Índice: 2271. Term: 122057




#### Analizar término 2

In [22]:
# El término 2 es:
term_idx = 1
print(f'El término en índice {indices[term_idx]} es {idx_to_term[indices[term_idx]]}')

El término en índice 25920 es cashing


In [23]:
# Los 5 más similares al término 2 son:
for i in range(5):
    j = term_to_similars[indices[term_idx]][i]
    print(f'Índice: {j}. Term: {idx_to_term[j]}')
    print('\n====================================================\n')

Índice: 33764. Term: dirtier


Índice: 97119. Term: wreaked


Índice: 30898. Term: culling


Índice: 37424. Term: environmentalists


Índice: 25920. Term: cashing




#### Analizar término 3

In [24]:
# El término 3 es:
term_idx = 2
print(f'El término en índice {indices[term_idx]} es {idx_to_term[indices[term_idx]]}')

El término en índice 4653 es 22b


In [25]:
# Los 5 más similares al término 3 son:
for i in range(5):
    j = term_to_similars[indices[term_idx]][i]
    print(f'Índice: {j}. Term: {idx_to_term[j]}')
    print('\n====================================================\n')

Índice: 80746. Term: scim


Índice: 98059. Term: xf7h0aib


Índice: 81272. Term: seki


Índice: 81033. Term: sduiu


Índice: 81013. Term: sdd2k




#### Analizar término 4

In [26]:
# El término 4 es:
term_idx = 3
print(f'El término en índice {indices[term_idx]} es {idx_to_term[indices[term_idx]]}')

El término en índice 38227 es evidences


In [27]:
# Los 5 más similares al término 4 son:
for i in range(5):
    j = term_to_similars[indices[term_idx]][i]
    print(f'Índice: {j}. Term: {idx_to_term[j]}')
    print('\n====================================================\n')

Índice: 45333. Term: hallucinated


Índice: 30039. Term: counterclaim


Índice: 73364. Term: proofs


Índice: 20808. Term: axioms


Índice: 35108. Term: droplet




#### Analizar término 5

In [28]:
# El término 5 es:
term_idx = 4
print(f'El término en índice {indices[term_idx]} es {idx_to_term[indices[term_idx]]}')

El término en índice 57399 es lowest


In [29]:
# Los 5 más similares al término 5 son:
for i in range(5):
    j = term_to_similars[indices[term_idx]][i]
    print(f'Índice: {j}. Term: {idx_to_term[j]}')
    print('\n====================================================\n')

Índice: 20303. Term: attribution


Índice: 20200. Term: atop


Índice: 56550. Term: liftoff


Índice: 20720. Term: awaiting


Índice: 88291. Term: thatn




### Comentarios

Se tomaron los 5 términos más similares a cada uno de 5 términos elegidos al azar. Se usó como métrica la distancia del coseno y a TF-IDF como técnica de vectorización. En sucesivas pruebas se observó que no existe semejanza semántica entre los términos comparados. Esto es esperable dada la técnica de vectorización empleada. TF-IDF no capta el significado semántico de las palabras y no considera el contexto en el que las mismas figuran. Palabras como "silla" y "asiento", por ejemplo, podrían tener vectores no cercanos en función de las diferentes frencuencias con las que estas palabras puedan aparecer en un determinado corpus. Distinto habría sido el resultado de haber usado word embeddings sobre un corpus suficientemente rico.