# Desafio 1 - Consigna

2. Entrenar modelos de clasificación Naïve Bayes para maximizar el desempeño de clasificación (f1-score macro) en el conjunto de datos de test. Considerar cambiar parámteros de instanciación del vectorizador y los modelos y probar modelos de Naïve Bayes Multinomial y ComplementNB.

3. Transponer la matriz documento-término. De esa manera se obtiene una matriz término-documento que puede ser interpretada como una colección de vectorización de palabras. Estudiar ahora similaridad entre palabras tomando 5 palabras y estudiando sus 5 más similares. La elección de palabras no debe ser al azar para evitar la aparición de términos poco interpretables, elegirlas "manualmente".

## Importación de librerias

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import f1_score
from sklearn.datasets import fetch_20newsgroups
import numpy as np

## Carga de datos

In [3]:
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

## Vectorización de documentos

1. Vectorizar documentos. Tomar 5 documentos al azar y medir similaridad con el resto de los documentos. Estudiar los 5 documentos más similares de cada uno analizar si tiene sentido la similaridad según el contenido del texto y la etiqueta de clasificación.

In [4]:
tfidfvect = TfidfVectorizer()

X_train = tfidfvect.fit_transform(newsgroups_train.data)

In [5]:
y_train = newsgroups_train.target
y_train[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [6]:
print(type(X_train))
print(f'Shape: {X_train.shape}')
print(f'Number of documents: {X_train.shape[0]}')
print(f'Vocabulary size: {X_train.shape[1]}')

<class 'scipy.sparse._csr.csr_matrix'>
Shape: (11314, 101631)
Number of documents: 11314
Vocabulary size: 101631


In [25]:
np.random.seed(40)
documents_idx = np.random.choice(X_train.shape[0], 5)
print(documents_idx)

[11256  7608  5426  5959  3603]


### Análisis del documento: 11256

In [89]:
main_idx = 11256
cos_sim = cosine_similarity(X_train[main_idx], X_train)[0]

In [90]:
most_similar_idx = np.argsort(cos_sim)[::-1][1:6]

print(f'Most similar documents to document {main_idx}:')
print(most_similar_idx)

Most similar documents to document 11256:
[ 1794 10714  8506   111  9736]


In [91]:
newsgroups_train.target_names[y_train[main_idx]]

print(f'Class for document {main_idx}: {newsgroups_train.target_names[y_train[most_similar_idx[0]]]}')
print(f'Content:\n{newsgroups_train.data[main_idx]}')

Class for document 11256: comp.sys.ibm.pc.hardware
Content:
: I need to know the Pins to connect to make a loopback connector for a serial
: port so I can build one.  The loopback connector is used to test the 
: serial port.
: 
: Thanks for any help.
: 
: 
: Steve
: 
Me Too!!!!!!!
skcgoh@tartarus.uwa.edu.au



In [92]:
def analyze_similarity(most_similar_idx, documents, labels, cosine_sim_matrix, main_idx, top_n=5):
    """
    Analyze the similarity of selected documents with the rest of the documents.

    Parameters:
    - most_similar_idx: List of indices of the most similar documents.
    - documents: List of documents in the dataset.
    - labels: List of labels corresponding to each document.
    - cosine_sim_matrix: Cosine similarity matrix of the documents.
    - main_idx: Index of the main document to analyze.
    - top_n: Number of most similar documents to retrieve for each selected document.
    """
    class_names = newsgroups_train.target_names
    
    print(f"\nTop {top_n} Similar Documents:\n{'-'*50}")

    for idx in most_similar_idx:
        print(f"\n{'='*50}")
        print(f"Analyzing Document {idx}")
        print(f"{'='*50}")
        print(f"Content:\n{documents[idx][:200]}\n")
        print(f"Document Class: {class_names[labels[idx]]}")
        print(f"Main Document Class: {class_names[labels[main_idx]]}")
        print(f"{'='*50}")

In [93]:
analyze_similarity(most_similar_idx=most_similar_idx, documents=newsgroups_train.data, labels=y_train, cosine_sim_matrix=cos_sim, main_idx=main_idx)


Top 5 Similar Documents:
--------------------------------------------------

Analyzing Document 1794
Content:
I need to know the Pins to connect to make a loopback connector for a serial
port so I can build one.  The loopback connector is used to test the 
serial port.

Thanks for any help.


Document Class: comp.sys.ibm.pc.hardware
Main Document Class: comp.sys.ibm.pc.hardware

Analyzing Document 10714
Content:
Subject says it  all.  Please email soon.  
skcgoh@tartarus.uwa.edu.au


Document Class: comp.sys.ibm.pc.hardware
Main Document Class: comp.sys.ibm.pc.hardware

Analyzing Document 8506
Content:


From a recent BYTE magazine i got the following:

[Question and part of the answer deleted]

  If you are handy with a soldering iron, the loopback plugs are easy to
make.  On a serial RS-232 nine-p

Document Class: comp.sys.ibm.pc.hardware
Main Document Class: comp.sys.ibm.pc.hardware

Analyzing Document 111
Content:

It would depend on the requirements of the poster's data, for some

### Análisis del documento: 7608

In [94]:
main_idx = 7608
cos_sim = cosine_similarity(X_train[main_idx], X_train)[0]

In [95]:
most_similar_idx = np.argsort(cos_sim)[::-1][1:6]

print(f'Most similar documents to document {main_idx}:')
print(most_similar_idx)
newsgroups_train.target_names[y_train[main_idx]]

Most similar documents to document 7608:
[2172 6387 7545 2800 9986]


'sci.space'

In [96]:
print(f'Class for document {main_idx}: {newsgroups_train.target_names[y_train[most_similar_idx[0]]]}')
print(f'Content:\n{newsgroups_train.data[main_idx]}')

Class for document 7608: sci.space
Content:
Other idea for old space crafts is as navigation beacons and such..
Why not?? If you can put them on "safe" "pause" mode.. why not have them be
activated by a signal from a space craft (manned?) to act as a naviagtion
beacon, to take a directional plot on??


In [97]:
analyze_similarity(most_similar_idx=most_similar_idx, documents=newsgroups_train.data, labels=y_train, cosine_sim_matrix=cos_sim, main_idx=main_idx)


Top 5 Similar Documents:
--------------------------------------------------

Analyzing Document 2172
Content:
   Other idea for old space crafts is as navigation beacons and such..
   Why not??

Because to be any use as a nav point you need to know -exactly- where
it is, which means you either nail it to some

Document Class: sci.space
Main Document Class: sci.space

Analyzing Document 6387
Content:



There is a whole constellation of custom built navigation beacon satellites
in the process of being phased out right now. The TRANSIT/OSCAR satellites
are being replaced by GPS. Or were you thinkin

Document Class: sci.space
Main Document Class: sci.space

Analyzing Document 7545
Content:
There is an interesting opinion piece in the business section of today's
LA Times (Thursday April 15, 1993, p. D1).  I thought I'd post it to
stir up some flame wars - I mean reasoned debate.  Let me 

Document Class: sci.space
Main Document Class: sci.space

Analyzing Document 2800
Content:
Archive-na

### Análisis del documento: 5426

In [98]:
main_idx = 5426
cos_sim = cosine_similarity(X_train[main_idx], X_train)[0]

In [99]:
most_similar_idx = np.argsort(cos_sim)[::-1][1:6]

print(f'Most similar documents to document {main_idx}:')
print(most_similar_idx)
newsgroups_train.target_names[y_train[main_idx]]

Most similar documents to document 5426:
[11124  7902  9030   955  6635]


'rec.motorcycles'

In [100]:
print(f'Class for document {main_idx}: {newsgroups_train.target_names[y_train[most_similar_idx[0]]]}')
print(f'Content:\n{newsgroups_train.data[main_idx]}')

Class for document 5426: rec.sport.hockey
Content:


Hmmm.. The LDDC security guards over here in Docklands only place parking 
stickers on the drivers SIDE windows.. But on reflection that could still 
cause an accident.. Suppose it's because people aren't as litigious over 
here as in the states :-)

Stephen


In [101]:
analyze_similarity(most_similar_idx=most_similar_idx, documents=newsgroups_train.data, labels=y_train, cosine_sim_matrix=cos_sim, main_idx=main_idx)


Top 5 Similar Documents:
--------------------------------------------------

Analyzing Document 11124
Content:
PATRICK
1st rd:	Pens over Isles in 4.
	Devils over Caps in 6.
2nd:	Pens over Devils in 7.

ADAMS
1st rd: B's over Sabres in 5.
	Nords over Habs in 5.
2nd:	B's over Nords in 6.

NORRIS
1st:	Hawks over 

Document Class: rec.sport.hockey
Main Document Class: rec.motorcycles

Analyzing Document 7902
Content:

(1)  Stephen said you took a quote out of context
(2)  You noted that Stephen had not replied to some other t.r.m article
     (call it A) that took a quote out of context
(3)  But the lack of eviden

Document Class: talk.religion.misc
Main Document Class: rec.motorcycles

Analyzing Document 9030
Content:

(not that logic has anything to do with it, but...)
I can see the liability of putting stickers on the car while it was moving,
or something, but it's the BDI that chooses to start and then drive the

Document Class: rec.motorcycles
Main Document Class: rec.motorcycles

An

### Análisis del documento: 5959

In [102]:
main_idx = 5959
cos_sim = cosine_similarity(X_train[main_idx], X_train)[0]

In [103]:
most_similar_idx = np.argsort(cos_sim)[::-1][1:6]

print(f'Most similar documents to document {main_idx}:')
print(most_similar_idx)
newsgroups_train.target_names[y_train[main_idx]]

Most similar documents to document 5959:
[ 913  919 3746 5826 4271]


'comp.sys.ibm.pc.hardware'

In [104]:
print(f'Class for document {main_idx}: {newsgroups_train.target_names[y_train[most_similar_idx[0]]]}')
print(f'Content:\n{newsgroups_train.data[main_idx]}')

Class for document 5959: alt.atheism
Content:


We have plenty of computer labs where the computers are left on all the
time. I don't see any shorter lifespan than the ones we have in the
offices which does get turned off at the end of the day. In fact, some
of the computers in the labs have outlived some of the same ones in the
offices. But it goes both ways so can't conclude anything.


In [105]:
analyze_similarity(most_similar_idx=most_similar_idx, documents=newsgroups_train.data, labels=y_train, cosine_sim_matrix=cos_sim, main_idx=main_idx)


Top 5 Similar Documents:
--------------------------------------------------

Analyzing Document 913
Content:
The recent rise of nostalgia in this group, combined with the
  incredible level of utter bullshit, has prompted me to comb
  through my archives and pull out some of "The Best of Alt.Atheism"
  for y

Document Class: alt.atheism
Main Document Class: comp.sys.ibm.pc.hardware

Analyzing Document 919
Content:
Accounts of Anti-Armenian Human Right Violatins in Azerbaijan #009
                 Prelude to Current Events in Nagorno-Karabakh

      +--------------------------------------------------------------

Document Class: talk.politics.mideast
Main Document Class: comp.sys.ibm.pc.hardware

Analyzing Document 3746
Content:
THE WHITE HOUSE

                  Office of the Press Secretary
                 (Vancouver, British Columbia) 
______________________________________________________________


                      

Document Class: talk.politics.misc
Main Document Class: com

### Análisis del documento: 3603

In [108]:
main_idx = 3603
cos_sim = cosine_similarity(X_train[main_idx], X_train)[0]

In [109]:
most_similar_idx = np.argsort(cos_sim)[::-1][1:6]

print(f'Most similar documents to document {main_idx}:')
print(most_similar_idx)
newsgroups_train.target_names[y_train[main_idx]]

Most similar documents to document 3603:
[5928 2495 1739 8101 9764]


'comp.sys.mac.hardware'

In [110]:
print(f'Class for document {main_idx}: {newsgroups_train.target_names[y_train[most_similar_idx[0]]]}')
print(f'Content:\n{newsgroups_train.data[main_idx]}')

Class for document 3603: comp.sys.mac.hardware
Content:


...


Mac sound hardware is diverse; some macs play in stereo and
mix the output (the SE/30 for instance) while others play in
stereo but ONLY has the left channel for the speaker, while
some are "truly" mono (like the LC)

Developers know that stuff played in the left channel is
guaranteed to be heard, while the right channel isn't. Some
send data to both, some only send data to the left channel
(the first is preferrable, of course)

Cheers,

					/ h+


In [111]:
analyze_similarity(most_similar_idx=most_similar_idx, documents=newsgroups_train.data, labels=y_train, cosine_sim_matrix=cos_sim, main_idx=main_idx)


Top 5 Similar Documents:
--------------------------------------------------

Analyzing Document 5928
Content:
or
there


Okay, I guess its time for a quick explanation of Mac sound.

The original documentation for the sound hardware (IM-3) documents how to
make sound by directly accessing hardware.  Basically

Document Class: comp.sys.mac.hardware
Main Document Class: comp.sys.mac.hardware

Analyzing Document 2495
Content:
Hi.  I think I have a problem with the stereo sound output on my Quadra
900, but I am not totally sure because my roomate has the same problem
on his PowerBook 170.  Any info or experience anyopne has

Document Class: comp.sys.mac.hardware
Main Document Class: comp.sys.mac.hardware

Analyzing Document 1739
Content:
I'm looking for a used/inexpensive audio mixer.  I need at least 
4 channels of stereo input and 1 channel of stereo output, but I would
prefer 8 or more input channels.  Each channel needs to have at

Document Class: misc.forsale
Main Document Class: com