# Ejemplo de clasificación

En general, consta de los siguientes pasos:


1.   Pre procesar los datos. Se leen los datos y aplican técnicas de estandarización, limpieza, etc.
2.   Separar los datos. Se forman los conjuntos de entrenamiento, validación y prueba.
3.   Se entrena el modelo. Se entrena hasta que estemos satisfechos con el desempeño.
4.   Se evalúa el modelo. Se aplica una métrica a los resultados.



In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

El dataset tiene las siguientes categorías:
```
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc'
]
 ```

In [None]:
categories = [
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'comp.sys.ibm.pc.hardware',
    'comp.sys.mac.hardware',
    'comp.windows.x',
]
remove = ('headers', 'footers', 'quotes')

In [None]:
train_data = fetch_20newsgroups(subset="train", categories=categories, remove=remove)

In [None]:
train_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [None]:
len(train_data['data'])

2936

In [None]:
print(train_data['data'][0])

Hi!

I remember reading (or hallucinating) that NCD's PC-Xremote functionality had 
been given, by NCD, to MIT for inclusion in X11R6.  Is this true?  If so,
(set mode/cheap) can I just wait for X11R6 to get compressed serial line
X server support?

Thanks!


In [None]:
print(train_data['target'][0], train_data['target_names'][0])

4 comp.graphics


In [None]:
test_data = fetch_20newsgroups(subset="test", categories=categories, remove=remove)
len(test_data['data'])

1955

Ponemos el conjunto de datos en variables para train y test

In [None]:
train = train_data['data']
test = test_data['data']
y_train = train_data['target']
y_test = test_data['target']

Creamos el conjunto de validación

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, val, y_train, y_val = train_test_split(train, y_train, test_size=0.2, random_state=42)

In [None]:
print(len(train), len(val), len(test))

2348 588 1955


Pasar el texto a vectores

In [None]:
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(train)

In [None]:
train_vec = vectorizer.transform(train)

In [None]:
train_vec.shape

(2348, 44247)

In [None]:
print(train[1])




Another program which produces this effect is:
- SpyGlass Transform 2.1 (while contouring a big 257*257 array).

Thanks for any information about this problem,


In [None]:
print(train_vec[1])

  (0, 2245)	2
  (0, 10163)	1
  (0, 11287)	1
  (0, 13775)	1
  (0, 16484)	1
  (0, 21871)	1
  (0, 31829)	1
  (0, 31857)	1
  (0, 31874)	1
  (0, 36291)	1
  (0, 37807)	1
  (0, 38293)	1


In [None]:
for i, w in enumerate(vectorizer.vocabulary_):
  print(w)
  if i == 10:
    break

uptight
computer
literate
people
advantages
act
like
mac
ate
cat
program


In [None]:
val_vec = vectorizer.transform(val)
test_vec = vectorizer.transform(test)

In [None]:
val_vec.shape, test_vec.shape

((588, 44247), (1955, 44247))

Aplicamos una SVM a los datos

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [None]:
model = SVC(C=1000, kernel='rbf')

In [None]:
model.fit(train_vec, y_train)

In [None]:
train_pred = model.predict(train_vec)

In [None]:
train_pred[0:10], y_train[0:10]

(array([3, 0, 3, 2, 2, 0, 2, 4, 2, 0]), array([1, 3, 3, 2, 2, 0, 2, 4, 2, 0]))

In [None]:
print("Acc en train =", accuracy_score(y_train, train_pred))

Acc en train = 0.8747870528109029


In [None]:
val_pred = model.predict(val_vec)
print("Acc en val =", accuracy_score(y_val, val_pred))

Acc en val = 0.7125850340136054


In [None]:
test_pred = model.predict(test_vec)
print("Acc en test =", accuracy_score(y_test, test_pred))

Acc en test = 0.6562659846547314
