Clasificación de texto usando Naive Bayes
===

**Juan David Velásquez Henao**  
jdvelasq@unal.edu.co   
Universidad Nacional de Colombia, Sede Medellín  
Facultad de Minas  
Medellín, Colombia

---

Haga click [aquí](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/1-classification-5-Bayes-20newsgroups.ipynb) para acceder a la última versión online.

Haga click [aquí](http://nbviewer.jupyter.org/github/jdvelasq/IPython-for-predictive-analytics/blob/master/1-classification-5-Bayes-20newsgroups.ipynb) para ver la última versión online en `nbviewer`. 

---
[Licencia](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/LICENSE)  
[Readme](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/readme.md)

**Bibliografia.**

* Learning scikit-learn: Machine Learning in Python. R. Garreta, G. Moncecchi. Packt Publishing, 2013.

# Definición del problema real

Se desea realizar la identificación del tipo de flor (virginica, setosa, versicolor) a partir de la medición del tamaño del sépalo y el pétalo.

# Definición del problema en términos de los datos

Se tienen 150 medidiciones del ancho y el largo del sépalo y el pétalo para las tres especies de la flor Iris, con 50 mediciones para cada especie. Se desea construir un clasificador que pronostique la especie de la flor a partir de dichas mediciones.

In [30]:
## los datos se encuentran disponibles directamente en scikit-learn
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
print(news.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])


In [31]:
print(news.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [32]:
print(news.data[0])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [33]:
SPLIT_PERC = 0.75
split_size = int(len(news.data)*SPLIT_PERC)
X_train = news.data[:split_size]
X_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]

In [34]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer

In [35]:
clf_1 = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])
clf_2 = Pipeline([
    ('vect', HashingVectorizer(non_negative=True)),
    ('clf', MultinomialNB()),
])
clf_3 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])

In [37]:
from sklearn.model_selection import cross_val_score, KFold
from scipy.stats import sem
import numpy as np

def evaluate_cross_validation(clf, X, y, K):
    kf = KFold(n_splits=5, shuffle=True, random_state=33)
    cv = kf.get_n_splits(X)
    scores = cross_val_score(clf, X, y, cv=cv)
    print(scores)
    print ("Mean score: {0:.3f} (+/-{1:.3f})".format(np.mean(scores), sem(scores)))
    
clfs = [clf_1, clf_2, clf_3]
for clf in clfs:
    evaluate_cross_validation(clf, news.data, news.target, 5)

[0.85245033 0.8619136  0.85809019 0.85767392 0.84529506]
Mean score: 0.855 (+/-0.003)




[0.77562914 0.78849722 0.79098143 0.7732342  0.77804359]
Mean score: 0.781 (+/-0.004)
[0.85562914 0.8632388  0.85835544 0.85369092 0.84848485]
Mean score: 0.856 (+/-0.002)


In [48]:
clf_4 = Pipeline([
    ('vect', TfidfVectorizer(token_pattern=u"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),
    ('clf', MultinomialNB()),
])

evaluate_cross_validation(clf_4, news.data, news.target, 5)

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [None]:
def get_stop_words():
     result = set()
     for line in open('stopwords_en.txt', 'r').readlines():
         result.add(line.strip())
     return result

In [50]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))
    y_pred = clf.predict(X_test)
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(metrics.confusion_matrix(y_test, y_pred))
    
train_and_evaluate(clf_1, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.9267015706806283
Accuracy on testing set:
0.8425297113752123
Classification Report:
             precision    recall  f1-score   support

          0       0.90      0.87      0.88       216
          1       0.61      0.85      0.71       246
          2       0.94      0.12      0.21       274
          3       0.61      0.85      0.71       235
          4       0.89      0.87      0.88       231
          5       0.75      0.90      0.82       225
          6       0.88      0.68      0.77       248
          7       0.90      0.88      0.89       275
          8       0.94      0.94      0.94       226
          9       0.97      0.94      0.96       250
         10       0.97      0.98      0.98       257
         11       0.87      0.98      0.92       261
         12       0.85      0.86      0.85       216
         13       0.90      0.92      0.91       257
         14       0.91      0.93      0.92       246
         15       0.81      0.95      0

In [51]:
print(len(clf_1.named_steps['vect'].get_feature_names()))

140678


In [52]:
clf_1.named_steps['vect'].get_feature_names()

['00',
 '000',
 '0000',
 '00000',
 '000000',
 '00000000',
 '0000000004',
 '0000000005',
 '00000000b',
 '00000001',
 '00000001b',
 '0000000667',
 '00000010',
 '00000010b',
 '00000011',
 '00000011b',
 '0000001200',
 '00000074',
 '00000100',
 '00000100b',
 '00000101',
 '00000101b',
 '00000110',
 '00000110b',
 '00000111',
 '00000111b',
 '000003',
 '00000315',
 '000005102000',
 '00000510200001',
 '000007',
 '00000f',
 '00001000',
 '00001000b',
 '00001001',
 '00001001b',
 '00001010',
 '00001010b',
 '00001011',
 '00001011b',
 '00001100',
 '00001100b',
 '00001101',
 '00001101b',
 '00001110',
 '00001110b',
 '00001111',
 '00001111b',
 '000020',
 '000021',
 '000050',
 '000062david42',
 '0000vec',
 '0001',
 '00010000',
 '00010000b',
 '00010001',
 '00010001b',
 '00010010',
 '00010010b',
 '00010011',
 '00010011b',
 '000100255pixel',
 '00010100',
 '00010100b',
 '00010101',
 '00010101b',
 '00010110',
 '00010110b',
 '00010111',
 '00010111b',
 '00011000',
 '00011000b',
 '00011001',
 '00011001b',
 '00011

In [49]:
clf_5 = Pipeline([
    ('vect', TfidfVectorizer(stop_words= get_stop_words(), 
                             token_pattern=u"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),
    ('clf', MultinomialNB())])

evaluate_cross_validation(clf_5, news.data, news.target, 5)

NameError: name 'get_stop_words' is not defined

In [None]:
clf_7 = Pipeline([
    ('vect', TfidfVectorizer(stop_words=stop_words,
                             token_pattern=ur"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),
    ('clf', MultinomialNB(alpha=0.01))])

evaluate_cross_validation(clf_75, news.data, news.target, 5)

---

Clasificación de texto usando Naive Bayes
===

**Juan David Velásquez Henao**  
jdvelasq@unal.edu.co   
Universidad Nacional de Colombia, Sede Medellín  
Facultad de Minas  
Medellín, Colombia

---

Haga click [aquí](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/1-classification-5-Bayes-20newsgroups.ipynb) para acceder a la última versión online.

Haga click [aquí](http://nbviewer.jupyter.org/github/jdvelasq/IPython-for-predictive-analytics/blob/master/1-classification-5-Bayes-20newsgroups.ipynb) para ver la última versión online en `nbviewer`. 

---
[Licencia](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/LICENSE)  
[Readme](https://github.com/jdvelasq/IPython-for-predictive-analytics/blob/master/readme.md)