In [0]:
# Estos dos comandos evitan que haya que hacer reload cada vez que se modifica un paquete
%load_ext autoreload
%autoreload 2

# Naive Bayes

Naive Bayes es una técnica estadística que consiste en repetir el método anterior en problemas cuyos sucesos no son independientes, pero suponiendo independencia.
Pongamos como ejemplo la clasificación de emails. En este caso podríamos estimar la probabilidad de ocurrencia de cada palabra según la categoría a la que pertenece el email.

Veamos un ejemplo concreto

## Carga de los datos


Descargamos los datos de github.

Tip: Con ! acceden a la consola y wget les permite descargar archivos

In [2]:
! wget 'https://raw.githubusercontent.com/rn-2019-itba/Clase-3---K-folding-TFIDF-Dask-/master/data/emails.csv'

--2019-08-21 12:22:29--  https://raw.githubusercontent.com/rn-2019-itba/Clase-3---K-folding-TFIDF-Dask-/master/data/emails.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8954755 (8.5M) [text/plain]
Saving to: ‘emails.csv’


2019-08-21 12:22:29 (110 MB/s) - ‘emails.csv’ saved [8954755/8954755]



Cargamos los datos a un pandas data frame

In [0]:
import pandas as pd
dataset = pd.read_csv('emails.csv')

## Analizemos un poco los datos

In [4]:
dataset.keys()

Index(['text', 'spam'], dtype='object')

In [5]:
len(dataset.text)

5728

In [8]:
print(dataset['spam'][-10:])

5718    0
5719    0
5720    0
5721    0
5722    0
5723    0
5724    0
5725    0
5726    0
5727    0
Name: spam, dtype: int64


In [9]:
print(dataset['text'][0:5])

0    Subject: naturally irresistible your corporate...
1    Subject: the stock trading gunslinger  fanny i...
2    Subject: unbelievable new homes made easy  im ...
3    Subject: 4 color printing special  request add...
4    Subject: do not have money , get software cds ...
Name: text, dtype: object


In [10]:
dataset.text[0]

"Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  ma

In [11]:
dataset.spam[0]

1

Calcular a priori


0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
       ..
1338    1
1339    1
1340    1
1341    1
1342    1
1343    1
1344    1
1345    1
1346    1
1347    1
1348    1
1349    1
1350    1
1351    1
1352    1
1353    1
1354    1
1355    1
1356    1
1357    1
1358    1
1359    1
1360    1
1361    1
1362    1
1363    1
1364    1
1365    1
1366    1
1367    1
Name: spam, Length: 1368, dtype: int64

In [17]:
print(len(dataset[dataset['spam']==0])/len(dataset)*100)

76.11731843575419


En resumen, tenemos un dataset que contiene 5728 emails clasificados en 2 tipos.
Ahora vamos a armar un bag of words y a filtrar un poco el vocabulario.




## Pre procesamos los datos
### Solo para un mail
Vamos a aplicar el siguiente procesamiento:

- **Tokenization (nltk):** Tokenizers divide strings into lists of substrings.  For example,
tokenizers can be used to find the words and punctuation in a string

In [18]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

em = dataset.text[0]
tok=word_tokenize(em)
print("\nArtículo tokenizado:")
print(tok)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

Artículo tokenizado:
['Subject', ':', 'naturally', 'irresistible', 'your', 'corporate', 'identity', 'lt', 'is', 'really', 'hard', 'to', 'recollect', 'a', 'company', ':', 'the', 'market', 'is', 'full', 'of', 'suqgestions', 'and', 'the', 'information', 'isoverwhelminq', ';', 'but', 'a', 'good', 'catchy', 'logo', ',', 'stylish', 'statlonery', 'and', 'outstanding', 'website', 'will', 'make', 'the', 'task', 'much', 'easier', '.', 'we', 'do', 'not', 'promise', 'that', 'havinq', 'ordered', 'a', 'iogo', 'your', 'company', 'will', 'automaticaily', 'become', 'a', 'world', 'ieader', ':', 'it', 'isguite', 'ciear', 'that', 'without', 'good', 'products', ',', 'effective', 'business', 'organization', 'and', 'practicable', 'aim', 'it', 'will', 'be', 'hotat', 'nowadays', 'market', ';', 'but', 'we', 'do', 'promise', 'that', 'your', 'marketing', 'efforts', 'will', 'become', 'much', 'more', 'effectiv


- **Lemmatization (nltk):** Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.


In [19]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lem=[lemmatizer.lemmatize(x,pos='v') for x in tok]
print("\nLematización")
print(lem)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.

Lematización
['Subject', ':', 'naturally', 'irresistible', 'your', 'corporate', 'identity', 'lt', 'be', 'really', 'hard', 'to', 'recollect', 'a', 'company', ':', 'the', 'market', 'be', 'full', 'of', 'suqgestions', 'and', 'the', 'information', 'isoverwhelminq', ';', 'but', 'a', 'good', 'catchy', 'logo', ',', 'stylish', 'statlonery', 'and', 'outstanding', 'website', 'will', 'make', 'the', 'task', 'much', 'easier', '.', 'we', 'do', 'not', 'promise', 'that', 'havinq', 'order', 'a', 'iogo', 'your', 'company', 'will', 'automaticaily', 'become', 'a', 'world', 'ieader', ':', 'it', 'isguite', 'ciear', 'that', 'without', 'good', 'products', ',', 'effective', 'business', 'organization', 'and', 'practicable', 'aim', 'it', 'will', 'be', 'hotat', 'nowadays', 'market', ';', 'but', 'we', 'do', 'promise', 'that', 'your', 'market', 'efforts', 'will', 'become', 'much', 'more', 'effective', '.', 'he

- **Stop Words (nltk):** One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words. Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”).


In [20]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stop = [x for x in lem if x not in stopwords.words('english')]
print("\nEliminando stopwords:")
print(stop)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

Eliminando stopwords:
['Subject', ':', 'naturally', 'irresistible', 'corporate', 'identity', 'lt', 'really', 'hard', 'recollect', 'company', ':', 'market', 'full', 'suqgestions', 'information', 'isoverwhelminq', ';', 'good', 'catchy', 'logo', ',', 'stylish', 'statlonery', 'outstanding', 'website', 'make', 'task', 'much', 'easier', '.', 'promise', 'havinq', 'order', 'iogo', 'company', 'automaticaily', 'become', 'world', 'ieader', ':', 'isguite', 'ciear', 'without', 'good', 'products', ',', 'effective', 'business', 'organization', 'practicable', 'aim', 'hotat', 'nowadays', 'market', ';', 'promise', 'market', 'efforts', 'become', 'much', 'effective', '.', 'list', 'clear', 'benefit', ':', 'creativeness', ':', 'hand', '-', 'make', ',', 'original', 'logos', ',', 'specially', 'reflect', 'distinctive', 'company', 'image', '.', 'convenience', ':', 'logo', 'stationery', 'provide', 'for

- **Stemming (nltk):** Stemmers remove morphological affixes from words, leaving only the word stem.


In [21]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stem=[stemmer.stem(x) for x in stop]
print("\nAplicando stemming")
print(stem)



Aplicando stemming
['subject', ':', 'natur', 'irresist', 'corpor', 'ident', 'lt', 'realli', 'hard', 'recollect', 'compani', ':', 'market', 'full', 'suqgest', 'inform', 'isoverwhelminq', ';', 'good', 'catchi', 'logo', ',', 'stylish', 'statloneri', 'outstand', 'websit', 'make', 'task', 'much', 'easier', '.', 'promis', 'havinq', 'order', 'iogo', 'compani', 'automaticaili', 'becom', 'world', 'ieader', ':', 'isguit', 'ciear', 'without', 'good', 'product', ',', 'effect', 'busi', 'organ', 'practic', 'aim', 'hotat', 'nowaday', 'market', ';', 'promis', 'market', 'effort', 'becom', 'much', 'effect', '.', 'list', 'clear', 'benefit', ':', 'creativ', ':', 'hand', '-', 'make', ',', 'origin', 'logo', ',', 'special', 'reflect', 'distinct', 'compani', 'imag', '.', 'conveni', ':', 'logo', 'stationeri', 'provid', 'format', ';', 'easi', '-', '-', 'use', 'content', 'manag', 'system', 'letsyou', 'chang', 'websit', 'content', 'even', 'structur', '.', 'prompt', ':', 'see', 'logo', 'draft', 'within', 'three',

- Filtrado de palabras


In [22]:
alpha=[x for x in stem if x.isalpha()]
print("\nFiltrando no-palabras:")
print(alpha)


Filtrando no-palabras:
['subject', 'natur', 'irresist', 'corpor', 'ident', 'lt', 'realli', 'hard', 'recollect', 'compani', 'market', 'full', 'suqgest', 'inform', 'isoverwhelminq', 'good', 'catchi', 'logo', 'stylish', 'statloneri', 'outstand', 'websit', 'make', 'task', 'much', 'easier', 'promis', 'havinq', 'order', 'iogo', 'compani', 'automaticaili', 'becom', 'world', 'ieader', 'isguit', 'ciear', 'without', 'good', 'product', 'effect', 'busi', 'organ', 'practic', 'aim', 'hotat', 'nowaday', 'market', 'promis', 'market', 'effort', 'becom', 'much', 'effect', 'list', 'clear', 'benefit', 'creativ', 'hand', 'make', 'origin', 'logo', 'special', 'reflect', 'distinct', 'compani', 'imag', 'conveni', 'logo', 'stationeri', 'provid', 'format', 'easi', 'use', 'content', 'manag', 'system', 'letsyou', 'chang', 'websit', 'content', 'even', 'structur', 'prompt', 'see', 'logo', 'draft', 'within', 'three', 'busi', 'day', 'afford', 'market', 'break', 'make', 'gap', 'budget', 'satisfact', 'guarante', 'provi


Mas info en:
http://text-processing.com/demo/stem/

### Ahora pre procesamos todos los datos

**WARNING: ESTA CELDA TARDA**

In [23]:
#Procesando todos los emails:
emails_filtrados=list()
for idx in range(len(dataset.text)):
    if idx%100==0:
        print("\r Procesados: {}".format(idx),end="")
    em=dataset.text[idx]
    tok=word_tokenize(em)
    lem=[lemmatizer.lemmatize(x,pos='v') for x in tok]
    stop = [x for x in lem if x not in stopwords.words('english')]
    stem=[stemmer.stem(x) for x in stop]
    alpha=[x for x in stem if x.isalpha()]
    emails_filtrados.append(" ".join(alpha))

 Procesados: 5700

In [25]:
print(emails_filtrados[2])

subject unbeliev new home make easi im want show homeown pre approv home loan fix rate offer extend uncondit credit way factor take advantag limit time opportun ask visit websit complet minut post approv form look foward hear dorca pittman


### Guardado de pre procesamiento
Vamos a guardar lo preprocesado usando pickle, que nos permite serializar objetos y guardarlos en disco, es muy importante que sepan hacer esto si no quieren perder tiempo!

In [0]:
#Salvado del procesamiento a disco:
import pickle

with open('em_filt.pck', 'wb') as fp:
    pickle.dump(emails_filtrados, fp)

In [0]:
with open ('em_filt.pck', 'rb') as fp:
    itemlist = pickle.load(fp)

In [30]:
print(itemlist[16])

subject softwar guarante legal name brand softwar low low low low price everyth come hustl wait mani would coward courag enough


### Seguimos pre procesando

- **Obtención del vocabulario (countvectorizer) y obtencion de la probabilidad**

TfidVectorizer significa Term Frequency – Inverse Document Frequency vectorizer, Term frequency (tf) refers to how many times a given term appears in a document. Inverse document frequency measures the weight of the word in the document, i.e if the word is common or rare in the entire document. 

Con el max_df le asignamos una maxima frecuencia de aparición, eliminando las palabras comunes que no aportan información.

Con min_df le asignamos la minima cantidad de veces que tiene que aparecer una palabra.


In [0]:
# Extracting features from articles

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(max_df=0.8,min_df=10,)
raw_data = tfidf_vect.fit_transform(itemlist) #Aprende el vocabulario y le asigna un código a cada palabra

In [41]:
tfidf_vect.vocabulary_ #Estos son los índices de cada una de las palabras

{'natur': 2632,
 'corpor': 849,
 'ident': 1891,
 'lt': 2345,
 'realli': 3208,
 'hard': 1763,
 'recollect': 3221,
 'compani': 732,
 'market': 2411,
 'full': 1614,
 'suqgest': 3837,
 'inform': 1959,
 'good': 1693,
 'catchi': 563,
 'logo': 2316,
 'stylish': 3798,
 'statloneri': 3744,
 'outstand': 2798,
 'websit': 4287,
 'make': 2384,
 'task': 3888,
 'much': 2605,
 'easier': 1189,
 'promis': 3097,
 'order': 2770,
 'iogo': 2028,
 'becom': 351,
 'world': 4366,
 'ieader': 1895,
 'isguit': 2040,
 'without': 4350,
 'product': 3082,
 'effect': 1221,
 'busi': 500,
 'organ': 2776,
 'practic': 3027,
 'aim': 92,
 'hotat': 1846,
 'nowaday': 2701,
 'effort': 1223,
 'list': 2287,
 'clear': 672,
 'benefit': 368,
 'creativ': 887,
 'hand': 1754,
 'origin': 2780,
 'special': 3691,
 'reflect': 3236,
 'distinct': 1105,
 'imag': 1908,
 'conveni': 823,
 'stationeri': 3742,
 'provid': 3117,
 'format': 1564,
 'easi': 1188,
 'use': 4146,
 'content': 810,
 'manag': 2392,
 'system': 3870,
 'letsyou': 2257,
 'chang'

In [42]:
vocabulary=tfidf_vect.get_feature_names() #Estos nombres de las palabras seleccionadas para el vocabulario, ordenadas por orden alfabético
print(vocabulary)

['aa', 'ab', 'abil', 'abl', 'abroad', 'absenc', 'absolut', 'absorb', 'abstract', 'abu', 'abus', 'ac', 'academ', 'acceler', 'accept', 'access', 'accid', 'accommod', 'accomod', 'accompani', 'accomplish', 'accord', 'accordingli', 'account', 'accumul', 'accur', 'accuraci', 'achiev', 'acknowledg', 'acquir', 'acquisit', 'acrobat', 'across', 'act', 'action', 'activ', 'actual', 'ad', 'adam', 'adapt', 'add', 'addit', 'address', 'addresse', 'adequ', 'adjust', 'admin', 'administ', 'administr', 'admir', 'admiss', 'admit', 'adob', 'adopt', 'adrian', 'adult', 'advanc', 'advantag', 'advantaq', 'advers', 'advertis', 'advic', 'advis', 'advisor', 'advisori', 'advoc', 'affair', 'affect', 'affili', 'afford', 'afraid', 'africa', 'afternoon', 'ag', 'age', 'agenc', 'agenda', 'agent', 'aggreg', 'aggress', 'agnihotri', 'ago', 'agre', 'agreement', 'agricultur', 'ahead', 'ahmad', 'aicohol', 'aid', 'aiesec', 'aiia', 'ail', 'aim', 'ainsley', 'air', 'aircraft', 'airlin', 'airport', 'al', 'alan', 'albani', 'albert',

In [43]:
print(vocabulary[641])
print(tfidf_vect.vocabulary_["christoph"])

christoph
641


### O usamos count vectorizer

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(max_df=0.8,min_df=100)
raw_data=count_vect.fit_transform(emails_filtrados) #Aprende el vocabulario y le asigna un código a cada palabra

In [64]:
raw_data.shape #Para cada documento hay un vector de ocurrencias

(5728, 984)

In [65]:
raw_data.toarray() #Es una sparse matrix, vamos a expandirla

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 0, 0]])

In [66]:
raw_data.toarray()[0,:].argmax() #Veamos a qué palabra pertenece la máxima ocurrencia en el primer artíclo

504

## Ahora entrenamos un modelo de Naive Bayes

Primero creemos un set de datos de entrenamiento y uno de test

In [0]:
len_train = int(raw_data.shape[0]*0.9)
len_test = raw_data.shape[0]-len_train
X_train = raw_data[0:len_train]
X_test  = raw_data[len_train:]
Y_train = dataset[0:len_train]['spam']
Y_test  = dataset[len_train:]['spam']

Entrenamos el modelo con fit

In [72]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train.toarray(), Y_train)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Vemos el accuracy

In [74]:
import numpy as np
porc=sum(np.array(clf.predict(X_test.toarray()))==np.array(Y_test))/len(Y_test)*100
print("El porcentaje de emails clasificados correctamente es: {:.2f}%".format(porc))

El porcentaje de emails clasificados correctamente es: 99.65%
