# Clasificador de textos (clasificador multiclase)
En este notebook vamos a crear un clasificador de textos multiclase usando `scikit-learn` sobre modelos BoW, TF-IDF y *averaged word vectors*  
Usamos como conjunto de prueba el dataset *20newsgroups* que consiste en unas 18000 noticias en inglés divididas en 20 categorías.  
### Descarga del dataset
Nos descargamos el dataset y creamos los conjuntos de entrenamiento y test

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

def get_data():
    data = fetch_20newsgroups(subset='all',
                              shuffle=True,
                              remove=('headers', 'footers', 'quotes'))
    return data
    
def remove_empty_docs(corpus, labels):
    filtered_corpus = []
    filtered_labels = []
    for doc, label in zip(corpus, labels):
        if doc.strip():
            filtered_corpus.append(doc)
            filtered_labels.append(label)

    return filtered_corpus, filtered_labels
    
    
dataset = get_data()

corpus, labels = dataset.data, dataset.target
corpus, labels = remove_empty_docs(corpus, labels)

print('\nCargados {} documentos'.format(len(corpus)))
print('Clases:\n',dataset.target_names)
print('\nDocumento de ejemplo:\n', corpus[10])
print('\nClase: {} ({})'.format(labels[10], dataset.target_names[labels[10]]))

train_corpus, test_corpus, train_labels, test_labels = train_test_split(corpus,
                                                                        labels,
                                                                        test_size=0.3, random_state=42)


Cargados 18331 documentos
Clases:
 ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Documento de ejemplo:
 the blood of the lamb.

This will be a hard task, because most cultures used most animals
for blood sacrifices. It has to be something related to our current
post-modernism state. Hmm, what about used computers?

Cheers,
Kent

Clase: 19 (talk.religion.misc)


La cantidad de documentos dentro de cada clase está bastante balanceada:

In [2]:
from collections import Counter
import pandas as pd

pd.DataFrame([(dataset.target_names[k], v) for k,v in Counter(train_labels).items()], columns=['clase', 'N'])

Unnamed: 0,clase,N
0,sci.crypt,689
1,rec.sport.baseball,658
2,comp.windows.x,690
3,misc.forsale,650
4,talk.politics.misc,528
5,talk.politics.mideast,629
6,rec.autos,649
7,rec.sport.hockey,699
8,comp.sys.mac.hardware,668
9,talk.politics.guns,628


### Pre-procesado del texto
Realizamos una limpieza del texto (quitamos signos de puntuación y espacios) y nos quedamos con el lema de cada palabra en minúsculas

In [3]:
import spacy
import re
import string

nlp=spacy.load('en_core_web_md')

def normalize_document(doc):
    '''Limpieamos y normalizamos un documento
    pasado como string'''
    # tokenizamos el texto
    tokens = nlp(doc)
    # quitamos puntuación/espacios
    filtered_tokens = [t for t in tokens if not t.is_punct and not t.is_space and not t.is_digit]
    #cogemos el lemma
    lemmas = []
    for tok in filtered_tokens:
        lemmas.append(re.sub('[{}]'.format(re.escape(string.punctuation)), '', tok.lemma_.lower())
                      if tok.lemma_ != "-PRON-" else tok.lower_)
    # juntamos de nuevo en una cadena
    doc = ' '.join(lemmas)
    return doc

def normalize_corpus(corpus):
    '''Aplicamos la función de normalización sobre
    el corpus pasado como lista de string'''
    return [normalize_document(text) for text in corpus]

Por ejemplo vemos el documento nº 15 normalizado

In [4]:
print(corpus[15])

In the following report: _Turkey Eyes Regional Role_ ANKARA, Turkey (AP)
April 27, 1993, we find in the last paragraph:

[Turanist] Although Premier Suleyman Demirel criticized Ozal's often
[Turanist] brash calls for more Turkish influence, he also has spoken
[Turanist] of a swath of Turkic peoples "stretching from the Adriatic
[Turanist] Sea to the Great Wall of China."

Who does Demirel think he is fooling? It seems at both ends of his envisioned 
pan-Turkic Empire -- the Balkans and the Caucasus -- Turkey's fascist boasts
are being pre-empted.

I would suggest Turkey let the world feel some of their "Grey Wolf Teeth", and
attempt to stretch from the Adriatic to China! Turkey will have cried "wolf"
just once too much! 




In [5]:
print(normalize_document(corpus[15]))

in the following report turkey eyes regional role ankara turkey ap april we find in the last paragraph turanist although premier suleyman demirel criticize ozal s often turanist brash call for more turkish influence he also have speak turanist of a swath of turkic people stretch from the adriatic turanist sea to the great wall of china who do demirel think he be fool it seem at both end of his envision pan turkic empire the balkans and the caucasus turkey s fascist boast be be pre empte i would suggest turkey let the world feel some of their grey wolf teeth and attempt to stretch from the adriatic to china turkey will have cry wolf just once too much


Normalizamos todo el conjunto de textos

In [6]:
norm_train_corpus = normalize_corpus(train_corpus)
norm_test_corpus = normalize_corpus(test_corpus)

## Modelos BoW y TF-IDF
Definimos funciones para obtener las características BoW y TF-IDF.  
Usamos el parámetro max_df=0.95 para eliminar los stop-words como las palabras que aparecen al menos en el 95% de los documentos.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def bow_extractor(corpus):
    
    vectorizer = CountVectorizer(min_df=1, max_df=0.95)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

def tfidf_extractor(corpus):
    
    vectorizer = TfidfVectorizer(min_df=1, max_df=0.95)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

## Modelo Averaged Word Vectors  
Para calcular los modelos basados en WV primero calculamos los word vectors del CORPUS con la librería `gensim` usando sólo el conjunto de entrenamiento. Con la librería `gensim` es necesario convertir los documentos del corpus en listas de tokens.  

In [8]:
import gensim

def word_tokenize(text):
    return [token.text for token in nlp.tokenizer(text)]


procesar = 0 #cambiar a 1 para calcular los vectores, puede tardar un rato

if procesar == 0:
    # tokenize documentos
    tokenized_train = [word_tokenize(text)
                       for text in norm_train_corpus]
    tokenized_test = [word_tokenize(text)
                       for text in norm_test_corpus]
    model_vectors = gensim.models.Word2Vec(tokenized_train, 
                                   size=100,
                                   window=10,
                                   min_count=2,
                                   sample=1e-3)
    model = model_vectors.wv
    model.save("wv_20newsgroup.dict")

model_vectors = gensim.models.KeyedVectors.load("wv_20newsgroup.dict")

len(model_vectors.vocab)#nº de palabras en el vocabulario

37951

In [9]:
import numpy as np
palabras = model_vectors.index2word
sorted(np.random.choice(palabras, 25, replace=False))

['01001100b',
 '1970',
 '317',
 'advisability',
 'astonomical',
 'attached',
 'colorview',
 'cronin',
 'cts',
 'distort',
 'eimac',
 'exporectx',
 'idealist',
 'kartonlar',
 'm5u',
 'mm1',
 'normal14',
 'pollute',
 'problems',
 'prompt',
 'republish',
 'saddlebag',
 'stratospheric',
 'survellience',
 'sysadmin']

Calculamos dos modelos basados en word-vectors:  
* el vector promedio de los WV de todos los tokens con el mismo peso para todas las palabras.  
* ponderando el WV de cada palabra por el término de frecuencia inversa de documento (IDF).  

Definimos las funciones para calcular estas dos matrices de características.

In [10]:
import numpy as np

def average_word_vectors(words, model, vocabulary, num_features):
    '''Calcula el vector promedio de los WV de todas las palabras de
    un documento pasado como lista de tokens, de acuerdo a:
    model: modelo Word2Vec de word vectors
    vocabulary: conjunto de palabras en el modelo'''
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector

def averaged_word_vectorizer(corpus, model, num_features):
    '''Aplica la función de cálculo del WE promedio a todos los
    documentos del corpus (cada doc es una lista de tokens)'''
    vocabulary = set(model.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

def tfidf_wtd_avg_word_vectors(words, tfidf_vector, tfidf_vocabulary, model, num_features):
    
    word_tfidfs = [tfidf_vector[0, tfidf_vocabulary.get(word)] 
                   if tfidf_vocabulary.get(word) 
                   else 0 for word in words]    
    word_tfidf_map = {word:tfidf_val for word, tfidf_val in zip(words, word_tfidfs)}
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    vocabulary = set(model.index2word)
    wts = 0.
    for word in words:
        if word in vocabulary: 
            word_vector = model[word]
            weighted_word_vector = word_tfidf_map[word] * word_vector
            wts = wts + word_tfidf_map[word]
            feature_vector = np.add(feature_vector, weighted_word_vector)
    if wts:
        feature_vector = np.divide(feature_vector, wts)
        
    return feature_vector
    
def tfidf_weighted_averaged_word_vectorizer(corpus, tfidf_vectors, 
                                   tfidf_vocabulary, model, num_features):
                                       
    docs_tfidfs = [(doc, doc_tfidf) 
                   for doc, doc_tfidf 
                   in zip(corpus, tfidf_vectors)]
    features = [tfidf_wtd_avg_word_vectors(tokenized_sentence, tfidf, tfidf_vocabulary,
                                   model, num_features)
                    for tokenized_sentence, tfidf in docs_tfidfs]
    return np.array(features)

## Extracción de características
Extraemos características con los distintos modelos a nuestro conjunto de entrenamiento

In [11]:
# características bag of words
bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)  
bow_test_features = bow_vectorizer.transform(norm_test_corpus) 

# características tfidf
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)  
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)    

# características averaged word vector
avg_wv_train_features = averaged_word_vectorizer(corpus=tokenized_train,
                                                 model=model_vectors,
                                                 num_features=100)                   
avg_wv_test_features = averaged_word_vectorizer(corpus=tokenized_test,
                                                model=model_vectors,
                                                num_features=100)                                                 

# características tfidf weighted averaged word vector
vocab = tfidf_vectorizer.vocabulary_
tfidf_wv_train_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_train, 
                                                                  tfidf_vectors=tfidf_train_features, 
                                                                  tfidf_vocabulary=vocab, 
                                                                  model=model_vectors, 
                                                                  num_features=100)
tfidf_wv_test_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_test, 
                                                                 tfidf_vectors=tfidf_test_features, 
                                                                 tfidf_vocabulary=vocab, 
                                                                 model=model_vectors, 
                                                                 num_features=100)

In [12]:
bow_train_features.shape

(12831, 95573)

In [13]:
tfidf_train_features.shape

(12831, 95573)

In [14]:
avg_wv_train_features.shape

(12831, 100)

In [15]:
tfidf_wv_train_features.shape

(12831, 100)

## Clasificación
Aplicamos distintos clasificadores a cada modelo para ver cuál funciona mejor con nuestros datos.  
Definimos unas funciones para entrenar y medir el rendimiento de los clasificadores. 

In [16]:
from sklearn import metrics

def get_metrics(true_labels, predicted_labels):
    """Calculamos distintas métricas sobre el
    rendimiento del modelo."""
    
    print('Accuracy:', np.round(
                        metrics.accuracy_score(true_labels, 
                                               predicted_labels),
                        2))
    print('Precision:', np.round(
                        metrics.precision_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        2))
    print('Recall:', np.round(
                        metrics.recall_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        2))
    print('F1 Score:', np.round(
                        metrics.f1_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        2))
                        

def train_predict_evaluate_model(classifier, 
                                 train_features, train_labels, 
                                 test_features, test_labels):
    """Función que entrena un modelo de clasificación sobre
    un conjunto de entrenamiento, lo aplica sobre un conjunto
    de test y devuelve las métricas de rendimiento"""
    # genera modelo    
    classifier.fit(train_features, train_labels)
    # predice usando el modelo sobre test
    predictions = classifier.predict(test_features) 
    # evalúa rendimiento de la predicción   
    get_metrics(true_labels=test_labels, 
                predicted_labels=predictions)
    return predictions    

Entrenamos sobre el conjunto de train y evaluamos en el conjunto de test.  

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

mnb = MultinomialNB()
svm = SGDClassifier(loss='hinge', max_iter=100)

# Multinomial Naive Bayes with bag of words features
print('Multinomial Naive Bayes con características Bag of Words')
mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
                                           train_features=bow_train_features,
                                           train_labels=train_labels,
                                           test_features=bow_test_features,
                                           test_labels=test_labels)

# Support Vector Machine with bag of words features
print('Support Vector Machine con características Bag of Words')
svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=bow_train_features,
                                           train_labels=train_labels,
                                           test_features=bow_test_features,
                                           test_labels=test_labels)
                                           
# Multinomial Naive Bayes with tfidf features
print('Multinomial Naive Bayes con características tf-idf')
mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

# Support Vector Machine with tfidf features
print('Support Vector Machine con características tf-idf')
svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

# Support Vector Machine with averaged word vector features
print('Support Vector Machine con características averaged word vector')
svm_avgwv_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=avg_wv_train_features,
                                           train_labels=train_labels,
                                           test_features=avg_wv_test_features,
                                           test_labels=test_labels)

# Support Vector Machine with tfidf weighted averaged word vector features
print('Support Vector Machine con características tfidf weighted averaged word vector')
svm_tfidfwv_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=tfidf_wv_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_wv_test_features,
                                           test_labels=test_labels)

Multinomial Naive Bayes con características Bag of Words
Accuracy: 0.62
Precision: 0.72
Recall: 0.62
F1 Score: 0.61
Support Vector Machine con características Bag of Words
Accuracy: 0.66
Precision: 0.69
Recall: 0.66
F1 Score: 0.65
Multinomial Naive Bayes con características tf-idf
Accuracy: 0.68
Precision: 0.74
Recall: 0.68
F1 Score: 0.67
Support Vector Machine con características tf-idf


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.77
Precision: 0.76
Recall: 0.77
F1 Score: 0.76
Support Vector Machine con características averaged word vector
Accuracy: 0.44
Precision: 0.47
Recall: 0.44
F1 Score: 0.4
Support Vector Machine con características tfidf weighted averaged word vector
Accuracy: 0.41
Precision: 0.5
Recall: 0.41
F1 Score: 0.41


Parece que el mejor modelo es el SVM con características TF-IDF. Vemos su matriz de confusión

In [28]:
#Matriz de confusión
import pandas as pd
cm = metrics.confusion_matrix(test_labels, svm_tfidf_predictions)
pd.DataFrame(cm, index=svm.classes_, columns=svm.classes_)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,129,1,0,2,1,1,3,3,1,2,5,4,4,4,7,40,3,10,6,12
1,0,209,7,6,7,11,6,2,2,2,0,2,5,3,4,2,0,0,2,1
2,0,19,209,19,11,19,4,0,1,1,1,1,6,0,0,0,0,1,3,0
3,0,10,21,193,17,1,9,4,1,1,1,2,6,2,1,0,1,2,0,0
4,0,4,6,18,197,4,6,2,3,1,1,2,10,4,1,0,1,0,1,0
5,0,21,20,1,3,242,0,0,0,0,0,0,1,0,1,1,1,1,0,0
6,0,2,5,12,12,2,239,11,3,2,0,1,11,2,3,0,3,1,0,0
7,2,3,2,3,0,2,7,224,19,1,2,0,10,3,3,1,4,1,1,0
8,1,0,1,3,4,1,4,28,238,3,1,3,0,1,3,0,2,3,1,0
9,3,2,1,0,1,3,4,1,2,258,10,1,2,2,1,3,1,4,0,1
