# Topic Classification of Sentences

Here we will see how to use bayesian on multi-class classification/discrimination.

Import class sklearn.naive_bayes.MultinomialNB for Multinomial logistic regression (logistic regression of multi-class).

If you want to classify binary classes, it is better to use BernoulliNB.

I will also compare accuracy for using BOW and TF-IDF vectorizing techniques.

In [1]:
from sklearn import metrics
import numpy as np
import sklearn.datasets
import re
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

Define some function to help us for preprocessing

In [2]:
# clear string
def clearstring(string):
    string = re.sub('[^A-Za-z0-9 ]+', '', string)
    string = string.split(' ')
    string = filter(None, string)
    string = [y.strip() for y in string]
    string = ' '.join(string)
    return string

# because of sklean.datasets read a document as a single element
# so we want to split based on new line
def separate_dataset(trainset):
    datastring = []
    datatarget = []
    for i in range(len(trainset.data)):
        data_ = trainset.data[i].split('\n')
        # python3, if python2, just remove list()
        data_ = list(filter(None, data_))
        for n in range(len(data_)):
            data_[n] = clearstring(data_[n])
        datastring += data_
        for n in range(len(data_)):
            datatarget.append(trainset.target[i])
    return datastring, datatarget

I included 6 classes in local/
1. adidas (wear)
2. apple (electronic)
3. hungry (status)
4. kerajaan (government related)
5. nike (wear)
6. pembangkang (opposition related)

In [3]:
# you can change any encoding type
trainset = sklearn.datasets.load_files(container_path = 'local', encoding = 'UTF-8')
trainset.data, trainset.target = separate_dataset(trainset)
print ("List of Classes: %s" %trainset.target_names)
print ("# of Samples: %s" %len(trainset.data))
print ("# of Samples: %s" %len(trainset.target))

List of Classes: ['adidas', 'apple', 'hungry', 'kerajaan', 'nike', 'pembangkang']
# of Samples: 25292
# of Samples: 25292


Change n to see different samples from the dataset

In [4]:
n=0
print("Sentence: %s" %trainset.data[n])
print("Class: %s" %trainset.target_names[trainset.target[n]])

Sentence: Najib emulating Trump in using tweets to spread his politics of fear hatred and lies
Class: pembangkang


Let's split the data into train (80%) and test (20%) sets.

In [5]:
train_data, test_data, train_Y, test_Y = train_test_split(trainset.data, trainset.target, test_size = 0.2)

# Using BOW

It is time to change data into BOW vector representation

In [6]:
bow = CountVectorizer().fit(train_data) # create and train a bow verctorizer using training data

Train and test Naive Bayes using BOW

In [7]:
bow_train_X = bow.transform(train_data)
bow_test_X = bow.transform(test_data)

bow_bayes_multinomial = MultinomialNB().fit(bow_train_X, train_Y)
predicted = bow_bayes_multinomial.predict(bow_test_X)
print('accuracy validation set: %s' %np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set: 0.854121367859
             precision    recall  f1-score   support

     adidas       0.92      0.77      0.84       313
      apple       0.81      0.88      0.84       450
     hungry       0.86      0.95      0.90      1068
   kerajaan       0.86      0.82      0.84      1367
       nike       0.89      0.83      0.86       317
pembangkang       0.85      0.83      0.84      1544

avg / total       0.86      0.85      0.85      5059



Let's test the trained model using our own sentecne.

Try using a sentence that would fall under one of the 6 classes. Don't forget to try something that is considered difficult to classify.

Example: "People who starve can not afford expensive shoes"

In [18]:
sentence = "election is important"

s = bow.transform([sentence])
l = bow_bayes_multinomial.predict(s)
print("Class: %s" %trainset.target_names[l[0]])

Class: pembangkang


# Using TF-IDF

It is time to change data into TF-IDF vector representation

In [19]:
# must get data from BOW first
tfidf = TfidfTransformer().fit(bow_train_X) # create and train a tfidf verctorizer using training data

Train Naive Bayes using TF-IDF

In [20]:
# must get data from BOW first
tfidf_train_X = tfidf.transform(bow_train_X)
tfidf_test_X = tfidf.transform(bow_test_X)

tfidf_bayes_multinomial = MultinomialNB().fit(tfidf_train_X, train_Y)
predicted = tfidf_bayes_multinomial.predict(tfidf_test_X)
print('accuracy validation set: %s' %np.mean(predicted == test_Y))

# print scores
print(metrics.classification_report(test_Y, predicted, target_names = trainset.target_names))

accuracy validation set: 0.812611187982
             precision    recall  f1-score   support

     adidas       0.95      0.55      0.70       313
      apple       0.97      0.62      0.75       450
     hungry       0.80      0.92      0.85      1068
   kerajaan       0.86      0.83      0.84      1367
       nike       0.93      0.64      0.76       317
pembangkang       0.74      0.87      0.80      1544

avg / total       0.83      0.81      0.81      5059



Let's test the trained model using our own sentecne.

Try using a sentence that would fall under one of the 6 classes. Don't forget to try something that is considered difficult to classify.

Example: "People who starve can not afford expensive shoes"

In [21]:
sentence = "this device is too good to be true"

s = bow.transform([sentence])
l = tfidf_bayes_multinomial.predict(s)
print("Class: %s" %trainset.target_names[l[0]])

Class: pembangkang
