CLASSIFICATION

We will classify consumer posts based on the topic of the message.

We'll use a dataset containing 18,000 posts (messages) on 20 topics. It's part of the sklearn dataset collection. Posts are devided into "train" and "test" types. 

Install the modules we'll need:

In [1]:
import sys

!{sys.executable} -m pip install numpy
import numpy as np

!{sys.executable} -m pip install sklearn
from sklearn import metrics

!{sys.executable} -m pip install pandas
import pandas as pd

[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Building wheels for collected packages: sklearn
  Running setup.py bdist_wheel for sklearn ... [?25ldone
[?25h  Stored in directory: /Users/corrine/Library/Caches/pip/wheels/76/03/bb/589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfully built sklearn
[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
Installing collected packages: sklearn
Successfully installed sklearn-0.0
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[31mdistributed 1.21.8 requires msgpack, which is not

Download the dataset:

In [2]:
from sklearn.datasets import fetch_20newsgroups 

Out of 20 available, we will use posts on 4 topics (classes) only: atheism, religion, computer graphics, and science.

In [4]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

twenty_train = fetch_20newsgroups(categories = categories,
                                  subset = 'train', 
                                  shuffle = False, 
                                  remove = ('headers', 'footers', 'quotes')) 

twenty_test = fetch_20newsgroups(categories = categories,
                                 subset='test', 
                                 shuffle=False,
                                 remove=('headers', 'footers', 'quotes')) 

Let's inspect the training data. First, have a look at one of the posts:

In [5]:
print(twenty_train.data[7])       

->	First I want to start right out and say that I'm a Christian.  It 
->makes sense to be one.  Have any of you read Tony Campollo's book- liar, 
->lunatic, or the real thing?  (I might be a little off on the title, but he 
->writes the book.  Anyway he was part of an effort to destroy Christianity, 
->in the process he became a Christian himself.

Sounds like you are saying he was a part of some conspiracy.  Just what organization did he 
belong to? Does it have a name?

->	The book says that Jesus was either a liar, or he was crazy ( a 
->modern day Koresh) or he was actually who he said he was.

Logic alert - artificial trifercation.  The are many other possible explainations.  Could have been
that he never existed.  There have been some good points made in this group that is not 
impossible  that JC is an amalgam of a number of different myths, Mithra comes to mind.

->	Some reasons why he wouldn't be a liar are as follows.  Who would 
->die for a lie?  Wouldn't people be able to t

The classes (topics) for each message, that you will be predicting, are encoded as numbers and can accessed via attribute .target and their names can be accessed via .target_names:

In [6]:
print("Category names: ", twenty_train.target_names)    
print("Categories for first 10 observations: ", twenty_train.target[:10])     
print("Number of posts in the training dataset: ", twenty_train.filenames.shape[0]) 

Category names:  ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
Categories for first 10 observations:  [0 2 1 0 2 3 0 0 2 2]
Number of posts in the training dataset:  2034


Define a function to be used later for feature matrix description:

In [7]:
def fmat_descr_fun(your_feature_matrix,your_vectorizer):
    print("Dimensions (number of posts x number of features): ", your_feature_matrix.shape)  
    print("The first 5 features - names: ", your_vectorizer.get_feature_names()[0:5]) 
    print("Share of non-zero elements in the matrix: ", 
          your_feature_matrix.nnz / (float(your_feature_matrix.shape[0]) * float(your_feature_matrix.shape[1])))
    print("Average number of features present, per post: ", 
          round(your_feature_matrix.nnz/float(your_feature_matrix.shape[0]),1))

* FEATURE EXTRACTION

Let's do feature extraction for our TRAIN data using the "bag-of-words" and TF-IDF methods. We'll use the option stop_words = 'english' to remove stopwords from the set of features. 

**First, let's do the TF_IDF method for TRAIN data:**

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', norm = 'l2')
X_train_tfidf = tfidf_vectorizer.fit_transform(twenty_train.data)

fmat_descr_fun(X_train_tfidf,tfidf_vectorizer)

Dimensions (number of posts x number of features):  (2034, 26576)
The first 5 features - names:  ['00', '000', '0000', '00000', '000000']
Share of non-zero elements in the matrix:  0.002472159028010871
Average number of features present, per post:  65.7


Let's have a look at the first 5 rows of the TF-IDF matrix:

In [9]:
tfidf_features_names = tfidf_vectorizer.get_feature_names() 
X_train_tfidf_table = pd.DataFrame(data = X_train_tfidf.todense(), columns = tfidf_features_names)
X_train_tfidf_table.head(5)

Unnamed: 0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,00041032,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's transform the TEST dataset using the TF_IDF method. 

**IMPORTANT: For transforming test data, you'll use the feature names extracted for the train data and do the counts for those feature names using the test data (you do not create new feature names based on the test data). There, 1) we do not define a new vectorizer and 2) we use method .transform (not .fit_trandform) with our vectorizer on the test data.** 

In [10]:
X_test_tfidf = tfidf_vectorizer.transform(twenty_test.data)
fmat_descr_fun(X_test_tfidf,tfidf_vectorizer)

Dimensions (number of posts x number of features):  (1353, 26576)
The first 5 features - names:  ['00', '000', '0000', '00000', '000000']
Share of non-zero elements in the matrix:  0.0023848546254604903
Average number of features present, per post:  63.4


**Let's do the "bag-of-words" method now for the TRAIN:**

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer(stop_words = 'english') 
X_train_bow = bow_vectorizer.fit_transform(twenty_train.data)

Let's describe the resulting matrix:

In [12]:
fmat_descr_fun(X_train_bow,bow_vectorizer)

Dimensions (number of posts x number of features):  (2034, 26576)
The first 5 features - names:  ['00', '000', '0000', '00000', '000000']
Share of non-zero elements in the matrix:  0.002472159028010871
Average number of features present, per post:  65.7


**EXERCISE: Transform the TEST dataset using the "bag-of-words" method. IMPORTANT: For transforming test data, you'll use the feature names extracted for the train data and do the counts for those feature names using the test data (you do not create new feature names based on the test data). There, 1) we do not define a new vectorizer and 2) we use method .transform (not .fit_trandform) with our vectorizer on the test data.**

In [15]:
X_test_bow = bow_vectorizer.transform(twenty_test.data)
fmat_descr_fun(X_test_bow,bow_vectorizer)

Dimensions (number of posts x number of features):  (1353, 26576)
The first 5 features - names:  ['00', '000', '0000', '00000', '000000']
Share of non-zero elements in the matrix:  0.0023848546254604903
Average number of features present, per post:  63.4


* NAIVE BAYES CLASSIFIER

Let's classify the posts using Naive Bayes classifier with TF-IDF featire matrix:

In [26]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0.1) 
clf.fit(X_train_tfidf, twenty_train.target)
predicted_nb = clf.predict(X_test_tfidf)

Note: you can set the hyperparameter alpha to an optimal value by trying different values > 0. With alpha = 0, you model will assign a probability of zero to a document in the test data if the document contains a feature not found in the training data.

Evaluate the predictive power:

In [27]:
cm = metrics.confusion_matrix(twenty_test.target, predicted_nb)
print("Confusion matrix: \n", pd.DataFrame(data = cm, 
                                           columns = twenty_train.target_names,
                                           index = twenty_train.target_names),"\n")
print("Accuracy rate: ", metrics.accuracy_score(twenty_test.target, predicted_nb),"\n") 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 224             11         35                  49
comp.graphics                 7            358         23                   1
sci.space                    21             17        353                   3
talk.religion.misc           84              9         23                 135 

Accuracy rate:  0.7908351810790836 



**EXERCISE: Do the Naive Bayes Classifier with "bag-of-words" features in the cell below and compare its performance to the classifier using TF_IDF features:** 

In [28]:
clf.fit(X_train_bow, twenty_train.target)
predicted_nb_bow = clf.predict(X_test_bow)
cm_bow = metrics.confusion_matrix(twenty_test.target, predicted_nb_bow)
print("Confusion matrix: \n", pd.DataFrame(data = cm_bow, 
                                           columns = twenty_train.target_names,
                                           index = twenty_train.target_names),"\n")
print("Accuracy rate: ", metrics.accuracy_score(twenty_test.target, predicted_nb_bow),"\n") 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 227              4         28                  60
comp.graphics                11            351         24                   3
sci.space                    19             21        343                  11
talk.religion.misc           82              7         21                 141 

Accuracy rate:  0.7849223946784922 



* SUPPORT VECTOR MACHINES (SVM) CLASSIFIER (for TF-IDF features only) 

In [19]:
from sklearn import linear_model

To use SVM, set parameter loss = 'hinge' in linear_model.SGDClassifier:

In [21]:
clf_svm = linear_model.SGDClassifier(loss='hinge') 
clf_svm.fit(X_train_tfidf, twenty_train.target) 
predicted_svm = clf_svm.predict(X_test_tfidf)  



Let's evaluate the SVM classifier performance:

In [22]:
cm = metrics.confusion_matrix(twenty_test.target, predicted_svm)
print("Confusion matrix: \n", pd.DataFrame(data = cm, 
                                           columns = twenty_train.target_names,
                                           index = twenty_train.target_names),"\n")
print("Accuracy score: ", metrics.accuracy_score(twenty_test.target, predicted_svm),"\n") 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 209              9         32                  69
comp.graphics                19            348         17                   5
sci.space                    38             19        327                  10
talk.religion.misc           71             13         17                 150 

Accuracy score:  0.7642276422764228 



* LOGIT-BASED CLASSIFIER

In [23]:
clf_log = linear_model.SGDClassifier(loss='log')
clf_log.fit(X_train_tfidf, twenty_train.target)
predicted_log = clf_log.predict(X_test_tfidf)

print("Accuracy score: ", metrics.accuracy_score(twenty_test.target, predicted_log),"\n") 
cm = metrics.confusion_matrix(twenty_test.target, predicted_log)                                                     
print("Confusion matrix: \n", pd.DataFrame(data = cm, 
                                           columns = twenty_train.target_names,
                                           index = twenty_train.target_names), "\n")

Accuracy score:  0.770879526977088 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 197             13         51                  58
comp.graphics                 8            354         26                   1
sci.space                    18             21        354                   1
talk.religion.misc           66             16         31                 138 





**EXERCISE: Provide your comments on the performance of:**

**1) Naive Bayes classifier with "bag-of-words" versus TF-IDF features**

**2) Naive Bayes, Logit-Based and SVM classifiers with TF-IDF features. Which of the 3 performed best? Did any classifier perform better at predicting a particular topic compared to others? If a classifier did a mistake and misclassified a "Computer Graphics" post, to which class such a post was mistakenly assigned, typically? What about a post on the "Atheism" topic?**

Naive Bayes classifier with "bag-of-words" performs slightly worse than TF-IDF in accuracy rate.

Naive Bayes performs best, especially at alt.atheism. SVM best at talk.religion.misc; Logit-Based at sci.space and comp.graphics. Typically mistakenly assign "Computer Graphics" to Sci.space; "Atheism" to "talk.religion.misc".