In [1]:
from __future__ import print_function

import lime
import sklearn
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics

## Fetching data, training a classifier

In the previous tutorial, we looked at lime in the two class case. In this tutorial, we will use the 20 newsgroups dataset again, but this time using all of the classes.

In [4]:
from sklearn.datasets import fetch_20newsgroups

In [5]:
newsgroups_train = fetch_20newsgroups(data_home="./dataset",subset='train')
newsgroups_test = fetch_20newsgroups(data_home="./dataset",subset='test')

In [6]:
[x for x in newsgroups_train.target_names]

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [7]:
# making class names shorter
class_names = [x.split('.')[-1] if 'misc' not in x else '.'.join(x.split('.')[-2:]) for x in newsgroups_train.target_names]
print(class_names)
class_names[3] = 'pc.hardware'
class_names[4] = 'mac.hardware'
print(','.join(class_names))

['atheism', 'graphics', 'ms-windows.misc', 'hardware', 'hardware', 'x', 'misc.forsale', 'autos', 'motorcycles', 'baseball', 'hockey', 'crypt', 'electronics', 'med', 'space', 'christian', 'guns', 'mideast', 'politics.misc', 'religion.misc']
atheism,graphics,ms-windows.misc,pc.hardware,mac.hardware,x,misc.forsale,autos,motorcycles,baseball,hockey,crypt,electronics,med,space,christian,guns,mideast,politics.misc,religion.misc


In [10]:
# to know the data
# print(newsgroups_train.data)
# print(newsgroups_train.target_names)
print(newsgroups_train.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [11]:
print(newsgroups_train.filenames[:2])
print(newsgroups_train.filenames.shape)

print(newsgroups_train.target[:10])
print(newsgroups_train.target.shape)

['./20news_home\\20news-bydate-train\\rec.autos\\102994'
 './20news_home\\20news-bydate-train\\comp.sys.mac.hardware\\51861']
(11314,)
[ 7  4  4  1 14 16 13  3  2  4]
(11314,)


Again, let's use the tfidf vectorizer, commonly used for text.

In [21]:
print(len(newsgroups_test.data))
print(newsgroups_test.data[:1])
print(newsgroups_test.data[:2])

7532
['From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)\nSubject: Need info on 88-89 Bonneville\nOrganization: University at Buffalo\nLines: 10\nNews-Software: VAX/VMS VNEWS 1.41\nNntp-Posting-Host: ubvmsd.cc.buffalo.edu\n\n\n I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.\n\n\t\t\tNeil Gandler\n']
['From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)\nSubject: Need info on 88-89 Bonneville\nOrganization: University at Buffalo\nLines: 10\nNews-Software: VAX/VMS VNEWS 1.41\nNntp-Posting-Host: ubvmsd.cc.buffalo.edu\n\n\n I am a little confused on all of the models of 

In [22]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(newsgroups_train.data)
test_vectors = vectorizer.transform(newsgroups_test.data)

In [30]:
print(type(train_vectors))
print(train_vectors.shape)
print(train_vectors[0].shape)
print(train_vectors[0])

print(test_vectors.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(11314, 155448)
(1, 155448)
  (0, 59369)	0.17908353775015898
  (0, 132203)	0.13577137702003086
  (0, 155175)	0.044125915410153815
  (0, 105796)	0.03881774495734832
  (0, 105351)	0.09519041111814386
  (0, 51612)	0.10755152832954751
  (0, 86626)	0.05908531795040087
  (0, 128778)	0.06393020017755233
  (0, 136529)	0.06728923560344167
  (0, 128204)	0.07135462952190691
  (0, 118839)	0.16006402566276856
  (0, 121176)	0.029779770743408693
  (0, 155163)	0.06201334318028585
  (0, 123788)	0.07838557702811846
  (0, 153723)	0.08525072063585917
  (0, 133980)	0.03312754295109964
  (0, 121755)	0.0864103260109497
  (0, 128695)	0.06450082671060631
  (0, 137820)	0.11128186113339457
  (0, 155067)	0.06171071861129325
  (0, 145341)	0.11128186113339457
  (0, 115422)	0.09979707979259114
  (0, 131877)	0.07243362676554121
  (0, 130741)	0.09316443062335987
  (0, 148355)	0.17908353775015898
  :	:
  (0, 4664)	0.06410135095294822
  (0, 59613)	0.019456489657940477
  (0, 72526)	0

This time we will use Multinomial Naive Bayes for classification, so that we can make reference to this document.
https://scikit-learn.org/stable/datasets/#filtering-text-for-more-realistic-training

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=.01)
nb.fit(train_vectors, newsgroups_train.target)