# Lab 6: Document classification

## Load 20 Newsgroup dataset.
You can find more information at http://qwone.com/~jason/20Newsgroups/

Thee scikit-learn library already provides access to the 20 Newsgroups dataset. 

As an alternative to retrieve all the document collection, you can select a standard split of the collection into training and test sets.

In [1]:
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

You can see the first 10 documents in the dataset using train.data[:10] and the classes of those documents using train.target[:10]. You will notice that the classes are represented as numbers. To see the class names you can use: train.target_names.

In [2]:
train.data[:10]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [3]:
train.target[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

## 1. Classifying the 20 NewsGroup collection


In [4]:
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer

def vectorize(train_data, test_data, min_df=1, max_df=1.0, use_idf=False):
    '''
    Convert a collection of raw documents to a matrix of TF-IDF features.
    Equivalent to CountVectorizer followed by TfidfTransformer.
    '''

    vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df, use_idf=use_idf)
    trainvec = vectorizer.fit_transform(train_data)
    testvec = vectorizer.transform(test_data)

    return trainvec, testvec


def classify(classifier, trainvec, train_target, testvec):
    classifier.fit(X=trainvec, y=train_target)
    classes = classifier.predict(testvec)
    return classes


def evaluate(test_target, classes):
    print('Accuracy score: {}'.format(metrics.accuracy_score(test_target, classes)))
    print()
    print(metrics.classification_report(test_target, classes))

### 1.1. Implement a classifier for the 20 Newsgroups collection and measure its performance.
You can use for instance a Multinomial Naïve Bayes classifier, available in scikit-learn.

In [5]:
from sklearn.naive_bayes import MultinomialNB
trainvec, testvec = vectorize(train.data, test.data, use_idf=False)
classes = classify(MultinomialNB(), trainvec, train.target, testvec)
evaluate(test.target, classes)

Accuracy score: 0.7052575677110993

              precision    recall  f1-score   support

           0       0.85      0.24      0.37       319
           1       0.71      0.60      0.65       389
           2       0.79      0.65      0.71       394
           3       0.63      0.75      0.69       392
           4       0.86      0.68      0.76       385
           5       0.88      0.68      0.77       395
           6       0.90      0.72      0.80       390
           7       0.71      0.92      0.80       396
           8       0.84      0.91      0.87       398
           9       0.86      0.85      0.86       397
          10       0.90      0.93      0.91       399
          11       0.52      0.96      0.67       396
          12       0.78      0.52      0.63       393
          13       0.82      0.76      0.79       396
          14       0.83      0.81      0.82       394
          15       0.34      0.98      0.51       398
          16       0.66      0.80      0.73  

### 1.2. Try to improve the classification by:

(a) Removing very rare words (e.g. words that occur less than 2 times) or very frequent words (e.g. words that occur in more than 90% of the documents) using the Vectorizer facilities provided by scikit-learn.

See: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Attention:

* **max_df: float or int, default=1.0** When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts.


* **min_df: float or int, default=1** When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts.



In [8]:
from sklearn.naive_bayes import MultinomialNB
trainvec, testvec = vectorize(train.data, test.data, min_df=2, max_df=0.9, use_idf=False)
classes = classify(Perceptron(), trainvec, train.target, testvec)
evaluate(test.target, classes)

Accuracy score: 0.7603558151885289

              precision    recall  f1-score   support

           0       0.51      0.83      0.64       319
           1       0.80      0.64      0.71       389
           2       0.68      0.63      0.65       394
           3       0.49      0.77      0.60       392
           4       0.78      0.73      0.75       385
           5       0.74      0.73      0.74       395
           6       0.80      0.83      0.82       390
           7       0.92      0.75      0.83       396
           8       0.94      0.90      0.92       398
           9       0.85      0.88      0.87       397
          10       0.94      0.92      0.93       399
          11       0.76      0.91      0.83       396
          12       0.78      0.57      0.66       393
          13       0.69      0.81      0.74       396
          14       0.91      0.84      0.87       394
          15       0.88      0.82      0.85       398
          16       0.75      0.73      0.74  

(b) Compare the performance against alternative classification algorithms, such as:

* a nearest neighbour classifier (sklearn.neighbors.KNeighborsClassifier)
* the perceptron algorithm (sklearn.linear model.Perceptron)
* support vector machines (sklearn.svm.LinearSVC)

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Perceptron
from sklearn.svm import LinearSVC

classifiers = [MultinomialNB(), KNeighborsClassifier(), Perceptron(), LinearSVC()]

for classifier in classifiers:
    print('Using {} as the classifier.'.format(classifier))
    trainvec, testvec = vectorize(train.data, test.data, use_idf=False)
    classes = classify(classifier, trainvec, train.target, testvec)
    evaluate(test.target, classes)

Using MultinomialNB() as the classifier.
Accuracy score: 0.7052575677110993

              precision    recall  f1-score   support

           0       0.85      0.24      0.37       319
           1       0.71      0.60      0.65       389
           2       0.79      0.65      0.71       394
           3       0.63      0.75      0.69       392
           4       0.86      0.68      0.76       385
           5       0.88      0.68      0.77       395
           6       0.90      0.72      0.80       390
           7       0.71      0.92      0.80       396
           8       0.84      0.91      0.87       398
           9       0.86      0.85      0.86       397
          10       0.90      0.93      0.91       399
          11       0.52      0.96      0.67       396
          12       0.78      0.52      0.63       393
          13       0.82      0.76      0.79       396
          14       0.83      0.81      0.82       394
          15       0.34      0.98      0.51       398
    