# Text Classification - 20 News Groups - SciKit-Learn
This project is to do text classification of 20 news groups datasets, it is orginally posted on http://qwone.com/~jason/20Newsgroups/, and there are available build-in datasets in sklearn.

The meachine learning algorithms used in the project are also from sklearn.

#### Reference
https://scikit-learn.org/ Version 0.23.1

Load Required Modules

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from time import time
from pprint import pprint
import numpy as np
import logging

Load Datasets

In [2]:
train = fetch_20newsgroups(subset='train', shuffle=True)
test = fetch_20newsgroups(subset='test', shuffle=True)

#### Why Pipeline?
Pipeline is used to do streaming workflows.

To predict the model, the test dataset has to do exactly the same preprocess as the train data.

#### Multinominal Navie Bayes

In [22]:
## Creat a Multinomial Naive Bayes Pipeline
NB_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                   ('tfidf', TfidfTransformer()),
                   ('clf', MultinomialNB())
                  ])

In [23]:
#train the model
NB_clf = NB_clf.fit(train.data, train.target)

In [24]:
#Predict
NB_predict = NB_clf.predict(test.data)

In [25]:
np.mean(NB_predict == test.target)

0.8169144981412639

#### SVM

In [7]:
SVM_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                    ('tfidf', TfidfTransformer()),
                    ('clf', SGDClassifier())
                   ])

In [8]:
#train the model
SVM_clf = SVM_clf.fit(train.data, train.target)

In [9]:
#predict
SVM_predict = SVM_clf.predict(test.data)

In [10]:
np.mean(SVM_predict == test.target)

0.8494423791821561

#### Model Selection for SVM

In [11]:
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

In [12]:
#parameters we choose to analysis
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    # 'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

In [13]:
#find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(SVM_clf, parameters, n_jobs=-1, verbose=1)

In [14]:
#get the results
print("Performing grid search...")
print("pipeline:", [name for name, _ in SVM_clf.steps])
print("parameters:")
pprint(parameters)

t0 = time()
grid_search.fit(train.data, train.target)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1e-05, 1e-06),
 'clf__max_iter': (20,),
 'clf__penalty': ('l2', 'elasticnet'),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   49.9s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:  5.4min finished


done in 341.653s

Best score: 0.928
Best parameters set:
	clf__alpha: 1e-05
	clf__max_iter: 20
	clf__penalty: 'elasticnet'
	vect__max_df: 0.5
	vect__ngram_range: (1, 2)
