# Text Classification with SciKit Learn

## Resources
- http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

## MDSD File Sizes
```
1.4G sorted_data/books/all.review
217M sorted_data/music/all.review
198M sorted_data/dvd/all.review
 56M sorted_data/video/all.review
 24M sorted_data/electronics/all.review
 19M sorted_data/kitchen_&_housewares/all.review
 13M sorted_data/toys_&_games/all.review
8.7M sorted_data/camera_&_photo/all.review
7.0M sorted_data/apparel/all.review
6.8M sorted_data/health_&_personal_care/all.review
5.7M sorted_data/sports_&_outdoors/all.review
4.9M sorted_data/magazines/all.review
4.4M sorted_data/computer_&_video_games/all.review
4.2M sorted_data/baby/all.review
2.9M sorted_data/software/all.review
2.7M sorted_data/beauty/all.review
2.3M sorted_data/grocery/all.review
1.7M sorted_data/jewelry_&_watches/all.review
1.5M sorted_data/outdoor_living/all.review
1.5M sorted_data/gourmet_food/all.review
1.1M sorted_data/cell_phones_&_service/all.review
654K sorted_data/automotive/all.review
444K sorted_data/office_products/all.review
315K sorted_data/musical_instruments/all.review
110K sorted_data/tools_&_hardware/all.review
```


In [73]:
import mdsd 
from pathlib import Path
import numpy

file = Path("../sorted_data/software/all.review")    # 2390 Reviews
file = Path("../sorted_data/electronics/all.review") # 23009 Reviews

# sklearn uses slightly different structures
data        = mdsd.parse_file(file)
targets     = numpy.array(list(map(lambda x: x['rating'], data)))
review_text = list(map(lambda x: x['text'], data))

File: C:\Users\Owner\Projects\ml-review-classification\sorted_data\electronics\all.review
Lines: 868686



100%|███████████████████████████████| 868686/868686 [00:08<00:00, 98138.34it/s]


Reviews: 23009



In [86]:
# Seperate into train and test datasets

from sklearn.model_selection import train_test_split

review_train, review_test, target_train, target_test = train_test_split(review_text, targets, test_size=0.10)

print('Review Stars: ' , '☆' * int(target_train[0]))
print('Review Text: ' , review_train[0])


Review Stars:  ☆☆☆☆☆
Review Text:  When checking printers in" Consumer Reports " this was rated "best buy".I see no reason to call it anything else



In [75]:
# Feature Extraction - Get count of words 

from sklearn.feature_extraction.text import CountVectorizer

count_vect     = CountVectorizer()
X_train_counts = count_vect.fit_transform(review_train)

X_train_counts.shape

(20708, 33831)

## From occurrences to frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

In [76]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf     = tfidf_transformer.fit_transform(X_train_counts)

X_train_tfidf.shape

(20708, 33831)

## Training a classifier

In [77]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, target_train)

## Evaluation of classifier

In [78]:
import numpy as np

x_test_counts = count_vect.transform(review_test)
x_test_tfidf  = tfidf_transformer.transform(x_test_counts)
score         = round(clf.score(x_test_tfidf, target_test) * 100, 3)

print('Accuracy: {}%'.format(score))

Accuracy: 53.672%


## [Linear Support Vector Machine (SVM)](http://scikit-learn.org/stable/modules/svm.html#svm)

Widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes).

In [79]:
from sklearn.linear_model import SGDClassifier

svm_clf = SGDClassifier(max_iter=10, tol=None)
svm_clf.fit(X_train_tfidf, target_train)

score = round(svm_clf.score(x_test_tfidf, target_test) * 100, 3)
print('Accuracy: {}%'.format(score))

Accuracy: 69.883%


In [80]:
from sklearn import metrics

predicted = svm_clf.predict(x_test_tfidf)
print(metrics.classification_report(target_test, predicted))

             precision    recall  f1-score   support

          1       0.70      0.72      0.71       346
          2       0.73      0.08      0.14       143
          4       0.62      0.37      0.47       578
          5       0.72      0.92      0.80      1234

avg / total       0.69      0.70      0.66      2301



## Hyper Parameter tuning with GridSearch

[Tuning the hyper-parameters of an estimator](http://scikit-learn.org/stable/modules/grid_search.html)

Hyper-parameters are parameters that are not directly learnt within estimators.

Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values. We try out all classifiers on either words or bigrams, with or without idf, and with a penalty parameter of either 0.01 or 0.001 for the linear SVM:

In [70]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(max_iter=5, tol=None)),
])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
}
gs_clf     = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf     = gs_clf.fit(review_train, target_train)

gs_clf.cv_results_



{'mean_fit_time': array([0.78221997, 2.3468008 , 0.47002689, 2.75962027, 0.4628009 ,
        1.90420341, 0.4488674 , 1.79633792]),
 'std_fit_time': array([0.07671648, 0.27333641, 0.02483424, 1.00170188, 0.01945677,
        0.09616504, 0.01368241, 0.04716935]),
 'mean_score_time': array([0.3160123 , 0.52236334, 0.21867911, 0.62154984, 0.19240022,
        0.43680072, 0.17160026, 0.43260074]),
 'std_score_time': array([0.07041602, 0.05170299, 0.04818533, 0.15638328, 0.00735395,
        0.01273751, 0.01273742, 0.03238825]),
 'param_clf__alpha': masked_array(data=[0.01, 0.01, 0.01, 0.01, 0.001, 0.001, 0.001, 0.001],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_tfidf__use_idf': masked_array(data=[True, True, False, False, True, True, False, False],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_vect__ngram_range'

In [71]:
from sklearn.metrics import classification_report

gs_clf.best_params_


{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}