# Text Classification with SciKit Learn

## Resources
- http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a


In [35]:
import mdsd 
from pathlib import Path
import numpy

file = Path("../sorted_data/software/all.review")

# sklearn uses slight different structures
data        = mdsd.parse_file(file)
targets     = numpy.array(list(map(lambda x: x['rating'], data)))
review_text = list(map(lambda x: x['text'], data))

File: C:\Users\Owner\Projects\ml-review-classification\sorted_data\software\all.review
Lines: 93808



100%|█████████████████████████████████| 93808/93808 [00:01<00:00, 91604.13it/s]


Reviews: 2390



In [36]:
# Seperate into train and test 

from sklearn.model_selection import train_test_split

review_train, review_test, target_train, target_test = train_test_split(review_text, targets, test_size=0.10)

print(review_train[0])
print(target_train[0])

Be advised, this Software is for a SINGLE USER, only licensed for one computer. I bought this for Home use after using SS% for 2 years and V-Com said it would no longer be supported of 03/06. I installed it on 2 home computers, it worked fine for 4 weeks, then it popped up a Box on my second computer that I couls not use it because it was licensed on another computer, I needed to buy additional license for each computer. When I tried to use it on my Main computer, it popped up the same message. So now it will not work on either computer. I uninstalled it and sent it back. Just because it's Cheap does not mean it's any good

1


In [37]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect     = CountVectorizer()
X_train_counts = count_vect.fit_transform(review_train)

X_train_counts.shape

(2151, 11703)

## From occurrences to frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

In [38]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf     = tfidf_transformer.fit_transform(X_train_counts)

X_train_tfidf.shape

(2151, 11703)

## Training a classifier

In [39]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, target_train)

## Evaluation of classifier

In [40]:
import numpy as np

x_test_counts = count_vect.transform(review_test)
x_test_tfidf  = tfidf_transformer.transform(x_test_counts)
score         = round(clf.score(x_test_tfidf, target_test) * 100, 3)

print('Accuracy: {}%'.format(score))

Accuracy: 55.23%


## [Linear Support Vector Machine (SVM)](http://scikit-learn.org/stable/modules/svm.html#svm)

Widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes).

In [47]:
from sklearn.linear_model import SGDClassifier

svm_clf = SGDClassifier(max_iter=10, tol=None)
svm_clf.fit(X_train_tfidf, target_train)

score = round(svm_clf.score(x_test_tfidf, target_test) * 100, 3)
print('Accuracy: {}%'.format(score))

Accuracy: 66.527%
