#                                           #Classification probblem on reuters newswire dataset.

In [0]:
import nltk

Reuters data set is  available in NLTK library.http://www.nltk.org/nltk_data/

It contains structured information about newswire articles that can be assigned to several classes, making it a multi-label problem.The collection originally consisted of 21,578 documents but a subset and split is traditionally used. The most common split is Mod-Apte which only considers categories that have at least one document in the training set and the test set. The Mod-Apte split has 90 categories with a training set of 7769 documents and a test set of 3019 documents.




NLTK has built-in support for dozens of corpora and trained models. To use these within NLTK we recommend that you use the NLTK corpus downloader, >>> nltk.download()

In [0]:
nltk.download('reuters')

[nltk_data] Downloading package reuters to /root/nltk_data...


True

*`In natural language processing, useless words (data), are referred to as stop words. Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.`*

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import reuters
from nltk.corpus import stopwords
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer

**Setting up train & Test Data**

In [0]:
train_documents, train_categories = zip(*[(reuters.raw(i), reuters.categories(i)) for i in reuters.fileids() if i.startswith('training/')])
test_documents, test_categories = zip(*[(reuters.raw(i), reuters.categories(i)) for i in reuters.fileids() if i.startswith('test/')])

Tokenize returns a list of stems that appear in the text that was passed as an argument. Stop-words are filtered out, as well as words that are too short. Furthermore, any string that contains other than letters is removed (e.g., numbers).
Here I have used porter stemmer to stem the words

In [0]:
from nltk.stem.porter import PorterStemmer
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = []
    for item in tokens:
        stems.append(PorterStemmer().stem(item))
    return stems

The above cell defines a function tokenize that performs following actions:

Receive a document as an argument to the function

Tokenize the document using nltk.word_tokenize()

Use PorterStemmer provided by the nltk to remove morphological affixes from each token

Append stemmed token to an already defined list stems

Return the list stems

In [0]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

**TF ** =  *Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization*

**TF(t)** = *(Number of times term t appears in a document) / (Total number of terms in the document).*

**IDF**: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:*

**IDF(t)** = log_e(Total number of documents / Number of documents with term t in it).*

**Example:**

Consider a document containing 100 words wherein the word cat appears 3 times.

*The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.* 


To begin, I first used TF-IDF for feature selection on both train as well as test data using TfidfVectorizer.

But first, What TfidfVectorizer actually does?

TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.
TF-IDF?

*TFIDF (abbreviation of the term frequency–inverse document frequency)* is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. tf–idf
Why TfidfVectorizer?

*TfidfVectorizer *scale down the impact of tokens that occur very frequently (e.g., “a”, “the”, and “of”) in a given corpus. Feature Extraction and Transformation
I gave following two arguments to TfidfVectorizer:

*tokenizer: tokenize function
stop_words *
Then I used *fit_transform* and transform on the train and test documents repectively.



In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer = tokenize, stop_words = 'english')

vectorised_train_documents = vectorizer.fit_transform(train_documents)
vectorised_test_documents = vectorizer.transform(test_documents)

  'stop_words.' % sorted(inconsistent))


**1.Firstly, the data representation for the category assignment to the different documents is slightly different, viewing each document as a list of bits representing being or not in each of the categories. This change is done by using the MultiLabelBinarizer as the code shows.**


The problem we are solving has a multi-label nature, and because of this, there are two changes that I have to made in the code that are not needed for binary classification.


In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform(train_categories)
test_labels = mlb.transform(test_categories)

**2.Secondly, we have to train our model (which is binary by nature) N times, once per category, where the negative cases will be the documents in all the other categories. This allows our model to make a binary decision per category and produce multi-label results. This can be done with the OneVsRestClassifier object in Scikit-learn.**

[youtube link for one vs rest classifier explained by Andrew NG](https://www.youtube.com/watch?v=ZvaELFv5IpM)

[Another yoytube link to understand one vs rest classifier](https://www.youtube.com/watch?v=6_YvpI-oDIs)

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

classifier = OneVsRestClassifier(LinearSVC())
classifier.fit(vectorised_train_documents, train_labels)

OneVsRestClassifier(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=None)

I used KFold with cross_val_score as KFold supports shuffling the data.

I also enabled random_state as 42

In [0]:
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits=10, random_state = 42, shuffle = True)
scores = cross_val_score(classifier, vectorised_train_documents, train_labels, cv = kf)

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


In [0]:
print('Cross-validation scores:', scores)
print('Cross-validation accuracy: {:.4f} (+/- {:.4f})'.format(scores.mean(), scores.std() * 2))

Cross-validation scores: [0.83655084 0.86743887 0.8043758  0.83011583 0.83655084 0.81724582
 0.82754183 0.8030888  0.80694981 0.82731959]
Cross-validation accuracy: 0.8257 (+/- 0.0368)


***Evaluation***

Measuring the quality of a classifier is a necessary step in order to potentially improve it. The main metrics for Text Classification are:

*Precision:* Number of documents correctly assigned to a category out of the total number of documents predicted.

*Recall*: Number of documents correctly assigned to a category out of the total number of documents in such category.

*F1*: Metric that combines precision and recall using the harmonic mean.
If the evaluation is being done in multi-class or multi-label environments, the method becomes slightly more complicated because the quality metrics have to be either shown per category, or globally aggregated. There are two main aggregation approaches:

**Micro-average:**  This will aggregate the contribution from all classes to compute the average metric.

**Macro-average:** This will compute the metric independently for each class then take the average.

[Reference for detailed explanation about micro and macro averages](https://datascience.stackexchange.com/a/24051)


In [0]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

predictions = classifier.predict(vectorised_test_documents)


accuracy = accuracy_score(test_labels, predictions)

macro_precision = precision_score(test_labels, predictions, average='macro')
macro_recall = recall_score(test_labels, predictions, average='macro')
macro_f1 = f1_score(test_labels, predictions, average='macro')

micro_precision = precision_score(test_labels, predictions, average='micro')
micro_recall = recall_score(test_labels, predictions, average='micro')
micro_f1 = f1_score(test_labels, predictions, average='micro')

cm = confusion_matrix(test_labels.argmax(axis = 1), predictions.argmax(axis = 1))

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [0]:
print("Accuracy: {:.4f}\nPrecision:\n- Macro: {:.4f}\n- Micro: {:.4f}\nRecall:\n- Macro: {:.4f}\n- Micro: {:.4f}\nF1-measure:\n- Macro: {:.4f}\n- Micro: {:.4f}".format(accuracy, macro_precision, micro_precision, macro_recall, micro_recall, macro_f1, micro_f1))

Accuracy: 0.8099
Precision:
- Macro: 0.6076
- Micro: 0.9471
Recall:
- Macro: 0.3708
- Micro: 0.7981
F1-measure:
- Macro: 0.4410
- Micro: 0.8662


# **using pickle object to predict the given text or file**

In [0]:
import pickle

In [0]:
with open('classifier.pickle','wb') as f:
    pickle.dump(classifier,f)
    
# Saving the Tf-Idf model
with open('tfidfmodel.pickle','wb') as f:
    pickle.dump(vectorizer,f)

In [0]:
# Loading the vectorizer and classfier
with open('classifier.pickle','rb') as f:
    classifier = pickle.load(f)
    
with open('tfidfmodel.pickle','rb') as f:
    tfidf = pickle.load(f)    
    

**PREDICTING THE GVEN TEXT OR TEXT FILE CATEGORY**

In [0]:
sent = classifier.predict(tfidf.transform(["this is a copper can"]).toarray())

  'stop_words.' % sorted(inconsistent))


In [0]:
sent

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0]])

In [0]:
myfile = open('14826.txt')

In [0]:
senet = classifier.predict(tfidf.transform([myfile.read()]).toarray())

In [0]:
senet

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0]])

In [0]:
categories = reuters.categories();

print(categories)

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']


In [0]:
categories[84]

'trade'

In [0]:
categories[10]

'copper'