# Text Mining - Multi-Label Classification on Reuters

In [1]:
# Full name, starting with your LASTNAME and then your FIRSTNAME(s)
name = "Okeke, Ebuka Chinagorom"

# Matriculation Number
number = "87565"

# Email address which you used on PIAZZA (or your @gw.uni-passau.de address)
email = "okeke01@gw.uni-passau.de"

In [2]:
try:
    import nltk
except ModuleNotFoundError:
    !pip install nltk

In [3]:
## This code downloads the required packages.
## You can run `nltk.download('all')` to download everything.

nltk_packages = [
    ("reuters", "corpora/reuters.zip")
]

for pid, fid in nltk_packages:
    try:
        nltk.data.find(fid)
    except LookupError:
        nltk.download(pid)

[nltk_data] Downloading package reuters to /Users/EBUKA/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


## Setting up corpus

In [4]:
from nltk.corpus import reuters

## Setting up train/test data

In [5]:
train_documents, train_categories = zip(*[(reuters.raw(i), reuters.categories(i)) for i in reuters.fileids() if i.startswith('training/')])
test_documents, test_categories = zip(*[(reuters.raw(i), reuters.categories(i)) for i in reuters.fileids() if i.startswith('test/')])

In [6]:
all_categories = sorted(list(set(reuters.categories())))

# Add your code here

In [7]:
import nltk 
nltk.download('all')
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, precision_score, recall_score

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/EBUKA/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package

[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package smultron is already up-to-date!
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package subjectivity is already up-to-date!
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package swadesh is already up-to-date!
[nltk_data]    | Downloading package switchboard to
[nltk_data]    |     /Users/EBUKA/nltk_data...
[nltk_data]    |   Package switchboard is alre

In [8]:
from nltk.corpus import stopwords
cachedStopWords = set(stopwords.words('english'))

In [9]:
import warnings
warnings.filterwarnings('ignore')

# Pre-processing the text to convert everything into lower case format, stemming and creating tokens

In [10]:
def tokenize(text):
    min_length = 3
    words = map(lambda word: word.lower(), word_tokenize(text))
    words = [word for word in words if word not in cachedStopWords]
    tokens = (list(map(lambda token: PorterStemmer().stem(token),words)))
    p = re.compile('[a-zA-Z]+');
    filtered_tokens = list(filter (lambda token: p.match(token) and len(token) >= min_length,tokens))
    return filtered_tokens

# Using TfidfVectorizer on tokens to transform them into vectors
# Using MultiLabel Binarizer for one hot encoding with multiple features

In [11]:
def represent(documents):
    train_docs_id = list(filter(lambda doc: doc.startswith("train"), documents))
    test_docs_id = list(filter(lambda doc: doc.startswith("test"), documents))
    
    train_docs = [reuters.raw(doc_id) for doc_id in train_docs_id]
    test_docs = [reuters.raw(doc_id) for doc_id in test_docs_id]
    
    # Tokenisation
    vectorizer = TfidfVectorizer(tokenizer=tokenize)
    
    # Learn and transform train documents
    vectorised_train_documents = vectorizer.fit_transform(train_docs)
    vectorised_test_documents = vectorizer.transform(test_docs)

    # Transform multilabel labels
    mlb = MultiLabelBinarizer()
    train_labels = mlb.fit_transform([reuters.categories(doc_id) for doc_id in train_docs_id]) 
    test_labels = mlb.transform([reuters.categories(doc_id) for doc_id in test_docs_id])
    
    return (vectorised_train_documents, train_labels, vectorised_test_documents, test_labels)

# Using OneVsRest Classifier in order to fit one classifier per class as it is easily interpretable.

In [12]:
def train_classifier(train_docs, train_labels):
    classifier = OneVsRestClassifier(LinearSVC(random_state=42))
    classifier.fit(train_docs, train_labels)
    return classifier

# Since we are using OneVsRest Classifier, we can use Micro as well Macro average for generalizing binary performance metrics: Precision, Recall, F1-Score.

In [13]:
def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='micro')
    recall = recall_score(test_labels, predictions, average='micro')
    f1 = f1_score(test_labels, predictions, average='micro')
    print("Micro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

    precision = precision_score(test_labels, predictions, average='macro')
    recall = recall_score(test_labels, predictions, average='macro')
    f1 = f1_score(test_labels, predictions, average='macro')

    print("Macro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

In [14]:
documents = reuters.fileids()

In [15]:
train_docs, train_labels, test_docs, test_labels = represent(documents)

In [16]:
model = train_classifier(train_docs, train_labels)

In [17]:
predictions = model.predict(test_docs)

# In micro averaging,we calculate the performance, e.g., precision, from the individual true positives, true negatives, false positives, and false negatives of the the k-class model.

# In macro-averaging, we average the performances of each individual class. 

In [18]:
evaluate(test_labels, predictions)

Micro-average quality numbers
Precision: 0.9455, Recall: 0.8013, F1-measure: 0.8674
Macro-average quality numbers
Precision: 0.6493, Recall: 0.3948, F1-measure: 0.4665


# Result Interpretation:
For Micro Average, the Precision & Recall are both high. This means that the correctly predicted positive observations from the total predicted positive observations is high i.e. True Postive rate is high and the correctly predicted positive observations to the all observations in actual class is also high i.e. High Sensitivity. F1 Score being the weighted average of Precision and Recall, is high since both Precision & Recall is high. 

For Macro Average, the Precision is decent while Recall is pretty low. This means that the correctly predicted positive observations from the total predicted positive observations is high i.e. True Postive rate is high and the correctly predicted positive observations to the all observations in actual class is pretty low i.e. Low Sensitivity. F1 Score being the weighted average of Precision and Recall, is average since Precision is decently good while Recall is low.

# Conclusion:
- I will be proceeding with Micro Average parameters since micro averaged results are a true measure of effectiveness on the large classes in a test collection. Micro Average method pools per-document decisions across classes, and then computes an effectiveness measure on the pooled contingency table.

- To summarize this approach, I did some intial data pre-processing, converted the terms into tokens, used TFidf to create vecors, used MLB for one hot encoding on with multiple features, used OneVsRest classifier for the multi class classification and used Micro Average parameters for evaluation of model performance basis Precison, Recall and F1-Score.

- The way forward can be to introduce user defined features that are related with the business domain for the dataset, that will result in further improving the Precision, Recall and F1-score. We can also try other NLP processes like Lemmetization, Named Entity Recognition, PoS Tagging etc. to further clean our training data and improve the training model.