![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

This tutorial is to demonstrate the implementation of text classification on news data using machine learning approach. The classifier are required to classify the text data into their corresponding categories in supervised manner.

# What will we accomplish?

Steps to implement text classifier in machine learning:

> Step 1: Importing Libraries

> Step 2: Loading Datasets & Exploratory Data Analysis

> Step 3: Text Pre-processing

> Step 4: Feature Extraction (Vectorization)

> Step 5: Running ML algorithms

> Step 6: Grid Search for parameter tuning

# Prerequisite

<h5>Python version: 3.9.6 (Python 3 ++)</h5>

### Library Required:
    1. NLTK
    2. sklearn
    3. re # Regular Expression
    4. genism
    
**Note**: Little bit of python and ML basics including text classification are required.

# Notebook Content

* [Introduction](#Introduction)


* [Step 1: Importing Libraries](#Step-1:-Importing-Libraries)


* [Step 2: Loading Datasets & EDA](#Step-2:-Loading-Datasets-&-EDA)


* [Step 3: Text Pre-Processing](#Step-3:-Text-Pre-Processing)

    * [Text Cleaning](#Text-Cleaning)
    
    * [Stopword Removal](#Stopword-Removal)
    
    * [Lemmatization](#Lemmatization)
    
    * [Full Text Preprocessing](#Full-Text-Preprocessing)
    
    * [Data Cleaning on Text Data](#Data-Cleaning-on-Text-Data)
    
    * [Pickling Cleaned Text Data](#Pickling-Cleaned-Text-Data)
    
    * [Loading Pickle File](#Loading-Pickle-File)
    

* [Step 4: Feature Extraction (Vectorization)](#Step-4:-Feature-Extraction-(Vectorization))
    
    * [Term Frequency-Inverse Document Frequencies (TF-IDF)](#Term-Frequency-Inverse-Document-Frequencies-(TF-IDF))
    
    * [Word2Vec](#Word2Vec)
    
    
* [Step 5: Running ML algorithms](#Step-5:-Running-ML-algorithms)
    
    * [Naive Bayes (NB)](#Naive-Bayes-(NB))
    
    * [Support Vector Machine](#Support-Vector-Machine)


* [Step 6: Grid Search](#Step-6:-Grid-Search)


* [Conclusion](#Conclusion)

# Introduction

**Document/Text classification** is one of the important and typical task in **supervised machine learning (ML)**. Assigning categories to documents, which can be a web page, library book, media articles, gallery etc. has many applications like e.g. **spam filtering**, **email routing**, **sentiment analysis** etc.

![text-classification](../../../images/text-classification.png)

# Step 1: Importing Libraries

In [1]:
import numpy as np

# Import fetch_20newsgroups
from sklearn.datasets import fetch_20newsgroups

# For text pre-processing
import re, string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Pickling File
import pickle

# Word2Vec
from gensim.models import Word2Vec

# Machine Learning algorithm
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

# For fine tuning
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tanch\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Step 2: Loading Datasets & EDA

The dataset will be using for this tutorial is the famous “20 Newsgoup” dataset. About the data from the original [website](#http://qwone.com/~jason/20Newsgroups/):

> The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

This data set is in-built in scikit, so we don’t need to download it explicitly.

In [2]:
train = fetch_20newsgroups(subset='train', shuffle=True)
test = fetch_20newsgroups(subset='test', shuffle=True)

## Exploratory Data Analysis

The datatypes of training set and test set are sklearn.utils.Bunch which the values can also access by the keys. For more information, you may refer to https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html

In [3]:
type(train)

sklearn.utils.Bunch

To get all the target names (categories) of news data, you can use `.target_name` attribute

In [4]:
train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Let's get some intuition on how our data look like.

In [5]:
print(train.data[0])
print("Target label:", train.target_names[train.target[0]])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Target label: rec.autos


# Step 3: Text Pre-Processing

Before we move to model building, we need to preprocess our dataset by **removing punctuations** & **special characters**, **cleaning texts**, **removing stop words**, and **applying lemmatization**

**Simple text cleaning processes**: 
Some of the common text cleaning process involves:
* Removing punctuations, special characters, URLs & hashtags
* Removing leading, trailing & extra white spaces/tabs
* Typos, slangs are corrected, abbreviations are written in their long forms

**Stop-word removal**: 
We can remove a list of generic stop words from the English vocabulary using nltk. A few such words are ‘i’,’you’,’a’,’the’,’he’,’which’ etc.

**Stemming**: 
Refers to the process of slicing the end or the beginning of words with the intention of removing affixes(prefix/suffix)


**Lemmatization**: 
It is the process of reducing the word to its base form


![Stemming vs Lemmatization](../../../images/stemming_lemmatization.png)

In [6]:
sample_text = train.data[10]
print(sample_text)

From: irwin@cmptrc.lonestar.org (Irwin Arnstein)
Subject: Re: Recommendation on Duc
Summary: What's it worth?
Distribution: usa
Expires: Sat, 1 May 1993 05:00:00 GMT
Organization: CompuTrac Inc., Richardson TX
Keywords: Ducati, GTS, How much? 
Lines: 13

I have a line on a Ducati 900GTS 1978 model with 17k on the clock.  Runs
very well, paint is the bronze/brown/orange faded out, leaks a bit of oil
and pops out of 1st with hard accel.  The shop will fix trans and oil 
leak.  They sold the bike to the 1 and only owner.  They want $3495, and
I am thinking more like $3K.  Any opinions out there?  Please email me.
Thanks.  It would be a nice stable mate to the Beemer.  Then I'll get
a jap bike and call myself Axis Motors!

-- 
-----------------------------------------------------------------------
"Tuba" (Irwin)      "I honk therefore I am"     CompuTrac-Richardson,Tx
irwin@cmptrc.lonestar.org    DoD #0826          (R75/6)
-------------------------------------------------------------------

## Text Cleaning

In [7]:
def cleaning(text):
    # lowercase all the characters
    text = text.lower()
    # Remove leading, trailing whitespaces
    text = text.strip()
    
    # Using regular expression (re) to preprocess text
    text = re.compile('<.*?>').sub('', text) 
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  
    text = re.sub('\s+', ' ', text)  
    text = re.sub(r'\[[0-9]*\]',' ',text) 
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d',' ',text) 
    text = re.sub(r'\s+',' ',text)
    
    # Return preprocessed text
    return text

In [8]:
print(cleaning(sample_text))

from irwin cmptrc lonestar org irwin arnstein subject re recommendation on duc summary what s it worth distribution usa expires sat may gmt organization computrac inc richardson tx keywords ducati gts how much lines i have a line on a ducati gts model with k on the clock runs very well paint is the bronze brown orange faded out leaks a bit of oil and pops out of st with hard accel the shop will fix trans and oil leak they sold the bike to the and only owner they want and i am thinking more like k any opinions out there please email me thanks it would be a nice stable mate to the beemer then i ll get a jap bike and call myself axis motors tuba irwin i honk therefore i am computrac richardson tx irwin cmptrc lonestar org dod r 


## Stopword Removal

In [9]:
def remove_stopword(text):
    filtered_words = [w for w in word_tokenize(text) if w not in stopwords.words('english')]
    return " ".join(filtered_words)

In [10]:
print(remove_stopword(sample_text))

From : irwin @ cmptrc.lonestar.org ( Irwin Arnstein ) Subject : Re : Recommendation Duc Summary : What 's worth ? Distribution : usa Expires : Sat , 1 May 1993 05:00:00 GMT Organization : CompuTrac Inc. , Richardson TX Keywords : Ducati , GTS , How much ? Lines : 13 I line Ducati 900GTS 1978 model 17k clock . Runs well , paint bronze/brown/orange faded , leaks bit oil pops 1st hard accel . The shop fix trans oil leak . They sold bike 1 owner . They want $ 3495 , I thinking like $ 3K . Any opinions ? Please email . Thanks . It would nice stable mate Beemer . Then I 'll get jap bike call Axis Motors ! -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - '' Tuba '' ( Irwin ) `` I honk therefore I '' CompuTrac-Richardson , Tx irwin @ cmptrc.lonestar.org DoD # 0826 ( R75/6 ) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -


## Lemmatization

In [11]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# This is a helper function that map NLTK position tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    
    elif tag.startswith('V'):
        return wordnet.VERB
    
    elif tag.startswith('N'):
        return wordnet.NOUN
    
    elif tag.startswith('R'):
        return wordnet.ADV
    
    else:
        return wordnet.NOUN
    
# Working on lemmatization
# Lemmatizer works well with part-of-speech tagging
def lemmatization(text):
    # Get NLTK position tags
    word_pos_tags = nltk.pos_tag(word_tokenize(text))
    # Map the position tags to wordnet tags and lemmatize the word/token
    processed_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in word_pos_tags]
    return " ".join(processed_words)

In [12]:
print(lemmatization(sample_text))

From : irwin @ cmptrc.lonestar.org ( Irwin Arnstein ) Subject : Re : Recommendation on Duc Summary : What 's it worth ? Distribution : usa Expires : Sat , 1 May 1993 05:00:00 GMT Organization : CompuTrac Inc. , Richardson TX Keywords : Ducati , GTS , How much ? Lines : 13 I have a line on a Ducati 900GTS 1978 model with 17k on the clock . Runs very well , paint be the bronze/brown/orange fade out , leak a bit of oil and pop out of 1st with hard accel . The shop will fix trans and oil leak . They sell the bike to the 1 and only owner . They want $ 3495 , and I be think more like $ 3K . Any opinion out there ? Please email me . Thanks . It would be a nice stable mate to the Beemer . Then I 'll get a jap bike and call myself Axis Motors ! -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - '' Tuba '' ( Irwin ) `` I honk therefore I be '' CompuTrac-Richardson , Tx irwin @ cmptrc.lonestar.org DoD # 0826 ( R75/6 ) -- -- -- -- -- -- --

## Full Text Preprocessing

In [13]:
def preprocess(text):
    cleaned_text = cleaning(text)
    filtered_text = remove_stopword(cleaned_text)
    preprocessed_text = lemmatization(filtered_text)
    return preprocessed_text

In [14]:
print(preprocess(sample_text))

irwin cmptrc lonestar org irwin arnstein subject recommendation duc summary worth distribution usa expire sat may gmt organization computrac inc richardson tx keywords ducati gts much line line ducati gts model k clock run well paint bronze brown orange fade leaks bit oil pop st hard accel shop fix trans oil leak sell bike owner want think like k opinion please email thanks would nice stable mate beemer get jap bike call axis motor tuba irwin honk therefore computrac richardson tx irwin cmptrc lonestar org dod r


## Data Cleaning on Text Data

In [15]:
# # Cleaning train data
# for idx, text in enumerate(train.data):
#     train.data[idx] = preprocess(text)
    
# print("Training data is cleaned")

In [16]:
# for idx, text in enumerate(test.data):
#     test.data[idx] = preprocess(text)

# print("Test data is cleaned")

## Pickling Cleaned Text Data

In [17]:
# with open("model/train.pkl", 'wb') as file:
#     pickle.dump(train, file)

In [18]:
# with open("model/test.pkl", 'wb') as file:
#     pickle.dump(test, file)

## Loading Pickle File

In [19]:
# Load train data
with open("model/train.pkl", 'rb') as file:
    train = pickle.load(file)

In [20]:
# Load test data
with open("model/test.pkl", 'rb') as file:
    test = pickle.load(file)

# Step 4: Feature Extraction (Vectorization)

It’s difficult to work with **text data** while building **Machine learning** models since these models need well-defined **numerical data**. The process to convert text data into numerical data/vector, is called **vectorization** or in the NLP world, **word embedding**. **Bag-of-Words(BoW)** and **Word Embedding (with Word2Vec)** are two well-known methods for converting text data to numerical data. In this notebook, we will be using Term Frequency-Inverse Document Frequencies(TF-IDF) and Word2Vec for feature extraction and vectorization.

## Term Frequency-Inverse Document Frequencies (TF-IDF)

By using TF-IDF, the value of a word **increases proportionally** to count in the document, but it is **inversely proportional** to the frequency of the word in the corpus.

In [21]:
# Initialized tfidf vectorizer
tf_idf = TfidfVectorizer(use_idf=True)
X_train_tfidf = tf_idf.fit_transform(train.data)
X_test_tfidf = tf_idf.transform(test.data)

In [22]:
X_train_tfidf.shape

(11314, 79397)

## Word2Vec

Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network which is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

In [23]:
# Word2Vec runs on tokenized sentences
X_train_tok= [nltk.word_tokenize(w) for w in train.data]  
X_test_tok= [nltk.word_tokenize(w) for w in test.data]

Let’s try to understand the hyperparameters of this model.

* vector_size: The number of dimensions of the embeddings and the default is 100.
* window: The maximum distance between a target word and words around the target word. The default window is 5.
* min_count: The minimum count of words to consider; words with occurrence less than this count will be * ignored. The default for min_count is 5.
* workers: The number of partitions during training and the default workers is 3.
* sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

After training the word2vec model, we can obtain the word embedding directly from the training model as following.

In [24]:
#building Word2Vec model
class MeanEmbeddingVectorizer(object):
    def __init__(self, model):
        self.model = model
        
        self.word2vec = model.wv
        
        self.dim = model.vector_size
        
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [25]:
clean_text_tok = [nltk.word_tokenize(i) for i in train.data]
model = Word2Vec(clean_text_tok, min_count=1, vector_size=150, workers=8, window=8) 

modelw = MeanEmbeddingVectorizer(model)

# converting text to numerical data using Word2Vec
X_train_w2v = modelw.transform(X_train_tok)
X_test_w2v = modelw.transform(X_test_tok)

In [26]:
X_train_w2v.shape

(11314, 150)

# Step 5: Running ML algorithms

It’s time to train a machine learning model on the vectorized dataset and test it. Now that we have converted the text data to numerical data, we can run ML models on ***X_train_tfidf*** & ***train.target***. We’ll test this model on ***X_test_tfidf*** to get ***test.target*** and further evaluate the performance of the model.

## Naive Bayes (NB)

### Using TFIDF-Text Data

In [27]:
# Initialized Multinomial Naive Bayes
clf = MultinomialNB()

In [28]:
# Fit Naive Bayes classifier according to X, y
clf.fit(X_train_tfidf, train.target)

MultinomialNB()

In [29]:
predicted = clf.predict(X_test_tfidf)

In [30]:
accuracy = np.mean(predicted == test.target)

print("NB-TFIDF Model Accuracy:")
print(accuracy)

NB-TFIDF Model Accuracy:
0.8076208178438662


### Using Word2vec Text Data

Since our word2vec data consist of negative values and MultinomialNB fails when features have negative values, therefore we will use GaussianNB instead to normalize features to [0, 1] range.

In [31]:
clf = GaussianNB().fit(X_train_w2v, train.target)

In [32]:
predicted = clf.predict(X_test_w2v)

In [33]:
accuracy = np.mean(predicted == test.target)

print("NB-W2V Model Accuracy:")
print(accuracy)

NB-W2V Model Accuracy:
0.45419543281996816


## Support Vector Machine

### Using TFIDF-Text Data

In [34]:
SVM = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=100, random_state=42)

In [35]:
SVM.fit(X_train_tfidf, train.target)

SGDClassifier(alpha=0.001, max_iter=100, random_state=42)

In [36]:
predicted = SVM.predict(X_test_tfidf)

In [37]:
accuracy = np.mean(predicted == test.target)

print("SVM-TFIDF Model Accuracy:")
print(accuracy)

SVM-TFIDF Model Accuracy:
0.8166489644184811


### Using Word2vec Text Data

In [38]:
SVM.fit(X_train_w2v, train.target)

SGDClassifier(alpha=0.001, max_iter=100, random_state=42)

In [39]:
predicted = SVM.predict(X_test_w2v)

In [40]:
accuracy = np.mean(predicted == test.target)

print("SVM-W2V Model Accuracy:")
print(accuracy)

SVM-W2V Model Accuracy:
0.6034253850238981


As you can observe from our result obtained, both SVM and NB models product low accuracy when using word2vec data. The possible reason is because our word2vec is only trained on the train data which has limited amount of vocabulary (around 11000++) and therefore it is hardly to understand the semantical meaning of sentences.

# Step 6: Grid Search

Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’. In this section, let's fine tune our SVM model with TF-IDF text data. We are going to use pipeline and therefore there are some processings need to be done.

In [41]:
text_cls = Pipeline([('tfidf', TfidfVectorizer()), 
                     ('cls', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=1000))])

In [42]:
parameters = {
    'tfidf__use_idf': (True, False),
    'cls__alpha': [1e-2, 1e-3]
}

In [43]:
gs_text_cls = GridSearchCV(text_cls, parameters, n_jobs=-1)

In [44]:
text_cls.get_params()

{'memory': None,
 'steps': [('tfidf', TfidfVectorizer()), ('cls', SGDClassifier(alpha=0.001))],
 'verbose': False,
 'tfidf': TfidfVectorizer(),
 'cls': SGDClassifier(alpha=0.001),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 1),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': None,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'cls__alpha': 0.001,
 'cls__average': False,
 'cls__class_weight': None,
 'cls__early_stopping': False,
 'cls__epsilon': 0.1,
 'cls__eta0': 0.0,
 'cls__fit_intercept': True,
 'cls__l1_ratio': 0.15,
 'cls__learning_r

In [45]:
gs_text_cls = gs_text_cls.fit(train.data, train.target)

In [46]:
# Best model scores
gs_text_cls.best_score_

0.8812091099152827

In [47]:
# Best model parameters
gs_text_cls.best_params_

{'cls__alpha': 0.001, 'tfidf__use_idf': True}

# Conclusion

In this article, I demonstrated the basics of building a text classification model comparing **Bag-of-Words (with Tf-Idf)** and **Word Embedding with Word2Vec**. You can further enhance the performance of your model using this code by

* using other classification algorithms like Support Vector Machines (SVM), XgBoost, Ensemble models, Neural networks etc.
* using Gridsearch with more parameters to tune the hyperparameters of your model
* using GloVe for word embeddings
* using advanced word-embedding methods like GloVe and BERT

# Contributors

**Author**
<br>Chee Lam

# References

1. https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
2. https://medium.com/analytics-vidhya/nlp-tutorial-for-text-classification-in-python-8f19cd17b49e