# Classification on Textual data

### Classification is the case of **Supervised Learning** where the target is always given. The explicit examples of output is provided what target model is supposed to produce for specific input. The data is split into an input space ${X}$ and an output space  ${Y}$

$$\Large{f:X->Y} $$

#### For this classifcation demo we are going to identify the genre of the text by using:
    - Statistical Models
        - Naive Bayes
        - SVM 
    - Neural Network Models
        - BiLSTM
    - Pretrained Transformer Models

##### We have the news data. I am not sure from where I got this dataset but I don't claim this is my property or I created it

In [1]:
from sklearn import model_selection, preprocessing, linear_model, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from TextPreProcessing import Preprocessing
from sklearn import decomposition, ensemble
import pandas as pd
import numpy as np
import nltk, string
import spacy, pickle
import warnings
warnings.filterwarnings('ignore')
# Loading model

from sklearn.naive_bayes import MultinomialNB

## Pre-Processing

In [2]:
# load the data using our favourite library pandas :D 
df = pd.read_csv("./train.csv", sep=',',  encoding='utf8')
# to see how much data we have on each class 
df.groupby('label').size()

label
business         610
entertainment    486
food              59
graphics          65
historical        22
politics         517
sport            298
tech             413
dtype: int64

In [3]:
# Preprocessing news and save them in new column
preprocess = Preprocessing()
df['cleanText'] = df['text'].apply(preprocess.normalizeText)


In [4]:
df.head()

Unnamed: 0,text,label,cleanText
0,Ad sales boost Time Warner profit\r\n\r\nQuart...,business,ad sales boost time warner profit quarterly pr...
1,Dollar gains on Greenspan speech\r\n\r\nThe do...,business,dollar gains on greenspan speech the dollar ha...
2,Yukos unit buyer faces loan claim\r\n\r\nThe o...,business,yukos unit buyer faces loan claim the owners o...
3,High fuel prices hit BA's profits\r\n\r\nBriti...,business,high fuel prices hit ba s profits british airw...
4,Pernod takeover talk lifts Domecq\r\n\r\nShare...,business,pernod takeover talk lifts domecq shares in uk...


In [5]:
# converting the labels into catgories (like 'business' -> 1, entertainment -> 2 etc) save them in new column labelCategory

df['label'] = df['label'].astype('category')
df['labelCategory'] = df['label'].cat.codes
df['label'].unique()

[business, entertainment, politics, sport, tech, food, graphics, historical]
Categories (8, object): [business, entertainment, politics, sport, tech, food, graphics, historical]

# train and test split

### This function from scikit-learn will randomly split the data into test and train 

In [6]:
# train and test split
trainX, testX, trainY, testY = model_selection.train_test_split(df['cleanText'], df['label']) 

# feature engineering
transform text data into feature vector. Features can be get using

- Counter Vector or Bag of Words: converting text to matrix 
- TF-IDF Vector
     - Word level
     - Character level
     - N-Gram level
- Word embeddings
- Text/NLP based features
- Topic modelling

#### Bag of Words

Scikit-learn’s CountVectorizer converts the text documents into count vector form.The strings are converted in tokens first then it places 1 on each word in the document and increases its count if it appears again in the corpus.

In [7]:
# 1- Counter Vector
countVec = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
countVec.fit(df['text'].apply(lambda x: np.str_(x)))

# transform the training and testing data using count vectorize object 
trainXCount = countVec.transform(trainX.apply(lambda x: np.str_(x)))
testXCount = countVec.transform(testX.apply(lambda x: np.str_(x)))

# to see the features countVec.get_feature_names()
print(testXCount.shape, trainXCount.shape, trainY.shape)

# saving countVector to pickle file
pickle.dump(countVec.vocabulary_, open("countVector.pkl", "wb"))

(618, 29336) (1852, 29336) (1852,)


#### Term Frequence Inverse Term Frequence (TF-IDF)
TF-IDF stands for term frequency-inverse document frequency. The higher the value of TF-IDF for a term, the higher its frequency in the corpus and low document frequency will be. In other words, to get the higher value means a word is rear in the whole document but frequent in a document.

Tf-IDF can be performed on:
    - Character level
    - Word level
    - N-gram level

For this demo we are going to perfom on all the three to see and compare the results 

In [8]:
# word level
tfIDF = TfidfVectorizer(analyzer="word", token_pattern=r'\w{1,}', max_features=5000)
tfIDF.fit(df['cleanText'].apply(lambda x: np.str_(x)))
trainX_TfIDF = tfIDF.transform(trainX.apply(lambda x: np.str_(x)))
testX_TfIDF = tfIDF.transform(testX.apply(lambda x: np.str_(x)))
pickle.dump(tfIDF, open("tfIDFWord.pkl", "wb"))

# ngram level 
tfIDF = TfidfVectorizer(token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)

tfIDF.fit(df['cleanText'].apply(lambda x: np.str_(x)))
trainX_TfIDFNgram = tfIDF.fit_transform(trainX.apply(lambda x: np.str_(x)))
testX_TfIDFNgram = tfIDF.fit_transform(testX.apply(lambda x: np.str_(x)))
pickle.dump(tfIDF, open("tfIDFNGram.pkl", "wb"))

In [9]:
# count vector level
tfIDF = TfidfTransformer()
trainX_TfCount = tfIDF.fit_transform(trainXCount)
testX_TfIDFCount = tfIDF.fit_transform(testXCount)
pickle.dump(tfIDF, open("tfIDFCount.pkl", "wb"))

In [10]:
def modelTraining(model, trainX, trainY, testX):
    """ 
        trainX = vectorized data
        trainY = labels of 'vectorized data'
        testY = test data 
    """
    classifier = model.fit(trainX, trainY)
#     pickle.dump(model, open("modelMNB_VecCount_TfIDF.pkl", "wb"))
    pickle.dump(model, open("modelSVM_TfIDF.pkl", "wb"))
    predictions = classifier.predict(testX)
    print(metrics.confusion_matrix(testY, predictions))
    print(metrics.accuracy_score(predictions, testY) * 100)
    print( metrics.classification_report(testY, predictions, target_names=  ['business', 'entertainment', 'politics', 'sport', 'tech', 'food', 'graphics', 'historical']))
    return metrics.accuracy_score(predictions, testY) * 100
    

In [11]:
# Naive Bayes
# 1- vectorCount => 49%
# accuracy = modelTraining(MultinomialNB(), trainXCount, trainY, testXCount) 
# print("Naive Bayes accuracy -> vectorCount = ", accuracy)
# 2- tf-IDF Word level => 83.44
# accuracy = modelTraining(MultinomialNB(), trainX_TfIDF, trainY, testX_TfIDF)
# print("Naive Bayes accuracy-> tf-IDF Word level = ", accuracy)
# # 3- tf-IDF n-gram uni and bigram level => 86.08, bi and trigram => 82.67
# accuracy = modelTraining(MultinomialNB(), trainX_TfIDFNgram, trainY, testX_TfIDFNgram)
# print("Naive Bayes accuracy-> tf-IDF n-gram = ", accuracy)
# # 4- tf-IDF Vector Count transformer level => 87.05
accuracy = modelTraining(MultinomialNB(), trainX_TfCount, trainY, testX_TfIDFCount)
print("Naive Bayes accuracy-> tf-IDF Word level = ", accuracy)

[[145   0   0   0   0   1   0   0]
 [  2 111   0   0   0   3   0   0]
 [ 13   0   3   0   0   0   0   0]
 [  0   0   0   0   0   5   0  10]
 [  4   0   0   0   0   3   0   0]
 [  2   0   0   0   0 134   0   0]
 [  3   1   0   0   0   7  53   0]
 [  7   2   0   0   0   6   0 103]]
88.83495145631069
               precision    recall  f1-score   support

     business       0.82      0.99      0.90       146
entertainment       0.97      0.96      0.97       116
     politics       1.00      0.19      0.32        16
        sport       0.00      0.00      0.00        15
         tech       0.00      0.00      0.00         7
         food       0.84      0.99      0.91       136
     graphics       1.00      0.83      0.91        64
   historical       0.91      0.87      0.89       118

     accuracy                           0.89       618
    macro avg       0.69      0.60      0.61       618
 weighted avg       0.87      0.89      0.87       618

Naive Bayes accuracy-> tf-IDF Word lev

In [12]:
# 4- SVM => 50.16
from sklearn.linear_model import SGDClassifier
SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42)
accuracy = modelTraining(svm.SVC(), trainX_TfCount, trainY, testX_TfIDFCount)
print("SVM accuracy-> tf-IDF n-gram = ", accuracy)

[[145   0   0   0   0   0   0   1]
 [  1 114   0   0   0   1   0   0]
 [  2   0  14   0   0   0   0   0]
 [  4   0   0  11   0   0   0   0]
 [  4   2   0   0   0   1   0   0]
 [  5   0   0   0   0 130   0   1]
 [  2   0   0   0   0   0  62   0]
 [  2   3   0   0   0   0   0 113]]
95.30744336569579
               precision    recall  f1-score   support

     business       0.88      0.99      0.93       146
entertainment       0.96      0.98      0.97       116
     politics       1.00      0.88      0.93        16
        sport       1.00      0.73      0.85        15
         tech       0.00      0.00      0.00         7
         food       0.98      0.96      0.97       136
     graphics       1.00      0.97      0.98        64
   historical       0.98      0.96      0.97       118

     accuracy                           0.95       618
    macro avg       0.85      0.81      0.83       618
 weighted avg       0.95      0.95      0.95       618

SVM accuracy-> tf-IDF n-gram =  95.307

# Transformer model

In [25]:
import sklearn
from sklearn import model_selection, preprocessing, linear_model, metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn import decomposition, ensemble
from simpletransformers.classification import ClassificationModel, ClassificationArgs

In [14]:
trainX, testX, trainY, testY = model_selection.train_test_split(df['cleanText'], df['labelCategory'])

In [18]:
# we are using wandb library 
model_args = ClassificationArgs()

model_args.num_train_epochs = 3
model_args.wandb_project = 'newsClassification'
model_args.use_early_stopping = True
model_args.early_stopping_delta = 0.01
model_args.early_stopping_metric = "mcc"
model_args.early_stopping_metric_minimize = False
model_args.early_stopping_patience = 5
model_args.evaluate_during_training_steps = 1000


train_df = pd.DataFrame(trainX)

train_df['label'] = pd.DataFrame(trainY)
labelsdict = dict(enumerate(df['label'].cat.categories))
print(labelsdict)

{0: 'business', 1: 'entertainment', 2: 'food', 3: 'graphics', 4: 'historical', 5: 'politics', 6: 'sport', 7: 'tech'}


In [27]:
# fine-tune RoBERTa based model for classification
model = ClassificationModel(
     "roberta", 
    # './trained-models/transformer-small/',  
    "roberta-base",  
    num_labels=len(labelsdict),
    use_cuda=False,
    args=model_args
) 
# for training
model.train_model(train_df, acc=sklearn.metrics.accuracy_score)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.de