# Classification on Textual data

### Classification is the case of **Supervised Learning** where the target is always given. The explicit examples of output is provided what target model is supposed to produce for specific input. The data is split into an input space ${X}$ and an output space  ${Y}$

$$\Large{f:X->Y} $$

#### For this classifcation demo we are going to identify the genre of the text by using:
    - Statistical Models
        - Naive Bayes
        - SVM 
    - Neural Network Models
        - BiLSTM
    - Pretrained Transformer Models

##### We have the news data. I am not sure from where I got this dataset but I don't claim this is my property or I created it

In [25]:
from sklearn import model_selection, preprocessing, linear_model, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from TextPreProcessing import Preprocessing
from sklearn import decomposition, ensemble
import pandas as pd
import numpy as np
import nltk, string
import spacy, pickle
# Loading model

from sklearn.naive_bayes import MultinomialNB

## Pre-Processing

In [3]:
# load the data using our favourite library pandas :D 
df = pd.read_csv("./train.csv", sep=',',  encoding='utf8')
# to see how much data we have on each class 
df.groupby('label').size()

label
business         610
entertainment    486
food              59
graphics          65
historical        22
politics         517
sport            298
tech             413
dtype: int64

In [4]:
# Preprocessing news and save them in new column
preprocess = Preprocessing()
df['cleanText'] = df['text'].apply(preprocess.normalizeText)


In [5]:
df.head()

Unnamed: 0,text,label,cleanText
0,Ad sales boost Time Warner profit\r\n\r\nQuart...,business,ad sales boost time warner profit quarterly pr...
1,Dollar gains on Greenspan speech\r\n\r\nThe do...,business,dollar gains on greenspan speech the dollar ha...
2,Yukos unit buyer faces loan claim\r\n\r\nThe o...,business,yukos unit buyer faces loan claim the owners o...
3,High fuel prices hit BA's profits\r\n\r\nBriti...,business,high fuel prices hit ba s profits british airw...
4,Pernod takeover talk lifts Domecq\r\n\r\nShare...,business,pernod takeover talk lifts domecq shares in uk...


In [27]:
# converting the labels into catgories (like 'business' -> 1, entertainment -> 2 etc) save them in new column labelCategory

df['label'] = df['label'].astype('category')
df['labelCategory'] = df['label'].cat.codes
df['label'].unique()

[business, entertainment, politics, sport, tech, food, graphics, historical]
Categories (8, object): [business, entertainment, politics, sport, tech, food, graphics, historical]

# train and test split

### This function from scikit-learn will randomly split the data into test and train 

In [10]:
# train and test split
trainX, testX, trainY, testY = model_selection.train_test_split(df['cleanText'], df['label']) 

# feature engineering
transform text data into feature vector. Features can be get using

- Counter Vector: converting text to matrix 
- TF-IDF Vector
     - Word level
     - Character level
     - N-Gram level
- Word embeddings
- Text/NLP based features
- Topic modelling

#### Counter Vector

In [20]:
# 1- Counter Vector
countVec = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
countVec.fit(df['text'].apply(lambda x: np.str_(x)))

# transform the training and testing data using count vectorize object 
trainXCount = countVec.transform(trainX.apply(lambda x: np.str_(x)))
testXCount = countVec.transform(testX.apply(lambda x: np.str_(x)))

# to see the features countVec.get_feature_names()
# print(testXCount.shape, trainXCount.shape, trainY.shape)

# saving countVector to pickle file
pickle.dump(countVec.vocabulary_, open("countVector.pkl", "wb"))

#### Term Frequence Inverse Term Frequence (TF-IDF)

In [21]:
# word level
tfIDF = TfidfVectorizer(analyzer="word", token_pattern=r'\w{1,}', max_features=5000)
tfIDF.fit(df['cleanText'].apply(lambda x: np.str_(x)))
trainX_TfIDF = tfIDF.transform(trainX.apply(lambda x: np.str_(x)))
testX_TfIDF = tfIDF.transform(testX.apply(lambda x: np.str_(x)))
pickle.dump(tfIDF, open("tfIDFWord.pkl", "wb"))

# ngram level 
tfIDF = TfidfVectorizer(token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)

tfIDF.fit(df['cleanText'].apply(lambda x: np.str_(x)))
trainX_TfIDFNgram = tfIDF.fit_transform(trainX.apply(lambda x: np.str_(x)))
testX_TfIDFNgram = tfIDF.fit_transform(testX.apply(lambda x: np.str_(x)))
pickle.dump(tfIDF, open("tfIDFNGram.pkl", "wb"))

In [22]:
# count vector level
tfIDF = TfidfTransformer()
trainX_TfCount = tfIDF.fit_transform(trainXCount)
testX_TfIDFCount = tfIDF.fit_transform(testXCount)
pickle.dump(tfIDF, open("tfIDFCount.pkl", "wb"))

In [28]:
def modelTraining(model, trainX, trainY, testX):
    """ 
        trainX = vectorized data
        trainY = labels of 'vectorized data'
        testY = test data 
    """
    classifier = model.fit(trainX, trainY)
#     pickle.dump(model, open("modelMNB_VecCount_TfIDF.pkl", "wb"))
    pickle.dump(model, open("modelSVM_TfIDF.pkl", "wb"))
    predictions = classifier.predict(testX)
    print(metrics.confusion_matrix(testY, predictions))
    print(metrics.accuracy_score(predictions, testY) * 100)
    print( metrics.classification_report(testY, predictions, target_names=  ['business', 'entertainment', 'politics', 'sport', 'tech', 'food', 'graphics', 'historical']))
    return metrics.accuracy_score(predictions, testY) * 100
    

In [29]:
# Naive Bayes
# 1- vectorCount => 49%
# accuracy = modelTraining(MultinomialNB(), trainXCount, trainY, testXCount) 
# print("Naive Bayes accuracy -> vectorCount = ", accuracy)
# 2- tf-IDF Word level => 83.44
# accuracy = modelTraining(MultinomialNB(), trainX_TfIDF, trainY, testX_TfIDF)
# print("Naive Bayes accuracy-> tf-IDF Word level = ", accuracy)
# # 3- tf-IDF n-gram uni and bigram level => 86.08, bi and trigram => 82.67
# accuracy = modelTraining(MultinomialNB(), trainX_TfIDFNgram, trainY, testX_TfIDFNgram)
# print("Naive Bayes accuracy-> tf-IDF n-gram = ", accuracy)
# # 4- tf-IDF Vector Count transformer level => 87.05
accuracy = modelTraining(MultinomialNB(), trainX_TfCount, trainY, testX_TfIDFCount)
print("Naive Bayes accuracy-> tf-IDF Word level = ", accuracy)

[[146   0   0   0   0   2   0   1]
 [  5 120   0   0   0  11   0   0]
 [  8   0   6   0   0   1   0   1]
 [  1   0   0   0   0   2   0  12]
 [  2   0   0   0   0   5   0   0]
 [  2   0   0   0   0 123   0   0]
 [  3   0   0   0   0  14  49   0]
 [  0   0   0   0   0   6   0  98]]
87.70226537216828
               precision    recall  f1-score   support

     business       0.87      0.98      0.92       149
entertainment       1.00      0.88      0.94       136
     politics       1.00      0.38      0.55        16
        sport       0.00      0.00      0.00        15
         tech       0.00      0.00      0.00         7
         food       0.75      0.98      0.85       125
     graphics       1.00      0.74      0.85        66
   historical       0.88      0.94      0.91       104

     accuracy                           0.88       618
    macro avg       0.69      0.61      0.63       618
 weighted avg       0.86      0.88      0.86       618

Naive Bayes accuracy-> tf-IDF Word lev

  _warn_prf(average, modifier, msg_start, len(result))


In [30]:
# 4- SVM => 50.16
from sklearn.linear_model import SGDClassifier
SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42)
accuracy = modelTraining(svm.SVC(), trainX_TfCount, trainY, testX_TfIDFCount)
print("SVM accuracy-> tf-IDF n-gram = ", accuracy)

[[146   0   0   0   0   3   0   0]
 [  2 132   0   0   0   1   0   1]
 [  4   0  12   0   0   0   0   0]
 [  0   0   0  13   0   0   0   2]
 [  7   0   0   0   0   0   0   0]
 [  5   1   0   0   0 119   0   0]
 [  2   0   0   0   0   0  64   0]
 [  0   0   0   0   0   1   0 103]]
95.30744336569579
               precision    recall  f1-score   support

     business       0.88      0.98      0.93       149
entertainment       0.99      0.97      0.98       136
     politics       1.00      0.75      0.86        16
        sport       1.00      0.87      0.93        15
         tech       0.00      0.00      0.00         7
         food       0.96      0.95      0.96       125
     graphics       1.00      0.97      0.98        66
   historical       0.97      0.99      0.98       104

     accuracy                           0.95       618
    macro avg       0.85      0.81      0.83       618
 weighted avg       0.95      0.95      0.95       618

SVM accuracy-> tf-IDF n-gram =  95.307

  _warn_prf(average, modifier, msg_start, len(result))
