## Text Classification

For reference I followed following resources:

https://www.kaggle.com/sudhirnl7/logistic-regression-with-stratifiedkfold
https://medium.com/datadriveninvestor/k-fold-and-other-cross-validation-techniques-6c03a2563f1e
https://realpython.com/python-keras-text-classification/#defining-a-baseline-model
https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

Here, I have tested Used 4 Basic feature extractors:

1) Count Vector
2) Word IF-IDF
3) N-gram IF-IDF
4) Character IF-IDF

For training, I tested two classifiers:

1) Logistic Regression
2) Support Vector Machine with Linear Kernel

I have attached a pdf (with performance metrices for above features and classifiers)

In [47]:
## import important modules
import numpy as np
import pandas as pd
from sklearn.model_selection import  train_test_split, StratifiedKFold
from sklearn import svm
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

In [48]:
# Load the data sets
train_data= pd.read_csv (r'train_set.csv',encoding='ISO-8859-1')
test_data =pd.read_csv(r'test_set.csv',encoding='ISO-8859-1')

## we use only train data for training, split it into labels and texts
labels= train_data['label']
texts= train_data['text']

In [49]:
# Split data into training and validation sets
# proper split makes our performance estimators (prediction metrices) stable 
# Here, I have used stratified cross-validation. It forces each fold to have at least m instances of each class, enabling balanced split
# Here, train to test split ratio = 4:1
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)

xtr =[]   #holds train_X, Here, n_splits =5, so, there are 5 different train_X sets
ytr =[]   #holds train_Y (labels), Here, n_splits =5, so, there are 5 different train_Y sets
xvl =[]   #holds Validation_X, Here, n_splits =5, so, there are 5 different Validation_X sets
yvl =[]   #holds Validation_Y (labels), Here, n_splits =5, so, there are 5 different Validation_Y sets
i=0       # counter

for train_index,test_index in kf.split(texts,labels):
   
    xtr.append (texts.loc[train_index])
    ytr.append (labels.loc[train_index])
    xvl.append (texts.loc[test_index])
    yvl.append (labels.loc[test_index])
    i+=1
    

In [70]:
# feature extractors

# count vector features
def count_vector (xtr,xvl):
    vectorizer = CountVectorizer(min_df=0, lowercase=False)
    vectorizer.fit(xtr)
    cnt_vector_xtr=vectorizer.transform(xtr)
    cnt_vector_xvl=vectorizer.transform(xvl)
    
    return vectorizer,cnt_vector_xtr,cnt_vector_xvl

# word level tf-idf
def word_tfidf( xtr,xvl):

    tfidf_vect = TfidfVectorizer()
    tfidf_vect.fit(xtr)
    xtrain_tfidf =  tfidf_vect.transform(xtr)
    xvalid_tfidf =  tfidf_vect.transform(xvl)
    
    return tfidf_vect, xtrain_tfidf,xvalid_tfidf

# ngram level tf-idf 
def ngram_tfidf(xtr,xvl):
    
    tfidf_vect_ngram = TfidfVectorizer(ngram_range=(2,3))
    tfidf_vect_ngram.fit(xtr)
    xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(xtr)
    xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(xvl)
    
    return tfidf_vect_ngram, xtrain_tfidf_ngram,xvalid_tfidf_ngram

# characters level tf-idf
def character_tfidf(xtr,xvl):
    
    tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char',ngram_range=(2,3))
    tfidf_vect_ngram_chars.fit(xtr)
    xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(xtr) 
    xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(xvl) 
    
    return tfidf_vect_ngram_chars, xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars

In [77]:
      
# Model Trainer and performance estimator
def model_trainer(classifier, feature_vector_train, feature_vector_validation,train_y, validation_y):
    
    
    model = classifier.fit (feature_vector_train, train_y)
        
    acc_score = accuracy_score(validation_y, model.predict(feature_vector_validation))
    f1_scr= f1_score(validation_y, model.predict(feature_vector_validation), average='weighted') 
    
    return model, acc_score,f1_scr
        
        

In [78]:
# Perform training, check different models and feature vectors and evaluate performance metrices

classifier= SVC(kernel='linear') #choose the desired classifier, use sklearn models only, example svm.SVC()
cv_accuracy_score =[]
cv_f1_score=[]

# training and performance evaluation is performed with N-splits (from stratified cross validation)   
for index in range (i):
        
        vector,train_X_feature, test_X_feature= character_tfidf(xtr[index],xvl[index]) # to use other features, only change the function
        #example: word_tfidf(xtr[index],xvl[index]) to choose "word level tf-idf" feature extractor, retain vector as dictionary
                                                                            
        
        # trainer = model_trainer (classifier, train_X, test_X, train_Y, test_Y)
        model, accuracy, F1_Score = model_trainer(classifier,  train_X_feature, test_X_feature, ytr[index], yvl[index])
        cv_accuracy_score.append(accuracy)
        cv_f1_score.append(F1_Score)
        print ("For the ", index, "Fold, Accuracy: ", accuracy)
        print ("For the ", index, "Fold, F1_score: ", F1_Score)

Mean_accuracy= np.mean(cv_accuracy_score)
Mean_f1_score= np.mean(cv_f1_score)
        
print ("Mean Accuracy : ", Mean_accuracy)
print ("Mean F1 score : ", Mean_f1_score)

For the  0 Fold, Accuracy:  0.9382402707275804
For the  0 Fold, F1_score:  0.93842726084675
For the  1 Fold, Accuracy:  0.9398814563928873
For the  1 Fold, F1_score:  0.9397518251834864
For the  2 Fold, Accuracy:  0.9358594411515665
For the  2 Fold, F1_score:  0.9358490029502204
For the  3 Fold, Accuracy:  0.9377250582503707
For the  3 Fold, F1_score:  0.9377846368033108
For the  4 Fold, Accuracy:  0.9446799491309877
For the  4 Fold, F1_score:  0.944733062185404
Mean Accuracy :  0.9392772351306785
Mean F1 score :  0.9393091575938344


In [63]:
# Test the code for test_set.csv

test_texts= test_data['text']
feature_extractor = vector.transform(test_texts) #vector is retained during training
output_class= model.predict(feature_extractor)
print (output_class)