# Text Classification, Topic Modelling using Neural Network, for Academic Article

## Table of Content
1. [Text Classification](#1)
    - 1.1 [Logistic Regression](#1.1)
    - 1.2 [Recurrent Neural Network](#1.2)
    - 1.3 [Comparison and Evaluation](#1.3)
2. [Topic Modelling](#2)

In [3]:
#import libraries
import pandas as pd
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score,recall_score,f1_score,confusion_matrix
import torch.nn as nn
import torch 
from torchtext.legacy import data
from torchtext.legacy.data import TabularDataset
import torch.optim as optim
import time
import warnings
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve,auc
import logging
from nltk.tokenize import RegexpTokenizer
from gensim.models import Phrases
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from pprint import pprint
import pyLDAvis.gensim

In [5]:
#Load data
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train.head(5)

Unnamed: 0,ID,URL,Date,Title,InfoTheory,CompVis,Math,Abstract
0,cs-9301111,arxiv.org/abs/cs/9301111,1989-12-31,Nested satisfiability,0,0,0,Nested satisfiability A special case of the s...
1,cs-9301112,arxiv.org/abs/cs/9301112,1990-03-31,A note on digitized angles,0,0,0,A note on digitized angles We study the confi...
2,cs-9301113,arxiv.org/abs/cs/9301113,1991-07-31,Textbook examples of recursion,0,0,0,Textbook examples of recursion We discuss pro...
3,cs-9301114,arxiv.org/abs/cs/9301114,1991-10-31,Theory and practice,0,0,0,Theory and practice The author argues to Sili...
4,cs-9301115,arxiv.org/abs/cs/9301115,1991-11-30,Context-free multilanguages,0,0,0,Context-free multilanguages This article is a...


In [6]:
test.head(5)

Unnamed: 0,ID,URL,Date,Title,InfoTheory,CompVis,Math,Abstract
0,no-150100335,arxiv.org/abs/1501.00335,2015-01-01,A Data Transparency Framework for Mobile Appli...,0,0,0,A Data Transparency Framework for Mobile Appl...
1,no-14024178,arxiv.org/abs/1402.4178,2015-01-01,A reclaimer scheduling problem arising in coal...,0,0,0,A reclaimer scheduling problem arising in coa...
2,no-150100263,arxiv.org/abs/1501.00263,2015-01-01,Communication-Efficient Distributed Optimizati...,0,0,1,Communication-Efficient Distributed Optimizat...
3,no-150100287,arxiv.org/abs/1501.00287,2015-01-01,Consistent Classification Algorithms for Multi...,0,0,0,Consistent Classification Algorithms for Mult...
4,no-11070586,arxiv.org/abs/1107.0586,2015-01-01,Managing key multicasting through orthogonal s...,0,0,0,Managing key multicasting through orthogonal ...


We only need information regarding the text and its classified labels. Therefore, we only retain relevants columns in `train` and `test`

In [7]:
train=train[['ID','InfoTheory','CompVis','Math','Abstract']]
test=test[['ID','InfoTheory','CompVis','Math','Abstract']]
train.head(5)

Unnamed: 0,ID,InfoTheory,CompVis,Math,Abstract
0,cs-9301111,0,0,0,Nested satisfiability A special case of the s...
1,cs-9301112,0,0,0,A note on digitized angles We study the confi...
2,cs-9301113,0,0,0,Textbook examples of recursion We discuss pro...
3,cs-9301114,0,0,0,Theory and practice The author argues to Sili...
4,cs-9301115,0,0,0,Context-free multilanguages This article is a...


In [8]:
test.head(5)

Unnamed: 0,ID,InfoTheory,CompVis,Math,Abstract
0,no-150100335,0,0,0,A Data Transparency Framework for Mobile Appl...
1,no-14024178,0,0,0,A reclaimer scheduling problem arising in coa...
2,no-150100263,0,0,1,Communication-Efficient Distributed Optimizat...
3,no-150100287,0,0,0,Consistent Classification Algorithms for Mult...
4,no-11070586,0,0,0,Managing key multicasting through orthogonal ...


In [9]:
#we check for dimensions of training and testing data
print(train.shape)  #54731 rows, 5 columns
print(test.shape)   #19679 rows, 5 columns

(54731, 5)
(19678, 5)


Next, we check if there are any class imbalance issue in `train` and `test`

In [14]:
print("The proportion of Articles that belongs to InfoIheory, in the training set, is: ",
      train.loc[train.InfoTheory==1].shape[0]/train.shape[0])
print("The proportion of Articles that belongs to Compvis, in the training set, is: ",
      train.loc[train.CompVis==1].shape[0]/train.shape[0])
print("The proportion of Articles that belongs to Math, in the training set, is: ",
      train.loc[train.Math==1].shape[0]/train.shape[0])
print('----------------------------------------------------------------------------------------------------')
print("The proportion of Articles that belongs to InfoIheory, in the testing set, is: ",
      test.loc[test.InfoTheory==1].shape[0]/test.shape[0])
print("The proportion of Articles that belongs to Compvis, in the testing set, is: ",
      test.loc[test.CompVis==1].shape[0]/test.shape[0])
print("The proportion of Articles that belongs to Math, in the testing set, is: ",
      test.loc[test.Math==1].shape[0]/test.shape[0])

The proportion of Articles that belongs to InfoIheory, in the training set, is:  0.1925417039703276
The proportion of Articles that belongs to Compvis, in the training set, is:  0.04063510624691674
The proportion of Articles that belongs to Math, in the training set, is:  0.30562204235259727
----------------------------------------------------------------------------------------------------
The proportion of Articles that belongs to InfoIheory, in the testing set, is:  0.1837585120439069
The proportion of Articles that belongs to Compvis, in the testing set, is:  0.1093607073889623
The proportion of Articles that belongs to Math, in the testing set, is:  0.3013517633905885


There is a problem of class imbalance, for all 3 response variables, in both training and testing set. The problem is particularly more severe for `CompVis` and `InfoTheory`, especially `CompVis`

## 1. Text Classification <a class="anchor" id="1"></a>

For this portion of the project, I will compare between 2 different types of models when it comes to text classification: A statistical classifier in the form of `Logistic Regression`, and a deep learning approach involving a `Recurrent Neural Network`. 

### Text Preprocessing

For both of the models, they will have the same text preprocessing stage, involving:
- Use TfidfVectorizer:
    - Set Lowercase is True
    - Allows for bigram
    - A minimum document frequency of 2
    - Use tokenizer and lemmatizer from WordNet from NLTK

In [26]:
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to C:\Users\Daniel
[nltk_data]     Tran\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Daniel
[nltk_data]     Tran\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [27]:
#Define tokenizer and lemmatizer
class LemmaTokenizerWordNet(object):
    def __init__(self):
        self.wnl=WordNetLemmatizer()
    def __call__(self,doc):
        #Tokenization, Lemmarization, remove stopword
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [28]:
#TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(analyzer='word',input='content',
                                lowercase=True,
                                min_df=2, 
                                ngram_range=(1,2),
                                tokenizer=LemmaTokenizerWordNet())

In [29]:
#Vectorize the text data to get the features
x_train=tfidf_vectorizer.fit_transform(train.Abstract.tolist())

#Number of features
len(tfidf_vectorizer.get_feature_names())

504439

In [30]:
#Get all training data labels as arrays
y_train_IT=np.asarray(train.InfoTheory.tolist())
y_train_CV=np.asarray(train.CompVis.tolist())
y_train_MA=np.asarray(train.Math.tolist())

### 1.1 Logistic Regression <a class="anchor" id="1.1"></a>

#### Model for InfoTheory

In [35]:
warnings.filterwarnings("ignore", category=DeprecationWarning) 

IT_model=LogisticRegression().fit(x_train,y_train_IT)

#Model performance on training set
IT_predict_train=IT_model.predict(x_train)
print(confusion_matrix(y_train_IT,IT_predict_train))

[[43839   354]
 [ 1656  8882]]


#### Model for CompVis

In [36]:
CV_model=LogisticRegression().fit(x_train,y_train_CV)

#Model performance on training set
CV_predict_train=CV_model.predict(x_train)
print(confusion_matrix(y_train_CV,CV_predict_train)) 

[[52412    95]
 [  876  1348]]


#### Model for Math

In [37]:
MA_model=LogisticRegression(max_iter=200).fit(x_train,y_train_MA)

#Model performance on training set

MA_predict_train=MA_model.predict(x_train)
print(confusion_matrix(y_train_MA,MA_predict_train)) 

[[37005   999]
 [ 3212 13515]]


### 1.2 Recurrent Neural Network <a class="anchor" id="1.2"></a>

### 1.3 Comparison and Evaluation <a class="anchor" id="1.3"></a>

## 2.Topic Modelling <a class="anchor" id="2"></a>