## Multi Classification problem

## Overview and Abstract

- The aim of this notebook is to predict from the text of the document what industry the document is related to.

- This notebook includes a multinomial Naive Bayes machine learning model, a dense deep neural network, a Convolutional neural network (CNN), a Long short-term memory (LSTM) neural network and a Gated Recurrent Unit (GRU) neural network to deal with the multi-label multi-classification task. 

- The Convolutional Neural Network outperforms the other neural networks with F1 score of 0.94.

## Method
- **Data processing:** Train and test dataset contains 13 and 11 attributes, respectively. The ‘label’ column is the target variable, which is based on the category column and exists only in the training dataset. In the label column there are 176 combinations, because there are different combinations from the 9 industries. We want our prediction to be based on the 9 labels with a singular industries and not based on the combinations. The contracts that belongs to a singular industry have one '1' in their corresponding label number. So, in order to remove those rows from our dataset, we dropped the lines with more that one '1'. After removing the rows, we lost almost 8K rows from the given dataset, which is not a problem, since our dataset contains around 98K rows (0.08% loss). Additionally, to complete the data processing, we concacated the two datasets as one. We dropped the attribute value, due to the number of null values. Also, the attribute category was removed, for the reason that is not presented in test set. The dataset has both categorical and text features. The categorical attributes of the dataset were encoded using the Label Encoding. For the text features of the dataset, we used some functions with the help of Natural Language Toolkit (NLTK) and spacy, in order to clean the text of the attributes. The text features attributes include the title, description and awarding authority columns, where the title has text from the English language, whether the other two columns contain text from the German language. The first step of the text preprocessing part is that we are creating the convert_string function which converts the text type into string and the capital letters into lower. Further, the second step is to create a function removing the string punctuations, the string digits and some other punctuation that were not in the English language. In addition, in order to remove the English stopwords, we used the remove_stopwards function, in order to remove words that do not provide valuable information for downstream. For stopwords function, we created two different functions for each language. Hence, with the help of the SnowballStemmer function we reduced the inflected (or sometimes derived) words to their word stem, base or root form (e.g., ‘walks’ and ‘walking’, converted into ‘walk’). For each language a separate function was created. Furthermore, the German language is very different from the English language, with unfamiliar symbols, punctuations and letters. In order to not miss anything of the aforementioned, we create the clean_text function, removing everything unnecessary from the two German attributes. After the preprocessing in order to feed and build the models, we created a new column ('text') in the preprocessed dataset, including all the data from the three text attributes into one.
 - Data processing code can be found in the following notebook:
https://colab.research.google.com/drive/15ZSh7QNpZBmAUUBeFDw8BOYjyxsOXpaX?usp=sharing

- **Multinomial Naive Bayes Baseline Machine Learning Model:** The analysis of categorical data, specifically text data, is one of the most common applications of machine learning. The most common Machine Learning classifiers for multi-label classification datasets are the Logistic Regression, the Random Forest and the multinomial Naïve Bayes. In the first attempt, we used Logistic Regression but the Colab crashes due to the Ram. Secondly, the Random Forest algorithm was taking too long to execute, since the dataset has almost 122K rows. Lastly, the most appropriate and successful algorithm based on the provided dataset, for the baseline Machine Learning model is the Multinomial Naïve Bayes. This algorithm is also well known for multi class prediction feature, with the ability to predict the probability of multiple classes of target variable. Also, Naïve Bayes classifiers are commonly used in text classification because they perform better in multiclass problems and have a higher success rate than other algorithms. As a consequence, it's popular in spam filtering and sentiment analysis. Especially, the Multinomial Naïve Bayes it's a classification method based on Bayes' Theorem and the presumption of predictor independence. The most important reason for using Naïve Bayes algorithm, is that its fast and easy to predict class of test dataset, especially in predictions of multi-classifications datasets. When the assumption of independence is met, a Naïve Bayes classifier outperforms other models such as logistic regression and needs less training data. Last but not least, for our dataset it performs better, since in comparison to numerical input variables, it performs well with categorical input variables. 

- **Dense Deep Neural Network:**
For the Dense and Deep Neural Network, we use ‘tokenizer()’, to vectorize our text into a list of integers. Due to the large number of unique tokens, we select only a specific number of words, meaning that keeping only the most common 50K words. To deal with the different length of words for each text sequence, we use padding which simply pads the sequence of words with zeros. Three hidden layers are created that make use of the ReLu activation function and batch normalisation to minimise training time and improve the performance of the neural network Then we use dropout at a rate of 25% to randomly remove the 25% of the neurons from the previous layer. The 9 output layer neurons correspond to each class we want to predict. The SoftMax function is used in the output layer in order to assign probabilities to each class. An L2 kernel regularizer is implemented to avoid overfitting. We implement the widely used ‘sparse_categorical_crossentropy ()’ for the loss function and the optimiser ‘Adam’. We re-run the model after oversampling via SMOTE and because the performance has been improved we stick to the oversampling technique. 

- **Additional Neural Networks:**
  - **KIM Convolutional Neural Network:**
A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing image data. However, CNN has performed well in a variety of text classification tasks. To feed the text data into the model, we have to represent them as an array of vectors (each word mapped to a specific vector in a vector space composed of the entire vocabulary). Using tokenizer, we vectorize our text into a list of integers. Due to the large number of unique tokens, we select only a specific number of words, meaning that keeping only the 10K most common words. To deal with the different length of words for each text sequence, we use padding which simply pads the sequence of words with zeros. The neural network architecture includes the following layers: 
    * Embedding layer which provides a dense representation of the tokens and their relative meanings. We specify the 3 arguments needed for this layer which are the size of the vocabulary in the training dataset, the size of the vector space in which words will be embedded and the length of the input sequences. For the vector dimension, based on experimentations with several numbers, we keep 20 dimensions. 
    * One-dimensional Convolutional Layer finds patterns in the sentences applying filters and generating feature maps, it is composed of 32 filters with a size of 3 and the ReLu function as activation function. Then, based on ReLu, we select the ‘He initializer’. 
    * Max pooling layer helps to subsample/select the most important features from the convolutional layer with pool size of 3 and stride equal to 1 in order to reduce the computational load, number of parameters and then minimizing the risk of overfitting. 
    * Dropout layer is used to improve the generalization of the model. The dropout rate is equal to 20% meaning that 20% of the neurons from the previous layer will be randomly removed. 
    * Same One-dimensional Convolutional Layer is produced with increasing number of 64 filters in order to capture more features in the dataset followed by same max pooling and dropout layer.  
    * Final One-dimensional Convolutional Layer is produced with increasing number of 128 filters and followed by max pooling and dropout layer. 
    * Output Layer (dense) has 9 neurons that corresponds to each class and the SoftMax function to map a probability for each class. We use l2 kernel regularizer with a regularization factor of 0.01 

    For loss function, we use the sparse categorical crossentropy widely used for multi-classification problems and adam optimizer. Finally, we re-run the model after oversampling via SMOTE and the performance is improved so we keep oversampling technique. 

  - **Long-Short Term Memory Neural Network:**
Our second model is a simple Long-Short Term Memory Neural Network, in order to vectorize text we convert each text into a sequence.  First, we set the maximum number of words we want to use by limiting the dataset to the top 10000 words. The maximum number of words in each horizontal sequence is set to 250 as the model requires.  Then we use Tokenizer() to create unique tokens (105650 are found in our case). To create an LSTM model, we need the sequences to be of the same length. To do that we truncate and pad the input sequences and we get the shape of the data tensor. Then we convert the categorical labels into numbers to get a label tensor. Then, we proceed to split the data into train and test sets and to create the neural network layers.  
    * The embedded layer is the first layer and represents vectors of 100 length for each word.  This layer is accompanied by a spatial 1D dropout layer that ignores neurons that are randomly selected during training, in order to prevent any decrease in the learning rate and overfitting. We use a 20% dropout which is a decent compromise to prevent overfitting while retaining the accuracy of the model. 
    * The second layer is the LSTM that has 100 memory units.  The dropout rate of the second layer is equal to 20% meaning that 20% of the neurons from the previous layer will be randomly removed. The recurrent dropout rate is also 20%, and drops the connections between the recurrent units  
    * The output layer needs to create 9 output values since we have 9 classes. We use SoftMax as the activation function to make sure that each class is assigned decimal probabilities that all add up to 1. This is done in order to speed up training.  
    * We use the Adam as our optimization algorithm for the network because is one of the most appropriate for noisy problems.  Kernel_regularizer is used to apply the default L2=0.01 regularization penalty on each layer’s kernel. The penalties will be added up in the loss function being optimized by the network. Categorical_crossentropy is used as the loss function because we have more than two classes.  Therefore, 5 epochs are trained while the batch size is the size of input being fed to the model and it is equal to 128. Finally, the callback function ‘Early Stopping’ is used to monitor training and stops it once the performance of the model stops improving. 

  - **Gated Reccurent Unit Neural Network:**
For the Gated Recurrent Unit (GRU) network we follow a similar method as for the LSTM model, since we used the same embedding layer. Thus, the maximum words used for processing and the embedding dimension are 10000 and 100 respectively. We used two dropouts for the GRU layers with 30% spatial dropout and recurrent dropout. In the following two dense layers, we used 1024 neural networks with the use of ReLu activation function, and dropout 0.8 in order to reduce the overfitting. In the last layer we used 9 neurons, since we have 9 categories, by using SoftMax. The activation function ‘SoftMax’, is implemented just before the output layer in order to assign decimal probabilities to each class and allow the training to converge faster. The output layer uses the optimizer ‘Adam’, along with categorical_crossentropy as the loss function since this is a multi-class classification. Finally, the callback function ‘Early Stopping’ is used to monitor training and stops it once the performance of the model stops improving. 

# Results and Discussion
- The performance of Multinomial Naive Bayes Baseline Machine Learning model is very low compared to the neural networks.
- The best scores are derived from the additional neural networks, especially the Convolutional neural network that has the best score.


 Model | Macro-F1-score
--- | --- 
Multinomial NB | 0.083
Dense Deep NN | 0.45 
Convolutional NN  | 0.94
LSTM NN | 0.90
GRU NN | 0.69

# Summary and Recommendation

- Regarding the LSTM neural network, ideally, we are supposed to use class weights, by embedding a matrix which needs to be derived from global vectors for the word representation. 

# References
- Ray, S., 2017. 6 Easy steps to Learn Naive Bayes Algorithm with codes in Python and R. [online] Analytics Vidhya. Available at: <https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/>
- Kaggle. 2019. Intro to Recurrent Neural Networks LST | GRU. [online] Available at: <https://www.kaggle.com/thebrownviking20/intro-to-recurrent-neural-networks-lstm-gru>
- Brownlee, J., 2019. Difference Between Return Sequences and Return States for LSTMs in Keras. [online] Machine Learning Mastery. Available at: <https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/>
- Chaubard, F. and Socher, R., 2019. Natural Language Processing with Deep Learning. p.CNNs (Convolutional Neural Networks) chapter
- Kim, J., 2017. Understanding how Convolutional Neural Network (CNN) perform text classification with word embeddings. [online] Towards Data Science. Available at: <https://towardsdatascience.com/understanding-how-convolutional-neural-network-cnn-perform-text-classification-with-word-d2ee64b9dd0b>
- Brownlee, J., 2018. How to Develop a Multichannel CNN Model for Text Classification. [online] Machine Learning Mastery. Available at: <https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/>
- Kaggle. 2019. NLP with CNN. [online] Available at: <https://www.kaggle.com/adhamsuliman1993/nlp-with-cnn/code>
- Kaggle. 2020. Text Classification using CNN. [online] Available at: <https://www.kaggle.com/au1206/text-classification-using-cnn>
- GeeksforGeeks. 2020. An introduction to MultiLabel classification. [online] Available at: <https://www.geeksforgeeks.org/an-introduction-to-multilabel-classification/>
- Artiwise. 2020. Multi-label Text Classification with Machine Learning and Deep Learning. [online] Available at: <https://medium.com/@artiwise_en/multi-label-text-classification-with-machine-learning-and-deep-learning-1a0565ee98c8>
- Scikit-learn.org. n.d. Multiclass and Multioutput algorithms. [online] Available at: <https://scikit-learn.org/stable/modules/multiclass.html>
- Nabi, J., 2018. Machine Learning - Text processing. [online] Towards Data Science. Available at: <https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958>
- Machine Learning. 2017. Text Classification using Neural Networks. [online] Available at: <https://machinelearnings.co/text-classification-using-neural-networks-f5cd7b8765c6>
- Janakiev, N., n.d. Practical Text Classification With Python and Keras. [online] Real Python. Available at: <https://realpython.com/python-keras-text-classification/>
- Shaikh, J., 2017. Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.. [online] Towards Data Science. Available at: <https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a>
- Kumar, S., 2019. Getting started with Text Preprocessing. [online] Kaggle. Available at: <https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing>
- Chapter 4. Text Vectorization and transformation pipelines [online] Oreilly. Available at:<https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html>


# Code


In [1]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

In [2]:
%tensorflow_version 2.x
import tensorflow as tf
import timeit

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise SystemError('GPU device not found')

def cpu():
  with tf.device('/cpu:0'):
    random_image_cpu = tf.random.normal((100, 100, 100, 3))
    net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
    return tf.math.reduce_sum(net_cpu)

def gpu():
  with tf.device('/device:GPU:0'):
    random_image_gpu = tf.random.normal((100, 100, 100, 3))
    net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
    return tf.math.reduce_sum(net_gpu)
  
# We run each op once to warm up; see: https://stackoverflow.com/a/45067900
cpu()
gpu()

# Run the op several times.
print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit('cpu()', number=10, setup="from __main__ import cpu")
print(cpu_time)
print('GPU (s):')
gpu_time = timeit.timeit('gpu()', number=10, setup="from __main__ import gpu")
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
2.904276865
GPU (s):
0.0386751899999922
GPU speedup over CPU: 75x


In [201]:
# Include your packages/imports here
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from google.colab import files
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input,Conv1D,MaxPooling1D,Dense,GlobalMaxPooling1D,Embedding
from tensorflow.keras.models import Model
from sklearn.metrics import confusion_matrix,classification_report
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, GRU
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from sklearn.metrics import f1_score
from keras.regularizers import l1,l2,l1_l2
from imblearn.over_sampling import SMOTE
from tqdm.notebook import tqdm

%matplotlib inline
# Add your models here

# Add your functions for training here

In [33]:
#Read the full processed dataset
df = pd.read_csv('dataset_task2.csv')

In [34]:
#adding the text of the three columns into one
df['text']= df['title']+' '+df['description']+' '+df['awarding_authority']

In [35]:
#splitting the dataset into train and test
df_train = df[:90384]
df_test = df[90384:]

In [36]:
df.head()

Unnamed: 0.1,Unnamed: 0,index,docid,publication_date,contract_type,nature_of_contract,title,description,awarding_authority,label,text
0,0,0,2493527426,114,0,1,germani wilhelmshaven clean servic,unterhalt glasrein,staatlich baumanag em wes,100000.0,germani wilhelmshaven clean servic unterhalt g...
1,1,1,2538215982,131,1,1,germani dresden engin design servic traffic in...,ab karlsruh stuttgart nurnberg leipzig dresd b...,db netz ag,1000.0,germani dresden engin design servic traffic in...
2,2,2,2204943443,100,1,3,germani germer heat ventil air condit instal work,fertigstell erst bauabschnitt erfolgt zweit ba...,gross kreisstadt germ,1000.0,germani germer heat ventil air condit instal w...
3,3,3,2417769175,96,1,2,germani limbach board,einricht tafelsyst,gemeind limbach,100000000.0,germani limbach board einricht tafelsyst gemei...
4,4,4,2242098706,93,0,3,germani frankfurt main landscap work green area,projekt neubau filial dortmund gewerk galabau ...,deutsch bundesbank beschaffungszentrum,1000.0,germani frankfurt main landscap work green are...


In [37]:
#removing the unexpected column: unnamed:0
df = df.iloc[: , 1:]

In [38]:
df_train.shape
df_train = df_train.iloc[: , 1:]

In [39]:
#removing the decimal places
df_train['label'] = df_train['label'].astype(float)

In [40]:
df_train['label'] = df_train['label'].astype(int)

In [41]:
df_train['label'] = df_train['label'].apply(str)

le = LabelEncoder()
df_train['label'] = le.fit_transform(df_train['label'])

In [42]:
df_test.shape
df_test = df_test.iloc[: , 1:]

In [43]:
df_train.head()

Unnamed: 0,index,docid,publication_date,contract_type,nature_of_contract,title,description,awarding_authority,label,text
0,0,2493527426,114,0,1,germani wilhelmshaven clean servic,unterhalt glasrein,staatlich baumanag em wes,5,germani wilhelmshaven clean servic unterhalt g...
1,1,2538215982,131,1,1,germani dresden engin design servic traffic in...,ab karlsruh stuttgart nurnberg leipzig dresd b...,db netz ag,3,germani dresden engin design servic traffic in...
2,2,2204943443,100,1,3,germani germer heat ventil air condit instal work,fertigstell erst bauabschnitt erfolgt zweit ba...,gross kreisstadt germ,3,germani germer heat ventil air condit instal w...
3,3,2417769175,96,1,2,germani limbach board,einricht tafelsyst,gemeind limbach,8,germani limbach board einricht tafelsyst gemei...
4,4,2242098706,93,0,3,germani frankfurt main landscap work green area,projekt neubau filial dortmund gewerk galabau ...,deutsch bundesbank beschaffungszentrum,3,germani frankfurt main landscap work green are...


# Multinomial Naive Bayes Baseline Machine Learning Model

In [44]:
y = df_train['label']

In [45]:
X = df_train.drop(columns=['index','label','title','description','awarding_authority'])

In [46]:
X.head()

Unnamed: 0,docid,publication_date,contract_type,nature_of_contract,text
0,2493527426,114,0,1,germani wilhelmshaven clean servic unterhalt g...
1,2538215982,131,1,1,germani dresden engin design servic traffic in...
2,2204943443,100,1,3,germani germer heat ventil air condit instal w...
3,2417769175,96,1,2,germani limbach board einricht tafelsyst gemei...
4,2242098706,93,0,3,germani frankfurt main landscap work green are...


In [47]:
#trail1 = X.copy()

In [52]:
X_train, X_val, y_train, y_val = train_test_split(X['text'], y, random_state=42, test_size=0.2, shuffle=True)

**TfidfVectorizer**

In [55]:
tfv = TfidfVectorizer(min_df=0.3, max_features=None, strip_accents='unicode', analyzer='word',ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1)

In [57]:
X_train = (X_train).values.astype('U')
X_val = (X_val).values.astype('U')

In [58]:
tfv.fit(list(X_train) + list(X_val))

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=0.3, ngram_range=(1, 3), norm='l2', preprocessor=None,
                smooth_idf=1, stop_words=None, strip_accents='unicode',
                sublinear_tf=1, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=1, vocabulary=None)

In [59]:
xtrain_tfv =  tfv.transform(X_train) 
xvalid_tfv = tfv.transform(X_val)

In [60]:
#Naive Bayes on TFIDF
clf = MultinomialNB()
clf.fit(xtrain_tfv, y_train)

y_pred = clf.predict(xvalid_tfv)

In [67]:
print(confusion_matrix(y_pred,y_val))
print(f1_score(y_pred,y_val, average = 'macro'))
print(classification_report(y_pred,y_val))

[[    0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0]
 [ 1202   527  1245 10906  1027  1342   516   215  1097]
 [    0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0]]
0.08361989978799833
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.00      0.00      0.00         0
           2       0.00      0.00      0.00         0
           3       1.00      0.60      0.75     18077
           4       0.00      0.00      0.00         0
           5       0.00      0.00      0.00         0
           6       0.00      0.00      0.00         0
           7       0.00      0.00

  _warn_prf(average, modifier, msg_start, len(result))


# Dense Deep Neural Network

In [68]:
df['text'].shape

(114965,)

In [70]:
max_words = 50000
tokenizer = Tokenizer(max_words)

df['text'] = df['text'].apply(str)
tokenizer.fit_on_texts(df['text'])

In [71]:
#sequences_to_matrix(sequences, mode='binary')- crashes due to RAM required 
sequence_train = tokenizer.texts_to_sequences(X_train)

In [72]:
sequence_test = tokenizer.texts_to_sequences(X_val)

In [73]:
word_2_vec = tokenizer.word_index
V = len(word_2_vec)

print('Dataset has {} number of independent tokens'.format(V))

Dataset has 109021 number of independent tokens


In [74]:
data_train = pad_sequences(sequence_train)
data_train.shape

(72307, 563)

In [75]:
T = data_train.shape[1]

data_test = pad_sequences(sequence_test, maxlen=T)
data_test.shape

(18077, 563)

In [86]:
smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_sample(data_train,y_train)



In [87]:
model = Sequential()
model.add(tf.keras.layers.Dense(128,input_shape=(data_train[0].shape)))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(Dropout(0.25))
model.add(tf.keras.layers.Dense(9, activation='softmax',kernel_regularizer=l2(0.01)))

In [88]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [90]:
nn = model.fit(X_sm, y_sm, validation_data=(data_test,y_val), batch_size=128, epochs=75)

Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75
Epoch 54/75
Epoch 55/75
Epoch 56/75
Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75


In [91]:
model.predict(data_test)

array([[1.5177588e-05, 2.2882006e-05, 2.6565665e-04, ..., 8.7775406e-06,
        4.9202589e-07, 1.6864638e-05],
       [5.6620096e-03, 1.2726370e-03, 1.0472906e-02, ..., 1.9940909e-04,
        4.1122660e-03, 6.0702078e-03],
       [2.1551408e-02, 3.3901431e-02, 3.4394044e-02, ..., 4.4942098e-03,
        1.9417939e-04, 2.7550308e-02],
       ...,
       [1.8840066e-08, 6.2170458e-10, 2.8192071e-08, ..., 6.5559003e-09,
        9.0935520e-16, 5.0390845e-06],
       [4.0643636e-02, 3.1042235e-02, 3.3931188e-02, ..., 1.9132821e-02,
        5.7891579e-03, 5.9540384e-02],
       [2.7357565e-02, 3.0128075e-02, 2.8876664e-02, ..., 1.1134259e-03,
        1.9256456e-03, 7.2576173e-02]], dtype=float32)

In [92]:
y_pred=model.predict(data_test)

In [93]:
y_pred_final=np.argmax(y_pred,axis=1)
y_pred_final

array([3, 3, 3, ..., 3, 3, 3])

In [94]:
print(classification_report(y_val,y_pred_final))
print(f1_score(y_val,y_pred_final, average = 'macro'))

              precision    recall  f1-score   support

           0       0.50      0.30      0.37      1202
           1       0.55      0.26      0.35       527
           2       0.60      0.31      0.41      1245
           3       0.74      0.89      0.81     10906
           4       0.66      0.48      0.55      1027
           5       0.50      0.30      0.37      1342
           6       0.64      0.48      0.55       516
           7       0.18      0.54      0.27       215
           8       0.50      0.30      0.37      1097

    accuracy                           0.68     18077
   macro avg       0.54      0.43      0.45     18077
weighted avg       0.66      0.68      0.65     18077

0.451587741602686


In [96]:
T = data_train.shape[1]

df_test['text'] = df_test['text'].apply(str)
sequence_actual = tokenizer.texts_to_sequences(df_test['text'])
data_actual = pad_sequences(sequence_actual, maxlen=T)
data_actual.shape

(24581, 563)

In [97]:
y_actual = model.predict(data_actual)

In [98]:
y_actual = np.argmax(y_actual, axis=1)
y_actual

array([3, 3, 4, ..., 3, 3, 7])

In [99]:
y_actual.shape

(24581,)

In [138]:
y_actual2 = le.inverse_transform(y_actual)

y_actual2 = pd.DataFrame(y_actual2, columns=['label'])
y_actual2['label'] = y_actual2['label'].apply(lambda x: x.zfill(9))
y_actual2

Unnamed: 0,label
0,000001000
1,000001000
2,000010000
3,000001000
4,000001000
...,...
24576,000001000
24577,000001000
24578,000001000
24579,000001000


In [164]:
pd.DataFrame(y_actual2).set_index(df_test['docid']).rename(columns={0:'label'}).to_csv('NN_1.csv')

# Additional Neural Networks

## 1. Convolutional Neural Network

In [141]:
max_words=10000

tokenizer=Tokenizer(max_words)
tokenizer.fit_on_texts(df['text'])
sequence_train=tokenizer.texts_to_sequences(X_train)
sequence_test=tokenizer.texts_to_sequences(X_val)

In [142]:
word_2_vec=tokenizer.word_index
V=len(word_2_vec)

print('Dataset has {} number of independent tokens'.format(V))

Dataset has 109021 number of independent tokens


In [143]:
data_train=pad_sequences(sequence_train)
data_train.shape

(72307, 454)

In [144]:
T = data_train.shape[1]

data_test = pad_sequences(sequence_test, maxlen = T)
data_test.shape

(18077, 454)

In [145]:
smote = SMOTE(sampling_strategy = 'minority')
X_sm, y_sm = smote.fit_sample(data_train, y_train)



In [146]:
D = 20
i = Input((T,))
x = Embedding(V+1, D)(i)
x = Conv1D(32, 3, kernel_initializer='he_uniform', activation='relu')(x)
x = MaxPooling1D(3)(x)
x = (Dropout(0.2))(x)
x = Conv1D(64, 3, kernel_initializer='he_uniform', activation='relu')(x)
x = MaxPooling1D(3)(x)
x = (Dropout(0.2))(x)
x = Conv1D(128, 3, kernel_initializer='he_uniform', activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = (Dropout(0.2))(x)
x = Dense(9, activation='softmax', kernel_regularizer = l2(0.01))(x)
model = Model(i, x)
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 454)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 454, 20)           2180440   
_________________________________________________________________
conv1d (Conv1D)              (None, 452, 32)           1952      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 150, 32)           0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 150, 32)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 148, 64)           6208      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 49, 64)            0     

In [147]:
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [149]:
cnn = model.fit(X_sm, y_sm, validation_data=(data_test, y_val), batch_size=128, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [150]:
model.predict(data_test)

array([[7.7424983e-11, 1.6726409e-16, 7.6045019e-14, ..., 8.1680433e-11,
        6.7368694e-10, 1.9670496e-11],
       [7.3270867e-21, 1.2751454e-13, 1.0000000e+00, ..., 5.1040306e-17,
        7.0419924e-26, 1.2759795e-21],
       [6.5742526e-08, 3.3495923e-12, 3.1139355e-10, ..., 5.0553957e-08,
        3.2786907e-08, 6.0013129e-08],
       ...,
       [7.1521881e-12, 3.4724356e-18, 2.1706329e-11, ..., 1.4108348e-14,
        7.0709310e-18, 1.4247949e-09],
       [1.0631021e-12, 6.8982680e-23, 8.3002782e-19, ..., 3.0251588e-16,
        7.4019099e-14, 3.1318399e-15],
       [6.3762022e-09, 8.0994866e-15, 2.5312768e-12, ..., 2.2533728e-10,
        1.1912837e-08, 7.1001782e-10]], dtype=float32)

In [151]:
y_pred = model.predict(data_test)

In [152]:
y_pred_final=np.argmax(y_pred, axis=1)
y_pred_final

array([3, 2, 3, ..., 5, 3, 3])

In [153]:
print(classification_report(y_val, y_pred_final))
print(f1_score(y_val, y_pred_final, average='macro'))

              precision    recall  f1-score   support

           0       0.97      0.95      0.96      1202
           1       0.98      0.94      0.96       527
           2       0.98      0.97      0.98      1245
           3       0.98      1.00      0.99     10906
           4       0.96      0.97      0.96      1027
           5       0.98      0.95      0.97      1342
           6       0.97      0.97      0.97       516
           7       0.72      0.87      0.79       215
           8       0.97      0.90      0.93      1097

    accuracy                           0.98     18077
   macro avg       0.95      0.94      0.95     18077
weighted avg       0.98      0.98      0.98     18077

0.9450903441794445


In [158]:
T = X_sm.shape[1]

sequence_actual = tokenizer.texts_to_sequences(df_test['text'])
data_actual = pad_sequences(sequence_actual, maxlen=T)
data_actual.shape

(24581, 454)

In [159]:
y_actual_cnn = model.predict(data_actual)

In [161]:
y_actual_cnn = np.argmax(y_actual_cnn, axis=1)
y_actual_cnn

array([2, 3, 7, ..., 3, 3, 5])

In [162]:
y_actual2 = le.inverse_transform(y_actual_cnn)

y_actual2 = pd.DataFrame(y_actual2, columns=['label'])
y_actual2['label'] = y_actual2['label'].apply(lambda x: x.zfill(9))
y_actual2

Unnamed: 0,label
0,000000100
1,000001000
2,010000000
3,000001000
4,000001000
...,...
24576,000001000
24577,000001000
24578,000001000
24579,000001000


In [163]:
pd.DataFrame(y_actual2).set_index(df_test['docid']).rename(columns={0:'label'}).to_csv('CNN_1.csv')

## 2. Long Short-Term Memory Neural Network

In [165]:
#Max wwords for processing 
MAX_NB_WORDS = 10000
# Max number of words in a sequence
MAX_SEQUENCE_LENGTH = 250
#fixed
EMBEDDING_DIM = 100

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(df['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 109021 unique tokens.


In [168]:
df_train['text'] = df_train['text'].apply(str)
X1 = tokenizer.texts_to_sequences(df_train['text'].values)
X1 = pad_sequences(X1, maxlen=MAX_SEQUENCE_LENGTH)

print('Shape of data tensor:', X1.shape)

Shape of data tensor: (90384, 250)


In [169]:
Y1 = pd.get_dummies(y).values

print('Shape of label tensor:', Y1.shape)

Shape of label tensor: (90384, 9)


In [170]:
X_train, X_test, Y_train, Y_test = train_test_split(X1, Y1, test_size = 0.20, random_state = 42)

print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(72307, 250) (72307, 9)
(18077, 250) (18077, 9)


In [171]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X1.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(9, activation='softmax', kernel_regularizer=l2(0.01)))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 250, 100)          1000000   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 250, 100)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense_15 (Dense)             (None, 9)                 909       
Total params: 1,081,309
Trainable params: 1,081,309
Non-trainable params: 0
_________________________________________________________________
None


In [172]:
epochs = 5
batch_size = 128

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [173]:
accr = model.evaluate(X_test, Y_test)



In [174]:
model.predict(X_test)

array([[4.5826574e-04, 1.4135386e-04, 3.7207917e-04, ..., 5.0347921e-04,
        5.6711980e-04, 6.3459558e-04],
       [9.4510510e-04, 3.7779568e-03, 9.7149110e-01, ..., 3.0791103e-03,
        1.0260203e-03, 6.3530793e-03],
       [5.5233983e-04, 1.9333945e-04, 4.4227808e-04, ..., 6.3904299e-04,
        7.3623046e-04, 8.3656330e-04],
       ...,
       [6.8511521e-03, 1.0113606e-03, 4.2883945e-03, ..., 2.6862510e-03,
        1.8898442e-03, 2.7827939e-03],
       [4.7273119e-04, 1.5216056e-04, 3.6970977e-04, ..., 5.4716144e-04,
        6.2427961e-04, 7.2582468e-04],
       [6.8233197e-04, 2.8164641e-04, 7.0552021e-04, ..., 7.7690277e-04,
        9.3062932e-04, 1.1100937e-03]], dtype=float32)

In [175]:
y_pred = model.predict(X_test)

In [176]:
y_pred_final = np.argmax(y_pred, axis=1)
y_pred_final

array([3, 2, 3, ..., 5, 3, 3])

In [177]:
print(classification_report(y_val, y_pred_final))
print(f1_score(y_val, y_pred_final, average='macro'))

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1202
           1       0.88      0.93      0.90       527
           2       0.97      0.97      0.97      1245
           3       1.00      0.99      0.99     10906
           4       0.93      0.97      0.95      1027
           5       0.95      0.97      0.96      1342
           6       0.92      0.93      0.93       516
           7       0.69      0.54      0.61       215
           8       0.90      0.90      0.90      1097

    accuracy                           0.97     18077
   macro avg       0.91      0.91      0.91     18077
weighted avg       0.97      0.97      0.97     18077

0.9060204407029785


In [187]:
act = tokenizer.texts_to_sequences(df_test['text'].values)
act = pad_sequences(act, maxlen = MAX_SEQUENCE_LENGTH)

print('Shape of data tensor:', act.shape)

Shape of data tensor: (24581, 250)


In [188]:
y_actual_lstm = model.predict(act)

In [189]:
y_actual_lstm = np.argmax(y_actual_lstm, axis=1)
y_actual_lstm

array([2, 3, 7, ..., 3, 3, 5])

In [190]:
y_actual2 = le.inverse_transform(y_actual_lstm)

y_actual2 = pd.DataFrame(y_actual2, columns=['label'])
y_actual2['label'] = y_actual2['label'].apply(lambda x: x.zfill(9))
y_actual2

Unnamed: 0,label
0,000000100
1,000001000
2,010000000
3,000001000
4,000001000
...,...
24576,000001000
24577,000001000
24578,000001000
24579,000001000


In [191]:
pd.DataFrame(y_actual2).set_index(df_test['docid']).rename(columns={0:'label'}).to_csv('LSTM_1.csv')

## 3. Gated Recurrent Unit Neural Network

In [192]:
X_train, X_test, Y_train, Y_test = train_test_split(X1,Y1, test_size = 0.20, random_state = 42)

In [200]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X1.shape[1],trainable=False))

model.add(SpatialDropout1D(0.3))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(9, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')

model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2, callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc39cb6c050>

In [203]:
y_pred = model.predict(X_test)

In [204]:
y_pred_final = np.argmax(y_pred, axis=1)

array([3, 2, 3, ..., 5, 3, 3])

In [205]:
print(classification_report(y_val, y_pred_final))
print(f1_score(y_val, y_pred_final, average='macro'))

              precision    recall  f1-score   support

           0       0.86      0.85      0.85      1202
           1       0.94      0.46      0.62       527
           2       0.68      0.94      0.79      1245
           3       0.99      0.96      0.97     10906
           4       0.96      0.60      0.74      1027
           5       0.91      0.78      0.84      1342
           6       0.90      0.79      0.84       516
           7       0.00      0.00      0.00       215
           8       0.46      0.87      0.60      1097

    accuracy                           0.88     18077
   macro avg       0.74      0.69      0.70     18077
weighted avg       0.90      0.88      0.88     18077

0.6957060612979331


  _warn_prf(average, modifier, msg_start, len(result))


In [206]:
act = tokenizer.texts_to_sequences(df_test['text'].values)
act = pad_sequences(act, maxlen = MAX_SEQUENCE_LENGTH)

print('Shape of data tensor:', act.shape)

Shape of data tensor: (24581, 250)


In [207]:
y_actual_gru = model.predict(act)

In [208]:
y_actual_gru = np.argmax(y_actual_gru, axis=1)
y_actual_gru

array([2, 3, 8, ..., 3, 3, 5])

In [209]:
y_actual2 = le.inverse_transform(y_actual_gru)

y_actual2 = pd.DataFrame(y_actual2, columns=['label'])
y_actual2['label'] = y_actual2['label'].apply(lambda x: x.zfill(9))
y_actual2

Unnamed: 0,label
0,000000100
1,000001000
2,100000000
3,000001000
4,000001000
...,...
24576,000001000
24577,000001000
24578,000001000
24579,000001000


In [210]:
pd.DataFrame(y_actual2).set_index(df_test['docid']).rename(columns={0:'label'}).to_csv('GRU_1.csv')