# A Comprehensive Guide to Understand and Implement Text Classification in Python

The content of this file is originally from the article by Shivam Bansal, published on April 23, 2018 at: https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/. The code examples in the article have some errors (see the comments at the bottom of the article). But the article is still a good source to learn various ML and DL methods in Python for text classification.

The original content and code are modified by Yuen-Hsien Tseng during 2018/09/29~2020/02/10, and is used in the paper: 

Yuen-Hsien Tseng, "[The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification](http://joemls.dils.tku.edu.tw/fulltext/57104fullText.pdf)," Journal of Educational Media & Library Sciences, Vol. 57, No. 1 (March 2020).

### Note: 

1. To run deep learning methods, you need Word2Vec files in your local disk. I downloaded them from:
 https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.zh_classical.vec
 for Chinese Word2Vec, and from:
 https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.simple.vec
 for English Word2Vec.
 See the function Read_Word_Embedding() below to know where to save the Word2Vec files in your local disk as an example.

2. The classifiers here are only for multi-class single-label text classification problems. For a text (document) belongs to multiple categories (multi-labeled), the methods shown here have to be modified.

# Introduction

Text Classification is an example of supervised machine learning task since a labelled dataset containing text documents and their labels is used for train a classifier. An end-to-end text classification pipeline is composed of three main components:

1. Dataset Preparation: The first step is the Dataset Preparation step which includes the process of loading a dataset and performing basic pre-processing. The dataset is then splitted into train and validation sets.

2. Feature Engineering: The next step is the Feature Engineering in which the raw dataset is transformed into flat features which can be used in a machine learning model. This step also includes the process of creating new features from the existing data.

3. Model Training: The final step is the Model Building step in which a machine learning model is trained on a labelled dataset.

4. Improve Performance of Text Classifier: In this article, we will also look at the different ways to improve the performance of text classifiers.

## Getting your machine ready
Lets implement basic components in a step by step manner in order to create a text classification framework in python. To start with, import all the required libraries.

You would need requisite libraries to run this code – you can install them at their individual official links:
Pandas, 
Scikit-learn, 
XGBoost, 
TextBlob, 
Keras,
jieba, etc.

In [1]:
# -*- coding: UTF-8 -*-
# libraries for dataset preparation, feature engineering, model training 
import time
time_Start = time.time()

from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

from nltk.stem import PorterStemmer, WordNetLemmatizer

import pandas, xgboost, numpy, textblob, string

## 1. Dataset preparation
To prepare the dataset, load the downloaded data into a pandas dataframe containing two columns – text and label. (Here are more text classification datasets: https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M)

In [2]:
import re, sys
import jieba # to deal with Chinese text dataset
jieba.load_userdict("TermFreq-utf8.txt") # added on 2019/02/04

print("It takes %4.2f seconds to import packages."%(time.time()-time_Start))

print("sys.argv:", sys.argv)
Trainable = True # indicate if word embedding layer is trainable
min_df=2
if len(sys.argv) == 5 and re.match('\d+', sys.argv[2]):
    prog, data_file, TrainSize, Trainable, min_df = sys.argv
elif len(sys.argv) == 4 and re.match('\d+', sys.argv[2]): 
    prog, data_file, TrainSize, Trainable = sys.argv
elif len(sys.argv) == 3 and re.match('\d+', sys.argv[2]):
    prog, data_file, TrainSize = sys.argv
else:
    data_file = input("Enter the dataset file: ")
    TrainSize = input("Enter the number of training examples: ")
    Trainable = input("Enter if word embedding layer is trainable (yes/no): ")
    min_df = input("min_df (terms whose tf lower than min_df is ignored): ")

min_df = int(min_df)
TrainSize = int(TrainSize)
Trainable = False if Trainable == 'no' else True

#data_file = 'Datasets/PCWeb_All.txt'
#TrainSize = 1190

#data_file = 'Datasets/PCNews_All.txt'
#TrainSize = 644

#data_file = 'Datasets/joke_All.txt'
#TrainSize = 2389

#data_file = 'Datasets/CTC_All_sl.txt'
#TrainSize = 19901

#data_file = 'Datasets/Reuters_All_sl.txt'
#TrainSize = 6561

#data_file = '20news-bydate/20news-bydate_All.txt'
#TrainSize = 11270

#data_file = 'Datasets/CnonC_All.txt'
#TrainSize = 232


Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/kg/jcdj05xn20144cv9kwywp26r0000gn/T/jieba.cache
Loading model cost 0.687 seconds.
Prefix dict has been built succesfully.


It takes 6.54 seconds to import packages.
sys.argv: ['/Users/sam/anaconda3/envs/py3.6/lib/python3.6/site-packages/ipykernel_launcher.py', '-f', '/Users/sam/Library/Jupyter/runtime/kernel-e98056d7-3645-4bc7-8885-724c596b84f8.json']
Enter the dataset file: Datasets/CnonC_All.txt
Enter the number of training examples: 232
Enter if word embedding layer is trainable (yes/no): yes
min_df (terms whose tf lower than min_df is ignored): 1


In [3]:

def clean_text(text): 
    '''
    Given a raw text string, return a clean text string.
    Example: 
        input:  "Years  passed. 多少   年过 去 了 。  "
        output: "years passed.多少年过去了。"
    '''
# The next 4 lines are copied from: https://github.com/ahmedbesbes/overview-and-benchmark-of-traditional-and-deep-learning-models-in-text-classification
    text = re.sub(r'http\S+', '', text) # URL
    text = re.sub(r"#(\w+)", '', text) # hasttag in tweets
    text = re.sub(r"@(\w+)", '', text) # @domain
    #text = re.sub(r'[^\w\s]', '', text) # remove non-word or non-space

    text = text.lower() # 'years  passed. 多少   年过 去 了 。'
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)

    # Next line will remove punctuations. \w matches Chinese characters
    #text = re.sub('\W', ' ', text) # 'Years passed  多少 年过 去 了  '
    # Next line will remove redundant white space for jeiba to cut
    text = re.sub('\s+([^a-zA-Z0-9.])', r'\1', text) # years passed.多少年过去了。
# see: https://stackoverflow.com/questions/16720541/python-string-replace-regular-expression
    text = text.strip(' ')
    return text


In [4]:

# This code block set 3 variables: PunctuationStr, Punctuations, StopWords

# The English punctuations are from: https://keras.io/preprocessing/text/
PunctuationStr = '''
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
！＃＄％＆\、（）＊＋，。/：；「」『』　■．．・…’’“”〝〞‵′。
''' # there is a Chinese white space before '■'
Punctuations = [x for x in PunctuationStr]

# Combine Chinese and English stop words
StopWords = '''
的 是 了 和 與 及 或 於 也 並 之 以 在 另 又 該 由 但 仍 就
都 讓 要 把 上 來 說 從 等 
我 你 他 妳 她 它 您 我們 你們 妳們 他們 她們 
並有 並可 可以 可供 提供 以及 包括 另有 另外 此外 除了 目前 現在 仍就 就是 
'''.split()
# StopWords.extend(['　', '■']) # these punctuations belong to r'\W'
# https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
StopWords.extend(list(ENGLISH_STOP_WORDS))
StopWords.extend('''said told '''.split())
#print(StopWords)

In [5]:
ps = PorterStemmer()
wnl = WordNetLemmatizer()

def count_words(words):
    WL = [w for w in words if not re.match('\s', w)]
#    print("After remove white space, this list is:", WL)
    return len(WL)

def clean_words(words):
#    print("After jieba.lcut():", words)
#    WL = [ w 
#    WL = [ ps.stem(w)
    WL = [ wnl.lemmatize(w)
            for w in words if (not re.match('\s', w)) and 
                (w not in StopWords) and
                (w not in Punctuations) and 
                (not re.match('^\W$', w)) and # can replace above line to remove single non-word
                (not re.match('^\d*\.?\d*%?$', w)) and # skip if decimal numbers or percentage
                (not re.match('^\.+$', w)) and # skip if '.', added on 2019/02/04
                (not re.match('^[a-z_]$', w)) # skip if single lower case term
         ]
    return WL

In [6]:

CharList, WordList = [], []

def Load_Data(file): # load the dataset
    labels, texts = [], []
    i = 0
    for line in open(file, encoding='UTF-8').read().split("\n"):
        if line == '': continue
        (label, text) = line.split("\t") # load my own data
        labels.append(label) # assume single label classification
        CharList.append(len(text))
#        print("Before text=", text)
#        texts.append(text); continue # 2019/02/03 for joke corpus, used only once for saving train and test files
        words_list = jieba.lcut(clean_text(text)) # https://github.com/fxsjy/jieba
        words = clean_words(words_list)
        WordList.append(count_words(words_list))
        texts.append(" ".join(words)) # should be a string of words
#        print("After clean_words(), texts[{}]='{}'".format(i, texts[i]))
#        print("word:", words)
#        i += 1
#        if i>=5: break

# create a dataframe using texts and lables
    DF = pandas.DataFrame()
    DF['text'] = texts
    DF['label'] = labels
    return DF

time_LoadData = time.time()

All_DF = Load_Data(data_file)

#CharList = [len(c) for c in All_DF['text']]
#WordList = [len(x.split()) for x in All_DF['text']]
TextCharsLen = max(CharList)
TextWordsLen = max(WordList)
TextCharsAvg = sum(CharList) / len(CharList)
TextWordsAvg = sum(WordList) / len(WordList)

print("TextCharsLen =", TextCharsLen, "\nTextWordsLen =", TextWordsLen)
print("Avg Text Chars = %3d"%TextCharsAvg)
print("Avg Text Words = %3d"%TextWordsAvg)

print("It takes %4.2f seconds to load, segment, and clean data."%(time.time()-time_LoadData))


TextCharsLen = 53 
TextWordsLen = 30
Avg Text Chars =  19
Avg Text Words =   8
It takes 1.28 seconds to load, segment, and clean data.


## 2. Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. (https://en.wikipedia.org/wiki/Exploratory_data_analysis)

EDA is a very important step which takes place after feature engineering and acquiring data and it should be done before any modeling. This is because it is very important for a data scientist to be able to understand the nature of the data without making assumptions. (https://datascienceguide.github.io/exploratory-data-analysis)

In [7]:
cat2num = pandas.Series(All_DF['label']).value_counts()
print(cat2num.sort_values(ascending=False))

# The next function does the similar report, but assume multiple columns
#   and each column (category) has 0 or 1 values
# See: https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5
def show_stats(All_DF):
    df_label = All_DF.drop(['text'], axis=1)
    counts = []
    categories = list(df_label.columns.values)
    print(categories)
    for c in categories:
        counts.append((c, df_label[c].sum()))
    df_stats = pandas.DataFrame(counts, columns=['category', 'number_of_texts'])
    print(df_stats)


002-非營建類    166
001-營建類     166
Name: label, dtype: int64


Next, we will split the dataset into training and validation sets so that we can train and test classifier.

Also, we will encode our target column so that it can be used in machine learning models.

In [8]:
# split the dataset into training and validation datasets 
# For shuffle=False to be effective, need scikit-learn 0.19
# pip install --upgrade scikit-learn # but in my laptop, this does not work
# Next line do not work because my scikit-learn is version 0.18.1
#trainText_x, testText_x, train_yL, test_yL = model_selection.train_test_split(
#    All_DF['text'], All_DF['label'], train_size=TrainSize, shuffle = False, stratify = None)


# https://stackoverflow.com/questions/43838052/how-to-get-a-non-shuffled-train-test-split-in-sklearn
def non_shuffling_train_test_split(X, y, train_size=0.7, test_size=0.3):
    i = int((1 - test_size) * X.shape[0]) + 1
    i = train_size # an integer to overwrite the above line
    X_train, X_test = numpy.split(X, [i])
    y_train, y_test = numpy.split(y, [i])
    return X_train, X_test, y_train, y_test


trainText_x, testText_x, train_yL, test_yL = non_shuffling_train_test_split(
    All_DF['text'], All_DF['label'], train_size=TrainSize)

# label encode the target variable 
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
LabEncoder = preprocessing.LabelEncoder() # convert label name to label int
train_y = LabEncoder.fit_transform(train_yL)
test_y = LabEncoder.fit_transform(test_yL)
Num_Classes = len(LabEncoder.classes_)

In [None]:
# On 2019/02/03: This function is used only once to obtain the train and test files. 
def Save_Split_Train_Test(All_DF):
# https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn
    trainText_x, testText_x, train_yL, test_yL = model_selection.train_test_split(
    All_DF['text'], All_DF['label'], test_size=0.3, stratify=All_DF['label'], random_state=42)
    #X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# https://stackoverflow.com/questions/34318141/zip-pandas-dataframes-into-a-new-dataframe
    #train_DF = pandas.concat([train_yL, trainText_x.str.split().str.join('')], axis=1)
    #test_DF = pandas.concat([test_yL, testText_x.str.split().str.join('')], axis=1)
    # Because of the original texts in Load_Data(), use next 2 lines rather than the above 2 lines
    train_DF = pandas.concat([train_yL, trainText_x], axis=1)
    test_DF = pandas.concat([test_yL, testText_x], axis=1)
# https://stackoverflow.com/questions/16923281/pandas-writing-dataframe-to-csv-file
    train_DF.to_csv('Datasets/CnonC_train.txt', sep='\t', encoding='utf-8', index=False, header=False)
    test_DF.to_csv('Datasets/Cnonc_test.txt', sep='\t', encoding='utf-8', index=False, header=False)
    return trainText_x, testText_x, train_yL, test_yL

# No need to call this function, if you have the train and test files already
# trainText_x, testText_x, train_yL, test_yL = Save_Split_Train_Test(All_DF)

In [9]:
print("All data: ", type(All_DF['label']), ", shape:", All_DF['label'].shape, "\n", All_DF['label'][:3])
#print("train_yL: ", type(train_yL), ", shape:", train_yL.shape, "\n", train_yL[:3])
#print("train_y:  ", type(train_y),  ", shape:", train_y.shape, "\n", "Label ID:", train_y[:5])
# See: https://stackoverflow.com/questions/38309729/count-unique-values-with-pandas-per-groups/38309823
print("train_yL: ", type(train_yL), ", shape:", train_yL.shape, ", unique:", train_yL.nunique(), "\n", train_yL.value_counts())
print("test_yL:  ", type(test_yL),  ", shape:", test_y.shape, ", unique:", test_yL.nunique(), "\n", test_yL.value_counts())

print("Num of Classes (Categories or Labels):", Num_Classes)
print("Label Names [:5]:", LabEncoder.classes_[:5]) # print label names
print("Label Names transformed[:5]:", LabEncoder.transform(LabEncoder.classes_[:5]))
print("Label inverse transform [0, 1]:", LabEncoder.inverse_transform([0, 1]))

All data:  <class 'pandas.core.series.Series'> , shape: (332,) 
 0     001-營建類
1    002-非營建類
2    002-非營建類
Name: label, dtype: object
train_yL:  <class 'pandas.core.series.Series'> , shape: (232,) , unique: 2 
 001-營建類     116
002-非營建類    116
Name: label, dtype: int64
test_yL:   <class 'pandas.core.series.Series'> , shape: (100,) , unique: 2 
 001-營建類     50
002-非營建類    50
Name: label, dtype: int64
Num of Classes (Categories or Labels): 2
Label Names [:5]: ['001-營建類' '002-非營建類']
Label Names transformed[:5]: [0 1]
Label inverse transform [0, 1]: ['001-營建類' '002-非營建類']


In [10]:
print(type(LabEncoder), LabEncoder)
print(type(train_y), type(test_y))
m, n = len(train_y), len(test_y)
print("train={}, test={}, sum={}, ration={}%".format(m, n, m+n, int(0.5+(n/(m+n)*100))))
print(train_y[:10], test_y[:10])

<class 'sklearn.preprocessing.label.LabelEncoder'> LabelEncoder()
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
train=232, test=100, sum=332, ration=30%
[0 1 1 1 0 1 0 0 0 0] [1 1 1 1 0 0 1 1 1 0]


## 2. Feature Engineering
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. 

### 2.1 Count Vectors as features
Count Vector is a document-to-term matrix notation of the dataset (corpus) in which:
1. every row represents a document from the corpus, 
2. every column represents a term from the corpus, and 
3. every cell represents the frequency count of a particular term in a particular document.

A similar term-to-document matrix is illustrated in the figure below:
![](https://ahmedbesbes.com/images/article_5/tfidf.jpg)

In [11]:

time_CountVector = time.time()

def Create_CountVector():

# Create a count vectorizer object.
# It takes the steps of prepocessing, tokenizer, stopwording, ...
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', 
        stop_words=StopWords, max_df=0.98, min_df=min_df)
    count_vect.fit(All_DF['text'])

# transform the training and validation data using count vectorizer object
    xtrain_count = count_vect.transform(trainText_x)
    xtest_count = count_vect.transform(testText_x)

    print("It takes %4.2f seconds to convert count vectors."%(time.time()-time_CountVector))

    return(xtrain_count, xtest_count, count_vect)

(xtrain_count, xtest_count, count_vect) = Create_CountVector()

It takes 0.02 seconds to convert count vectors.


In [12]:
def Print_count_vect():
    print(type(count_vect), count_vect)
    print(type(xtrain_count), type(xtest_count))
    print("xtrain_count.shape:", xtrain_count.shape)
    print("xtest_count.shape :", xtest_count.shape)
# https://stackoverflow.com/questions/36967666/transform-scipy-sparse-csr-to-pandas
# from scipy.sparse.csr import csr_matrix
    # A = csr_matrix([[1, 0, 2], [0, 3, 0]]); print(A)
    # df = pd.DataFrame(A.toarray()); print(df)
    #print(xtrain_count)
    #print(xtest_count[0, 0:10])
    print("\nUsed stop words: ", count_vect.get_stop_words())
    
Print_count_vect()

<class 'sklearn.feature_extraction.text.CountVectorizer'> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.98, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['的', '是', '了', '和', '與', '及', '或', '於', '也', '並', '之', '以', '在', '另', '又', '該', '由', '但', '仍', '就', '都', '讓', '要', '把', '上', '來', '說', '從', '等', '我', '你', '他', '妳', '她', '它', '您', '我們', '你們', '妳們', '他們', '她們', '並有', '並可', '可以', '可供', '提供', '以及', '包括', '另有', '另外', '此外', '除了', '目前', '現在', ...ds', 'anyone', 'at', 'give', 'of', 'only', 'hers', 'from', 'take', 'had', 'perhaps', 'said', 'told'],
        strip_accents=None, token_pattern='\\w{1,}', tokenizer=None,
        vocabulary=None)
<class 'scipy.sparse.csr.csr_matrix'> <class 'scipy.sparse.csr.csr_matrix'>
xtrain_count.shape: (232, 1304)
xtest_count.shape : (100, 1304)

Used stop words:  frozenset({'her', 'em

### 2.2 TF-IDF Vectors as features
TF-IDF score represents the relative importance of a term in the document and the entire corpus. TF-IDF score is composed by two terms: the first computes the normalized Term Frequency (TF), the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams)

1. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents
2. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams
3. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus

A word level n-gram can be illustrated in the figure below (from https://ahmedbesbes.com/images/article_5/ngrams.png):
![](https://ahmedbesbes.com/images/article_5/ngrams.png)

A character level n-gram can be illustrated in the figure below )from https://ahmedbesbes.com/images/article_5/ngrams_char.jpg):
![](https://ahmedbesbes.com/images/article_5/ngrams_char.jpg)

In [13]:
time_TfidfVector = time.time()

def Create_TFxIDF():

# word level tf-idf
    #tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=10000)
    tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', 
        stop_words=StopWords, max_df=0.98, min_df=min_df, max_features=10000)
    tfidf_vect.fit(All_DF['text'])
    xtrain_tfidf = tfidf_vect.transform(trainText_x)
    xtest_tfidf = tfidf_vect.transform(testText_x)
    print("xtrain_tfidf.shape:", xtrain_tfidf.shape)
    print("xtest_tfidf.shape :", xtest_tfidf.shape)

# word level ngram tf-idf 
    tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', 
                    stop_words=StopWords, max_df=0.98, min_df=min_df,
                    nmgra_range=(2,3), max_features=10000)
    tfidf_vect_ngram.fit(All_DF['text'])
    xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(trainText_x)
    xtest_tfidf_ngram =  tfidf_vect_ngram.transform(testText_x)
    print("xtrain_tfidf_ngram.shape:", xtrain_tfidf_ngram.shape)
    print("xtest_tfidf_ngram.shape :", xtest_tfidf_ngram.shape)


# character level ngram tf-idf
    tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', 
                    stop_words=StopWords, max_df=0.98, min_df=min_df,
                    ngram_range=(2,3), max_features=10000)
    tfidf_vect_ngram_chars.fit(All_DF['text'])
    xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(trainText_x) 
    xtest_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(testText_x) 
    print("xtrain_tfidf_ngram_chars.shape:", xtrain_tfidf_ngram_chars.shape)
    print("xtest_tfidf_ngram_chars.shape :", xtest_tfidf_ngram_chars.shape)

    print("It takes %4.2f seconds to convert 3 TFxIDF vectors."%(time.time()-time_TfidfVector))

    return (xtrain_tfidf, xtest_tfidf, 
             xtrain_tfidf_ngram, xtest_tfidf_ngram,
             xtrain_tfidf_ngram_chars, xtest_tfidf_ngram_chars,
            tfidf_vect, tfidf_vect_ngram, tfidf_vect_ngram_chars)

(xtrain_tfidf, xtest_tfidf, 
 xtrain_tfidf_ngram, xtest_tfidf_ngram,
 xtrain_tfidf_ngram_chars, xtest_tfidf_ngram_chars,
 tfidf_vect, tfidf_vect_ngram, tfidf_vect_ngram_chars) = Create_TFxIDF()

xtrain_tfidf.shape: (232, 1304)
xtest_tfidf.shape : (100, 1304)
xtrain_tfidf_ngram.shape: (232, 3482)
xtest_tfidf_ngram.shape : (100, 3482)
xtrain_tfidf_ngram_chars.shape: (232, 6583)
xtest_tfidf_ngram_chars.shape : (100, 6583)
It takes 0.08 seconds to convert 3 TFxIDF vectors.


### 2.3 Word Embeddings
A word embedding is a form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Word embeddings can be trained using the input corpus itself or can be generated using pre-trained word embeddings such as Glove, FastText, and Word2Vec. Any one of them can be downloaded and used as transfer learning. One can read more about word embeddings at: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/.

Following snnipet shows how to use pre-trained word embeddings in the model. There are four essential steps:

1. Loading the pretrained word embeddings
2. Creating a tokenizer object
3. Transforming text documents to sequence of tokens and pad them
4. Create a mapping of token and their respective embeddings

You can download the pre-trained word embeddings from: https://fasttext.cc/docs/en/pretrained-vectors.html.

In [14]:
time_ReadWordEmbed = time.time()

def Read_Word_Embedding():
# load the pre-trained word-embedding vectors 
    embeddings_index = {}
# https://stackoverflow.com/questions/47118678/difference-between-fasttext-vec-and-bin-file
# For Chinese Word2Vec, it is now (2020/02/07) available 
#  at: https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.zh_classical.vec
    for i, line in enumerate(open('/Users/sam/data_exp/Corpora/TextClassification/wiki.zh_classical/wiki.zh_classical.vec')):
        if i==0: continue # The first line is: "10696 300"
        values = line.split()
        embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')
    print("Totoal Chinese embedding words: {}, embedding vector size: {}".format(i,len(values)-1))
# Chinese Embedding words: 10696, embedding vector size: 300

# For English Word2Vec, it is now (2020/02/07) available 
#  at: https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.simple.vec
    for i, line in enumerate(open('/Users/sam/data_exp/Corpora/TextClassification/wiki.simple/wiki.simple.vec')):
        if i==0: continue # The first line is: "111051 300"
        values = line.split()
        try:
            embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')
        # ValueError: could not convert string to float: 'united'
        except ValueError:
            print("Warning: fail to read line at line", i)
        # Warning: fail to read line at line 62776
        # Warning: fail to read line at line 105579
        # Warning: fail to read line at line 110602
    embedding_vector_size = len(values)-1
    print("Totoal English embedding words: {}, embedding vector size: {}".format(i, embedding_vector_size))


    print("It takes %4.2f seconds to load embedding vectors."%(time.time()-time_ReadWordEmbed))

    return (embeddings_index, embedding_vector_size)

(embeddings_index, embedding_vector_size) = Read_Word_Embedding()

Totoal Chinese embedding words: 10696, embedding vector size: 300
Totoal English embedding words: 111051, embedding vector size: 300
It takes 6.84 seconds to load embedding vectors.


In [15]:
from keras.preprocessing import text, sequence

time_BuildWordEmbed = time.time()

# Do not call Remove_word_index() to change word_index, 
#   because the index has been assigned when calling token.fit_on_texts()
def Remove_word_index():
    print("Before removing stop words, word_index length:", len(word_index))
#https://stackoverflow.com/questions/11277432/how-to-remove-a-key-from-a-python-dictionary
    for w in StopWords: word_index.pop(w, None)
# English puntuations can be found in print(string.punctuation)
    for w in Punctuations: word_index.pop(w, None)
    print("After  removing stop words, word_index length:", len(word_index))

def Create_Word_Embedding():
# create a tokenizer 
# for details, see: https://keras.io/preprocessing/text/ and
#   Tokenizer API at: https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/
# And: https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/
    token = text.Tokenizer(filters=PunctuationStr)
    token.fit_on_texts(All_DF['text']) # split on ' ', lowercasing English
# fit_on_texts() does not provide stop words removal, what a pity!
# fit_on_texts() create 4 attributes, see https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/
    word_index = token.word_index # a dictionary with ('word', integer_id) 
    word_index_len = len(word_index)
    print("Number of words in TextWordsLen:", TextWordsLen)
    print("Number of (word, id) pairs in word_index:", word_index_len)
    print("word_index.items()[:5]:", list(word_index.items())[:5]) 

# convert text to sequence of tokens and pad them to ensure equal length vectors 
    train_seq_x = sequence.pad_sequences(token.texts_to_sequences(trainText_x), maxlen=TextWordsLen)
    test_seq_x = sequence.pad_sequences(token.texts_to_sequences(testText_x), maxlen=TextWordsLen)

    print("train_seq_x.shape:", train_seq_x.shape)
    print("test_seq_x.shape: ", test_seq_x.shape)
    print("train_seq_x:\n", train_seq_x[:2])

# create token-embedding mapping
    OOV, s = 0, 0
    embedding_matrix = numpy.zeros( (word_index_len + 1, embedding_vector_size) )
    for word, i in word_index.items():
    # Let the deep NN to learn features. 
    # So comment out the next 2 lines on 2018/09/07
    #if word in StopWords: s+=1; continue # added by Sam Tseng on 2018/09/05
    #if word in Punctuations: s+=1; continue # added by Sam Tseng on 2018/09/05
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
        else:
          OOV += 1

# Number of (key, value) pairs in PCWeb dataset's word_index: 7028
# Number of terms not in the Chinese pretrained word embedding: 5502
    print("Number of terms not in the pretrained word embedding:", OOV)
    print("Number of StopWords and Punctuations being removed:", s)
    print("embedding_matrix.shape:", embedding_matrix.shape)

    print("\nIt takes %4.2f seconds to build Chinese and English embedding vectors."%(time.time()-time_BuildWordEmbed))

    return (train_seq_x, test_seq_x, word_index_len, embedding_matrix) 

(train_seq_x, test_seq_x, word_index_len, embedding_matrix) = Create_Word_Embedding()

Using TensorFlow backend.


Number of words in TextWordsLen: 30
Number of (word, id) pairs in word_index: 1304
word_index.items()[:5]: [('工程', 1), ('改善', 2), ('設備', 3), ('八十七年度', 4), ('八十八年度', 5)]
train_seq_x.shape: (232, 30)
test_seq_x.shape:  (100, 30)
train_seq_x:
 [[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0 356 357  32 358 359 360]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0 163 164 165 166 167 168 361 362  46  33]]
Number of terms not in the pretrained word embedding: 937
Number of StopWords and Punctuations being removed: 0
embedding_matrix.shape: (1305, 300)

It takes 0.02 seconds to build Chinese and English embedding vectors.


### 2.4 Text / NLP based features
A number of extra text based features can also be created which sometimes are helpful for improving text classification models. Some examples are:

1. Word Count of the documents – total number of words in the documents
2. Character Count of the documents – total number of characters in the documents
3. Average Word Density of the documents – average length of the words used in the documents
4. Puncutation Count in the Complete Essay – total number of punctuation marks in the documents
5. Upper Case Count in the Complete Essay – total number of upper count words in the documents
6. Title Word Count in the Complete Essay – total number of proper case (title) words in the documents
7. Frequency distribution of Part of Speech Tags:
  1. Noun Count
  2. Verb Count
  3. Adjective Count
  4. Adverb Count
  5. Pronoun Count

These features are highly experimental ones and should be used according to the problem statement only.


time_NLPstats = time.time()

All_DF['char_count'] = All_DF['text'].apply(len)

All_DF['word_count'] = All_DF['text'].apply(lambda x: len(x.split()))

All_DF['word_density'] = All_DF['char_count'] / (All_DF['word_count']+1)

All_DF['punctuation_count'] = All_DF['text'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 

All_DF['title_word_count'] = All_DF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))

All_DF['upper_case_word_count'] = All_DF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

print("\nIt takes %4.2f seconds to build NLP features."%(time.time()-time_NLPstats))


#print("char_count:", All_DF['char_count'][:2])
#print(All_DF.iloc[0:2, 1:])
print(All_DF.head())


time_NLPfeatures = time.time()

pos_family = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

#Function to check and get the part of speech tag count of a words in a given sentence
def check_pos_tag(x, flag):
    cnt = 0
    try:
        wiki = textblob.TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_family[flag]:
                cnt += 1
    except:
        pass
    return cnt

All_DF['noun_count'] = All_DF['text'].apply(lambda x: check_pos_tag(x, 'noun'))
All_DF['verb_count'] = All_DF['text'].apply(lambda x: check_pos_tag(x, 'verb'))
All_DF['adj_count'] = All_DF['text'].apply(lambda x: check_pos_tag(x, 'adj'))
All_DF['adv_count'] = All_DF['text'].apply(lambda x: check_pos_tag(x, 'adv'))
All_DF['pron_count'] = All_DF['text'].apply(lambda x: check_pos_tag(x, 'pron'))

print("\nIt takes %4.2f seconds to build NLP features."%(time.time()-time_NLPfeature))


print(All_DF.head())

### 2.5 Topic Models as features
Topic Modelling is a technique to identify the groups of words (called a topic) from a collection of documents that contains best information in the collection. 

Here Latent Dirichlet Allocation (LDA) is used for generating Topic Modelling Features. LDA is an iterative model which starts from a fixed number of topics. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents. One can read more about topic modelling at: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/.

In [16]:
time_TopicModel = time.time()

def Create_Topic_Model():
    # use global var: count_vect, xtrain_count
# train a LDA Model
# The next line cause error: 
    #lda_model = decomposition.LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20)
    lda_model = decomposition.LatentDirichletAllocation(learning_method='online', max_iter=20)

    X_topics = lda_model.fit_transform(xtrain_count)
    topic_word = lda_model.components_ 
    vocab = count_vect.get_feature_names()

# view the topic models
    n_top_words = 10
    topic_summaries = []
    for i, topic_dist in enumerate(topic_word):
        topic_words = numpy.array(vocab)[numpy.argsort(topic_dist)][:-(n_top_words+1):-1]
        topic_summaries.append(' '.join(topic_words))
    print(topic_summaries[:20])

    print("\nIt takes %4.2f seconds to train a topic model"%(time.time()-time_TopicModel))

Create_Topic_Model()

['工程 改善 農路 設備 路面 城市 土 道路 拓寬 路', '八十七年度 第二次 工程 採購 用 公告 遴選 雲林縣 宣導 防制', '工程 公告 設備 八十七年度 線 中 69kv 購置 暨 軟體', '工程 整修 設備 八十七年度 技 裝 製 硝酸 儲槽 貨車', '型 件 西裝 長褲 上衣 內視鏡 中 採購案 工程 八十七年度', '工程 計畫 公告 地 用 技術 開發 公司 第二 段', '工程 更新 漁港 里 國小 興建 新建 小型 第四次 吋', '工程 改善 土 城市 登山 步道 八十八年度 附屬 清潔 電話', '維護 工程 路 345kv 電線 輸 明潭 二廠 大觀 鳳林', '保險 研究船 八十八年度 工程 硬質 一號 貴儀 聚乙烯 現期 一九九九年']

It takes 0.40 seconds to train a topic model


## 3. Model Building
The final step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will implement following different classifiers for this purpose:

1. Naive Bayes Classifier
2. Linear Classifier
3. Support Vector Machine
4. Bagging Models
5. Boosting Models
6. Shallow Neural Networks
7. Deep Neural Networks
  1. Convolutional Neural Network (CNN)
  2. Long Short Term Modelr (LSTM)
  3. Gated Recurrent Unit (GRU)
  4. Bidirectional RNN
  5. Recurrent Convolutional Neural Network (RCNN)
  6. Other Variants of Deep Neural Networks

Lets implement these models and understand their details. The following function is a utility function which can be used to train a model. It accepts the classifier, feature_vector of training data, labels of training data and feature vectors of valid data as inputs. Using these inputs, the model is trained and accuracy score is computed.

In [17]:
def train_predict(classifier, feature_vector_train, label, feature_vector_test):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on test dataset
    return classifier.predict(feature_vector_test), classifier

In [18]:
def tcfunc(x, n=4): # trancate a number to have n decimal digits
    d = '0' * n
    d = int('1' + d)
# https://stackoverflow.com/questions/4541155/check-if-a-number-is-int-or-float
    if isinstance(x, (int, float)): return int(x * d) / d
    return x

In [19]:
# http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
#import itertools # replace this line by next line on 2019/01/03, because cannot find itertools for Python 3.6.7
import more_itertools
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, numpy.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    #print(cm) # print out consufion matrix

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = numpy.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
#    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
# Replace the above line by the next line on 2019/01/03, because cannot find itertools for Python 3.6.7
    for i, j in more_itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [20]:
# use global variables:
#  test_y
#  LabEncoder.classes_
def show_confusion_matrix(predictions):
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(test_y, predictions)
    numpy.set_printoptions(precision=2)

    # Plot non-normalized confusion matrix
    plt.figure()
    plot_confusion_matrix(cnf_matrix, classes=LabEncoder.classes_ ,
                      title='Confusion matrix, without normalization')
    # Plot normalized confusion matrix
    plt.figure()
    plot_confusion_matrix(cnf_matrix, classes=LabEncoder.classes_ , normalize=True,
                      title='Normalized confusion matrix')

    plt.show()

In [21]:
# http://scikit-learn.org/stable/modules/model_evaluation.html
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# use a global variable: test_y
def show_Result(predictions):
    print(predictions[:10])

    # http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
#    print("MicroF1 = %0.4f, MacroF1=%0.4f" %
#       (metrics.f1_score(test_y, predictions, average='micro'),
#        metrics.f1_score(test_y, predictions, average='macro')))
# https://stackoverflow.com/questions/455612/limiting-floats-to-two-decimal-points

# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
    print("\tPrecision\tRecall\tF1\tSupport")
    (Precision, Recall, F1, Support) = list(map(tcfunc, 
        precision_recall_fscore_support(test_y, predictions, average='micro')))
    print("Micro\t{}\t{}\t{}\t{}".format(Precision, Recall, F1, Support))
    (Precision, Recall, F1, Support) = list(map(tcfunc, 
        precision_recall_fscore_support(test_y, predictions, average='macro')))
    print("Macro\t{}\t{}\t{}\t{}".format(Precision, Recall, F1, Support))
    
#    if True:
    if False:
        print(confusion_matrix(test_y, predictions))
        try: 
            print(classification_report(test_y, predictions, digits=4))
        except ValueError:
            print('May be some category has no predicted samples')
        show_confusion_matrix(predictions)


### 3.1 Naive Bayes
Implementing a naive bayes model using sklearn implementation with different features

Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature at: https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/ .

In [22]:
# The next function does not work
def no_use_most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
    for coef, feat in topn:
        print(classlabel, feat, coef)

In [23]:
# This function is modified from: https://gist.github.com/bbengfort/044682e76def583a12e6c09209c664a1
# and from: https://stackoverflow.com/questions/26976362/how-to-get-most-informative-features-for-scikit-learn-classifier-for-different-c
# This function only works for binary classes
def most_informative_feature_for_class(vectorizer, classifier, labels, n=10):
    coefs = sorted( # Zip the feature names with the coefs and sort
        zip(classifier.coef_[0], vectorizer.get_feature_names()))
    topn  = zip(coefs[:n], coefs[:-(n+1):-1])
    # Create two columns with most negative and most positive features.
    for (cp, fnp), (cn, fnn) in topn:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (cp, fnp, cn, fnn))

# nltk.classify.NaiveBayesClassifier has a show_most_informative_features()
# You may compare the result here with those at: https://www.twilio.com/blog/2017/09/sentiment-analysis-python-messy-data-nltk.html


In [24]:
def Run_NaiveBayes():
    
    time_NaiveBayes = time.time()

# Naive Bayes on Count Vectors   
    predict, clf = train_predict(naive_bayes.MultinomialNB(), xtrain_count, train_y, xtest_count)
    print("\nNB, Count Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(count_vect, clf, train_yL, n=10)

# Naive Bayes on Word Level TF IDF Vectors
    predict, clf = train_predict(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xtest_tfidf)
    print("\nNB, WordLevel TF-IDF: ")
    show_Result(predict)
    #most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10)
    most_informative_feature_for_class(tfidf_vect, clf, train_y, n=10)

# Naive Bayes on Ngram Level TF IDF Vectors
    predict, clf = train_predict(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
    print("\nNB, N-Gram Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(tfidf_vect_ngram, clf, train_y, n=10)

# Naive Bayes on Character Level TF IDF Vectors
    predict, clf = train_predict(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
    print("NB, CharLevel Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(tfidf_vect_ngram_chars, clf, train_y, n=10)

    print("\nIt takes %4.2f seconds for Naive Bayes."%(time.time()-time_NaiveBayes))

Run_NaiveBayes()


NB, Count Vectors: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.87	0.87	0.87	None
Macro	0.8713	0.87	0.8698	None
	-7.6760	18k            		-4.5405	設備             
	-7.6760	20k            		-4.6315	工程             
	-7.6760	21k            		-5.0370	八十七年度          
	-7.6760	22k            		-5.1111	公告             
	-7.6760	264k           		-5.3734	電腦             
	-7.6760	266k           		-5.3734	採購             
	-7.6760	29k            		-5.4788	型              
	-7.6760	3              		-5.5966	系統             
	-7.6760	32k            		-5.7301	計畫             
	-7.6760	345kv          		-5.7301	工作             

NB, WordLevel TF-IDF: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.87	0.87	0.87	None
Macro	0.8701	0.87	0.8699	None
	-7.3784	18k            		-5.5366	設備             
	-7.3784	20k            		-5.9175	八十七年度          
	-7.3784	21k            		-5.9228	採購             
	-7.3784	22k            		-5.9875	電腦             
	-7.3784	264k           		-6.0224	公告     

### 3.2 Linear Classifier
Implementing a Linear Classifier (Logistic Regression)

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function. One can read more about logistic regression at: https://www.analyticsvidhya.com/blog/2015/10/basics-logistic-regression/.

In [25]:
def Run_LogisticRegret():
    
    time_LogisticRegret = time.time()

# Linear Classifier on Count Vectors
    predict, clf = train_predict(linear_model.LogisticRegression(), xtrain_count, train_y, xtest_count)
    print("\nLR, Count Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(count_vect, clf, train_y, n=10)

# Linear Classifier on Word Level TF IDF Vectors
    predict, clf = train_predict(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xtest_tfidf)
    print("\nLR, WordLevel TF-IDF: ")
    show_Result(predict)
    most_informative_feature_for_class(tfidf_vect, clf, train_y, n=10)

# Linear Classifier on Ngram Level TF IDF Vectors
    predict, clf = train_predict(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
    print("\nLR, N-Gram Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(tfidf_vect_ngram, clf, train_y, n=10)

# Linear Classifier on Character Level TF IDF Vectors
    predict, clf = train_predict(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
    print("\nLR, CharLevel Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(tfidf_vect_ngram_chars, clf, train_y, n=10)

    print("\nIt takes %4.2f seconds for Logistic Regression."%(time.time()-time_LogisticRegret))

Run_LogisticRegret()


LR, Count Vectors: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.91	0.91	0.91	None
Macro	0.9141	0.91	0.9097	None
	-2.8382	工程             		1.1360	設備             
	-0.8465	漁港             		0.8256	系統             
	-0.7653	改善             		0.7800	公告             
	-0.7046	中寮             		0.7002	第三次            
	-0.6529	線              		0.6797	購置             
	-0.6470	里              		0.6727	油庫             
	-0.5590	中城蝕溝           		0.6727	五股             
	-0.5590	控制工程           		0.5623	更新             
	-0.5351	大武崙            		0.5491	型              
	-0.5301	嘉民二路           		0.5286	儀控             

LR, WordLevel TF-IDF: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.89	0.89	0.89	None
Macro	0.8914	0.89	0.8899	None
	-3.1718	工程             		1.2851	設備             
	-1.6856	改善             		0.8183	採購             
	-0.7672	農路             		0.7652	公告             
	-0.7254	路面             		0.7602	型              
	-0.6905	里              		0.6811	系統             
	-0.68



### 3.3 Implementing a SVM Model
Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. The model extracts a best possible hyper-plane / line that segregates the two classes. One can read more about it at: https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/.

In [26]:
def Run_SVM():
    
    time_LinearSVM = time.time()

# Use of class_weight='balanced' decrease accuracy, although PCWeb is unbalanced
#accuracy = train_model(svm.SVC(class_weight='balanced'), xtrain_count, train_y, xtest_count)
# LinearSVC() is much much better than SVC()
    predict, clf = train_predict(svm.LinearSVC(), xtrain_count, train_y, xtest_count)
    print("\nSVM, Count Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(count_vect, clf, train_y, n=10)

    predict, clf = train_predict(svm.LinearSVC(), xtrain_tfidf, train_y, xtest_tfidf)
    print("\nSVM, WordLevel TF-IDF: ")
    show_Result(predict)
    most_informative_feature_for_class(tfidf_vect, clf, train_y, n=10)

    predict, clf = train_predict(svm.LinearSVC(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
    print("\nSVM, N-Gram Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(tfidf_vect_ngram, clf, train_y, n=10)

    predict, clf = train_predict(svm.LinearSVC(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
    print("\nSVM, CharLevel Vectors: ")
    show_Result(predict)
    most_informative_feature_for_class(tfidf_vect_ngram_chars, clf, train_y, n=10)

    print("\nIt takes %4.2f seconds for Linear SVM."%(time.time()-time_LinearSVM))

Run_SVM()


SVM, Count Vectors: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.91	0.91	0.91	None
Macro	0.9141	0.91	0.9097	None
	-1.1119	工程             		0.4263	第三次            
	-0.6617	中寮             		0.4025	油庫             
	-0.6317	中城蝕溝           		0.4025	五股             
	-0.6317	控制工程           		0.3940	購置             
	-0.6117	嘉民二路           		0.3928	系統             
	-0.5758	漁港             		0.3689	活性碳            
	-0.4376	線              		0.3568	儀控             
	-0.4133	大武崙            		0.3495	設備             
	-0.3934	整建工程           		0.3439	更新             
	-0.3444	里              		0.3086	車三輛            

SVM, WordLevel TF-IDF: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.89	0.89	0.89	None
Macro	0.8914	0.89	0.8899	None
	-3.2724	工程             		1.0027	設備             
	-1.2323	改善             		0.7850	系統             
	-0.7990	里              		0.7375	公告             
	-0.7938	中寮             		0.7019	型              
	-0.7566	漁港             		0.6825	購置             
	-0.

### 3.4 Bagging Model
Implementing a Random Forest Model

Random Forest models are a type of ensemble models, particularly bagging models. They are part of the tree based model family. One can read more about Bagging and random forests at: https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/.

In [27]:
def Run_RdnForest():
    
    time_RdnForest = time.time()

# RF on Count Vectors
    predict, clf = train_predict(ensemble.RandomForestClassifier(), xtrain_count, train_y, xtest_count)
    print("\nRF, Count Vectors: ")
    show_Result(predict)
    #most_informative_feature_for_class(count_vect, clf, train_y, n=10)
    #'RandomForestClassifier' object has no attribute 'coef_'

# RF on Word Level TF IDF Vectors
    predict, clf = train_predict(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xtest_tfidf)
    print("\nRF, WordLevel TF-IDF: ")
    show_Result(predict)

    predict, clf = train_predict(ensemble.RandomForestClassifier(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
    print("\nRF, N-Gram Vectors: ")
    show_Result(predict)

    predict, clf = train_predict(ensemble.RandomForestClassifier(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
    print("\nRF, CharLevel Vectors: ")
    show_Result(predict)

    print("\nIt takes %4.2f seconds for Random Forest."%(time.time()-time_RdnForest))

Run_RdnForest()


RF, Count Vectors: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.9	0.9	0.9	None
Macro	0.9058	0.9	0.8996	None

RF, WordLevel TF-IDF: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.9	0.9	0.9	None
Macro	0.9025	0.9	0.8998	None

RF, N-Gram Vectors: 
[1 1 1 1 1 1 1 1 1 1]
	Precision	Recall	F1	Support
Micro	0.65	0.65	0.65	None
Macro	0.7657	0.65	0.6072	None

RF, CharLevel Vectors: 
[1 1 1 1 0 0 1 1 1 0]
	Precision	Recall	F1	Support
Micro	0.92	0.92	0.92	None
Macro	0.9261	0.92	0.9197	None

It takes 0.08 seconds for Random Forest.




### 3.5 Boosting Model
Implementing Xtereme Gradient Boosting Model

Boosting models are another type of ensemble models part of tree based models. Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). Read more about these models at: https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/.

In [None]:
def Run_XGboost():
# The XGboost takes 2330 seconds for the 20NG datasets. 
# So we do not use it.

    time_XGboost = time.time()

# Extereme Gradient Boosting on Count Vectors
    predict, clf = train_predict(xgboost.XGBClassifier(), xtrain_count.tocsc(), train_y, xtest_count.tocsc())
    print("\nXgb, Count Vectors: ")
    show_Result(predict)
    #most_informative_feature_for_class(count_vect, clf, train_y, n=10)
    #'XGBClassifier' object has no attribute 'coef_'

# Extereme Gradient Boosting on Word Level TF IDF Vectors
    predict, clf = train_predict(xgboost.XGBClassifier(), xtrain_tfidf.tocsc(), train_y, xtest_tfidf.tocsc())
    print("\nXgb, WordLevel TF-IDF: ")
    show_Result(predict)

    predict, clf = train_predict(xgboost.XGBClassifier(), xtrain_tfidf_ngram.tocsc(), train_y, xtest_tfidf_ngram.tocsc())
    print("\nXgb, N-Gram Vectors: ")
    show_Result(predict)

# Extereme Gradient Boosting on Character Level TF IDF Vectors
    predict, clf = train_predict(xgboost.XGBClassifier(), xtrain_tfidf_ngram_chars.tocsc(), train_y, xtest_tfidf_ngram_chars.tocsc())
    print("\nXgb, CharLevel Vectors: ")
    show_Result(predict)

    print("\nIt takes %4.2f seconds for XGboost."%(time.time()-time_XGboost))

#Run_XGboost() # may require 100 times of clock time more than SVM

### 3.6 Shallow Neural Networks
A neural network is a mathematical model that is designed to behave similar to biological neurons and nervous system. These models are used to recognize complex patterns and relationships that exists within a labelled data. A shallow neural network contains mainly three types of layers – input layer, hidden layer, and output layer:
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/OH3gI-1.png)
(The figire above is from: https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/OH3gI-1.png.)

Read more about neural networks at: https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/.

In [28]:
from keras import layers, models, optimizers
from keras.utils import np_utils

# Convert label ID into one-hot-encoding for all training and testing data
y_Train_OneHot = np_utils.to_categorical(train_y)
y_Test_OneHot = np_utils.to_categorical(test_y)

In [29]:
from matplotlib import pyplot as plt
# https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/
def plotHistory(train_history):
    print("train history keys:", train_history.history.keys())
    # train history keys: dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])

    # summarize history for accuracy
    plt.plot(train_history.history['acc'])
    if 'val_acc' in train_history.history.keys():
        plt.plot(train_history.history['val_acc'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    if 'val_acc' in train_history.history.keys():
        plt.legend(['train', 'validation'], loc='lower right')
    else:
        plt.legend(['train'], loc='upper left')

    plt.show()

    # summarize history for loss
    plt.plot(train_history.history['loss'])
    if 'val_loss' in train_history.history.keys():
        plt.plot(train_history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    if 'val_loss' in train_history.history.keys():
        plt.legend(['train', 'validation'], loc='upper right')
    else:
        plt.legend(['train'], loc='upper right')

    plt.show()

In [30]:
def train_NN_predict(classifier, feature_vector_train, label, feature_vector_valid, 
            batch_size=32,
            epochs=10,
            validation_split=0.0): # or validation_split=0.1): 
    
    # fit the training dataset on the classifier
    history=classifier.fit(feature_vector_train, label, verbose=1,
            epochs=epochs, batch_size=batch_size, validation_split=validation_split)

    # plotHistory(history)
    
    # https://stackoverflow.com/questions/38971293/get-class-labels-from-keras-functional-model
    # Note: predict_classes() exists in Sequential, not in Functional model 
    # predict the labels on validation dataset
    y_prob = classifier.predict(feature_vector_valid)
    #print("y_prob:\n", y_prob[:3])
    predictions = y_prob.argmax(axis=-1)
    # or equivalently, use next line:
    # predictions = y_classes = keras.np_utils.probas_to_classes(y_prob)
    print("predictions:\n", predictions[:3])
    return predictions

In [31]:
def create_SimpleNN(input_size, output_size):
    # create input layer 
    print("input_size:", input_size, ", output_size:", output_size)
    input_layer = layers.Input((input_size, ), sparse=True)
    
    # create hidden layer
    hidden_layer = layers.Dense(100, activation="tanh")(input_layer)
    hidden_layer = layers.Dropout(0.2)(hidden_layer)
    
    # create output layer
    output_layer = layers.Dense(output_size, activation="softmax")(hidden_layer)

    classifier = models.Model(inputs = input_layer, outputs = output_layer)
    classifier.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    
    return classifier 

In [32]:
def Run_SimpleNN():
    
    time_SimpleNN = time.time()

    classifier = create_SimpleNN(xtrain_count.shape[1], Num_Classes)
    predictions = train_NN_predict(classifier, xtrain_count, y_Train_OneHot, 
                          xtest_count, epochs=20)
    print("\nNN, Count Vectors: ")
    print(classifier.summary())
    show_Result(predictions)

    classifier = create_SimpleNN(xtrain_tfidf.shape[1], Num_Classes)
    predictions = train_NN_predict(classifier, xtrain_tfidf, y_Train_OneHot, 
                          xtest_tfidf, epochs=20)
    print("\nNN, WordLevel TF-IDF: ")
    print(classifier.summary())
    show_Result(predictions)

    classifier = create_SimpleNN(xtrain_tfidf_ngram.shape[1], Num_Classes)
    predictions = train_NN_predict(classifier, xtrain_tfidf_ngram, y_Train_OneHot, 
                          xtest_tfidf_ngram, epochs=20)
    print("\nNN, N-Gram Vectors: ")
    print(classifier.summary())
    show_Result(predictions)

    classifier = create_SimpleNN(xtrain_tfidf_ngram_chars.shape[1], Num_Classes)
    predictions = train_NN_predict(classifier, xtrain_tfidf_ngram_chars, y_Train_OneHot, 
                          xtest_tfidf_ngram_chars, epochs=20)
    print("\nNN, CharLevel Vectors: ")
    print(classifier.summary())
    show_Result(predictions)

    print("\nIt takes %4.2f seconds for Simple NN."%(time.time()-time_SimpleNN))

Run_SimpleNN()

input_size: 1304 , output_size: 2
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
predictions:
 [1 1 1]

NN, Count Vectors: 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1304)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               130500    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 202       
Total params: 130,702
Trainable params: 130,702
Non-trainable params: 0
__________________________

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
predictions:
 [1 1 1]

NN, CharLevel Vectors: 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 6583)              0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               658400    
_________________________________________________________________
dropout_4 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 2)                 202       
Total params: 658,602
Trainable params: 658,602
Non-trainable params: 0
________________________________________________________

### 3.7 Deep Neural Networks
Deep Neural Networks use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. The figure below is from: https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/OH3gI.png.
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/OH3gI.png)
They are more complex neural networks in which the hidden layers performs much more complex operations than simple sigmoid or relu activations. Different types of deep learning models can be applied in text classification problems.

### 3.7.1 Convolutional Neural Network
In convolutional neural networks, convolutions over the input layer are used to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters and combines their results. The figure below is from: https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/cnnimage.png.
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/cnnimage.png)

Read more about Convolutional Neural Networks at: https://www.analyticsvidhya.com/blog/2017/06/architecture-of-convolutional-neural-networks-simplified-demystified/.

In [39]:

time_CNN = time.time()

def create_cnn(input_size, output_size):
    # to receive sequences of TextWordsLen integers, between 1 and word_index_len.
    input_layer = layers.Input((input_size, ))

    # Add the word embedding Layer
    # see: https://keras.io/getting-started/functional-api-guide/
    embedding_layer = layers.Embedding(input_dim=(word_index_len + 1),
                output_dim=embedding_vector_size, 
#                weights=[embedding_matrix], trainable=False)(input_layer)
# Making embedding vectors trainable improve microF1 from 0.58 to 0.79 !
                weights=[embedding_matrix], trainable=Trainable)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.25)(embedding_layer)

    # Add the convolutional Layer
    conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

    # Add the pooling Layer
    pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    # output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)
    output_layer2 = layers.Dense(output_size, activation="softmax")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), 
                  loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

def Run_CNN():
    classifier = create_cnn(TextWordsLen, Num_Classes)
    print(classifier.summary())

    predictions = train_NN_predict(classifier, train_seq_x, y_Train_OneHot, 
#                               test_seq_x, epochs=10) # for CnonC dataset, 10 is better than 20
                               test_seq_x, epochs=20)
    print("CNN, Word Embeddings")
    show_Result(predictions)

    print("\nIt takes %4.2f seconds for Convolutional NN."%(time.time()-time_CNN))

Run_CNN()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_9 (InputLayer)         (None, 30)                0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 30, 300)           391500    
_________________________________________________________________
spatial_dropout1d_5 (Spatial (None, 30, 300)           0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 28, 100)           90100     
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 100)               0         
_________________________________________________________________
dense_17 (Dense)             (None, 50)                5050      
_________________________________________________________________
dropout_9 (Dropout)          (None, 50)                0         
__________

### 3.7.2 Recurrent Neural Network – LSTM
Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.
![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/bptt-768x313.png)

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

Read more about LSTMs at: https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/.

In [34]:

time_LSTM = time.time()

def create_rnn_lstm(input_size, output_size):
    # Add an Input Layer
    input_layer = layers.Input((input_size, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(input_dim=(word_index_len + 1),
                output_dim=embedding_vector_size, 
                weights=[embedding_matrix], trainable=Trainable)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.25)(embedding_layer)

    # Add the LSTM Layer
    lstm_layer = layers.LSTM(100)(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    # output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)
    output_layer2 = layers.Dense(output_size, activation="softmax")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), 
                  loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

def Run_LSTM():
    classifier = create_rnn_lstm(TextWordsLen, Num_Classes)
    print(classifier.summary())

    predictions = train_NN_predict(classifier, train_seq_x, y_Train_OneHot, 
                               test_seq_x, epochs=20)
    print("RNN-LSTM, Word Embeddings")
    show_Result(predictions)

    print("\nIt takes %4.2f seconds for LSTM."%(time.time()-time_LSTM))

#Run_LSTM()

### 3.7.3 Recurrent Neural Network – GRU
Gated Recurrent Units are another form of recurrent neural networks. Lets add a layer of GRU instead of LSTM in our network.

In [None]:

time_GRU = time.time()

def create_rnn_gru(input_size, output_size):
    # Add an Input Layer
    input_layer = layers.Input((input_size, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(input_dim=(word_index_len + 1),
                output_dim=embedding_vector_size, 
                weights=[embedding_matrix], trainable=Trainable)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.25)(embedding_layer)

    # Add the GRU Layer
    lstm_layer = layers.GRU(100)(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    # output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)
    output_layer2 = layers.Dense(output_size, activation="softmax")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), 
                  loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

def Run_GRU():
    classifier = create_rnn_gru(TextWordsLen, Num_Classes)
    print(classifier.summary())

    predictions = train_NN_predict(classifier, train_seq_x, y_Train_OneHot, 
                               test_seq_x, epochs=20)
    print("RNN-GRU, Word Embeddings")
    show_Result(predictions)

    print("\nIt takes %4.2f seconds for GRU."%(time.time()-time_GRU))

#Run_GRU()

### 3.7.4 Bidirectional RNN
RNN layers can be wrapped in Bidirectional layers as well. Lets wrap our GRU layer in bidirectional layer.

In [36]:

time_BiGRU = time.time()

def create_bidirectional_rnn(input_size, output_size):
    # Add an Input Layer
    input_layer = layers.Input((input_size, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(input_dim=(word_index_len + 1),
                output_dim=embedding_vector_size, 
            weights=[embedding_matrix], trainable=Trainable)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.25)(embedding_layer)

    # Add the LSTM Layer
    lstm_layer = layers.Bidirectional(layers.GRU(100))(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    # output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)
    output_layer2 = layers.Dense(output_size, activation="softmax")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), 
                  loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

def Run_BiGRU():
    classifier = create_bidirectional_rnn(TextWordsLen, Num_Classes)
    print(classifier.summary())

    predictions = train_NN_predict(classifier, train_seq_x, y_Train_OneHot, 
                               test_seq_x, epochs=20)
    print("Bidirectional-GRU, Word Embeddings")
    show_Result(predictions)

    print("\nIt takes %4.2f seconds for Bidirectional GRU."%(time.time()-time_BiGRU))

#Run_BiGRU() # may require 2 times of clock time than GRU

### 3.7.5 Recurrent Convolutional Neural Network (RCNN)

The architechture of RCNN in the following is very similar to:
![](https://www.researchgate.net/publication/329109642/figure/fig3/AS:695607987535873@1542857276692/The-architecture-of-the-BLSTM-C-model_W640.jpg)
which is from Yue Li, Xutao Wang, Pengjian Xu, "Chinese Text Classification Model Based on Deep Learning", Future Internet, 10(11), 2018, doi:10.3390/fi10110113www.mdpi.com/journal/futureinternet.

In [37]:

time_RCNN = time.time()

def create_rcnn(input_size, output_size):
    # Add an Input Layer
    input_layer = layers.Input((input_size, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(input_dim=(word_index_len + 1),
                output_dim=embedding_vector_size, 
                weights=[embedding_matrix], trainable=Trainable)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.25)(embedding_layer)
    
    # Add the recurrent layer
    rnn_layer = layers.Bidirectional(layers.GRU(50, return_sequences=True))(embedding_layer)
    
    # Add the convolutional Layer
    #conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)
    # The above line is replaced by the next line on 2019/02/03
    conv_layer = layers.Convolution1D(100, 3, activation="relu")(rnn_layer)

    # Add the pooling Layer
    pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    # output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)
    output_layer2 = layers.Dense(output_size, activation="softmax")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), 
                  loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

def Run_RCNN():
    classifier = create_rcnn(TextWordsLen, Num_Classes)
    print(classifier.summary())

    predictions = train_NN_predict(classifier, train_seq_x, y_Train_OneHot, 
                               test_seq_x, epochs=20)
    print("RCNN, Word Embeddings")
    show_Result(predictions)

    print("\nIt takes %4.2f seconds for RCNN CNN."%(time.time()-time_RCNN))

Run_RCNN()

print("\nIt takes %4.2f seconds for all the experiments."%(time.time()-time_Start))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 30)                0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 30, 300)           391500    
_________________________________________________________________
spatial_dropout1d_3 (Spatial (None, 30, 300)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 30, 100)           105300    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 28, 100)           30100     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 100)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 50)                5050      
__________

## 4. Improving Text Classification Models
While the above framework can be applied to a number of text classification problems, but to achieve a good accuracy some improvements can be done in the overall framework. For example, following are some tips to improve the performance of text classification models and this framework.

1. Text Cleaning : text cleaning can help to reducue the noise present in text data in the form of stopwords, punctuations marks, suffix variations etc. This article can help to understand how to implement text classification in detail.

2. Hstacking Text / NLP features with text feature vectors : In the feature engineering section, we generated a number of different feature vectros, combining them together can help to improve the accuracy of the classifier.

3. Bidirectional Recurrent Convolutional Neural Networks
4. CNNs and RNNs with more number of layers
5. Sequence to Sequence Models with Attention
6. Hierarichial Attention Networks

7. Hyperparamter Tuning in modelling : Tuning the paramters is an important step, a number of parameters such as tree length, leafs, network paramters etc can be fine tuned to get a best fit model.

8. Ensemble Models : Stacking different models and blending their outputs can help to further improve the results. Read more about ensemble models here
