# Introduction

The goal of text classification is to automatically classify the text documents into one or more defined categories. Some examples of text classification are:
- Understanding audience sentiment from social media,
- Detection of spam and non-spam emails,
- Auto tagging of customer queries, and
- Categorization of news articles into defined topics. <br> <br>

Text Classification is an example of supervised machine learning task since a labelled dataset containing text documents and their labels is used for train a classifier. There are 4 steps that we need to do as follows:
- Dataset Preparation (Preprocessing Data)
- Feature Engineering (Preprocessing Data)
- Model Training
- Improve Performance 


In this tutorial, we will implement Text Classifier Model for newpapers in Vietnamese. <br>
There are totally 10 classes in data set.

# Preprocessing Data

Dataset was downloaded from https://github.com/duyvuleo/VNTC

In [2]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers
from keras.layers import *

## Dataset preparation

In [3]:
from pyvi import ViTokenizer, ViPosTagger
from tqdm import tqdm
import numpy as np
import gensim
import numpy as np

In [None]:
import os 
dir_path = os.path.dirname(os.path.realpath(os.getcwd()))
dir_path = os.path.join(dir_path, 'Data')

def get_data(folder_path):
    X = []
    y = []
    dirs = os.listdir(folder_path)
    for path in dirs:
        file_paths = os.listdir(os.path.join(folder_path, path))
        for file_path in tqdm(file_paths):
            with open(os.path.join(folder_path, path, file_path), 'r', encoding="utf-16") as f:
                lines = f.readlines()
                lines = ' '.join(lines)
                lines = gensim.utils.simple_preprocess(lines)
                lines = ' '.join(lines)
                lines = ViTokenizer.tokenize(lines)
                X.append(lines)
                y.append(path)

    return X, y

train_path = os.path.join(dir_path, 'E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/10Topics/Ver1.1/Train_Full')
X_data, y_data = get_data(train_path)


100%|██████████████████████████████████████████████████████████████████████████████| 5219/5219 [01:32<00:00, 56.28it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3159/3159 [01:01<00:00, 51.49it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 1820/1820 [00:31<00:00, 56.91it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 2552/2552 [00:43<00:00, 58.75it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3868/3868 [01:01<00:00, 63.35it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3384/3384 [00:59<00:00, 57.03it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 2898/2898 [00:49<00:00, 58.79it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 5298/5298 [01:37<00:00, 54.48it/s]
 38%|█████████████████████████████▎     

In [2]:
import pickle

pickle.dump(X_data, open('E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/X_data.pkl', 'wb'))
pickle.dump(y_data, open('E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/y_data.pkl', 'wb'))

NameError: name 'X_data' is not defined

In [None]:
test_path = os.path.join(dir_path, 'E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/10Topics/Ver1.1/Test_Full')
X_test, y_test = get_data(test_path)


  4%|▎         | 231/6250 [00:22<09:34, 10.48it/s][A

In [None]:
pickle.dump(X_test, open('E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/X_test.pkl', 'wb'))
pickle.dump(y_test, open('E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/y_test.pkl', 'wb'))

## Feature Engineering

In this step, raw text data will be transformed into eature vectors and new features will be created using the existing dataset. We will implement some idea as follows:
1. Count Vectors as features
2. TF-IDF Vectors as features<br>
    2.1. Word level<br>
    2.2. N-Gram level<br>
    2.3. Character level
3. Word Embeddings as features
4. Text / NLP based features
5. Topic Models as features

In [4]:
import pickle

X_data = pickle.load(open('E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/X_data.pkl', 'rb'))
y_data = pickle.load(open('E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/y_data.pkl', 'rb'))

X_test = pickle.load(open('E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/X_test.pkl', 'rb'))
y_test = pickle.load(open('E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/y_test.pkl', 'rb'))

### Count Vectors as features
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

In [6]:
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(X_data)

# transform the training and validation data using count vectorizer object
X_data_count = count_vect.transform(X_data)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)<br>
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)<br>
TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams)

a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents

b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams

c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus



In [7]:
# word level - we choose max number of words equal to 30000 except all words (100k+ words)
tfidf_vect = TfidfVectorizer(analyzer='word', max_features=30000)
tfidf_vect.fit(X_data) # learn vocabulary and idf from training set
X_data_tfidf =  tfidf_vect.transform(X_data)
# assume that we don't have test set before
X_test_tfidf =  tfidf_vect.transform(X_test)

In [None]:
tfidf_vect.get_feature_names()

In [8]:
# ngram level - we choose max number of words equal to 30000 except all words (100k+ words)
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', max_features=30000, ngram_range=(2, 3))
tfidf_vect_ngram.fit(X_data)
X_data_tfidf_ngram =  tfidf_vect_ngram.transform(X_data)
# assume that we don't have test set before
X_test_tfidf_ngram =  tfidf_vect_ngram.transform(X_test)

MemoryError: 

In [None]:
tfidf_vect_ngram.get_feature_names()

In [4]:
# ngram-char level - we choose max number of words equal to 30000 except all words (100k+ words)
tfidf_vect_ngram_char = TfidfVectorizer(analyzer='char', max_features=30000, ngram_range=(2, 3))
tfidf_vect_ngram_char.fit(X_data)
X_data_tfidf_ngram_char =  tfidf_vect_ngram_char.transform(X_data)
# assume that we don't have test set before
X_test_tfidf_ngram_char =  tfidf_vect_ngram_char.transform(X_test)

#### Transform by SVD to decrease number of dimensions

##### Word Level

In [9]:
from sklearn.decomposition import TruncatedSVD

In [10]:
svd = TruncatedSVD(n_components=300, random_state=42)
svd.fit(X_data_tfidf)

In [11]:
X_data_tfidf_svd = svd.transform(X_data_tfidf)
X_test_tfidf_svd = svd.transform(X_test_tfidf)

##### ngram Level

In [12]:
svd_ngram = TruncatedSVD(n_components=300, random_state=42)
svd_ngram.fit(X_data_tfidf_ngram)

NameError: name 'X_data_tfidf_ngram' is not defined

In [13]:
X_data_tfidf_ngram_svd = svd_ngram.transform(X_data_tfidf_ngram)
X_test_tfidf_ngram_svd = svd_ngram.transform(X_test_tfidf_ngram)

NameError: name 'X_data_tfidf_ngram' is not defined

##### ngram Char Level

In [7]:
svd_ngram_char = TruncatedSVD(n_components=300, random_state=42)
svd_ngram_char.fit(X_data_tfidf_ngram_char)

TruncatedSVD(algorithm='randomized', n_components=300, n_iter=5,
       random_state=42, tol=0.0)

In [14]:
X_data_tfidf_ngram_char_svd = svd_ngram_char.transform(X_data_tfidf_ngram_char)
X_test_tfidf_ngram_char_svd = svd_ngram_char.transform(X_test_tfidf_ngram_char)

NameError: name 'svd_ngram_char' is not defined

### Word Embeddings

We will convert each word in document to a embedding vector. We will use pretrained model for Vietnamese. The model can be downloaded from https://github.com/Kyubyong/wordvectors

Assume that, one document have $n$ word, each word is represented by 300 dimensional vector, then the document vector be 2-dimensional matrix with size $ n \times 300 $. From that, we can use DNN, RNN, CNN model for this type of data.

In [16]:
from gensim.models import KeyedVectors 
dir_path = os.path.dirname(os.path.realpath(os.getcwd()))
word2vec_model_path = os.path.join(dir_path, "E:/CodeDatabase/Web/Natual-Language-Processing/Text-Classifier/data/vi/vi.vec")

w2v = KeyedVectors.load_word2vec_format(word2vec_model_path)
vocab = w2v.wv.vocab
wv = w2v.wv

NameError: name 'os' is not defined

In [45]:
def get_word2vec_data(X):
    word2vec_data = []
    for x in X:
        sentence = []
        for word in x.split(" "):
            if word in vocab:
#                 print(word)
                sentence.append(wv[word])

        word2vec_data.append(sentence)
#         break
    return word2vec_data

X_data_w2v = get_word2vec_data(X_data)
X_test_w2v = get_word2vec_data(X_test)



### Text / NLP based features
Idea from https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

A number of extra text based features can also be created which sometimes are helpful for improving text classification models. Some examples are:

1. Word Count of the documents – total number of words in the documents
2. Character Count of the documents – total number of characters in the documents
3. Average Word Density of the documents – average length of the words used in the documents
4. Puncutation Count in the Complete Essay – total number of punctuation marks in the documents
5. Upper Case Count in the Complete Essay – total number of upper count words in the documents
6. Title Word Count in the Complete Essay – total number of proper case (title) words in the documents
7. Frequency distribution of Part of Speech Tags:
    - Noun Count
    - Verb Count
    - Adjective Count
    - Adverb Count
    - Pronoun Count
    
These features are highly experimental ones and should be used according to the problem statement only.

### Topic Models as features

Topic Modelling is a technique to identify the groups of words (called a topic) from a collection of documents that contains best information in the collection. I have used Latent Dirichlet Allocation for generating Topic Modelling Features. LDA is an iterative model which starts from a fixed number of topics. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents

### Convert y to categorical

In [21]:
encoder = preprocessing.LabelEncoder()
y_data_n = encoder.fit_transform(y_data)
y_test_n = encoder.fit_transform(y_test)

In [22]:
encoder.classes_

array(['Chinh tri Xa hoi', 'Doi song', 'Khoa hoc', 'Kinh doanh',
       'Phap luat', 'Suc khoe', 'The gioi', 'The thao', 'Van hoa',
       'Vi tinh'], dtype='<U16')

# Model

In this tutorial, we will implement some models and compare them to find the most effective model for text classification problem. We will implement these models:
1. Naive Bayes Classifier
2. Linear Classifier
3. Support Vector Machine
4. Bagging Models
5. Boosting Models
6. Shallow Neural Networks
7. Deep Neural Networks
    - Convolutional Neural Network (CNN)
    - Long Short Term Modelr (LSTM)
    - Gated Recurrent Unit (GRU)
    - Bidirectional RNN
    - Recurrent Convolutional Neural Network (RCNN)
    - Other Variants of Deep Neural Networks
8. Doc2Vec model

We use the prototype function to do some classifiers as follows: <br>
(Because of my machine memory, I test only on WORD LEVEL TF-IDF (with SVD or not))

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
def train_model(classifier, X_data, y_data, X_test, y_test, is_neuralnet=False, n_epochs=3):       
    X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.1, random_state=42)
    
    if is_neuralnet:
        classifier.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=n_epochs, batch_size=512)
        
        val_predictions = classifier.predict(X_val)
        test_predictions = classifier.predict(X_test)
        val_predictions = val_predictions.argmax(axis=-1)
        test_predictions = test_predictions.argmax(axis=-1)
    else:
        classifier.fit(X_train, y_train)
    
        train_predictions = classifier.predict(X_train)
        val_predictions = classifier.predict(X_val)
        test_predictions = classifier.predict(X_test)
        
    print("Validation accuracy: ", metrics.accuracy_score(val_predictions, y_val))
    print("Test accuracy: ", metrics.accuracy_score(test_predictions, y_test))

## Naive Bayes

In [13]:
train_model(naive_bayes.MultinomialNB(), X_data_tfidf, y_data, X_test_tfidf, y_test, is_neuralnet=False)

Train accuracy:  0.880031596616529
Validation accuracy:  0.8690758293838863
Test accuracy:  0.8650666031405714


In [None]:
train_model(naive_bayes.MultinomialNB(), X_data_tfidf_ngram_svd, y_data, X_test_tfidf_ngram_svd, y_test, is_neuralnet=False)

In [None]:
train_model(naive_bayes.MultinomialNB(), X_data_tfidf_ngram_char_svd, y_data, X_test_tfidf_ngram_char_svd, y_test, is_neuralnet=False)

### Other type Naive Bayes

In [58]:
# use too much memory
# train_model(naive_bayes.GaussianNB(), X_data_tfidf.todense(), y_data, X_test_tfidf.todense(), y_test, is_neuralnet=False)

In [15]:
train_model(naive_bayes.BernoulliNB(), X_data_tfidf, y_data, X_test_tfidf, y_test, is_neuralnet=False)

Train accuracy:  0.8485995457986374
Validation accuracy:  0.8293838862559242
Test accuracy:  0.8531554602664125


In [16]:
train_model(naive_bayes.BernoulliNB(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

Train accuracy:  0.8087746437152354
Validation accuracy:  0.8033175355450237
Test accuracy:  0.8143449864014453


## Linear Classifier

In [17]:
train_model(linear_model.LogisticRegression(), X_data_tfidf, y_data, X_test_tfidf, y_test, is_neuralnet=False)

Train accuracy:  0.9473060593094823
Validation accuracy:  0.9167654028436019
Test accuracy:  0.9207511960772636


In [18]:
train_model(linear_model.LogisticRegression(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

Train accuracy:  0.9023467070401211
Validation accuracy:  0.8927725118483413
Test accuracy:  0.9046314493875688


## SVM Model

In [11]:
train_model(svm.SVC(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

Train accuracy:  0.43359773557581544
Validation accuracy:  0.4277251184834123
Test accuracy:  0.3908840053203105


## Bagging Model

In [12]:
train_model(ensemble.RandomForestClassifier(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

Train accuracy:  0.9962479017871836
Validation accuracy:  0.8311611374407583
Test accuracy:  0.834435114049193


## Boosting Model

In [13]:
train_model(xgboost.XGBClassifier(), X_data_tfidf_svd, y_data, X_test_tfidf_svd, y_test, is_neuralnet=False)

  if diff:
  if diff:


Train accuracy:  0.9042885824309647
Validation accuracy:  0.8696682464454977
Test accuracy:  0.8786850098266928


  if diff:


## Deep Neural Network

In [112]:
from keras.layers import *

In [24]:
def create_dnn_model():
    input_layer = Input(shape=(300,))
    layer = Dense(1024, activation='relu')(input_layer)
    layer = Dense(1024, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [48]:
classifier = create_dnn_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True)

Train on 30383 samples, validate on 3376 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Validation accuracy:  0.8969194312796208
Test accuracy:  0.9038770770055387


## Convolutional Neural Network 

In [49]:
def create_cnn_model():
    pass

## Recurrent Neural Network  

### LSTM 

In [68]:
def create_lstm_model():
    input_layer = Input(shape=(300,))
    
    layer = Reshape((10, 30))(input_layer)
    layer = LSTM(128, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(128, activation='relu')(layer)
    
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [69]:
classifier = create_lstm_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True)

Train on 30383 samples, validate on 3376 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Validation accuracy:  0.7058649289099526
Test accuracy:  0.7002163857622139


### GRU 

In [70]:
def create_gru_model():
    input_layer = Input(shape=(300,))
    
    layer = Reshape((10, 30))(input_layer)
    layer = GRU(128, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(128, activation='relu')(layer)
    
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [81]:
classifier = create_gru_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True, n_epochs=10)

Train on 30383 samples, validate on 3376 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Validation accuracy:  0.6519549763033176
Test accuracy:  0.6368689575764794


### Bidirectional RNN 

In [76]:
def create_brnn_model():
    input_layer = Input(shape=(300,))
    
    layer = Reshape((10, 30))(input_layer)
    layer = Bidirectional(GRU(128, activation='relu'))(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(128, activation='relu')(layer)
    
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [80]:
classifier = create_brnn_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True, n_epochs=20)

Train on 30383 samples, validate on 3376 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Validation accuracy:  0.8957345971563981
Test accuracy:  0.9022095170031564


## Recurrent Convolutional Neural Network 

In [13]:
# def create_rcnn_model():
#     input_layer = Input(shape=(300,))
    
#     layer = Reshape((10, 30))(input_layer)
#     layer = Bidirectional(GRU(128, activation='relu', return_sequences=True))(layer)
# #     layer = Reshape((16, 16))(layer)
# #     layer = Convolution1D(100, 3, activation="relu")(layer)
#     layer = Dense(512, activation='relu')(layer)
#     layer = Dense(512, activation='relu')(layer)
#     layer = Dense(128, activation='relu')(layer)
    
#     output_layer = Dense(10, activation='softmax')(layer)
    
#     classifier = models.Model(input_layer, output_layer)
    
#     classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
#     return classifier
def create_rcnn_model():
    input_layer = Input(shape=(300,))
    
    layer = Reshape((10, 30))(input_layer)
    layer = Bidirectional(GRU(128, activation='relu', return_sequences=True))(layer)    
    layer = Convolution1D(100, 3, activation="relu")(layer)
    layer = Flatten()(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(512, activation='relu')(layer)
    layer = Dense(128, activation='relu')(layer)
    
    output_layer = Dense(10, activation='softmax')(layer)
    
    classifier = models.Model(input_layer, output_layer)
    classifier.summary()
    classifier.compile(optimizer=optimizers.Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return classifier

In [110]:
classifier = create_rcnn_model()
train_model(classifier=classifier, X_data=X_data_tfidf_svd, y_data=y_data_n, X_test=X_test_tfidf_svd, y_test=y_test_n, is_neuralnet=True, n_epochs=20)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_35 (InputLayer)        (None, 300)               0         
_________________________________________________________________
reshape_31 (Reshape)         (None, 10, 30)            0         
_________________________________________________________________
bidirectional_16 (Bidirectio (None, 10, 256)           122112    
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 8, 100)            76900     
_________________________________________________________________
flatten_1 (Flatten)          (None, 800)               0         
_________________________________________________________________
dense_105 (Dense)            (None, 512)               410112    
_________________________________________________________________
dense_106 (Dense)            (None, 512)               262656    
__________

## Doc2Vec Model 

In [5]:
def get_corpus(documents):
    corpus = []
    
    for i in tqdm(range(len(documents))):
        doc = documents[i]
        
        words = doc.split(' ')
        tagged_document = gensim.models.doc2vec.TaggedDocument(words, [i])
        
        corpus.append(tagged_document)
        
    return corpus

In [6]:
train_corpus = get_corpus(X_data)


100%|██████████| 33759/33759 [00:01<00:00, 22059.46it/s]


In [7]:
test_corpus = get_corpus(X_test)

100%|██████████| 50373/50373 [00:02<00:00, 17527.29it/s]


#### Build Doc2Vec model 

In [8]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=300, min_count=2, epochs=40)
model.build_vocab(train_corpus)

In [9]:
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 27min 55s, sys: 14.6 s, total: 28min 9s
Wall time: 15min 9s


#### Get vector 

In [10]:
X_data_vectors = []
for x in train_corpus:
    vector = model.infer_vector(x.words)
    X_data_vectors.append(vector)

In [11]:
X_test_vectors = []
for x in test_corpus:
    vector = model.infer_vector(x.words)
    X_test_vectors.append(vector)

In [None]:
classifier = create_dnn_model()
train_model(classifier=classifier, X_data=np.array(X_data_vectors), y_data=y_data_n, X_test=(X_test_vectors), y_test=y_test_n, is_neuralnet=True, n_epochs=5)