<a href="https://colab.research.google.com/github/MohanSuresh36/NLP-and-text-classification/blob/main/New_Category_predictions_MachineHack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis - Machine Learning and Basic Deep Neural Network Models

We have already discussed that sentiment analysis, also popularly known as opinion analysis or opinion mining is one of the most important applications of NLP. The key idea is to predict the potential sentiment of a body of text based on the textual content. In this sub-unit, we will be exploring supervised learning models. 

![](https://github.com/dipanjanS/nlp_workshop_dhs18/blob/master/Unit%2012%20-%20Project%209%20-%20Sentiment%20Analysis%20-%20Supervised%20Learning/sentiment_cover.png?raw=1)

Another way to build a model to understand the text content and predict the sentiment of the text based reviews is to use supervised machine learning. To be more specific, we will be using classification models for solving this problem. We will be building an automated sentiment text classification system in subsequent sections. The major steps to achieve this are mentioned as follows.

1.	Prepare train and test datasets (optionally a validation dataset)
2.	Pre-process and normalize text documents
3.	Feature Engineering 
4.	Model training
5.	Model prediction and evaluation

These are the major steps for building our system. Optionally the last step would be to deploy the model in your server or on the cloud. The following figure shows a detailed workflow for building a standard text classification system with supervised learning (classification) models.

![](https://github.com/dipanjanS/nlp_workshop_dhs18/blob/master/Unit%2012%20-%20Project%209%20-%20Sentiment%20Analysis%20-%20Supervised%20Learning/sentiment_classifier_workflow.png?raw=1)


In our scenario, documents indicate the movie reviews and classes indicate the review sentiments which can either be positive or negative making it a binary classification problem. We will build models using both traditional machine learning methods and newer deep learning in the subsequent sections. 

In [None]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/0a/04/d5e0bb9f2cef5d15616ebf68087a725c5dbdd71bd422bcfb35d709f98ce7/contractions-0.0.48-py2.py3-none-any.whl
Collecting textsearch>=0.0.21
  Downloading https://files.pythonhosted.org/packages/d3/fe/021d7d76961b5ceb9f8d022c4138461d83beff36c3938dc424586085e559/textsearch-0.0.21-py2.py3-none-any.whl
Collecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/7f/c2/eae730037ae1cbbfaa229d27030d1d5e34a1e41114b21447d1202ae9c220/pyahocorasick-1.4.2.tar.gz (321kB)
[K     |████████████████████████████████| 327kB 27.9MB/s 
[?25hCollecting anyascii
[?25l  Downloading https://files.pythonhosted.org/packages/09/c7/61370d9e3c349478e89a5554c1e5d9658e1e3116cc4f2528f568909ebdf1/anyascii-0.1.7-py3-none-any.whl (260kB)
[K     |████████████████████████████████| 266kB 51.8MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone

True

# Load and View Dataset

In [None]:
import pandas as pd

dataset = pd.read_excel('/content/Data_Train.xlsx')
test_dataset = pd.read_excel('/content/Data_Test.xlsx')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7628 entries, 0 to 7627
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   STORY    7628 non-null   object
 1   SECTION  7628 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 119.3+ KB


In [None]:
dataset.head()

Unnamed: 0,STORY,SECTION
0,But the most painful was the huge reversal in ...,3
1,How formidable is the opposition alliance amon...,0
2,Most Asian currencies were trading lower today...,3
3,"If you want to answer any question, click on ‘...",1
4,"In global markets, gold prices edged up today ...",3


# Build Train and Test Datasets

In [None]:
# build train and test datasets
sentiments= dataset['SECTION'].values
reviews = dataset['STORY'].values

train_reviews = dataset['STORY']
train_sentiments = dataset['SECTION']

test_reviews = test_dataset['STORY']

In [None]:
to_categorical(train_sentiments)

array([[0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.]], dtype=float32)

# Text Wrangling & Normalization

In [None]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
import tqdm
import unicodedata


def strip_html_tags(text):
  soup = BeautifulSoup(text, "html.parser")
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
  return stripped_text

def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

def pre_process_corpus(docs):
  norm_docs = []
  for doc in tqdm.tqdm(docs):
    doc = strip_html_tags(doc)
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower()
    doc = remove_accented_chars(doc)
    doc = contractions.fix(doc)
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()  
    norm_docs.append(doc)
  
  return norm_docs

In [None]:
%%time

norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

100%|██████████| 7628/7628 [00:01<00:00, 4132.47it/s]
100%|██████████| 2748/2748 [00:00<00:00, 4114.66it/s]

CPU times: user 2.46 s, sys: 49.9 ms, total: 2.51 s
Wall time: 2.52 s





# Traditional Supervised Machine Learning Models

## Feature Engineering

In [None]:
%%time

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=5, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(norm_train_reviews)


# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=5, max_df=1.0, ngram_range=(1,2),
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(norm_train_reviews)

CPU times: user 4.5 s, sys: 158 ms, total: 4.65 s
Wall time: 4.66 s


In [None]:
%%time

# transform test reviews into features
cv_test_features = cv.transform(norm_test_reviews)
tv_test_features = tv.transform(norm_test_reviews)

CPU times: user 847 ms, sys: 2.56 ms, total: 849 ms
Wall time: 850 ms


In [None]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

BOW model:> Train features shape: (7628, 31941)  Test features shape: (2748, 31941)
TFIDF model:> Train features shape: (7628, 31941)  Test features shape: (2748, 31941)


## Model Training, Prediction and Performance Evaluation

### Try out Logistic Regression

The logistic regression model is actually a statistical model developed by statistician
David Cox in 1958. It is also known as the logit or logistic model since it uses the
logistic (popularly also known as sigmoid) mathematical function to estimate the
parameter values. These are the coefficients of all our features such that the overall loss
is minimized when predicting the outcome—

In [None]:
%%time

# Logistic Regression model on BOW features
from sklearn.linear_model import LogisticRegression

# instantiate model
lr = LogisticRegression(penalty='l2', max_iter=500, C=1, solver='lbfgs', random_state=42)

# train model
lr.fit(cv_train_features, train_sentiments)

# predict on test data
lr_bow_predictions = lr.predict(cv_test_features)

CPU times: user 4min 37s, sys: 1min 40s, total: 6min 18s
Wall time: 1min




In [None]:
from sklearn.metrics import confusion_matrix, classification_report

labels = ['negative', 'positive']
print(classification_report(test_sentiments, lr_bow_predictions))
pd.DataFrame(confusion_matrix(test_sentiments, lr_bow_predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.90      0.90      0.90      7490
    positive       0.90      0.91      0.90      7510

    accuracy                           0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



Unnamed: 0,negative,positive
negative,6754,736
positive,711,6799


In [None]:
%%time

# Logistic Regression model on TF-IDF features

# train model
lr.fit(tv_train_features, train_sentiments)

# predict on test data
lr_tfidf_predictions = lr.predict(tv_test_features)

CPU times: user 11.6 s, sys: 4.67 s, total: 16.2 s
Wall time: 2.84 s


In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, lr_tfidf_predictions))
pd.DataFrame(confusion_matrix(test_sentiments, lr_tfidf_predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.91      0.89      0.90      7490
    positive       0.90      0.91      0.90      7510

    accuracy                           0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



Unnamed: 0,negative,positive
negative,6694,796
positive,665,6845


### Try out Random Forest

Decision trees are a family of supervised machine learning algorithms that can represent
and interpret sets of rules automatically from the underlying data. They use metrics like
information gain and gini-index to build the tree. However, a major drawback of decision
trees is that since they are non-parametric, the more data there is, greater the depth of
the tree. We can end up with really huge and deep trees that are prone to overfitting. The
model might work really well on training data, but instead of learning, it just memorizes
all the training samples and builds very specific rules to them. Hence, it performs really
poorly on the test data. Random forests try to tackle this problem.

A random forest is a meta-estimator or an ensemble model that fits a number of
decision tree classifiers on various sub-samples of the dataset and uses averaging to
improve the predictive accuracy and control over-fitting. The sub-sample size is always
the same as the original input sample size, but the samples are drawn with replacement
(bootstrap samples). In random forests, all the trees are trained in parallel (bagging
model/bootstrap aggregation). Besides this, each tree in the ensemble is built from a
sample drawn with replacement (i.e., a bootstrap sample) from the training set. Also,
when splitting a node during the construction of the tree, the split that is chosen is no
longer the best split among all features. Instead, the split that is picked is the best split
among a random subset of the features. T

In [None]:
%%time

# Random Forest model on BOW features
from sklearn.ensemble import RandomForestClassifier

# instantiate model
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)

# train model
rf.fit(cv_train_features, train_sentiments)

# predict on test data
rf_bow_predictions = rf.predict(cv_test_features)

CPU times: user 3min 33s, sys: 1.18 s, total: 3min 34s
Wall time: 31.8 s


In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, rf_bow_predictions))
pd.DataFrame(confusion_matrix(test_sentiments, rf_bow_predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.86      0.86      0.86      7490
    positive       0.86      0.86      0.86      7510

    accuracy                           0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



Unnamed: 0,negative,positive
negative,6406,1084
positive,1080,6430


In [None]:
%%time

# Random Forest model on TF-IDF features

# train model
rf.fit(tv_train_features, train_sentiments)

# predict on test data
rf_tfidf_predictions = rf.predict(tv_test_features)

CPU times: user 3min 21s, sys: 1.02 s, total: 3min 22s
Wall time: 30.6 s


In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, rf_tfidf_predictions))
pd.DataFrame(confusion_matrix(test_sentiments, rf_tfidf_predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.85      0.86      0.85      7490
    positive       0.86      0.84      0.85      7510

    accuracy                           0.85     15000
   macro avg       0.85      0.85      0.85     15000
weighted avg       0.85      0.85      0.85     15000



Unnamed: 0,negative,positive
negative,6458,1032
positive,1175,6335


# Newer Supervised Deep Learning Models

In [None]:
import gensim
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout, Activation, Dense
from sklearn.preprocessing import OneHotEncoder
from keras.utils.np_utils import to_categorical

## Prediction class label encoding

In [None]:
# converting y data into categorical (one-hot encoding)
y_train = to_categorical(train_sentiments)
#y_test = to_categorical(y_test)

# tokenize train reviews & encode train labels
tokenized_train = [nltk.word_tokenize(text)
                       for text in norm_train_reviews]
y_train = le.fit_transform(y_train)
# tokenize test reviews & encode test labels
tokenized_test = [nltk.word_tokenize(text)
                      for text in norm_test_reviews]
#y_test = le.fit_transform(test_sentiments)

In [None]:
# print class label encoding map and encoded labels
print('Sentiment class label map:', dict(zip(le.classes_, le.transform(le.classes_))))
print('Sample test label transformation:\n'+'-'*35,
      '\nActual Labels:', test_sentiments[:3], '\nEncoded Labels:', y_test[:3])

AttributeError: ignored

## Feature Engineering with word embeddings

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
%%time
# build word2vec model
w2v_num_features = 300
w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features, window=150,
                                   min_count=10, workers=4, iter=5)    

2021-04-11 06:18:06,632 : INFO : collecting all words and their counts
2021-04-11 06:18:06,634 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-11 06:18:06,791 : INFO : collected 40121 word types from a corpus of 821529 raw words and 7628 sentences
2021-04-11 06:18:06,792 : INFO : Loading a fresh vocabulary
2021-04-11 06:18:06,829 : INFO : effective_min_count=10 retains 6846 unique words (17% of original 40121, drops 33275)
2021-04-11 06:18:06,830 : INFO : effective_min_count=10 leaves 746491 word corpus (90% of original 821529, drops 75038)
2021-04-11 06:18:06,852 : INFO : deleting the raw counts dictionary of 40121 items
2021-04-11 06:18:06,854 : INFO : sample=0.001 downsamples 36 most-common words
2021-04-11 06:18:06,856 : INFO : downsampling leaves estimated 572095 word corpus (76.6% of prior 746491)
2021-04-11 06:18:06,872 : INFO : estimated required memory for 6846 words and 300 dimensions: 19853400 bytes
2021-04-11 06:18:06,873 : INFO : resettin

CPU times: user 42.9 s, sys: 113 ms, total: 43 s
Wall time: 22.8 s


In [None]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

In [None]:
# generate averaged word vector features from word2vec model
avg_wv_train_features = averaged_word2vec_vectorizer(corpus=tokenized_train, model=w2v_model,
                                                     num_features=w2v_num_features)
avg_wv_test_features = averaged_word2vec_vectorizer(corpus=tokenized_test, model=w2v_model,
                                                    num_features=w2v_num_features)

In [None]:
print('Word2Vec model:> Train features shape:', avg_wv_train_features.shape, ' Test features shape:', avg_wv_test_features.shape)

Word2Vec model:> Train features shape: (7628, 300)  Test features shape: (2748, 300)


## Modeling with deep neural networks 

### Building Deep neural network architecture

In [None]:
from keras.layers import BatchNormalization

In [None]:
def construct_deepnn_architecture(num_input_features):
    dnn_model = Sequential()
    dnn_model.add(Dense(512, input_shape=(num_input_features,), kernel_initializer='he_normal'))
    dnn_model.add(BatchNormalization())
    dnn_model.add(Activation('elu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(256, kernel_initializer='he_normal'))
    dnn_model.add(BatchNormalization())
    dnn_model.add(Activation('elu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(256, kernel_initializer='he_normal'))
    dnn_model.add(BatchNormalization())
    dnn_model.add(Activation('elu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(4))
    dnn_model.add(Activation('softmax'))

    dnn_model.compile(loss='categorical_crossentropy', optimizer='adam',                 
                      metrics=['accuracy'])
    return dnn_model

In [None]:
w2v_dnn = construct_deepnn_architecture(num_input_features=w2v_num_features)

### Visualize sample deep architecture

In [None]:
w2v_dnn.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 512)               154112    
_________________________________________________________________
batch_normalization_3 (Batch (None, 512)               2048      
_________________________________________________________________
activation_8 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 256)               131328    
_________________________________________________________________
batch_normalization_4 (Batch (None, 256)               1024      
_________________________________________________________________
activation_9 (Activation)    (None, 256)              

### Model Training, Prediction and Performance Evaluation

In [None]:
batch_size = 100
w2v_dnn.fit(avg_wv_train_features, to_categorical(train_sentiments), epochs=50, batch_size=batch_size, 
            shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f607fc30a50>

In [None]:
np.array(y_train )

array(<7628x8 sparse matrix of type '<class 'numpy.float64'>'
	with 30512 stored elements in Compressed Sparse Row format>, dtype=object)

In [None]:
y_pred = w2v_dnn.predict_classes(avg_wv_test_features)
y_pred
#predictions =  np.argmax(y_pred)



In [None]:
submission = pd.read_excel('/content/Sample_submission.xlsx')
submission['SECTION'] = y_pred
submission.to_excel('/content/Sample_submission_w2v_CNN.xlsx',index=False)

In [None]:
y_pred.max()

3

In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, predictions))
pd.DataFrame(confusion_matrix(test_sentiments, predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.87      0.88      0.88      7490
    positive       0.88      0.87      0.88      7510

    accuracy                           0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



Unnamed: 0,negative,positive
negative,6628,862
positive,984,6526


In [None]:
import tensorflow as tf

t = tf.keras.preprocessing.text.Tokenizer(oov_token='<UNK>')
# fit the tokenizer on the documents
t.fit_on_texts(norm_train_reviews)
t.word_index['<PAD>'] = 0

In [None]:
VOCAB_SIZE = len(t.word_index)

In [None]:
train_sequences = t.texts_to_sequences(norm_train_reviews)
test_sequences = t.texts_to_sequences(norm_test_reviews)
X_train = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen=1000)
X_test = tf.keras.preprocessing.sequence.pad_sequences(test_sequences, maxlen=1000)

In [None]:
EMBEDDING_DIM = 300 # dimension for dense embeddings for each token
LSTM_DIM = 128 # total LSTM units

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM, input_length=1000))
model.add(tf.keras.layers.SpatialDropout1D(0.1))
model.add(tf.keras.layers.LSTM(LSTM_DIM, return_sequences=False))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dense(4, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam",
              metrics=["accuracy"])
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1000, 300)         12036900  
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 1000, 300)         0         
_________________________________________________________________
lstm (LSTM)                  (None, 128)               219648    
_________________________________________________________________
dense_12 (Dense)             (None, 256)               33024     
_________________________________________________________________
dense_13 (Dense)             (None, 4)                 1028      
Total params: 12,290,600
Trainable params: 12,290,600
Non-trainable params: 0
_________________________________________________________________


In [None]:
batch_size = 100
model.fit(X_train, to_categorical(train_sentiments), epochs=10, batch_size=batch_size, 
          shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f602beeb5d0>

In [None]:
train_sequences[1]

[128,
 5069,
 8,
 2,
 362,
 185,
 168,
 44,
 1910,
 11116,
 6851,
 13107,
 4,
 1910,
 3464,
 6851,
 13108]

In [None]:
X_train[1]


array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,

In [None]:
predictions = model.predict_classes(X_test)
predictions[:10]



array([1, 2, 1, 0, 1, 1, 1, 2, 1, 2])

In [None]:
submission = pd.read_excel('/content/Sample_submission.xlsx')
submission['SECTION'] = predictions
submission.to_excel('/content/Sample_submission_w2v_LSTM.xlsx',index=False)