keras-2.15.0 tensorflow-2.15.0

## **Introduction**

Classify on Stack Overflow into 3 categories depending on their quality.

This Case Study outlines 2 techniques to achieve the task of text classification:

1.   [Training Word Embedding](https://github.com/shraddha-an/nlp/blob/main/word_embedding_classification.ipynb)
2.   Pretrained GloVe Word Embeddings

This Colab Notebook focusses on the second task, Using the GloVe pre-trained Embedding.


**Dataset**: [Stack Overflow Questions](https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate)


## 1) **Data Preparation**

In [2]:
# Importing libraries
# Data Manipulation/ Handling
import pandas as pd, numpy as np

# Visualization
import seaborn as sb, matplotlib.pyplot as plt

# NLP libraries
import re
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess

stop_words = set(stopwords.words('english'))

In [3]:
# Importing training & testing datasets
dataset = pd.read_csv('data/train.csv')[['Body', 'Y']].rename(columns = {'Body': 'question', 'Y': 'category'})
ds = pd.read_csv('data/valid.csv')[['Body', 'Y']].rename(columns = {'Body': 'question', 'Y': 'category'})

## **2) Preprocessing**

In [4]:
# Removing symbols, stopwords, punctuation
symbols = re.compile(pattern = '[/<>(){}\[\]\|@,;]')
tags = ['href', 'http', 'https', 'www']

def text_clean(s: str) -> str:
    """
    Removes unwanted symbols, punctuation and stop words from a given string.
    """
    s = symbols.sub(' ', s)
    for i in tags:
        s = s.replace(i, ' ')
    cleaned_text = ' '.join(word for word in simple_preprocess(s, deacc = True) if not word in stop_words)
    return cleaned_text

# Applying the function on the questions column
dataset.iloc[:, 0] = dataset.iloc[:, 0].apply(text_clean)
ds.iloc[:, 0] = ds.iloc[:, 0].apply(text_clean)

# Train & Test subsets
X_train, y_train = dataset.iloc[:, 0].values, dataset.iloc[:, 1].values.reshape(-1, 1)
X_test, y_test = ds.iloc[:, 0].values, ds.iloc[:, 1].values.reshape(-1, 1)

## **3) Categorical Encoding**

In [5]:
# One Hot Encoding the Categories Column
from sklearn.preprocessing import OneHotEncoder as ohe
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(transformers = [('one_hot_encoder', ohe(categories = 'auto'), [0])],
                       remainder = 'passthrough')

y_train = ct.fit_transform(y_train)
y_test = ct.transform(y_test)

## **4) Tokenization**

In [6]:
# Vectorizing our text corpus of questions
# Setting some paramters
vocab_size = 2100
glove_dim = 50
sequence_length = 300

# Tokenization with keras
from keras.preprocessing.text import Tokenizer

tk = Tokenizer(num_words = vocab_size)
tk.fit_on_texts(X_train)

X_train = tk.texts_to_sequences(X_train)
X_test = tk.texts_to_sequences(X_test)

# Padding all questions with zeros
from keras.preprocessing.sequence import pad_sequences

X_train_seq = pad_sequences(X_train, maxlen = sequence_length, padding = 'post')
X_test_seq = pad_sequences(X_test, maxlen = sequence_length, padding = 'post')


## **5) Building the Embedding Matrix**

https://github.com/stanfordnlp/GloVe

In [7]:
# Importing the 50-dimensional pre-trained embedding text file
path = 'data/glove.6B.50d.txt'

embeddings = {}

with open(path, 'r', encoding = 'utf-8') as f:
    for line in f:
      values = line.split()                                          # Each line in the file is a word + 50 integers denoting its vector.
      embeddings[values[0]] = np.array(values[1:], 'float32')        # The first element of every line is a word & the rest 50 are its array of integers.


# Building the embeddings matrix out of words present in our corpus
embedding_matrix = np.zeros((vocab_size, glove_dim))            # glove_dim = 50 as I chose to use the 50-D embedding; replace it with the one you choose.

word_index = tk.word_index
for word, index in word_index.items():
    if index < vocab_size:
        try:
          embedding_matrix[index] = embeddings[word]                  # If the embedding for the given word exists, retrieve it and map it to the word.
        except:
            pass

In [10]:
embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.44411999, -0.67868   , -0.094546  , ...,  0.50032997,
         0.47428   ,  0.040882  ],
       [-0.27987   , -0.22764   , -0.061538  , ..., -0.21626   ,
         0.53468001,  0.15719   ],
       ...,
       [ 0.73847997,  0.049798  ,  0.80636001, ...,  0.2411    ,
        -0.89866   , -0.44416001],
       [ 0.98005998,  0.30903   , -0.87260002, ...,  0.085949  ,
        -0.77061999,  0.117     ],
       [-0.22017001,  0.34727001,  1.15690005, ...,  0.38277   ,
         0.64181   , -0.78127998]])

## **6) Embedding Model**


In [8]:
# Buidling & Training the NN + Embedding layer
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten

model = Sequential()
model.add(Embedding(input_dim = vocab_size,
                    output_dim = glove_dim,
                    input_length = sequence_length))
model.add(Flatten())
model.add(Dense(units = 3, activation = 'softmax'))
model.compile(optimizer = 'adam', metrics = ['accuracy'], loss = 'categorical_crossentropy')

# Loading our pre-trained embedding matrix in the Embedding layer
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False  # Weights won't be updated while training.

model.summary()

# Training the model
history = model.fit(X_train_seq, y_train, epochs = 20, batch_size = 512, verbose = 1)

# Save the model
#model.save('model.h5')


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 50)           105000    
                                                                 
 flatten (Flatten)           (None, 15000)             0         
                                                                 
 dense (Dense)               (None, 3)                 45003     
                                                                 
Total params: 150003 (585.95 KB)
Trainable params: 150003 (585.95 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 50)           105000    
                                                         

## **6) Evaluating Performance**

In [9]:
# Evaluating model performance on test set
loss, accuracy = model.evaluate(X_test_seq, y_test, verbose = 1)
print("\nAccuracy: {}\nLoss: {}".format(accuracy, loss))


Accuracy: 0.7688666582107544
Loss: 0.6589033603668213
