Name: Aparna Iyer

PRN: 22070126017

Batch: 2022-2026

Branch: AI-ML A1


###**1. Title:** LSTM for Text Classification

###**2. Objectives:**

a. To study the architecture and functioning of Long Short-Term Memory (LSTM) networks.

b. To implement LSTM for a text classification dataset from Kaggle.

###**3. Theory:**

Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Network (RNN) specifically designed to overcome the vanishing gradient problem faced by traditional RNNs.

They are particularly well-suited for sequential data, such as text, due to their ability to retain important information over long sequences.

LSTM units contain gates (input, forget, and output) that control the flow of information, enabling the network to maintain and update memory over time.

For text classification, LSTMs are effective in capturing word dependencies and context, helping to improve the accuracy of classification tasks, such as spam detection, sentiment analysis, and topic categorization.

In [None]:
!pip install contractions -qq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/289.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m286.7/289.9 kB[0m [31m11.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/110.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# NLP
import string, re, nltk
from string import punctuation
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
import spacy
import contractions

In [None]:
 nltk.download("all")
 !python -m spacy download en_core_web_sm

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import AdamW
from tensorflow.keras.utils import to_categorical

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
data = pd.read_csv("/content/abbrevations (2).csv",
                   names=['Labels','Description'])
data.head()

Unnamed: 0,Labels,Description
0,?,I have a question
1,?,I don’t understand what you mean
2,?4U,I have a question for you
3,;S,GeHmm? What did you say?
4,^^,read message


In [None]:
len(data)

data.dropna(inplace = True)
data.drop_duplicates(inplace = True)
data.reset_index(drop = True, inplace = True)

In [None]:
len(data)

1548

In [None]:
#Regular Expression

regexp = RegexpTokenizer("[\w']+")

#Lowercase
def text_lower(text):
  text = text.lower()
  return text

#Remove Whitespace
def remove_whitespace(text):
  text = text.strip()
  return text

#Remove Punctuation
def remove_punctuation(text):
  punct = string.punctuation
  punct = punct.replace("'","")
  text = text.translate(str.maketrans("", "",punct))
  return text

#Remove HTML
def remove_html(text):
  html = re.compile(r'<.*?>')
  text = html.sub(r'',text)
  return text

#Removing emojis

def remove_emoji(text):
  emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
 "]+",flags=re.UNICODE
 )

  text = emoji_pattern.sub(r'',text)

#Remove URLS
def remove_http_links(text):
  text = re.sub('http://\S+|https://\S+','',text)
  return text

#Convert Abbreviated Words
abbrev = pd.read_csv('/content/abbrevations (2).csv',
                     names=['SMF','FF'])
abbrev.head()


Unnamed: 0,SMF,FF
0,?,I have a question
1,?,I don’t understand what you mean
2,?4U,I have a question for you
3,;S,GeHmm? What did you say?
4,^^,read message


In [None]:
abbrev_lower = pd.DataFrame()
abbrev_lower['SMF'] = abbrev['SMF'].apply(text_lower)
abbrev_lower['FF'] = abbrev['FF'].apply(text_lower)
abbrev_dict = dict(zip(list(abbrev_lower.SMF), list(abbrev_lower.FF)))
abbrev_words = list(abbrev_dict.keys())

def convert_abbrev(text):
  words = []
  for word in regexp.tokenize(text):
    if word in abbrev_words:
      words = words + abbrev_dict[word].split()
    else:
      words = words + word.split()
  text_converted = " ".join(words)
  return text_converted

#Convert Contractions like you're

def convert_contractions(text):
  text = contractions.fix(text)
  return text

#Remove Stopwords
def remove_stopwords(text):
  text = " ".join([word for word in nltk.tokenize.word_tokenize(text)
  if word not in stopwords.words('english')])

  return text

#Lemmatization

nlp = spacy.load("en_core_web_sm",
                 disable = ['parser', 'ner'])

def lemmatize(text):
  text = " ".join([token.lemma_ for token in nlp(text)])
  return text

#Remove Non-Alphabetic Characters
def discard_non_alpha(text):
  word_list_non_alpha = [word for word in regexp.tokenize(text)]
  if word.isalpha():
    text = " ".join(word_list_non_alpha)
  return text

In [None]:
 #Aggregating All definitions
def text_clean(text):
  text = text_lower(text)
  text = remove_whitespace(text)
  text = re.sub('\n' , '', text)
  text = re.sub('\[.*?\]', '', text)
  text = remove_http_links(text)
  text = remove_punctuation(text)
  text = remove_html(text)
  text = remove_emoji(text)
  text = convert_abbrev(text)
  text = convert_contractions(text)
  text = remove_stopwords(text)
  text = discard_non_alpha(text)
  text = lemmatize(text)
  return text

In [None]:
# Preprocessing: Tokenizing and padding sequences
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(data['Description'])  # The column 'Description' contains the text data
X = tokenizer.texts_to_sequences(data['Description'])
X = pad_sequences(X, maxlen=100)  # Padding sequences to ensure uniform length

# Prepare labels
y = pd.get_dummies(data['Labels']).values  # Assuming 'label' column has the target classes



In [None]:
# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
#Import necessary libraries

!pip install tensorflow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dropout, Dense


model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(LSTM(units=128, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(units=len(np.unique(data['Labels'])), activation='softmax'))  # Adjust units for output

# Use appropriate loss function based on your labels
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])



In [None]:
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=25, batch_size=64, validation_data=(X_test, y_test))



Epoch 1/25
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 259ms/step - accuracy: 0.0153 - loss: 6.4274 - val_accuracy: 0.0000e+00 - val_loss: 10.2199
Epoch 2/25
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 330ms/step - accuracy: 0.0162 - loss: 6.4356 - val_accuracy: 0.0000e+00 - val_loss: 10.7967
Epoch 3/25
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 234ms/step - accuracy: 0.0224 - loss: 6.3638 - val_accuracy: 0.0000e+00 - val_loss: 10.6393
Epoch 4/25
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 387ms/step - accuracy: 0.0207 - loss: 6.2418 - val_accuracy: 0.0000e+00 - val_loss: 11.5944
Epoch 5/25
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 252ms/step - accuracy: 0.0351 - loss: 6.0942 - val_accuracy: 0.0000e+00 - val_loss: 11.4473
Epoch 6/25
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 379ms/step - accuracy: 0.0518 - loss: 5.9716 - val_accuracy: 0.0000e+00 - val_loss: 11.3

In [None]:
# Evaluate the model on test data
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.2f}')

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 61ms/step - accuracy: 0.0148 - loss: 13.9106
Test Accuracy: 0.01


###**4. Conclusion:**
The implementation of LSTM for text classification demonstrates its effectiveness in handling sequential data, especially when context over long sequences is important.

The LSTM model captures patterns in the text data, allowing for improved classification performance.

Training Accuracy is 76.40% while Test Accuracy is only 1.00%.

The reasons for this could be:

a. Overfitting: The model memorizes training data but fails to generalize to unseen data.

b. Data Leakage: Test data information may have influenced training, inflating training accuracy.

c. Class Imbalance: Uneven class distribution can lead to poor performance on minority classes.

d. Insufficient Representation: The test set might not reflect the training set's diversity.

e. Hyperparameter Issues: Poorly tuned parameters can impair learning and generalization.

f. Data Quality: Noisy or incorrect data can mislead the model.



**Improvements:**

a. Use regularization techniques to prevent overfitting.

b. Ensure consistent preprocessing for both training and test sets.

c. Employ stratified sampling for balanced class representation.

d. Gather more diverse training data.

e. Tune hyperparameters for better model performance.

f. Addressing these issues can improve test accuracy and overall model performance.

Further tuning of hyperparameters and use of techniques such as regularization or pre-trained embeddings (like Word2Vec or GloVe) can improve model performance on more complex datasets.


In [None]:
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc

In [None]:
!pip install pypandoc

In [None]:
!apt-get update
!apt-get install -y pandoc

In [None]:
!apt-get install pandoc

In [None]:
!apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic

In [None]:
!apt-get install -y texlive-xetex texlive-fonts-recommended texlive-plain-generic

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!jupyter nbconvert --to PDF "/content/drive/MyDrive/Colab Notebooks/DL_Lab_Experiment_9_AparnaIyer.ipynb"