<h1 style='color:green;'><center>Long short-term memory (LSTM)</center></h1>

<em>In this notebook, we train a Long Short-Term Memory (LSTM) model on the BBC news classification problem. The goal is to classify news articles into predefined categories based on their content. We preprocess the text data, build and train the LSTM model, and evaluate its performance on a test dataset.</em>

In [40]:
import pandas as pd 
import numpy as np 

df = pd.read_csv("../Assets/bbc-news-data.csv")
df.head()

Unnamed: 0,category,title,content
0,business,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...
1,business,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...
2,business,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...
3,business,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...
4,business,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...


In [41]:

df['category'].value_counts()


category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

In [73]:

import keras as kr
from keras.src.layers import SimpleRNN, Dense, Embedding, LSTM
from keras.src.models import sequential
from keras.api.preprocessing.sequence import pad_sequences
from keras._tf_keras.keras.preprocessing.text import Tokenizer as tok
from sklearn.model_selection import train_test_split
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re


In [None]:
## methos to clean the text data in dataframe

def text_preprocess(text):
    
    if text:
        text_to_lower = text.lower()
        clean_text = re.sub("[^a-zA-Z]", " ", text_to_lower)
        test_to_list = clean_text.split()
        stem= WordNetLemmatizer()
        tokens = [ stem.lemmatize(word=word) for word in test_to_list if not word in set(stopwords.words('english'))]
        tokens_to_str = " ".join(tokens)
        return tokens_to_str    
    
        

go jurong point crazy available bugis n great world la e buffet cine got amore wat


In [61]:
df['title'] = df['title'].apply(lambda x : text_preprocess(x))

In [62]:
df.head()

Unnamed: 0,category,title,content
0,business,ad sale boost time warner profit,Quarterly profits at US media giant TimeWarne...
1,business,dollar gain greenspan speech,The dollar has hit its highest level against ...
2,business,yukos unit buyer face loan claim,The owners of embattled Russian oil giant Yuk...
3,business,high fuel price hit ba profit,British Airways has blamed high fuel prices f...
4,business,pernod takeover talk lift domecq,Shares in UK drinks and food firm Allied Dome...


In [64]:
## checking for max length of sentence
max(df['title'].str.split(' ').str.len())

7

In [67]:
X = df['title']
y = df['category']

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [68]:
x_train.shape, x_test.shape

((1557,), (668,))

In [69]:
y_train.shape, y_test.shape

((1557,), (668,))

In [72]:
y_train = pd.get_dummies(y_train)
y_test = pd.get_dummies(y_test)

y_test.head()

Unnamed: 0,business,entertainment,politics,sport,tech
414,True,False,False,False,False
420,True,False,False,False,False
1644,False,False,False,True,False
416,True,False,False,False,False
1232,False,False,True,False,False


In [88]:
def tokenize_and_padding(input_text, max_lenght, tok):
    text_seq = tok.texts_to_sequences(input_text)
    tokens = pad_sequences(text_seq, max_lenght, padding="post")
    return (tokens)


text_tok = tok()
text_tok.fit_on_texts(x_train)
x= text_tok.fit_on_texts(x_train)

x_train_tokens = tokenize_and_padding(input_text=x_train, max_lenght=7, tok=text_tok)
x_test_tokens = tokenize_and_padding(input_text=x_train,max_lenght=7, tok=text_tok)

In [87]:
x_train_tokens


array([[1135,  191,  124, ...,    0,    0,    0],
       [ 452,  453,  656, ...,    0,    0,    0],
       [1138,  453, 1139, ...,    0,    0,    0],
       ...,
       [ 148,  180,    2, ..., 2683,  104,    0],
       [  15,   77,  389, ...,   27,    0,    0],
       [ 155,    8, 2685, ...,    0,    0,    0]])

In [89]:
x_test_tokens

array([[1135,  191,  124, ...,    0,    0,    0],
       [ 452,  453,  656, ...,    0,    0,    0],
       [1138,  453, 1139, ...,    0,    0,    0],
       ...,
       [ 148,  180,    2, ..., 2683,  104,    0],
       [  15,   77,  389, ...,   27,    0,    0],
       [ 155,    8, 2685, ...,    0,    0,    0]])