
## Implementation of Simple RNN



## Aim: To classify the news artciles into product news and stock news.

### Dataset : https://www.kaggle.com/datasets/sulphatet/twitter-financial-news
### News Classification using RNN
### Labels:
* Product News(0) and Stock Commentary(1)


## Importing the libraries

In [None]:
import pandas as pd
import tensorflow as tf
import nltk
import re
import numpy as np

In [None]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [None]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import SimpleRNN, Dense, Dropout

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Loading the data

In [None]:
data =pd.read_csv("/content/drive/MyDrive/Colab Notebooks/DEEP LEARNING/classification_train_data.csv")
data.head()

Unnamed: 0,text,label
0,$HOUR flagging here below the squeeze level to...,1
1,$SPY closed just above 2 mo channel &amp; 10d ...,1
2,$VLCN going green.....,1
3,$QQQ - QQQ: It's Make It Or Break It For The S...,1
4,Nike college apparel will have 'faster speed t...,0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5663 entries, 0 to 5662
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5663 non-null   object
 1   label   5663 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 88.6+ KB


In [None]:
data.isnull().sum()

text     0
label    0
dtype: int64

## Defining the X and Y features

In [None]:
X = data.drop("label", axis =1)
y = data["label"]

In [None]:
X.shape, y.shape

((5663, 1), (5663,))

## Defining the vocabulary size

In [None]:
vocab_size = 5000

In [None]:
news = X.copy()

## Data Preprocessing

In [None]:
corpus= []
def data_preprocess(news):
  ps = PorterStemmer()

  for i in range(0, len(news)):
    text = re.sub("[^a-zA-Z]", " ", news["text"][i])
    text = text.lower()
    text= text.split()
    text = [ps.stem(word) for word in text if not word in stopwords.words("english")]
    text = " ".join(text)
    corpus.append(text)
data_preprocess(news)

## One-Hot Encoding

In [None]:
one_hot_rep =[]
def One_Hot(corpus):
  for words in corpus:
    onehot = one_hot(words, vocab_size)
    one_hot_rep.append(onehot)
  # one_hot_rep = [one_hot(words, vocab_size) for words in corpus]
One_Hot(corpus)

## Generating the padded sequences

In [None]:
# The one hot encoded sequences are of unequal length so we do padding.
sent_length = 20
embedded_docs = pad_sequences(one_hot_rep, padding="pre", maxlen =sent_length)
embedded_docs[0]

array([   0,    0,    0,    0,    0, 1522,  365, 3872, 1797, 1748, 3314,
       1471,  956, 2220, 3461, 1222, 4734, 4523, 3872, 2212], dtype=int32)

## Model Building

In [None]:
embedding_vector_features = 40
model =Sequential()
model.add(Embedding(vocab_size,embedding_vector_features, input_length =sent_length ))
model.add(SimpleRNN(100))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            200000    
                                                                 
 simple_rnn (SimpleRNN)      (None, 100)               14100     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 214,201
Trainable params: 214,201
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Converting the padded sequences into an array
X_final = np.array(embedded_docs)
y_final = np.array(y)

In [None]:
X_final.shape, y_final.shape

((5663, 20), (5663,))

In [None]:
X_train,X_test, y_train,y_test = train_test_split(X_final, y_final, test_size=0.20, random_state=42)

In [None]:
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 10, batch_size =50)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f97949884f0>

## Model Evaluation

In [None]:
y_pred= model.predict(X_test)



In [None]:
loss, accuracy = model.evaluate(X_test, y_test)
print("Loss: ", round(loss,2), "Accuracy: ","%.2f"%accuracy)

Loss:  0.36 Accuracy:  0.91


## Adding Dropout Layer

In [None]:
model1 =Sequential()
model1.add(Embedding(vocab_size,embedding_vector_features, input_length =sent_length ))
model1.add(Dropout(0.3))
model1.add(SimpleRNN(100))
model1.add(Dropout(0.3))
model1.add(Dense(1, activation="sigmoid"))
model1.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
model1.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 10, batch_size =50)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f97901737f0>

In [None]:
loss, accuracy = model1.evaluate(X_test, y_test)
print("Loss: ", round(loss,2), "Accuracy: ","%.2f"%accuracy)

Loss:  0.23 Accuracy:  0.93


## Model Testing

In [None]:
test_data=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/DEEP LEARNING/classification_test_data.csv")
test_data.head()

Unnamed: 0,text,label
0,https://t.co/G0nmNITjcy is cutting 25% of its...,0
1,https://t.co/XNB7m39H8H Launches Fixed-term S...,0
2,#ALSI future monthly putting in a reversal. St...,1
3,#ALSI major constituent weightings within Top4...,1
4,#Consumerdiscretionary outperforms $XLY $TSLA ...,1


In [None]:
test_data.isnull().sum()

text     0
label    0
dtype: int64

In [None]:
info =test_data.drop("label", axis=1)
info

Unnamed: 0,text
0,https://t.co/G0nmNITjcy is cutting 25% of its...
1,https://t.co/XNB7m39H8H Launches Fixed-term S...
2,#ALSI future monthly putting in a reversal. St...
3,#ALSI major constituent weightings within Top4...
4,#Consumerdiscretionary outperforms $XLY $TSLA ...
...,...
1375,ZapBatt Partners with Toshiba to Unlock Proven...
1376,ZEDEDA Closes $26M Series B Funding Round as D...
1377,Zultys Receives 2022 Unified Communications Pr...
1378,Zymeworks Announces Plan to Become a Delaware ...


In [None]:
corpus1= []
def data_preprocess(news):
  ps = PorterStemmer()

  for i in range(0, len(news)):
    text = re.sub("[^a-zA-Z]", " ", news["text"][i])
    text = text.lower()
    text= text.split()
    text = [ps.stem(word) for word in text if not word in stopwords.words("english")]
    text = " ".join(text)
    corpus1.append(text)
data_preprocess(info)

In [None]:
one_hot_repr = [one_hot(words, vocab_size) for words in corpus1]
embedded_docu = pad_sequences(one_hot_repr, padding="pre", maxlen =sent_length)
embedded_docu[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,  798,
       1035, 1588, 2094, 3090, 2867, 1419, 4430, 4145, 4438], dtype=int32)

In [None]:
y_pred =model.predict(embedded_docu)



In [None]:
np.round(y_pred)

array([[0.],
       [0.],
       [1.],
       ...,
       [0.],
       [0.],
       [0.]], dtype=float32)

In [None]:
model.evaluate(embedded_docu)



[0.0, 0.0]

# Conclusion:
*  We have used the News Classification dataset for our RNN model. We have preprocessed the data and one hot encoded the values.As the sequence of input to our model should be of equal length we pad them. We have set the input dimensions as 5000 and the length of the input sequence is 20. We will be embedding the 5000 word vocabulary into 40 dimension (embedding_feature_vector). The model consists of 1 embedding, 1 RNN layer and an output layer. The accuracy of this model is 99%. We obtained the same results even after adding the batch normalization layer.

* The model is tested with a new dataset and though the model was able to predict  both the  classes, it's prediction is completely wrong.
