# Personalised News Aggregator Machine Learning Model
## Title of the project : Personalized News Aggregator and Sentiment Analyzer
### Project is for the final year of my university.
This model is essentially the backbone of the website which will aggregate news articles from various sources based on user preferences and performs sentiment analysis to provide users with personalized news content. 
#### Importing the necessary libraries:

In [1]:
# The following libraries are for data representation, manipulation and for numerical calculations.
import numpy as np 
import pandas as pd 

# Deep Learning library used to build and train the Recurrent Neural Network(RNN).
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout

# To split the dataset into train and test sets and also to calculate the accuracy.
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


### Loading the dataset to a pandas dataframe.

In [2]:
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

### Preprocessing of the dataset.

In [3]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df.shape

(50000, 2)

In [6]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [7]:
sentences = df['review'].values # Saving the Reviews separately.
labels = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0).values # Converting the label into binary values by applying a lambda function.

#### Tokenization and Padding.

In [8]:
tokenizer = Tokenizer(num_words = 5000) # Initializing the tokenizer to only change the most frequent 5000 words into an integer value.
tokenizer.fit_on_texts(sentences) # Fitting the tokenizer on the reviews of the dataset.
sequences = tokenizer.texts_to_sequences(sentences) # Finally converting the text into a sequence of integers.

In [9]:
# Padding the sequences such that all of the sequences will have same length.
X = pad_sequences(sequences, maxlen = 200) 
y = np.array(labels)

### Splitting the data.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21) # 80 - 20 split of the dataset.

### Building the RNN.

The simple RNN will have 64 layers and L2(Ridge) Regularization will be used to prevent overfitting for the model.

In [11]:
model = Sequential() # Initializing the layers.
model.add(Embedding(input_dim = 5000, output_dim = 128)) # Mapping the 5000 unique words into a dense 128 dimensional vector.
model.add(SimpleRNN(64, return_sequences = False, kernel_regularizer = regularizers.l2(0.01))) # Adding a 64 unit simple RNN to the initialized network.
model.add(Dropout(0.7)) # Drops 67% of neurons randomly during training to prevent overfitting.
model.add(Dense(32, activation = 'relu', kernel_regularizer = regularizers.l2(0.01))) # Adding a 32 unit layer with ReLu activation function.
model.add(Dropout(0.7))
model.add(Dense(1, activation = 'sigmoid')) # Output layer with only one unit and sigmoid activation function.

#### Compiling the model using Adam optimizer.

In [12]:
model.compile(optimizer = Adam(learning_rate = 0.0001), loss = 'binary_crossentropy', metrics = ['accuracy'])

### Training the model.

In [13]:
early_stopping = EarlyStopping(monitor = 'val_loss', patience = 3, restore_best_weights = True)

history = model.fit(X_train, y_train, 
                    epochs = 15, batch_size = 128, 
                    validation_data = (X_test, y_test),
                    callbacks = [early_stopping])

Epoch 1/15


I0000 00:00:1726150175.344275     134 service.cc:145] XLA service 0x7e87c800c7e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1726150175.344328     134 service.cc:153]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0


[1m 10/313[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m5s[0m 19ms/step - accuracy: 0.5226 - loss: 1.9970

I0000 00:00:1726150179.306189     134 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 32ms/step - accuracy: 0.4933 - loss: 1.9010 - val_accuracy: 0.4979 - val_loss: 1.6095
Epoch 2/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 20ms/step - accuracy: 0.5036 - loss: 1.5438 - val_accuracy: 0.5144 - val_loss: 1.3330
Epoch 3/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 20ms/step - accuracy: 0.5125 - loss: 1.2773 - val_accuracy: 0.5535 - val_loss: 1.1209
Epoch 4/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 20ms/step - accuracy: 0.5563 - loss: 1.0754 - val_accuracy: 0.6928 - val_loss: 0.9038
Epoch 5/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 20ms/step - accuracy: 0.6367 - loss: 0.9409 - val_accuracy: 0.6948 - val_loss: 0.8663
Epoch 6/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 20ms/step - accuracy: 0.6951 - loss: 0.8283 - val_accuracy: 0.7536 - val_loss: 0.7736
Epoch 7/15
[1m313/313[0m [32m

### Evaluating the model.

In [14]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.4f}')

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.8725 - loss: 0.3969
Test Accuracy: 0.8709


#### Making Predictions.

In [15]:
y_pred = (model.predict(X_test) > 0.5).astype("int32")
print(f'Accuracy Score: {accuracy_score(y_test, y_pred):.4f}')

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step
Accuracy Score: 0.8709


### Loading the data.

In [16]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
import joblib

# Saving the model in .h5 format with Keras.
model.save('models/rnn_model.h5')  

# Saving the tokenizer using joblib.
joblib.dump(tokenizer, 'models/tokenizer.pkl')

['models/tokenizer.pkl']