# News Classification Notebook

### Built with: Python, Numpy, Pandas and Tensorflow

[Fake vs Real News Dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset/data)

In [1]:
import numpy as np
import pandas as pd
from tensorflow.keras import layers, models, optimizers, losses
import tensorflow as tf

Set Random Seed so we can replicate the results.

In [2]:
np.random.seed(2)
tf.random.set_seed(2)

Read data from CSV files

In [4]:
fake = pd.read_csv('Fake.csv')
real = pd.read_csv('True.csv')

Next, we extract the top 10000 rows to optimize space.

In [5]:
fake = fake.iloc[:10000]
real = real.iloc[:10000]

In [6]:
fake.shape, real.shape

((10000, 4), (10000, 4))

In [7]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


Next, we drop the title, subject and date fields.

In [8]:
fake.drop(['title', 'subject', 'date'], axis=1, inplace=True)
real.drop(['title', 'subject', 'date'], axis=1, inplace=True)

Next, we add the appropriate labels. 0 for Fake news and 1 for Real news.

In [9]:
fake['label'] = 0
real['label'] = 1

In [10]:
real.head()

Unnamed: 0,text,label
0,WASHINGTON (Reuters) - The head of a conservat...,1
1,WASHINGTON (Reuters) - Transgender people will...,1
2,WASHINGTON (Reuters) - The special counsel inv...,1
3,WASHINGTON (Reuters) - Trump campaign adviser ...,1
4,SEATTLE/WASHINGTON (Reuters) - President Donal...,1


Next, we concatenate both real and fake news into one.

In [11]:
ds = pd.concat([real, fake])

In [12]:
fake = np.nan
real = np.nan

In [13]:
ds.reset_index(drop=True, inplace=True)

In [14]:
ds

Unnamed: 0,text,label
0,WASHINGTON (Reuters) - The head of a conservat...,1
1,WASHINGTON (Reuters) - Transgender people will...,1
2,WASHINGTON (Reuters) - The special counsel inv...,1
3,WASHINGTON (Reuters) - Trump campaign adviser ...,1
4,SEATTLE/WASHINGTON (Reuters) - President Donal...,1
...,...,...
19995,Here s a 1999 video of Jesse Jackson praising ...,0
19996,Check out what s happening in Texas! President...,0
19997,Jesse Jackson thinks he s Heaven s gatekeeper ...,0
19998,The media will lose it again because Melania T...,0


In [15]:
ds = np.array(ds)

Next, we shuffle the data

In [16]:
np.random.shuffle(ds)

In [17]:
ds

array([['WASHINGTON (Reuters) - Two senior U.S. House of Representatives Republicans said on Friday they have agreed on terms for the reauthorization of the deeply indebted National Flood Insurance Program. House Majority Whip Steve Scalise of Louisiana and House Financial Services Committee Chairman Jeb Hensarling of Texas said in a statement: “The bill we support will begin to make the flood insurance program more stable and sustainable for the people who count on it. We look forward to bringing this legislation to the House soon and urge our colleagues to support it.” They did not provide any details of the agreement. Lawmakers are wrestling with how to handle the flood insurance program’s expiration on Dec. 8. It is at least $24.6 billion in debt to the U.S. Treasury and likely to face billions of dollars in additional costs due to Hurricanes Harvey and Irma, which struck Texas and Florida in recent weeks. The program was extended 17 times between 2008 and 2012 and lapsed four time

Next, we split the data into train, validation and test sets.

In [18]:
num_of_samples = len(ds)

In [19]:
train_size = int(0.8 * num_of_samples)
valid_size = int(0.1 * num_of_samples)
test_size = num_of_samples - train_size - valid_size

In [20]:
print(train_size, valid_size, test_size)

16000 2000 2000


In [21]:
text_ds = np.array([obv[0] for obv in ds])
label_ds = np.array([obv[1] for obv in ds])

In [22]:
ds = np.nan

In [23]:
train_text = text_ds[:train_size]
train_label = label_ds[:train_size]

valid_text = text_ds[train_size:train_size+valid_size]
valid_label = label_ds[train_size:train_size+valid_size]

test_text = text_ds[train_size+valid_size:]
test_label = label_ds[train_size+valid_size:]

In [24]:
text_ds = np.nan
label_ds = np.nan

Next, we batch the data so the model is fed the data in batches.

In [25]:
train_ds = tf.data.Dataset.from_tensor_slices((train_text, train_label))
valid_ds = tf.data.Dataset.from_tensor_slices((valid_text, valid_label))

In [26]:
BATCH_SIZE = 16

train_ds = train_ds.cache().batch(batch_size=BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
valid_ds = valid_ds.cache().batch(batch_size=BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

Next, we vectorize the data. We train the vectorizer on the train and validation texts.

In [27]:
VOCAB_SIZE = 20000

encoder = layers.TextVectorization(max_tokens=VOCAB_SIZE)
encoder.adapt(train_ds.map(lambda text, label: text))
encoder.adapt(valid_ds.map(lambda text, label: text))

Next, we construct the model.

In [38]:
model = models.Sequential()

model.add(encoder)

model.add(layers.Embedding(input_dim=VOCAB_SIZE, output_dim=30))

model.add(layers.Bidirectional(layers.LSTM(20)))

model.add(layers.Dense(20, activation='relu'))

model.add(layers.Dense(1))

In [39]:
model.compile(
    optimizer = optimizers.Adam(),
    loss = losses.BinaryCrossentropy(from_logits=True),
    metrics = ['accuracy']
)

In [40]:
epochs = 5

model.fit(
    train_ds,
    epochs = epochs,
    validation_data = valid_ds,
    verbose = 2
)

Epoch 1/5


1000/1000 - 516s - 516ms/step - accuracy: 0.9729 - loss: 0.0462 - val_accuracy: 0.9990 - val_loss: 0.0040
Epoch 2/5
1000/1000 - 502s - 502ms/step - accuracy: 0.9999 - loss: 0.0013 - val_accuracy: 0.9995 - val_loss: 0.0031
Epoch 3/5
1000/1000 - 497s - 497ms/step - accuracy: 0.9999 - loss: 6.2773e-04 - val_accuracy: 0.9995 - val_loss: 0.0033
Epoch 4/5
1000/1000 - 502s - 502ms/step - accuracy: 0.9999 - loss: 2.7534e-04 - val_accuracy: 0.9995 - val_loss: 0.0039
Epoch 5/5
1000/1000 - 495s - 495ms/step - accuracy: 0.9999 - loss: 3.8688e-04 - val_accuracy: 0.9990 - val_loss: 0.0052


<keras.src.callbacks.history.History at 0x2319944aed0>

After 5 epochs, we have a training accuracy of 99.99% and a validation accuracy of 99.90%. These are extremely good numbers. Training loss is 0.0037 and validation loss is 0.0052.

In [41]:
model.summary()

In [42]:
model.save('model.keras')

In [43]:
test_ds = tf.data.Dataset.from_tensor_slices((test_text, test_label))
test_ds = test_ds.cache().batch(batch_size=BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [44]:
accuracy, accuracy = model.evaluate(test_ds)

[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 102ms/step - accuracy: 0.9992 - loss: 0.0045


After testing the data on the test set, we have an accuracy of 99.92% and a loss of 0.0045.