<a href="https://colab.research.google.com/github/Sandesh816/Deep-Learning-Project/blob/main/News_Bias_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NEWS BIAS DETECTOR**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Load Data**

In [24]:
filename = "allsides_balanced_news_headlines-texts.csv"

import requests
url = 'https://raw.githubusercontent.com/irgroup/Qbias/refs/heads/main/allsides_balanced_news_headlines-texts.csv'
res = requests.get(url, allow_redirects=True)
with open(filename,'wb') as file:
    file.write(res.content)

df = pd.read_csv(filename)
print("Shape:", df.shape)
print(df.head(10))
print(f"Columns: {df.columns}")

Shape: (21754, 7)
   Unnamed: 0                                              title  \
0           0           Gun Violence Over Fourth of July Weekend   
1           1           Gun Violence Over Fourth of July Weekend   
2           2           Gun Violence Over Fourth of July Weekend   
3           3  Yellen Warns Congress of 'Economic Recession' ...   
4           4  Yellen Warns Congress of 'Economic Recession' ...   
5           5  Yellen Warns Congress of 'Economic Recession' ...   
6           6                       Night 2: Christie on Hillary   
7           7                       Night 2: Christie on Hillary   
8           8                       Night 2: Christie on Hillary   
9           9  Denying Abortion Medication Could Violate Civi...   

                                                tags  \
0  ['Protests', 'Fourth Of July', 'Gun Control An...   
1  ['Protests', 'Fourth Of July', 'Gun Control An...   
2  ['Protests', 'Fourth Of July', 'Gun Control An...   
3  ['Jane

**Drop 'Unnamed: 0' column and reset index**

In [25]:
df.drop(columns= ["Unnamed: 0"], inplace = True)
df.reset_index(drop = True, inplace = True)
print(df.columns)

Index(['title', 'tags', 'heading', 'source', 'text', 'bias_rating'], dtype='object')


In [26]:
print(df.head())

                                               title  \
0           Gun Violence Over Fourth of July Weekend   
1           Gun Violence Over Fourth of July Weekend   
2           Gun Violence Over Fourth of July Weekend   
3  Yellen Warns Congress of 'Economic Recession' ...   
4  Yellen Warns Congress of 'Economic Recession' ...   

                                                tags  \
0  ['Protests', 'Fourth Of July', 'Gun Control An...   
1  ['Protests', 'Fourth Of July', 'Gun Control An...   
2  ['Protests', 'Fourth Of July', 'Gun Control An...   
3  ['Janet Yellen', 'Debt Ceiling', 'Economic Pol...   
4  ['Janet Yellen', 'Debt Ceiling', 'Economic Pol...   

                                             heading                 source  \
0  Chicago Gun Violence Spikes and Increasingly F...  New York Times (News)   
1  ‘Bullets just came from nowhere’: Fourth of Ju...        Chicago Tribune   
2  Dozens of shootings across US mark bloody July...   New York Post (News)   
3  Federal

**Data Exploration**

In [27]:
# Explore number of labeled articles in each bucket
left_df = df[df["bias_rating"] == "left"]
right_df = df[df["bias_rating"] == "right"]
center_df = df[df["bias_rating"] == "center"]

print(f"Left: {left_df.shape}")
print(f"Right: {right_df.shape}")
print(f"Center: {center_df.shape}")

Left: (10275, 6)
Right: (7226, 6)
Center: (4253, 6)


In [28]:
# Explore word counts across articles
left_word_count = sum(left_df["text"].fillna("").str.split().apply(len))
right_word_count = sum(right_df["text"].fillna("").str.split().apply(len))
center_word_count = sum(center_df["text"].fillna("").str.split().apply(len))

print("Left Leaning Articles Word Count:", left_word_count)
print("Right Leaning Articles Word Count:", right_word_count)
print("Center Leaning Articles Word Count:", center_word_count)

Left Leaning Articles Word Count: 658761
Right Leaning Articles Word Count: 481681
Center Leaning Articles Word Count: 301266


There is a discrepancy in the total word count of the articles labeled left, right, and center

In [29]:
# Analyze the average lengths of left, right, and center leaning articles to check quality of dataset
average_length_left = left_word_count / left_df.shape[0]
average_length_right = right_word_count / right_df.shape[0]
average_length_center = center_word_count / center_df.shape[0]

print("Average length of left-leaning articles:", average_length_left)
print("Average length of right-leaning articles:", average_length_right)
print("Average length of center-leaning articles:", average_length_center)

Average length of left-leaning articles: 64.11299270072993
Average length of right-leaning articles: 66.65942430113479
Average length of center-leaning articles: 70.83611568304725


In [30]:
# Explore tag lengths across articles
left_tags_count = sum(left_df["tags"].fillna("").str.split(",").apply(len))
right_tags_count = sum(right_df["tags"].fillna("").str.split(",").apply(len))
center_tags_count = sum(center_df["tags"].fillna("").str.split(",").apply(len))

print("Left Leaning Articles Tags Count:", left_tags_count)
print("Right Leaning Articles Tags Count:", right_tags_count)
print("Center Leaning Articles Tags Count:", center_tags_count)

Left Leaning Articles Tags Count: 35898
Right Leaning Articles Tags Count: 27676
Center Leaning Articles Tags Count: 19801


Tags counts are closer

**Data Preparation**

In [31]:
# Convert words to lowercase in all columns
df = df.map(lambda x: x.lower() if isinstance(x, str) else x)
print(df.head())

                                               title  \
0           gun violence over fourth of july weekend   
1           gun violence over fourth of july weekend   
2           gun violence over fourth of july weekend   
3  yellen warns congress of 'economic recession' ...   
4  yellen warns congress of 'economic recession' ...   

                                                tags  \
0  ['protests', 'fourth of july', 'gun control an...   
1  ['protests', 'fourth of july', 'gun control an...   
2  ['protests', 'fourth of july', 'gun control an...   
3  ['janet yellen', 'debt ceiling', 'economic pol...   
4  ['janet yellen', 'debt ceiling', 'economic pol...   

                                             heading                 source  \
0  chicago gun violence spikes and increasingly f...  new york times (news)   
1  ‘bullets just came from nowhere’: fourth of ju...        chicago tribune   
2  dozens of shootings across us mark bloody july...   new york post (news)   
3  federal

In [None]:
# Shuffle df and split into X and Y
# all_indices = np.arange(df.shape[0])
# np.random.shuffle(all_indices)

# test_size = int(0.2 * df.shape[0])
# validation_size = int(0.1 * df.shape[0])

# test_indices = all_indices[: test_size]
# validation_indices = all_indices[test_size: test_size + validation_size]
# train_indices = all_indices[test_size + validation_size: ]

# X_train = df.iloc[train_indices].drop(columns = ["bias_rating"]).reset_index(drop = True)
# X_validation = df.iloc[validation_indices].drop(columns = ["bias_rating"]).reset_index(drop = True)
# X_test = df.iloc[test_indices].drop(columns = ["bias_rating"]).reset_index(drop = True)

# y_train = df.iloc[train_indices]["bias_rating"].reset_index(drop = True)
# y_validation = df.iloc[validation_indices]["bias_rating"].reset_index(drop = True)
# y_test = df.iloc[test_indices]["bias_rating"].reset_index(drop = True)

# print(f"X_train shape: {X_train.shape}")
# print(f"X_validation shape: {X_validation.shape}")
# print(f"X_test shape: {X_test.shape}")
# print(f"y_train shape: {y_train.shape}")
# print(f"y_validation shape: {y_validation.shape}")
# print(f"y_test shape: {y_test.shape}")

# print(X_train.head())
# print(y_train.head())

In [32]:
# Shuffle df and split into X and Y
# df = df.sample(frac = 1) ## don't really need it as train_test_split will shuffle all rows
X = df[['title', 'heading', 'text']]
y = df["bias_rating"]
print(X.head())
print(y.head())

                                               title  \
0           gun violence over fourth of july weekend   
1           gun violence over fourth of july weekend   
2           gun violence over fourth of july weekend   
3  yellen warns congress of 'economic recession' ...   
4  yellen warns congress of 'economic recession' ...   

                                             heading  \
0  chicago gun violence spikes and increasingly f...   
1  ‘bullets just came from nowhere’: fourth of ju...   
2  dozens of shootings across us mark bloody july...   
3  federal government will run out of cash on oct...   
4  yellen tells congress that u.s. will run out o...   

                                                text  
0  as yasmin miller drove home from a laundromat ...  
1  as many chicagoans were celebrating the fourth...  
2  the nation’s 4th of july weekend was marred by...  
3  treasury secretary janet yellen on tuesday war...  
4  treasury secretary janet yellen on tuesday tol..

In [33]:
# Split into training, testing with even ratio between articles from each side (stratify)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42, shuffle= True)
X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5, stratify=y_test, random_state=42, shuffle= True)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"X_valid shape: {X_valid.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print(f"y_valid shape: {y_valid.shape}")

print(X_train.head(2))
print(y_train.head(2))

X_train shape: (15227, 3)
X_test shape: (3263, 3)
X_valid shape: (3264, 3)
y_train shape: (15227,)
y_test shape: (3263,)
y_valid shape: (3264,)
                                  title  \
9300           clinton, greenpeace spat   
18177  trump taps rick perry for energy   

                                                 heading  \
9300   fact checking the clinton-sanders spat over bi...   
18177         trump picks rick perry as energy secretary   

                                                    text  
9300   “i have money from people who work for fossil-...  
18177  president-elect donald trump will pick former ...  
9300      left
18177    right
Name: bias_rating, dtype: object


Resetting indices

In [34]:
X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)
X_valid = X_valid.reset_index(drop = True)
y_train = y_train.reset_index(drop = True)
y_test = y_test.reset_index(drop = True)
y_valid = y_valid.reset_index(drop = True)

print(X_train.head(2))
print(y_train.head(2))

                              title  \
0          clinton, greenpeace spat   
1  trump taps rick perry for energy   

                                             heading  \
0  fact checking the clinton-sanders spat over bi...   
1         trump picks rick perry as energy secretary   

                                                text  
0  “i have money from people who work for fossil-...  
1  president-elect donald trump will pick former ...  
0     left
1    right
Name: bias_rating, dtype: object




**Milestone 2**



In [23]:
import tensorflow as tf
import huggingface_hub
from transformers import AutoTokenizer, TFBertModel

In [35]:
# We will use the BERT model as our baseline model
model_name = "bert-base-uncased"
model = TFBertModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [36]:
# Tokenize the first article's text for sanity check
test_tokens = tokenizer.tokenize(X_train["text"][0])
test_token_ids = tokenizer.convert_tokens_to_ids(test_tokens)
print(test_tokens)
print(test_token_ids)

['“', 'i', 'have', 'money', 'from', 'people', 'who', 'work', 'for', 'fossil', '-', 'fuel', 'companies', '.', 'i', 'am', 'so', 'sick', '—', 'i', 'am', 'so', 'sick', 'of', 'the', 'sanders', 'campaign', 'lying', 'about', 'me', '.', '”', '—', 'hillary', 'clinton', ',', 'to', 'a', 'green', '##pe', '##ace', 'activist', ',', 'march', '31', ',', '2016', '“', 'the', 'fact', 'of', 'the', 'matter', 'is', 'secretary', 'clinton', 'has', 'taken', 'significant', 'money', 'from', 'the', 'fossil', 'fuel', 'industry', '.', 'she', 'raises', 'her', 'money', 'with', 'a', 'super', 'pac', '.', 'she', 'gets', 'a', 'lot', 'of', 'money', 'from', 'wall', 'street', ',', 'from', 'the', 'drug', 'companies', 'and', 'fossil', 'fuel', 'industry', '.', '”']
[1523, 1045, 2031, 2769, 2013, 2111, 2040, 2147, 2005, 10725, 1011, 4762, 3316, 1012, 1045, 2572, 2061, 5305, 1517, 1045, 2572, 2061, 5305, 1997, 1996, 12055, 3049, 4688, 2055, 2033, 1012, 1524, 1517, 18520, 7207, 1010, 2000, 1037, 2665, 5051, 10732, 7423, 1010, 223

In [37]:
# Analyze the token count across the articles after tokenizing all text
left_token_count = sum(len(tokenizer.tokenize(str(text))) for text in left_df["text"] if pd.notna(text))
right_token_count = sum(len(tokenizer.tokenize(str(text))) for text in right_df["text"] if pd.notna(text))
center_token_count = sum(len(tokenizer.tokenize(str(text))) for text in center_df["text"] if pd.notna(text))

print("Left-leaning articles token count:", left_token_count)
print("Right-leaning articles token count:", right_token_count)
print("Center-leaning articles token count:", center_token_count)

Left-leaning articles token count: 837527
Right-leaning articles token count: 620797
Center-leaning articles token count: 388367


**Trying BERT Baseline Model**

We will use the process described by Keras to train on a portion of our data: https://keras.io/keras_hub/api/models/bert/bert_text_classifier/

In [38]:
import keras_hub
import tensorflow as tf
from tensorflow import keras
import numpy as np

# Grab our train data
features = X_train.copy()
features = features[:1000]
features = list(features["text"].astype(str))

# Grab our train labels and map string labels to numerical labels
label_mapping = {'left': 0, 'center': 1, 'right': 2}
labels = np.array([label_mapping[label] for label in y_train])
labels = labels[:1000]

# Pretrained classifier.
classifier = keras_hub.models.BertTextClassifier.from_preset(
    "bert_base_en_uncased",
    num_classes=3,
)
classifier.fit(x=features, y=labels, batch_size=2)
classifier.predict(x=features, batch_size=2)

# Re-compile (e.g., with a new learning rate).
classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer="adam",
    jit_compile=True,
)
# Access backbone programmatically (e.g., to change `trainable`).
classifier.backbone.trainable = False
# Fit again.
classifier.fit(x=features, y=labels, batch_size=2)


[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m198s[0m 235ms/step - loss: 1.0893 - sparse_categorical_accuracy: 0.4129
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 65ms/step
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 78ms/step - loss: 1.2157 - sparse_categorical_accuracy: 0.4100


<keras.src.callbacks.history.History at 0x7f2363310910>

BERT Baseline Model Evaluation

In [39]:
# Prepare data for evaluation
X_test_list = list(X_test["text"].astype(str))  # Convert X_test to a list of strings
y_test_mapped = np.array([label_mapping[label] for label in y_test])  # Map y_test labels

# Create a tf.data.Dataset for evaluation
eval_dataset = tf.data.Dataset.from_tensor_slices((X_test_list, y_test_mapped)).batch(2)

# Evaluate the model
loss, accuracy = classifier.evaluate(eval_dataset) # Use classifier instead of model
print(f"Loss: {loss}, Accuracy: {accuracy}")

from sklearn.metrics import classification_report

# Get predictions
preds = classifier.predict(eval_dataset) # Use classifier instead of model
y_pred = np.argmax(preds, axis=1)  # Get predicted labels

# Get true labels
y_true = np.concatenate([y for x, y in eval_dataset], axis=0)

print(classification_report(y_true, y_pred, target_names=['left', 'center', 'right']))

[1m1632/1632[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m115s[0m 66ms/step - loss: 1.3414 - sparse_categorical_accuracy: 0.1927
Loss: 1.3385035991668701, Accuracy: 0.19552558660507202
[1m1632/1632[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m113s[0m 67ms/step
              precision    recall  f1-score   support

        left       0.00      0.00      0.00      1541
      center       0.20      1.00      0.33       638
       right       0.00      0.00      0.00      1084

    accuracy                           0.20      3263
   macro avg       0.07      0.33      0.11      3263
weighted avg       0.04      0.20      0.06      3263



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Trying an LSTM custom model**

We also wanted to experiment with trying out using an LSTM that we built on our data.

In [49]:
from sklearn.model_selection import train_test_split
import keras_hub
import tensorflow as tf
from tensorflow import keras
import numpy as np
from collections import Counter
import huggingface_hub
from transformers import AutoTokenizer, TFBertModel
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [46]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

label_map = {"left": 0, "right": 1, "center": 2}
X_text = (df["title"].fillna("") + df["heading"].fillna("") + df["text"].fillna("")).to_list() # now, we get a list of strings as X_text
y = df["bias_rating"].str.lower().map(label_map)

max_length = 600
X = [
    tokenizer.encode(x, max_length=max_length, truncation=True, add_special_tokens= True) for x in X_text
]

X_pad = pad_sequences(X, maxlen=max_length, padding='post', truncating='post').astype("int32") # padding will allow us to send batches as tensor
y_np  = y.to_numpy(dtype="int32")

X_train, X_val, y_train, y_val = train_test_split(X_pad, y_np, test_size=0.2, stratify=y_np)

In [47]:
# creating the train and validation dataset
BATCH = 32
train_ds = (
    tf.data.Dataset.from_tensor_slices((X_train, y_train))
      .shuffle(10_000)
      .batch(BATCH)
      .prefetch(tf.data.AUTOTUNE)
)

val_ds = (
    tf.data.Dataset.from_tensor_slices((X_val, y_val))
      .batch(BATCH)
      .prefetch(tf.data.AUTOTUNE)
)

In [50]:
# building our bidirectional LSTM model (embedding -> dropout -> biD -> dropout -> biD -> Dense)
vocab_size = tokenizer.vocab_size
num_classes = 3

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=128))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Bidirectional(LSTM(64, return_sequences=True)))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Bidirectional(LSTM(64)))
model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
es = tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10,
    callbacks=[es]
)

Epoch 1/10


Exception ignored in: <function _xla_gc_callback at 0x7f2456ffad40>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/jax/_src/lib/__init__.py", line 96, in _xla_gc_callback
    def _xla_gc_callback(*args):
    
KeyboardInterrupt: 


[1m544/544[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 84ms/step - accuracy: 0.4718 - loss: 1.0396 - val_accuracy: 0.4810 - val_loss: 1.0203
Epoch 2/10
[1m544/544[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 82ms/step - accuracy: 0.5143 - loss: 0.9742 - val_accuracy: 0.4620 - val_loss: 1.0296
Epoch 3/10
[1m544/544[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 81ms/step - accuracy: 0.5908 - loss: 0.8611 - val_accuracy: 0.4599 - val_loss: 1.0785
Epoch 4/10
[1m544/544[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 81ms/step - accuracy: 0.6876 - loss: 0.7057 - val_accuracy: 0.4240 - val_loss: 1.2675


Above are our results from our initial test using an LSTM. The results seem to show that the model is indeed training, however the validation accuracy is not changing, which shows something is off. The model needs to be trained for many more epochs to achieve better accuracy.

**Evaluate on Testing Set**

In [52]:
# Get predictions
pred_probs = model.predict(X_val, batch_size=32)

y_pred = np.argmax(pred_probs, axis=1)
y_true = y_val

print(classification_report(y_true, y_pred,
                            target_names=['left', 'center', 'right']))

[1m136/136[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 35ms/step
              precision    recall  f1-score   support

        left       0.51      0.87      0.64      2055
      center       0.36      0.21      0.27      1445
       right       0.00      0.00      0.00       851

    accuracy                           0.48      4351
   macro avg       0.29      0.36      0.30      4351
weighted avg       0.36      0.48      0.39      4351



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Create requirements.txt**

In [53]:
!pip freeze > requirements.txt