# Fake News Detection

## Overview
I trained three machine learning models (naive bayes classifier based on polynomial distribution, random forest classifier, and feedforward neural network) and compared their accuracy. Finally, I selected the model with the highest accuracy to classify true and false news on unlabeled data.

## Data
### Fake news
- train.csv: A full training dataset with the following attributes:
    - id: unique id for a news article
    - title: the title of a news article
    - author: author of the news article
    - text: the text of the article; could be incomplete
    - label: a label that marks the article as potentially unreliable
        - 1: unreliable
        - 0: reliable
- test.csv: A testing training dataset with all the same attributes at train.csv without the label.

- submit.csv: A sample submission that you can

Data publicly available: Fake News | Kaggle: https://www.kaggle.com/competitions/fake-news/data?select=submit.csv


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:
df = pd.read_csv('E:/2024计算社会科学夏校/hw/day8/train.csv')
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [3]:
df.shape

(20800, 5)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [5]:
# Replace missing values with spaces
df = df.fillna(' ')

In [6]:
df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [7]:
df['text'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) \nWith apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide, it looks like we also know who the second-worst person is as well. It turns out that when Comey sent his now-infamous letter announcing that the FBI was looking into emails that may be related to Hillary Clinton’s email server, the ranking Democrats on the relevant committees didn’t hear about it from Comey. They found out via a tweet from one of the Republican committee chairmen. \nAs we now know, Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence, Judiciary, and Oversight committees that his agency was reviewing emai

## Feature Extraction

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

# Merge text messages
df['total'] = df['title'] + '| ' + df['author'] + '| ' + df['text']

'''
Configure CountVectorizer: Convert text data to word frequency vectors 
containing up to 20,000 of the most common words, while excluding stop words in English

'''
count_vectorizer = CountVectorizer(ngram_range=(1,1), 
                                   stop_words = 'english',
                                  max_features = 20000)

# Call CountVectorizer to generate word frequency matrix
total = df['total'].values
counts = count_vectorizer.fit_transform(total)

In [9]:
counts

<20800x20000 sparse matrix of type '<class 'numpy.int64'>'
	with 4903378 stored elements in Compressed Sparse Row format>

In [10]:
'''
Set up TfidfTransformer: Weighting word frequency counts to generate 
TF-IDF values that reflect the importance of words in the document collection

'''
transformer = TfidfTransformer(smooth_idf=True)

# Call TfidfTransformer to generate TF-IDF numerical matrix
tfidf = transformer.fit_transform(counts)

In [11]:
# Check the dimension
print("TF-IDF shape:", tfidf.shape)

TF-IDF shape: (20800, 20000)


In [12]:
print("TF-IDF of first document:\n", tfidf[0].toarray())

TF-IDF of first document:
 [[0. 0. 0. ... 0. 0. 0.]]


In [13]:
tfidf

<20800x20000 sparse matrix of type '<class 'numpy.float64'>'
	with 4903378 stored elements in Compressed Sparse Row format>

In [14]:
y = df['label']

In [15]:
from sklearn.model_selection import train_test_split

# Split the train and test data
Xtrain, Xtest, ytrain, ytest = train_test_split(tfidf, y, 
                                                random_state = 1, 
                                                train_size = 0.7)

In [16]:
print(f"Training data shape: {Xtrain.shape}, Test data shape: {Xtest.shape}")

Training data shape: (14559, 20000), Test data shape: (6241, 20000)


## Naive bayes classifier based on polynomial distribution

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, roc_auc_score,  roc_curve, auc

# Train
nbmodel = MultinomialNB() # Initialize the MultinomialNB classifier
nbmodel.fit(Xtrain, ytrain)

# Predict the label (0/1)
y_test_pred = nbmodel.predict(Xtest)
# Predict the probability of label = 1
y_test_prob = nbmodel.predict_proba(Xtest)[:, 1] 

# Accuracy
accu = accuracy_score(ytest, y_test_pred)

#  AUC（use predicted probability）
auc = roc_auc_score(ytest, y_test_prob)

print('The test accuracy is %.4f' % accu)
print('The test AUC score is %.4f' % auc)

The test accuracy is 0.9143
The test AUC score is 0.9806


## Random forest classifier

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# 初始化随机森林分类器
rfmodel = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# 训练模型
rfmodel.fit(Xtrain, ytrain)

# 使用测试集进行预测
y_test_pred = rfmodel.predict(Xtest)  # 预测标签
y_test_prob = rfmodel.predict_proba(Xtest)[:, 1]  # 类别 1 的概率

# 评估模型性能
accu = accuracy_score(ytest, y_test_pred)
auc = roc_auc_score(ytest, y_test_prob)

# 输出结果
print('The test accuracy is %.4f' % accu)
print('The test AUC score is %.4f' % auc)

The test accuracy is 0.9135
The test AUC score is 0.9769


## Feedforward neural network

### Data preprocessing

In [19]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score, roc_auc_score

# Convert data to a tensor in PyTorch
Xtrain_tensor = torch.tensor(Xtrain.toarray(), dtype=torch.float32)
Xtest_tensor = torch.tensor(Xtest.toarray(), dtype=torch.float32)
ytrain_tensor = torch.tensor(ytrain.values, dtype=torch.long)  # Classification problems require long types
ytest_tensor = torch.tensor(ytest.values, dtype=torch.long)


### Build a neural network model

In [20]:
class FNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(FNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)  # Input layer
        self.relu = nn.ReLU()                        # Activation function
        self.fc2 = nn.Linear(hidden_size, output_size)  # Output layer
        self.softmax = nn.LogSoftmax(dim=1)          # Probabilistic output for multiple classes

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.softmax(x)
        return x

# Define model parameters
input_size = Xtrain.shape[1]  # number of features
hidden_size = 50             # Hidden layer size
output_size = 2              # binary classification problem

fnnmodel = FNN(input_size, hidden_size, output_size)


### Define loss function and optimizer

In [21]:
criterion = nn.NLLLoss()  # Negative log-likelihood loss, suitable for LogSoftmax
optimizer = optim.Adam(fnnmodel.parameters(), lr=0.001)  # Adams optimizer

### Train the model

In [22]:
num_epochs = 10
batch_size = 64

# Training data is divided into batches
train_data = torch.utils.data.TensorDataset(Xtrain_tensor, ytrain_tensor)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)

# Train
for epoch in range(num_epochs):
    fnnmodel.train()  # Set training mode
    for batch_X, batch_y in train_loader:
        # Forward Propagation
        outputs = fnnmodel(batch_X)
        loss = criterion(outputs, batch_y)

        # Backpropagation and Optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")


Epoch [1/10], Loss: 0.1594
Epoch [2/10], Loss: 0.0300
Epoch [3/10], Loss: 0.0207
Epoch [4/10], Loss: 0.0096
Epoch [5/10], Loss: 0.0066
Epoch [6/10], Loss: 0.0044
Epoch [7/10], Loss: 0.0040
Epoch [8/10], Loss: 0.0030
Epoch [9/10], Loss: 0.0006
Epoch [10/10], Loss: 0.0006


### Evaluate the model

In [23]:
# Testing mode
fnnmodel.eval()

# Predict with test data
with torch.no_grad():
    y_test_pred_prob = fnnmodel(Xtest_tensor)  # predict probability
    y_test_pred = torch.argmax(y_test_pred_prob, axis=1)  # predict the label
    y_test_prob = torch.exp(y_test_pred_prob)[:, 1]  # pr(label= 1 )

accu = accuracy_score(ytest_tensor.numpy(), y_test_pred.numpy())
auc = roc_auc_score(ytest_tensor.numpy(), y_test_prob.numpy())

print('The test accuracy is %.4f' % accu)
print('The test AUC score is %.4f' % auc)

The test accuracy is 0.9696
The test AUC score is 0.9962


## Use the trained FNN model to predict the credibility of news

In [24]:
df1 = pd.read_csv('E:/2024计算社会科学夏校/hw/day8/test.csv')
df1

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...
...,...,...,...,...
5195,25995,The Bangladeshi Traffic Jam That Never Ends - ...,Jody Rosen,Of all the dysfunctions that plague the world’...
5196,25996,John Kasich Signs One Abortion Bill in Ohio bu...,Sheryl Gay Stolberg,WASHINGTON — Gov. John Kasich of Ohio on Tu...
5197,25997,"California Today: What, Exactly, Is in Your Su...",Mike McPhate,Good morning. (Want to get California Today by...
5198,25998,300 US Marines To Be Deployed To Russian Borde...,,« Previous - Next » 300 US Marines To Be Deplo...


In [25]:
df1 = df.fillna(' ')

df1['total'] = df1['title'] + '| ' + df1['author'] + '| ' + df1['text']

count_vectorizer = CountVectorizer(ngram_range=(1,1), 
                                   stop_words = 'english',
                                  max_features = 20000)
transformer = TfidfTransformer(smooth_idf=True)

counts = count_vectorizer.fit_transform(df1['total'].values)

tfidf = transformer.fit_transform(counts)

print("TF-IDF shape:", tfidf.shape)

TF-IDF shape: (20800, 20000)


In [26]:
label_pred = []

In [27]:
X = tfidf
X_tensor = torch.tensor(X.toarray(), dtype=torch.float32)

In [28]:
with torch.no_grad():
    label_pred_prob = fnnmodel(X_tensor)  
    label_pred = torch.argmax(label_pred_prob, axis=1)  

In [29]:
print(label_pred.shape)

torch.Size([20800])


In [30]:
label_pred = label_pred.numpy()  # Convert to NumPy Array

In [31]:
# Add to the DataFrame 
df1['predicted_label'] = label_pred

In [32]:
df1.head()

Unnamed: 0,id,title,author,text,label,total,predicted_label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Why the Truth Might Get You Fired| Consortiumn...,1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,15 Civilians Killed In Single US Airstrike Hav...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Iranian woman jailed for fictional unpublished...,1


In [33]:
submission = df1[['id', 'predicted_label']]
submission.to_csv('submission.csv', index=False)