# **Final Project**

##**Problem stament :**     

The widespread dissemination of fake news and propaganda presents serious societal risks, including the erosion of public trust, political polarization, manipulation of elections, and the spread of harmful misinformation during crises such as pandemics or conflicts. From an NLP perspective, detecting fake news is fraught with challenges. Linguistically, fake news often mimics the tone and structure of legitimate journalism, making it difficult to distinguish using surface-level features. The absence of reliable and up-to-date labeled datasets, especially across multiple languages and regions, hampers the effectiveness of supervised learning models. Additionally, the dynamic and adversarial nature of misinformation means that malicious actors constantly evolve their language and strategies to bypass detection systems. Cultural context, sarcasm, satire, and implicit bias further complicate automated analysis. Moreover, NLP models risk amplifying biases present in training data, leading to unfair classifications and potential censorship of legitimate content. These challenges underscore the need for cautious, context-aware approaches, as the failure to address them can inadvertently contribute to misinformation, rather than mitigate it.



Use datasets in link : https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing
to complete requirement.

## **About dataset:**

* **True Articles**:

  * **File**: `MisinfoSuperset_TRUE.csv`
  * **Sources**:

    * Reputable media outlets like **Reuters**, **The New York Times**, **The Washington Post**, etc.

* **Fake/Misinformation/Propaganda Articles**:

  * **File**: `MisinfoSuperset_FAKE.csv`
  * **Sources**:

    * **American right-wing extremist websites** (e.g., Redflag Newsdesk, Breitbart, Truth Broadcast Network)
    * **Public dataset** from:

      * Ahmed, H., Traore, I., & Saad, S. (2017): "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques" *(Springer LNCS 10618)*



## **Requirement**

A team consisting of three members must complete a project that involves applying the methods learned from the beginning of the course up to the present. The team is expected to follow and document the entire machine learning workflow, which includes the following steps:

1. **Data Preprocessing**: Clean and prepare the dataset,etc.

2. **Exploratory Data Analysis (EDA)**: Explore and visualize the data.

3. **Model Building**: Select and build one or more machine learning models suitable for the problem at hand.

4. **Hyperparameter set up**: Set and adjust the model's hyperparameters using appropriate methods to improve performance.

5. **Model Training**: Train the model(s) on the training dataset.

6. **Performance Evaluation**: Evaluate the trained model(s) using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix, etc.) and validate their performance on unseen data.

7. **Conclusion**: Summarize the results, discuss the model's strengths and weaknesses, and suggest possible improvements or future work.





In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [13]:
df_true = pd.read_csv("DataSet_Misinfo_TRUE.csv")
df_fake = pd.read_csv("DataSet_Misinfo_FAKE.csv")

print(f"True dataset:\n {df_true.head()} \n {df_true.shape}")
print(f"Fake dataset:\n {df_fake.head()} \n {df_fake.shape}")

True dataset:
    Unnamed: 0                                               text
0           0  The head of a conservative Republican faction ...
1           1  Transgender people will be allowed for the fir...
2           2  The special counsel investigation of links bet...
3           3  Trump campaign adviser George Papadopoulos tol...
4           4  President Donald Trump called on the U.S. Post... 
 (34975, 2)
Fake dataset:
    Unnamed: 0                                               text
0           0  Donald Trump just couldn t wish all Americans ...
1           1  House Intelligence Committee Chairman Devin Nu...
2           2  On Friday, it was revealed that former Milwauk...
3           3  On Christmas day, Donald Trump announced that ...
4           4  Pope Francis used his annual Christmas Day mes... 
 (43642, 2)


In [14]:
df_true = df_true.drop(columns=["Unnamed: 0"], errors='ignore')
df_fake = df_fake.drop(columns=["Unnamed: 0"], errors='ignore')

In [21]:
print(f"True dataset null: {df_true.isnull().sum()}")
print(f"Fake dataset null: {df_fake.isnull().sum()}")
df_true = df_true.dropna()
df_fake = df_fake.dropna()

True dataset null: text    0
dtype: int64
Fake dataset null: text    0
dtype: int64


In [23]:
print(f"True dataset duplicates: {df_true.duplicated().sum()}")
print(f"Fake dataset duplicates: {df_fake.duplicated().sum()}")
df_true = df_true.drop_duplicates()
df_fake = df_fake.drop_duplicates()

True dataset duplicates: 0
Fake dataset duplicates: 0


In [26]:
df_fake["label"] = 0
df_true["label"] = 1

# Kết hợp 2 tập dữ liệu
df = pd.concat([df_fake, df_true], ignore_index=True)

# Shuffle dữ liệu
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

df.head()

Unnamed: 0,text,label
0,At a campaign event Hillary Clinton talked to ...,0
1,Officials in the U.S. military’s Central Comma...,1
2,Five migrants were wounded on Saturday when sh...,1
3,The continuing refusal by the San Francisco 49...,1
4,Human Rights Watch is a dangerous threat to th...,0


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Giả sử cột chứa văn bản là 'text'
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X = vectorizer.fit_transform(df['text'])  # Chuyển văn bản thành vector
y = df['label']

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

model = LogisticRegression()
model.fit(X_train, y_train)

# Dự đoán
y_pred = model.predict(X_test)

# Đánh giá
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.9346986371255739
              precision    recall  f1-score   support

           0       0.93      0.94      0.94      6894
           1       0.94      0.93      0.93      6827

    accuracy                           0.93     13721
   macro avg       0.93      0.93      0.93     13721
weighted avg       0.93      0.93      0.93     13721

