# Fake News Detection

## Team Members

- Muhammed Talha KAYA, kayamu16@itu.edu.tr, 090160339
- Eymen Gül, guley17@itu.edu.tr, 090190746

## Description of the Problem

This project aims to address the challenges faced during the recent earthquake in Turkey and the impact of the new social media law, by developing a machine learning-based model capable of accurately detecting and classifying fake news on social media during natural disasters. By focusing on real-time analysis, our model will empower users and authorities to make informed decisions based on verified information, improving the overall disaster response process and enabling a more responsible approach to sharing information in light of the new regulatory landscape.

### Key Objectives

1. Investigate state-of-the-art techniques in fake news detection and adapt them for real-time analysis during natural disasters and in response to recent social media regulations.
2. Collect and preprocess a dataset of genuine and fake news items related to natural disasters, with a focus on earthquakes, as well as news items influenced by the social media law.
3. Develop a robust and efficient machine learning model for detecting fake news and assessing the credibility of information shared on social media platforms, taking into account the impact of recent regulatory changes.
4. Evaluate the performance of the model using appropriate metrics, such as accuracy, precision, recall, and F1-score.
5. Deploy the model as a user-friendly tool that can be easily integrated into existing social media platforms or utilized as a standalone application, while respecting the legal requirements of the social media law.

By implementing an effective news verification model, our project aims to enhance the credibility of information shared on social media during natural disasters and in the context of the new social media law, leading to better decision-making, optimized resource allocation, and ultimately, a safer and more informed environment for all.

## Methodologies

 Our methodology can be broken down into the following key stages:

1. *Literature Review*: We will conduct a comprehensive review of existing research papers, articles, and state-of-the-art techniques in the field of fake news detection. This will help us understand the current landscape, identify relevant approaches, and determine areas for potential improvements or novel contributions.


2. *Data Collection*: We will gather a dataset containing genuine and fake news items, ensuring that it includes examples from various subjects, such as politics, sports, and world news. The dataset should have information on the news title, text, subject, date, and a binary label indicating the news item's truthfulness.


3. *Data Preprocessing*: In this stage, we will clean and preprocess the dataset. Initially, we will obtain the data from separate CSV files for fake and true news items. We will then add a 'label' column to each of these files, assigning a value of 1 for true news and 0 for fake news. Afterward, we will merge the true and fake datasets using the concat method to create a single, unified dataset. Next, we will remove any irrelevant columns and perform a check for missing values in the remaining columns. We will handle any missing values as necessary to ensure the dataset's quality. Furthermore, we will clean the text data by removing any special characters or other unwanted elements.To ensure that our data is well-shuffled, we will randomize the order of the entries and reset the indices. Finally, we will check the column names and perform additional preprocessing tasks, such as removing duplicate records and ensuring data consistency. This will result in a high-quality dataset that can be effectively used for model development.


4. *Feature Engineering*: We will extract and generate relevant features from the raw data that can be used as input for our machine learning model. In particular, the text will be converted into vectors using the TF-IDF method. This transformation allows us to represent the importance of words in each news item relative to the entire dataset. We may also explore additional techniques, such as sentiment analysis and keyword extraction. We will perform feature selection and dimensionality reduction to retain only the most informative features for our model.


5. *Model Selection and Development*: We will explore various machine learning algorithms, including Logistic Regression, Decision Tree Classification, Gradient Boosting Classifier, and Random Forest Classifier, to determine the most suitable model for our problem. We will then develop and implement the chosen model, fine-tuning its hyperparameters as necessary.


6. *Model Training, Evaluation, and Validation*: We will train the selected model on our preprocessed dataset and evaluate its performance using appropriate metrics, such as accuracy, precision, recall, and F1-score. We will iteratively refine the model based on these evaluation results and validate its performance on unseen data.


7. *Deployment and Integration*: Upon achieving satisfactory results, we will deploy the best-performing model to a production environment and integrate it with a web service or application for real-time fake news detection.


8. *Documentation and Presentation*: Throughout the project, we will document our methodology, results, and lessons learned. We will also prepare a presentation to communicate our findings and the impact of our work to relevant stakeholders and the wider community.


By following this methodology, we aim to develop a robust and effective fake news detection model that can contribute to a safer and more informed social media environment during natural disasters and in light of the new social media law.

## Data Collection and Description

#### Try to classify the incoming news data as true news or fake news.

##### Source of dataset : https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv

##### 2 separate datasets named True.csv and False.csv. There are 4 columns with the same name in both datasets.  Combine these two datasets by adding a new 'label' column to obtain the dataset we want to use. The columns of this dataset have obtained are as follows:

- title : The title of the article

- text : The text of the article

- subject : The subject of the article

- date : The date that this article was posted at

- label : 1 if it's true news, 0 if it's fake news


## Data Preprocessing

- Import data with true and fake news.

In [108]:
import pandas as pd
import re
import string
from sklearn.model_selection import train_test_split

true_df = pd.read_csv('datasets/True.csv')
fake_df = pd.read_csv('datasets/Fake.csv')
true_df.index+=1
fake_df.index+=1

- Look at true and fake dataframe

In [109]:
true_df

Unnamed: 0,title,text,subject,date
1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
2,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
3,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
4,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
5,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"
...,...,...,...,...
21413,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017"
21414,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017"
21415,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017"
21416,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017"


In [110]:
fake_df

Unnamed: 0,title,text,subject,date
1,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
2,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
3,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
4,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
5,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"
...,...,...,...,...
23477,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016"
23478,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016"
23479,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016"
23480,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016"


- Added a new column named 'label' to both dataframes.

In [111]:
fake_df["label"] = 0
true_df["label"] = 1

### In order to train the models and test these models manually in practice, 5 pieces of true and fake news are extracted.

In [112]:
fake_df_application_testing = fake_df.tail(5)
for i in range(23480,23475,-1):
    fake_df.drop([i], axis = 0, inplace = True)
    
    
true_df_application_testing = true_df.tail(5)
for i in range(21416,21411,-1):
    true_df.drop([i], axis = 0, inplace = True)

In [113]:
fake_df_application_testing = fake_df_application_testing.drop(["title", "subject","date"], axis = 1)
true_df_application_testing = true_df_application_testing.drop(["title", "subject","date"], axis = 1)

In [114]:
fake_df.shape

(23476, 5)

In [115]:
true_df.shape

(21412, 5)

In [116]:
df_application_testing = pd.concat([fake_df_application_testing,true_df_application_testing], axis = 0)
df_application_testing.to_csv("FakeNewsSite/application_testing.csv")

- Merged this 2 dataframes.

In [117]:
merge_df = pd.concat([fake_df, true_df], axis =0 )
merge_df

Unnamed: 0,title,text,subject,date,label
1,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
2,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
3,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
4,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
5,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
...,...,...,...,...,...
21408,"Mata Pires, owner of embattled Brazil builder ...","SAO PAULO (Reuters) - Cesar Mata Pires, the ow...",worldnews,"August 22, 2017",1
21409,"U.S., North Korea clash at U.N. forum over nuc...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21410,"U.S., North Korea clash at U.N. arms forum on ...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21411,Headless torso could belong to submarine journ...,COPENHAGEN (Reuters) - Danish police said on T...,worldnews,"August 22, 2017",1


- New dataframe shuffled and reindexed.

In [118]:
news_df = merge_df.sample(frac = 1)
news_df.reset_index(inplace = True)
news_df.index+=1
news_df

Unnamed: 0,index,title,text,subject,date,label
1,11990,BREAKING: Paul Ryan Makes A HUGE Announcement ...,,politics,"Jan 5, 2017",0
2,6023,Rights advocates slam Trump plans on Muslim im...,NEW YORK (Reuters) - Immigrant and refugee adv...,politicsNews,"January 25, 2017",1
3,16171,WATCH: TSA’S PAT-DOWN At Dallas Airport Leaves...,TSA allows for a pat-down of a teenage passen...,Government News,"Mar 29, 2017",0
4,18752,DON’T TAKE YOUR KIDS TO New Orleans To Learn A...,It s a sad day in America when we allow the le...,left-news,"Apr 25, 2017",0
5,8938,The Hidden Root Of White Rage,Paul Krugman asks a simple question about why ...,News,"January 6, 2016",0
...,...,...,...,...,...,...
44884,937,The Internet EVISCERATES Whiny Nikki Haley Af...,The Donald Trump Administration has to be the ...,News,"July 4, 2017",0
44885,9685,SHERIFF CLARKE CALLS OUT NFL For Latest Move T...,Sheriff Clarke has it right! The NFL wasn t li...,politics,"Oct 14, 2017",0
44886,997,"Trump releases some JFK files, blocks others u...",WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"October 26, 2017",1
44887,4367,Russian PM says U.S. Syria strikes 'one step a...,MOSCOW (Reuters) - Russian Prime Minister Dmit...,politicsNews,"April 7, 2017",1


- Checked column names.

In [119]:
news_df.columns

Index(['index', 'title', 'text', 'subject', 'date', 'label'], dtype='object')

- Dropped unnecessary columns.

In [120]:
df = news_df.drop(["title", "subject","date"], axis = 1)
df

Unnamed: 0,index,text,label
1,11990,,0
2,6023,NEW YORK (Reuters) - Immigrant and refugee adv...,1
3,16171,TSA allows for a pat-down of a teenage passen...,0
4,18752,It s a sad day in America when we allow the le...,0
5,8938,Paul Krugman asks a simple question about why ...,0
...,...,...,...
44884,937,The Donald Trump Administration has to be the ...,0
44885,9685,Sheriff Clarke has it right! The NFL wasn t li...,0
44886,997,WASHINGTON (Reuters) - U.S. President Donald T...,1
44887,4367,MOSCOW (Reuters) - Russian Prime Minister Dmit...,1


- Checked if there is null.

In [121]:
df.isnull().sum()

index    0
text     0
label    0
dtype: int64

In [122]:
df.shape

(44888, 3)

In [123]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44888 entries, 1 to 44888
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   44888 non-null  int64 
 1   text    44888 non-null  object
 2   label   44888 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.0+ MB


- Data cleaning function has been created and applied.

In [124]:
def wordopt(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) 
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)    
    return text

In [125]:
df["text"] = df["text"].apply(wordopt)

df

Unnamed: 0,index,text,label
1,11990,,0
2,6023,new york reuters immigrant and refugee adv...,1
3,16171,tsa allows for a pat down of a teenage passen...,0
4,18752,it s a sad day in america when we allow the le...,0
5,8938,paul krugman asks a simple question about why ...,0
...,...,...,...
44884,937,the donald trump administration has to be the ...,0
44885,9685,sheriff clarke has it right the nfl wasn t li...,0
44886,997,washington reuters u s president donald t...,1
44887,4367,moscow reuters russian prime minister dmit...,1


- Test and train data seperated reproducible with %75-%25. Used stratify for homogeneous distribution of data.

In [86]:
x = df["text"]
y = df["label"]

In [130]:
y.value_counts()

0    23476
1    21412
Name: label, dtype: int64

In [87]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25,random_state=1773,stratify=y)

In [88]:
x_train.shape

(33666,)

In [89]:
y_train.shape

(33666,)

In [90]:
x_test.shape

(11222,)

In [91]:
y_test.shape

(11222,)

## Feature Extraction

Using Tf-Idf method. TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical method used to determine the importance of a word in text documents. It works by calculating a score by multiplying the term frequency of a word in a document (TF) by its inverse document frequency (IDF) across all documents.

- In this way, numerical data is obtained from the text, that is, a new feature is produced from the existing feature.

In [126]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorization = TfidfVectorizer()
x_train_feature_extract = vectorization.fit_transform(x_train)
x_test_feature_extract = vectorization.transform(x_test)

In [133]:
pd.DataFrame(x_train_feature_extract).head()

Unnamed: 0,0
0,"(0, 57166)\t0.02105224061158265\n (0, 10382..."
1,"(0, 58372)\t0.035711623162124284\n (0, 7443..."
2,"(0, 12431)\t0.025324550730060805\n (0, 7438..."
3,"(0, 4894)\t0.0215539966222589\n (0, 62237)\..."
4,"(0, 47364)\t0.05865207966596597\n (0, 62587..."


In [135]:
pd.DataFrame(x_train_feature_extract).shape

(33666, 1)

## Model Training, Evaluation, and Validation

In this section, 3 different models are trained and performance metrics are examined. The models used are Logistic Regression, Decision Tree and Random Forest models. Training is conducted with each model and its performances are examined.

### Logistic Regression

In [93]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

LogReg = LogisticRegression(random_state=1773)
LogReg.fit(x_train_feature_extract,y_train)

LogisticRegression(random_state=1773)

In [94]:
prediction_LogReg =LogReg.predict(x_test_feature_extract)

In [95]:
LogReg.score(x_test_feature_extract, y_test)

0.9870789520584566

In [96]:
print(classification_report(y_test, prediction_LogReg))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99      5869
           1       0.98      0.99      0.99      5353

    accuracy                           0.99     11222
   macro avg       0.99      0.99      0.99     11222
weighted avg       0.99      0.99      0.99     11222



### Decision Tree Classification

In [97]:
from sklearn.tree import DecisionTreeClassifier

DecTree = DecisionTreeClassifier(random_state=1773)
DecTree.fit(x_train_feature_extract, y_train)

DecisionTreeClassifier(random_state=1773)

In [98]:
prediction_DecTree = DecTree.predict(x_test_feature_extract)

In [99]:
DecTree.score(x_test_feature_extract, y_test)

0.9942969167706291

In [100]:
print(classification_report(y_test, prediction_DecTree))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5869
           1       0.99      0.99      0.99      5353

    accuracy                           0.99     11222
   macro avg       0.99      0.99      0.99     11222
weighted avg       0.99      0.99      0.99     11222



### Random Forest Classifier

In [101]:
from sklearn.ensemble import RandomForestClassifier

RandForest = RandomForestClassifier(random_state=1773)
RandForest.fit(x_train_feature_extract, y_train)

RandomForestClassifier(random_state=1773)

In [102]:
prediction_RandForest = RandForest.predict(x_test_feature_extract)

In [103]:
RandForest.score(x_test_feature_extract, y_test)

0.9878809481375869

In [104]:
print(classification_report(y_test, prediction_RandForest))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5869
           1       0.99      0.99      0.99      5353

    accuracy                           0.99     11222
   macro avg       0.99      0.99      0.99     11222
weighted avg       0.99      0.99      0.99     11222



### Conclusion

Based on these results, all three models demonstrated high precision, recall, and F1-score, indicating their effectiveness in making accurate predictions. The accuracy of each model was also quite high, reaching 99%. These promising outcomes highlight the successful utilization of feature extraction and the potential of these models in handling text data for classification tasks

## Deployment and Integration

We decided to deploy these models on a website and developed a simple website that runs locally using Python's Flask library. On this website, users can enter a news text, which is then sent to the trained models. The predictions from these models are displayed on the website, indicating whether the news is real or fake. To make requests to these models, we used Python's pickle library to serialize the models' outputs and imported them into the website for utilization.

In [105]:
import pickle

pickle.dump(LogReg, open("FakeNewsSite/LogReg.pkl", "wb"))
pickle.dump(DecTree, open("FakeNewsSite/DecTree.pkl", "wb"))
pickle.dump(RandForest, open("FakeNewsSite/RandForest.pkl", "wb"))

In [107]:
pickle.dump(vectorization, open("FakeNewsSite/vectorizer.pkl", "wb"))