Text Classification for Fake News Detection 📰

In an age characterized by the rapid dissemination of digital information, distinguishing between authentic news and misinformation has become increasingly challenging. The prevalence of fake news and propaganda across online platforms necessitates automated systems capable of discerning the veracity of textual content in real-time. Leveraging the convergence of Natural Language Processing (NLP), Deep Learning, and Machine Learning (ML), this project endeavors to develop a sophisticated model for automatic classification of texts as either fake/propaganda or legitimate news.

Dataset Overview

The dataset, named 'Corpus Proppy 1.0', comprises 52,000 articles sourced from over 100 diverse news outlets. It has been meticulously segmented into three datasets:

Training Dataset: 35,986 articles for model learning.
Validation Dataset: 5,125 articles for model evaluation.
Test Dataset: 5,125 articles for final model testing.

Preprocessing

Text Preprocessing

The dataset arrives pre-labeled as 'propaganda' or 'non-propaganda', streamlining the preprocessing process. Key steps include:

Handling Missing Data
Text Cleaning: Removing irrelevant characters, punctuation, and symbols.
Tokenization: Splitting text into individual words or tokens.
Lemmatization/Stemming: Reducing words to their base or root form.
Stopword Removal: Eliminating common, non-significant words.

Handling Imbalanced Data

Thorough exploration and addressing of class imbalance.
Application of Synthetic Minority Over-sampling Technique (SMOTE).
Experimentation with class weights in machine learning models.

Feature Extraction

Various embedding techniques were employed, including:

N-gram analysis
Bag of Words (BOW)
TF-IDF (Term Frequency-Inverse Document Frequency)
Word2Vec (Google News)
ELMO (Embeddings from Language Models)
Latent Dirichlet Allocation (LDA)
Non-Negative Matrix Factorization (NMF)

Model Development

A spectrum of models, from traditional ML algorithms to advanced deep learning architectures, was explored:

Traditional ML: Logistic Regression, Naive Bayes, XGBoost, LightGBM
Deep Learning: Bidirectional RNN, LSTM, GRU
Transformer Models: BERT, DistilBERT, RoBERTa, ALBERT, XLNet

Bidirectional LSTM and Transformer-based models emerged as top performers, showcasing superior accuracy and robustness in classifying text.

Model Deployment

The propaganda detection model was deployed using Streamlit and hosted on Streamlit Cloud. The project's code is available on GitHub for version control and collaboration, ensuring user accessibility, security, and easy maintenance.

Check out the app here!

Stay tuned for updates and enhancements!

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
1.jpg		1.jpg
2.jpg		2.jpg
3.jpg		3.jpg
4.jpg		4.jpg
5.jpg		5.jpg
LICENSE		LICENSE
README.md		README.md
api.py		api.py
app.py		app.py
model.pkl		model.pkl
model.pkt		model.pkt
requirements.txt		requirements.txt
tf_idf.pkl		tf_idf.pkl
tf_idf.pkt		tf_idf.pkt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification for Fake News Detection 📰

Dataset Overview

Preprocessing

Text Preprocessing

Handling Imbalanced Data

Feature Extraction

Model Development

Model Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

SaiSpr/Article_Classifier

Folders and files

Latest commit

History

Repository files navigation

Text Classification for Fake News Detection 📰

Dataset Overview

Preprocessing

Text Preprocessing

Handling Imbalanced Data

Feature Extraction

Model Development

Model Deployment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages