Skip to content

SaiSpr/Article_Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Classification for Fake News Detection 📰

In an age characterized by the rapid dissemination of digital information, distinguishing between authentic news and misinformation has become increasingly challenging. The prevalence of fake news and propaganda across online platforms necessitates automated systems capable of discerning the veracity of textual content in real-time. Leveraging the convergence of Natural Language Processing (NLP), Deep Learning, and Machine Learning (ML), this project endeavors to develop a sophisticated model for automatic classification of texts as either fake/propaganda or legitimate news.

Fake News Classifier

Dataset Overview

The dataset, named 'Corpus Proppy 1.0', comprises 52,000 articles sourced from over 100 diverse news outlets. It has been meticulously segmented into three datasets:

  • Training Dataset: 35,986 articles for model learning.
  • Validation Dataset: 5,125 articles for model evaluation.
  • Test Dataset: 5,125 articles for final model testing.

Preprocessing

Text Preprocessing

The dataset arrives pre-labeled as 'propaganda' or 'non-propaganda', streamlining the preprocessing process. Key steps include:

  • Handling Missing Data
  • Text Cleaning: Removing irrelevant characters, punctuation, and symbols.
  • Tokenization: Splitting text into individual words or tokens.
  • Lemmatization/Stemming: Reducing words to their base or root form.
  • Stopword Removal: Eliminating common, non-significant words.

Handling Imbalanced Data

  • Thorough exploration and addressing of class imbalance.
  • Application of Synthetic Minority Over-sampling Technique (SMOTE).
  • Experimentation with class weights in machine learning models.

Feature Extraction

Various embedding techniques were employed, including:

  • N-gram analysis
  • Bag of Words (BOW)
  • TF-IDF (Term Frequency-Inverse Document Frequency)
  • Word2Vec (Google News)
  • ELMO (Embeddings from Language Models)
  • Latent Dirichlet Allocation (LDA)
  • Non-Negative Matrix Factorization (NMF)

Model Development

A spectrum of models, from traditional ML algorithms to advanced deep learning architectures, was explored:

  • Traditional ML: Logistic Regression, Naive Bayes, XGBoost, LightGBM
  • Deep Learning: Bidirectional RNN, LSTM, GRU
  • Transformer Models: BERT, DistilBERT, RoBERTa, ALBERT, XLNet

Bidirectional LSTM and Transformer-based models emerged as top performers, showcasing superior accuracy and robustness in classifying text.

Model Deployment

The propaganda detection model was deployed using Streamlit and hosted on Streamlit Cloud. The project's code is available on GitHub for version control and collaboration, ensuring user accessibility, security, and easy maintenance.

Check out the app here!

Stay tuned for updates and enhancements!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages