In an age characterized by the rapid dissemination of digital information, distinguishing between authentic news and misinformation has become increasingly challenging. The prevalence of fake news and propaganda across online platforms necessitates automated systems capable of discerning the veracity of textual content in real-time. Leveraging the convergence of Natural Language Processing (NLP), Deep Learning, and Machine Learning (ML), this project endeavors to develop a sophisticated model for automatic classification of texts as either fake/propaganda or legitimate news.
The dataset, named 'Corpus Proppy 1.0', comprises 52,000 articles sourced from over 100 diverse news outlets. It has been meticulously segmented into three datasets:
- Training Dataset: 35,986 articles for model learning.
- Validation Dataset: 5,125 articles for model evaluation.
- Test Dataset: 5,125 articles for final model testing.
The dataset arrives pre-labeled as 'propaganda' or 'non-propaganda', streamlining the preprocessing process. Key steps include:
- Handling Missing Data
- Text Cleaning: Removing irrelevant characters, punctuation, and symbols.
- Tokenization: Splitting text into individual words or tokens.
- Lemmatization/Stemming: Reducing words to their base or root form.
- Stopword Removal: Eliminating common, non-significant words.
- Thorough exploration and addressing of class imbalance.
- Application of Synthetic Minority Over-sampling Technique (SMOTE).
- Experimentation with class weights in machine learning models.
Various embedding techniques were employed, including:
- N-gram analysis
- Bag of Words (BOW)
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Word2Vec (Google News)
- ELMO (Embeddings from Language Models)
- Latent Dirichlet Allocation (LDA)
- Non-Negative Matrix Factorization (NMF)
A spectrum of models, from traditional ML algorithms to advanced deep learning architectures, was explored:
- Traditional ML: Logistic Regression, Naive Bayes, XGBoost, LightGBM
- Deep Learning: Bidirectional RNN, LSTM, GRU
- Transformer Models: BERT, DistilBERT, RoBERTa, ALBERT, XLNet
Bidirectional LSTM and Transformer-based models emerged as top performers, showcasing superior accuracy and robustness in classifying text.
The propaganda detection model was deployed using Streamlit and hosted on Streamlit Cloud. The project's code is available on GitHub for version control and collaboration, ensuring user accessibility, security, and easy maintenance.
Stay tuned for updates and enhancements!
