This project contains a machine learning pipeline to classify news articles as "REAL" or "FAKE". It includes a script to train a Logistic Regression
model and a separate script to use that model for predicting the authenticity of new, user-provided text.
- Text Preprocessing: Cleans and prepares text data using stemming and stopword removal.
- TF-IDF Vectorization: Converts text articles into a numerical format suitable for machine learning.
- Model Training: Trains a Logistic Regression classifier and saves the trained components.
- Real-time Prediction: Allows a user to input any news text and get an instant prediction.
.
├── news.csv
├── news_classifier_model.pkl
├── tfidf_vectorizer.pkl
├── predictor.py
└── train_model.py
- Python 3.7+
- pip package manager
If your code is in a Git repository, you can clone it. Otherwise, just make sure all your files are in one folder.
git clone <your-repository-url>
cd <your-repository-name>
It's recommended to use a virtual environment.
# Create and activate a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
# Install the required libraries
pip install pandas numpy scikit-learn nltk joblib
The script requires the NLTK stopwords
corpus. Run the following command in a Python interpreter to download it:
import nltk
nltk.download('stopwords')
The process is divided into two main steps: training the model and then using it for prediction.
First, you must run the training script. This script will process the news.csv
file, train the classifier, and save the model and vectorizer to disk as .pkl
files.
python train_model.py
After running, you will see the model's accuracy on the training and test data printed to the console, and two new files will be created:
news_classifier_model.pkl
tfidf_vectorizer.pkl
Once the model and vectorizer are saved, you can use the predictor.py
script to classify new articles.
Run the script from your terminal:
python predictor.py
The script will prompt you to enter the news text. Paste the article content and press Enter. The model will then output its prediction.
Example Interaction:
Loading model and vectorizer...
Files loaded successfully.
Enter the news text to check:
<...paste your news article text here...>
--- Prediction ---
🚨 The model predicts that this news is FAKE.
- Algorithm: Logistic Regression
- Feature Extraction: Term Frequency-Inverse Document Frequency (TF-IDF)
- Core Libraries: Scikit-learn, Pandas, NLTK
train_model.py
: The main script for training the model and saving the pipeline components.predictor.py
: A script that loads the saved model to make real-time predictions on user input.news.csv
: The dataset used for training the model. It must containtitle
,text
, andlabel
columns.news_classifier_model.pkl
: The saved, trained Logistic Regression model object.tfidf_vectorizer.pkl
: The saved TF-IDF vectorizer object, necessary to transform new text correctly.