# News Text Classification Pipeline with spaCy & scikit-learn

## Table of Contents

- [1. Introduction](#1-introduction)
- [2. Importing Required Libraries](#2-importing-required-libraries)
- [3. Loading and Exploring the Dataset](#3-loading-and-exploring-the-dataset)
  - [3.1 Importing the Dataset from Kaggle](#31-importing-the-dataset-from-kaggle)
  - [3.2 Loading the Dataset](#32-loading-the-dataset)
  - [3.3 Checking for Missing Values](#33-checking-for-missing-values)
  - [3.4 Balancing the Dataset](#34-balancing-the-dataset)
- [4. Feature Extraction with spaCy](#4-feature-extraction-with-spacy)
  - [4.1 Importing Spacy Large Language model](#41-importing-spacy-large-language-model)
  - [4.2 Vectorizing the text column](#42-vectorizing-the-text-column)
- [5. Model Training and Evaluation](#5-model-training-and-evaluation)
  - [5.1 Splitting the Data](#51-splitting-the-data)
  - [5.2 Training the Model](#52-training-the-model)
  - [5.3 Model Prediction](#53-model-prediction)
  - [5.4 Model Evaluation](#54-model-evaluation)
- [6. Model Selection and Hyperparameter Tuning](#6-model-selection-and-hyperparameter-tuning)
  - [6.1 Setting Up and Running GridSearchCV](#61-setting-up-and-running-gridsearchcv)
  - [6.2 Analyzing Grid Search Results](#62-analyzing-grid-search-results)
  - [6.3 Extracting Best Results for Each Classifier](#63-extracting-best-results-for-each-classifier)
- [7. Final Model Evaluation](#7-final-model-evaluation)


## 1. Introduction

This notebook provides a comprehensive guide to classifying news articles as Fake or Real using modern machine learning techniques. We walk through each step of the workflow, including data loading, preprocessing, feature extraction with spaCy, model training, evaluation, and hyperparameter tuning. By the end, you will have a clear understanding of how to build and assess a robust news classification model.

## 2. Importing Required Libraries

We begin by importing all the necessary libraries for data manipulation, visualization, and machine learning. Libraries like pandas and numpy help with data handling, matplotlib and seaborn are used for visualization, and scikit-learn provides tools for model building and evaluation. We also use spaCy for advanced natural language processing.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning utilities
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Natural Language Processing
import spacy

# File and Directory operations
import os
import shutil
import zipfile

## 3. Loading and Exploring the Dataset

In this section, we load the news dataset and perform initial exploration to understand its structure, check for missing values, and get a sense of the data distribution.


### 3.1 Importing the Dataset from Kaggle

This code shell below automates downloading and extracting a news dataset from Kaggle:

1. **Creates a directory** called `NEWS DATA` if it doesn't exist.
2. **Configures Kaggle API credentials** by copying `kaggle.json` to the correct location (`~/.kaggle`) and setting secure permissions.
3. **Downloads the dataset** (`fake-and-real-news-dataset`) from Kaggle into the `NEWS DATA` directory using the Kaggle CLI.
4. **Unzips the downloaded dataset** into the `NEWS DATA` directory for further use. 


In [None]:
# Ensure NEWS DATA directory exists
os.makedirs("NEWS DATA", exist_ok=True)

# Set Kaggle API credentials (assumes kaggle.json is in the current directory)
kaggle_json_path = os.path.expanduser("kaggle.json")
kaggle_dir = os.path.expanduser("~/.kaggle")
os.makedirs(kaggle_dir, exist_ok=True)
if not os.path.exists(os.path.join(kaggle_dir, "kaggle.json")):
    shutil.copy(kaggle_json_path, os.path.join(kaggle_dir, "kaggle.json"))

os.chmod(os.path.join(kaggle_dir, "kaggle.json"), 0o600)

# Download and unzip dataset to NEWS DATA directory
!kaggle datasets download -d clmentbisaillon/fake-and-real-news-dataset -p "NEWS DATA"

with zipfile.ZipFile("NEWS DATA/fake-and-real-news-dataset.zip", "r") as zip_ref:
    zip_ref.extractall("NEWS DATA")

### 3.2 Loading the Dataset


Dataset separated in two files:
- `Fake.csv` (23502 fake news article)
- `True.csv` (21417 true news article)

Dataset columns:

- `Title`: title of news article
- `Text`: body text of news article
- `Subject`: subject of news article
- `Date`: publish date of news article

From each dataset, we need to extract the `Text` column for the classification task. For each dataset, we will create a new column `label` to indicate whether the news is fake or real. The `Fake.csv` will have `label` as `Fake` and `True.csv` will have `label` as `Real`. After this we will concatenate both datasets into a single DataFrame. We also need to shuffle the DataFrame to ensure that the data is randomly distributed. The final dataset is then exported to a CSV file named `news_data.csv` for further processing.




In [None]:
real_data = pd.read_csv("NEWS DATA/True.csv")
real_data["label"] = "Real"
real_df = real_data[["text", "label"]]

fake_data = pd.read_csv("NEWS DATA/Fake.csv")
fake_data["label"] = "Fake"
fake_df = fake_data[["text", "label"]]

real_fake_df = pd.concat([real_df, fake_df], axis=0)
news_data = real_fake_df.sample(frac=1).reset_index(drop=True)
news_data.to_csv("NEWS DATA/news_data.csv", index=False)

We load the news data from a CSV file. The first few rows are displayed to get an overview of the data structure and columns.

In [None]:
news_data = pd.read_csv("NEWS DATA/news_data.csv")
news_data.head()

In [None]:
news_data.shape

### 3.3 Checking for Missing Values

It is important to check for missing or empty values in the dataset to ensure data quality. Here, we check for nulls and empty strings in the 'text' column.

In [None]:
news_data.isnull().sum()

In [None]:
news_data["text"].isnull().sum()

We found that there are no null values, but there are some empty strings. We will exclude these from our analysis.

In [None]:
empty_text = news_data[news_data["text"] == " "]
print(empty_text.shape)
empty_text.head()

In [None]:
empty_text["label"].value_counts()

The final dataset will be the one with no empty strings in the 'text' column, which we will use for further processing and model training.

In [None]:
new_data_final = news_data[~(news_data["text"] == " ")]
new_data_final.shape

In [None]:
new_data_final["label"].value_counts()

### 3.4 Balancing the Dataset

To ensure fair model training, we balance the dataset by sampling an equal number of examples from each class. This helps prevent bias towards the majority class.

In [None]:
news_df = (
    new_data_final.groupby("label", group_keys=False)
    .apply(lambda x: x.sample(5000, random_state=2024))
    .reset_index(drop=True)
)
news_df

In [None]:
news_df["label"].value_counts()

Mapping the labels to integers is done to facilitate model training. The 'Fake' label is mapped to 0 and 'Real' to 1

In [None]:
news_df["label"] = news_df["label"].map({"Fake": 0, "Real": 1})
news_df["label"].value_counts()

In [None]:
print(news_df.loc[0, "text"])

## 4. Feature Extraction with spaCy

We use spaCy's large English language model to convert news text into numerical vectors. These vectors capture semantic information from the text, which can be used as features for machine learning models.

### 4.1 Importing Spacy Large Language model

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
text = news_df.iloc[0, 0]
text_vector = nlp(text).vector
print(text_vector.shape)
text_vector

In [None]:
lemma_word_list = []
for word in nlp(text):
    lemma_word_list.append(word.lemma_)


print(text)
print(" ".join(lemma_word_list))


### 4.2 Vectorizing the `text` column

Now, we convert each news article's text into a vector using spaCy. This allows us to use these vectors as features for our machine learning models.

In [None]:
news_df["text_vector"] = news_df["text"].apply(lambda x: nlp(x).vector)
news_df.head()

## 5. Model Training and Evaluation

We split the data into training and test sets, then train a Multinomial Naive Bayes classifier using the extracted features. The model's performance is evaluated using accuracy, confusion matrix, and classification report.

### 5.1 Splitting the Data

Splitting the dataset into training and test sets is crucial for evaluating the model's performance. We use an 80-20 split, where 80% of the data is used for training and 20% for testing. We have also used stratified sampling to ensure that both sets have a balanced representation of the classes.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    news_df["text_vector"].values,
    news_df["label"],
    test_size=0.2,
    random_state=2024,
    stratify=news_df["label"],
)


Stacking with `np.stack()` is used to combine a sequence of arrays (like lists or 1D arrays) into a single NumPy array with an extra dimension. In your code:

```python
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)
```

**Why is this required?**

- **Consistent Shape:** Machine learning models (especially in scikit-learn, TensorFlow, or PyTorch) expect input data as a 2D array: shape `(num_samples, num_features)`.
- **List to Array:** If `X_train` is a list of 1D arrays (each representing a sample), stacking turns it into a 2D array where each row is a sample.
- **Efficient Computation:** NumPy arrays are faster and more memory-efficient than Python lists for numerical operations.


In [None]:
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)
print(
    f"Shape of training data before stacking: {X_train.shape}, after stacking: {X_train_2d.shape}"
)

### 5.2 Training the Model

We have trained a Multinomial Naive Bayes classifier on the training set. The training data is also scaled using MinMaxScaler to ensure that all features contribute equally to the model's performance.

In [None]:
pipe = Pipeline([("scaler", MinMaxScaler()), ("clf", MultinomialNB())])
pipe.fit(X_train_2d, y_train)


### 5.3 Model Prediction

In [None]:
y_pred = pipe.predict(X_test_2d)
acc_score = accuracy_score(y_test, y_pred)
print(f"The accuracy score of the classifier is {acc_score * 100:.2f}%")

### 5.4 Model Evaluation

We evaluate the model's performance using accuracy, confusion matrix, and classification report. The accuracy score gives us a quick overview of how well the model is performing, while the confusion matrix and classification report provide detailed insights into the model's predictions across different classes.

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(
    cm, columns=news_data["label"].unique(), index=news_data["label"].unique()
)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt="d")
plt.title(
    "Confusion Matrix for the Predicted Classes\n"
    f"Classifier: MultinomialNB() ; Accuracy: {acc_score * 100:.2f}%",
    fontsize=14,
    pad=10,
)
plt.xlabel("True Label")
plt.ylabel("Predicted Label")
plt.show()

In [None]:
print(classification_report(y_test, y_pred))
report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
report_df = pd.DataFrame(report).transpose().fillna(0)

plt.figure(figsize=(8, 6))
sns.heatmap(report_df.iloc[:-1, :-1], annot=True, cmap="Blues")
plt.title("Classification Report Heatmap")
plt.show()


## 6. Model Selection and Hyperparameter Tuning

To find the best performing model, we use GridSearchCV to test different classifiers and hyperparameters. This helps us identify the optimal model configuration for our dataset.

### 6.1. Setting Up and Running GridSearchCV

Here, we set up GridSearchCV with the pipeline and parameter grid defined above. GridSearchCV performs cross-validation to evaluate all combinations of classifiers and hyperparameters, helping us find the best model configuration based on accuracy.

In [None]:
# Define the pipeline
pipe = Pipeline(
    [
        ("scaler", MinMaxScaler()),
        (
            "clf",
            MultinomialNB(),
        ),  # Placeholder, to be replaced by different classifiers
    ]
)

# Define the parameter grid
param_grid = [
    {
        "clf": [MultinomialNB()],
        "clf__alpha": [0.1, 0.5, 1.0],  # Example hyperparameters for MultinomialNB
    },
    {
        "clf": [RandomForestClassifier()],
        "clf__n_estimators": [100, 200],
        "clf__max_depth": [None, 10, 20],
    },
    {"clf": [SVC()], "clf__C": [0.1, 1, 10], "clf__kernel": ["linear", "rbf"]},
]

# Set up GridSearchCV
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy", n_jobs=-1)

# Fit GridSearchCV to the data
grid_search.fit(X_train_2d, y_train)

# Display the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)


### 6.2 Analyzing Grid Search Results

This cell extracts the results from GridSearchCV and creates a DataFrame to display the mean and standard deviation of test scores for each parameter combination. The results are sorted to highlight the best-performing configurations.

In [None]:
results = grid_search.cv_results_

# Create a DataFrame with the relevant information
df_results = pd.DataFrame(
    {
        "mean_test_score": results["mean_test_score"],
        "std_test_score": results["std_test_score"],
        "params": results["params"],
    }
)

# Sort the DataFrame by the mean test score in descending order
df_results = df_results.sort_values(by="mean_test_score", ascending=False)

pd.set_option("display.max_colwidth", None)
df_results.to_csv("RESULTS/grid_search_results.csv", index=False)
df_results


### 6.3 Extracting Best Results for Each Classifier

After this, we loop through the results of GridSearchCV to find and display the best accuracy score and corresponding parameters for each classifier type (MultinomialNB, RandomForestClassifier, SVC). This helps us compare the top-performing models side by side.

In [None]:
results = grid_search.cv_results_

# Initialize an empty list to store the best results
best_results = []

# Loop through the parameter grid to find the best results for each classifier
for classifier in ["MultinomialNB", "RandomForestClassifier", "SVC"]:
    # Filter results for the current classifier
    classifier_results = [
        (mean_score, params)
        for mean_score, params in zip(results["mean_test_score"], results["params"])
        if params["clf"].__class__.__name__ == classifier
    ]

    # Get the best score and corresponding parameters
    if classifier_results:
        best_score, best_params = max(classifier_results, key=lambda x: x[0])
        best_results.append(
            {
                "classifier": classifier,
                "best_accuracy_score": best_score,
                "best_params": best_params,
            }
        )

df_best_results = pd.DataFrame(best_results).sort_values(
    by="best_accuracy_score", ascending=False
)

pd.set_option("display.max_colwidth", None)
df_best_results


## 7. Final Model Evaluation

After selecting the best model and hyperparameters, we retrain the model and evaluate its performance on the test set. This gives us a realistic estimate of how well the model will perform on unseen data.

In [None]:
pipe_SVC = Pipeline([("sclaer", MinMaxScaler()), ("clf", SVC(kernel="rbf", C=10))])

pipe_SVC.fit(X_train_2d, y_train)
y_pred_SVC = pipe_SVC.predict(X_test_2d)
acc_score_SVC = accuracy_score(y_test, y_pred_SVC)

print(f"The accuracy score of the SVM model is {acc_score_SVC * 100:.2f}%\n")

print(f"The classification report:\n {classification_report(y_test, y_pred_SVC)}")

cm = confusion_matrix(y_test, y_pred_SVC)
cm_df = pd.DataFrame(
    cm, columns=news_data["label"].unique(), index=news_data["label"].unique()
)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt="d")
plt.title(
    "Confusion Matrix for the Predicted Classes\n"
    f"Classifier: SVC() ; Accuracy: {acc_score_SVC * 100:.2f}%",
    fontsize=14,
    pad=10,
)
plt.xlabel("True Label")
plt.ylabel("Predicted Label")
plt.show()