# News categorization

In this notebook we are going to build a Machine Learning model for news categorisation.

Our dataset is the one we preprocessed before, which has two colums:

- **description_filtered** which is the filtered descrition after performing cleaning, tokenisation, lemmatization and stopword removal on the description of the news
- **category_label** which is a numeric value that represents the category of our label.

We converted the dataset format from csv to parket.

We are going to study two classification models: **Naive Bayes** and **Logistic regression** with as feature engineering method HashingTF-IDF

## I- Modules import

Let us import the modules we need.

In [5]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import  IDF, HashingTF
from pyspark.ml import  Pipeline
from math import ceil,log2
from pyspark.ml.classification import LogisticRegression,NaiveBayes
from pyspark.sql.functions import col,explode,split
import numpy as np

## II- Spark context and session creation

Let us create a spark session

In [6]:
spark = (SparkSession.builder
    .master('local[*]')
    .appName("NewsCategorization")
    .config("spark.driver.memory", '320g')\
    .config("spark.driver.memoryOverhead", "500g")
    .getOrCreate()
        )
spark

## III- Dataframe preparing

### 1. Load the data

In [12]:
# Load the ata

df = spark.read.parquet("hashingTF-IDF/nb_lr/input/news.parquet", header=True, inferSchema=True)

                                                                                

### 2. Partition and cache the dataframe

In [13]:
# Get the current number rof RDD partitions
df.rdd.getNumPartitions()

7

In [14]:
# Repartitioning: Use 4 partitions per core
num_partitions=4*42
df= df.repartition(num_partitions).cache()

In [15]:
df.rdd.getNumPartitions()



168

### 3. Preview the data

In [16]:
# Count the number of observations
df.count()

                                                                                

650028

In [17]:
# Show the dataframe
df.show()

+--------------+--------------------+
|category_label|description_filtered|
+--------------+--------------------+
|          10.0|vatican palace sa...|
|          10.0|u mho secular com...|
|           1.0|ciara kelly tampo...|
|           0.0|vepa kamesam 8217...|
|           6.0|reward help kid g...|
|           7.0|inside city light...|
|           4.0|hillary clinton t...|
|          11.0|new return new yo...|
|           6.0|parent ask google...|
|           5.0|fortnite entropy ...|
|           7.0|discovery channel...|
|           2.0|homegrown stem ac...|
|           5.0|fact checking hil...|
|           6.0|dad show incredib...|
|           9.0|savanna georgia w...|
|           1.0|six bristol city ...|
|           2.0|      red admiral 20|
|           6.0| proposal antarctica|
|          10.0|troubled anna mar...|
|           1.0|kanye west make s...|
+--------------+--------------------+
only showing top 20 rows



In [18]:
# Print the schema of the dataframe
df.printSchema()

root
 |-- category_label: double (nullable = true)
 |-- description_filtered: string (nullable = true)



### 4. Convert filtered descriptions to arrays

In [19]:
# Create a new DataFrame with description_filtered as arrays
df= df.withColumn('description_filtered', split(col('description_filtered'), ' '))
# Show the new DataFrame
df.show(truncate=False)

+--------------+---------------------------------------------------------------------------------------------------------+
|category_label|description_filtered                                                                                     |
+--------------+---------------------------------------------------------------------------------------------------------+
|10.0          |[vatican, palace, say, transgender, valet, de, chambre, become, godparent]                               |
|10.0          |[u, mho, secular, community, won, significant, legal, victory]                                           |
|1.0           |[ciara, kelly, tampon, ad, protest, hundred, box, received, already]                                     |
|0.0           |[vepa, kamesam, 8217s, term, rbi, extended, three, month]                                                |
|6.0           |[reward, help, kid, get, active, dont, necessarily, lead, better, health, study]                         |
|7.0           |

## IV- Feature Engineering


### 1. Explode the filtered descriptions to get the words

In [20]:
exploded_df=df.select(explode(df.description_filtered)).alias('words')
exploded_df.show()

+-----------+
|        col|
+-----------+
|    vatican|
|     palace|
|        say|
|transgender|
|      valet|
|         de|
|    chambre|
|     become|
|  godparent|
|          u|
|        mho|
|    secular|
|  community|
|        won|
|significant|
|      legal|
|    victory|
|      ciara|
|      kelly|
|     tampon|
+-----------+
only showing top 20 rows



### 2. Get unique words in the filtered_description

In [21]:
unique_words=exploded_df.distinct()

### 3. Cache and show the unique words dataframe

In [22]:
unique_words=unique_words.cache()
unique_words.show()



+-------------+
|          col|
+-------------+
|    godparent|
|        still|
|       travel|
|         hope|
|       voyage|
|intermarriage|
|infinitesimal|
|       online|
|     mushball|
| transference|
|       harder|
|          art|
|       outfit|
|        spoil|
|       biting|
|     cautious|
|      elevate|
|     incoming|
|       poetry|
|   hoverboard|
+-------------+
only showing top 20 rows



                                                                                

### 4. Get the vocabulary size

In [23]:
vocabulary_size=unique_words.count()
vocabulary_size

114967

### 5. Unpersit the unique words dataframe(not needed anymore)

In [24]:
unique_words=unique_words.unpersist()

### 6. Get the smallest `n` such that $2^n$ is greater than `vocabulary_size`

In [25]:
n=ceil(log2(vocabulary_size))
n

17

### 7. Get the number of features for HashingTF

In [26]:
num_features=2**n
num_features

131072

### 8. Define the HashingTF and IDF stages

In [27]:
# Define the HashingTF and IDF stages
hashingTF = HashingTF(inputCol="description_filtered", outputCol="rawFeatures", numFeatures=num_features)
idf = IDF(inputCol="rawFeatures", outputCol="features")

## V- Models set up, training and evaluation

### 1. Set up Naive Bayes and Logistic regression classifiers

In [28]:
# Define the classifiers

# Logistic regression classifier
lr = LogisticRegression(labelCol="category_label", featuresCol="features")

# Naive Bayes classifier
nb = NaiveBayes(labelCol="category_label", featuresCol="features")

### 2. Set up pipelines

We will  set up pipelines of the following transformations for Native Bayes and Linear reggression

- HashingTF
- IDF
- 3-Fold Cross-validation  without grid search

In [29]:
# Define parameter grids
paramGrid_nb=paramGrid_lr=ParamGridBuilder().build()


# Cross-validation for Naive Bayes
cv_nb = CrossValidator(estimator=nb, estimatorParamMaps=paramGrid_nb,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=1)
# Cross-validation for Logistic Regression
cv_lr = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid_lr,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=1)


# Create pipelines
# Pipeline for Naive Bayes
pipeline_nb = Pipeline(stages=[hashingTF, idf, cv_nb])
# Pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[hashingTF, idf, cv_lr])
model_pipelines=pipeline_nb, pipeline_lr
model_pipelines

(Pipeline_ad342620aa8a, Pipeline_5bd3c633f9e6)

### 3. Split the data

Let us split the data into train and test set: 80% for train and 20% for test

In [30]:
# Split data
(train_set, test_set) = df.randomSplit([0.80, 0.20], seed=0)

### 4. Create a function for model training

Let us create a function which takes as argument a model that it trains and then returns the trained model.

In [31]:
def train_model(model):
    return model.fit(train_set)

### 5. Define a function to evaluate the model

The function takes as parameter a fitted model, evaluates the model on train and test split and then return the train and test performance. The accuracy is the metric used.

In [32]:
# Initialize the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy")

# Function to evaluate model and get best parameters
def evaluate_model(fitted_model):

    print('Making predictions on the training set')

    train_predictions = fitted_model.transform(train_set)

    print('Making predictions on the test set')
    test_predictions = fitted_model.transform(test_set)

    print('Evaluating the model on training set')
    train_accuracy = evaluator.evaluate(train_predictions)

    print('Evaluating the model on test set')
    test_accuracy = evaluator.evaluate(test_predictions)

    print('Train accuracy:',train_accuracy)
    print('Test accuracy:',test_accuracy)
    return train_accuracy, test_accuracy

### 6. Create a function which takes pipelines and train the models, evaluate them and then return the results

In [33]:
def train_and_evaluate_models(model_pipelines,model_names=["Naive Bayes", "Logistic Regression"]):

    # Initialize the results dictionary
    results = {}

    # Loop over the indices and model names simultaneously
    for idx, (model_pipeline, model_name) in enumerate(zip(model_pipelines, model_names)):
        print(f"Training {model_name} model")

        # Fit the model pipeline to the training set
        #fitted_model = model_pipeline.fit(train_set)
        fitted_model = train_model(model_pipeline)

        print("Done")
        print(f"Evaluating {model_name} model")

        # Evaluate the fitted model
        train_accuracy, test_accuracy = evaluate_model(fitted_model)
        print("Done")
        # Store the results
        results[idx] = {
            'model_name': model_name,
            'fitted_model': fitted_model,
            "train_accuracy": train_accuracy,
            "test_accuracy": test_accuracy
        }

    if len(model_name)==0:
        results=results[0]

    return results

### 5. Call the function and interpret the results

#### a. Training and evaluation

In [34]:
results = train_and_evaluate_models(model_pipelines)
results

Training Naive Bayes model


24/06/28 17:59:23 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:28 WARN DAGScheduler: Broadcasting large task binary with size 14.1 MiB
24/06/28 17:59:28 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/06/28 17:59:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:33 WARN DAGScheduler: Broadcasting large task binary with size 14.1 MiB
24/06/28 17:59:34 WARN DAGSched

Done
Evaluating Naive Bayes model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/28 17:59:41 WARN DAGScheduler: Broadcasting large task binary with size 14.1 MiB
                                                                                

Evaluating the model on test set


24/06/28 17:59:42 WARN DAGScheduler: Broadcasting large task binary with size 14.1 MiB
                                                                                

Train accuracy: 0.8501056907598002
Test accuracy: 0.8040579487376551
Done
Training Logistic Regression model


24/06/28 17:59:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 17:59:50 WARN DAGScheduler: Broadcasting larg

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/28 18:08:05 WARN DAGScheduler: Broadcasting large task binary with size 12.0 MiB
24/06/28 18:08:21 WARN DAGScheduler: Broadcasting large task binary with size 12.0 MiB


Evaluating the model on test set




Train accuracy: 0.9611550394008228
Test accuracy: 0.7972101602428621
Done


                                                                                

{0: {'model_name': 'Naive Bayes',
  'fitted_model': PipelineModel_0317d4b4065a,
  'train_accuracy': 0.8501056907598002,
  'test_accuracy': 0.8040579487376551},
 1: {'model_name': 'Logistic Regression',
  'fitted_model': PipelineModel_aee599021530,
  'train_accuracy': 0.9611550394008228,
  'test_accuracy': 0.7972101602428621}}

In [37]:
# Results of the fitted Naive Bayes classifier
results[0]

{'model_name': 'Naive Bayes',
 'fitted_model': PipelineModel_0317d4b4065a,
 'train_accuracy': 0.8501056907598002,
 'test_accuracy': 0.8040579487376551}

In [38]:
# Results of the fitted Logistic regression classifier
results[1]

{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_aee599021530,
 'train_accuracy': 0.9611550394008228,
 'test_accuracy': 0.7972101602428621}

#### b. Results interpretetion

We remark that:
- **Naive Bayes** sets a performance of **85%** on the train set and **80%** on the test set.
- **Logistic Regression** sets a performance of **96%** on the train set and **80%** on the test set.

We can then conclude that:

- Both models set a relatively good performance on both the training and test sets. However, the Logistic Regression model shows signs of overfitting, given its much higher performance on the training set compared to the test set.
- The **Naive Bayes** model demonstrates a more consistent performance between the training and test sets, indicating it generalizes better than the Logistic Regression model in this context.
- Although the Logistic Regression model has a higher training set performance, its test set performance is the same as that of the Naive Bayes model.

Given the overfitting observed with the Logistic Regression model, careful tuning of its parameters is essential to improve its generalization performance.

In the next section, we will tune the parameters of the **Logistic Regression** model to get the best parameters.

Before moving forward let us save the Naive Bayes trained model.

In [39]:
fitted_nb_pipeline=results[0]['fitted_model']
fitted_nb_pipeline

PipelineModel_0317d4b4065a

In [40]:
# Saving the NB pipeline model
fitted_nb_pipeline.save('hashingTF-IDF/nb_lr/output/models/naive_bayes/pipeline')

24/06/28 18:10:47 WARN TaskSetManager: Stage 1337 contains a task of very large size (2097 KiB). The maximum recommended task size is 1000 KiB.
24/06/28 18:10:48 WARN TaskSetManager: Stage 1344 contains a task of very large size (12548 KiB). The maximum recommended task size is 1000 KiB.


In [41]:
# Saving the NB model without pipeline
fitted_nb_pipeline.stages[-1].bestModel.save('output/models/naive_bayes/simple')

24/06/28 18:11:10 WARN TaskSetManager: Stage 1348 contains a task of very large size (12548 KiB). The maximum recommended task size is 1000 KiB.


## VI- Logistic regression regularisation tuning

Let us use Grid search with cross validation to find the best regularisation parameter. We will use 10 values of regularisation parameter varing in a log scale.

### 1. Pipeline creation

In [42]:
# Define parameter grids for Logistic regresion grid search

reg_values = np.logspace(-4, 4, num=50)

paramGrid_lr= ParamGridBuilder().addGrid(lr.regParam, reg_values).build()

# Create Cross-validation for Logistic Regression
cv_lr = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid_lr,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=3)


# Create pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[hashingTF, idf, cv_lr])

pipeline_lr

Pipeline_789a42acc152

### 2. Hyperparameters tuning

In [43]:
results=train_and_evaluate_models(model_pipelines=[pipeline_lr],model_names=["Logistic Regression"])
results

Training Logistic Regression model


24/06/28 18:11:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/28 18:11:28 WARN DAGScheduler: Broadcasting larg

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/28 19:38:05 WARN DAGScheduler: Broadcasting large task binary with size 12.0 MiB
24/06/28 19:38:21 WARN DAGScheduler: Broadcasting large task binary with size 12.0 MiB


Evaluating the model on test set




Train accuracy: 0.9202328081813688
Test accuracy: 0.8479268339545787
Done


                                                                                

{0: {'model_name': 'Logistic Regression',
  'fitted_model': PipelineModel_7dd97d07ea3b,
  'train_accuracy': 0.9202328081813688,
  'test_accuracy': 0.8479268339545787}}

### 3. Interpreting the results

In [44]:
results=results[0]
results

{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_7dd97d07ea3b,
 'train_accuracy': 0.9202328081813688,
 'test_accuracy': 0.8479268339545787}

From these results, we observe the following:

The training set accuracy improved to 95%, indicating that the model fits the training data well.
The test set accuracy improved to 85%, showing a better generalization performance compared to the initial evaluation.

In [45]:
fitted_lr_pipeline=results['fitted_model']
fitted_lr_pipeline

PipelineModel_7dd97d07ea3b

### 4. Get the best parameters

In [46]:
# Get the best model
best_model = fitted_lr_pipeline.stages[-1].bestModel

# Print the best parameters
print(f"Best parameters for Logistic regression:")

for param, value in best_model.extractParamMap().items():
     print(f"  {param.name}: {value}")

Best parameters for Logistic regression:
  aggregationDepth: 2
  elasticNetParam: 0.0
  family: auto
  featuresCol: features
  fitIntercept: True
  labelCol: category_label
  maxBlockSizeInMB: 0.0
  maxIter: 100
  predictionCol: prediction
  probabilityCol: probability
  rawPredictionCol: rawPrediction
  regParam: 0.013257113655901081
  standardization: True
  threshold: 0.5
  tol: 1e-06


### 5. Save the best model

Now let us save best model

In [47]:
# Saving the pipeline model
fitted_lr_pipeline.save('hashingTF-IDF/nb_lr/output/models/logistic_regression/pipeline')

24/06/28 19:47:36 WARN TaskSetManager: Stage 15748 contains a task of very large size (2097 KiB). The maximum recommended task size is 1000 KiB.
24/06/28 19:47:37 WARN TaskSetManager: Stage 15755 contains a task of very large size (10474 KiB). The maximum recommended task size is 1000 KiB.


In [48]:
# Saving the model without pipeline
fitted_lr_pipeline.stages[-1].bestModel.save('hashingTF-IDF/nb_lr/output/models/logistic_regression/simple')

24/06/28 19:47:57 WARN TaskSetManager: Stage 15759 contains a task of very large size (10474 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

## VII - Summary

In this notebook, we have studied two models for our news categorization task: **Naive Bayes** and **Logistic Regression**.

Our study reveals the following:
- The **Logistic Regression** model initially showed the best performance on the training set but exhibited signs of overfitting when evaluated on the test set.
- The **Naive Bayes** model demonstrated more consistent performance between the training and test sets, indicating better generalization in the initial evaluation.

To address the overfitting issue with the Logistic Regression model, we tuned its hyperparameters using grid search with cross-validation. This process led us to identify the best model, which we subsequently saved.

The tuned Logistic Regression model showed significant improvement:
- Training accuracy: **92%**
- Test accuracy: **85%**

In conclusion, after hyperparameter tuning, the **Logistic Regression** model emerged as the best-performing model for our news categorization task.


In [49]:
# Remove the cache
df.unpersist()

DataFrame[category_label: double, description_filtered: array<string>]

In [33]:
# Stop the spark session
spark.stop()

24/06/22 04:09:43 ERROR Instrumentation: org.apache.spark.SparkException: Job 2892 cancelled because SparkContext was shut down

	at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1253)

	at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1251)

	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)

	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1251)

	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:3087)

	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)

	at org.apache.spark.scheduler.DAGScheduler.$anonfun$stop$3(DAGScheduler.scala:2973)

	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1375)

	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2973)

	at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2263)

	at org.apache.spark.uti