# News categorization

In this notebook we are going to build a Machine Learning model for news categorisation.

Our dataset is the one we preprocessed before, which has two colums:

- **description_filtered** which is the filtered descrition after performing cleaning, tokenisation, lemmatization and stopword removal on the description of the news
- **category_label** which is a numeric value that represents the category of our label.

We converted the dataset format from csv to parket.

We are going to study two classification models: **Random Forest**

## I- Modules import

Let us import the modules we need.

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import  IDF, HashingTF
from pyspark.ml import  Pipeline
from math import ceil,log2
from pyspark.ml.classification import RandomForestClassifier
from pyspark.sql.functions import col,explode,split


## II- Spark context and session creation

Let us create a spark session

In [2]:
spark = (SparkSession.builder
    .master('local[*]')
    .appName("NewsCategorization")
   .config("spark.driver.memory", '320g')\
    .config("spark.driver.memoryOverhead", "1t")
    .getOrCreate()
        )
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/29 18:40:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## III- Dataframe preparing

### 1. Load the data

In [3]:
# Load the ata

df = spark.read.parquet("rf/input/news.parquet", header=True, inferSchema=True)

                                                                                

### 2. Partition and cache the dataframe

In [4]:
# Get the current numbe rof RDD partitions
df.rdd.getNumPartitions()

7

In [5]:
# Repartitionning: Use 4 partitions per core
num_partitions=4*42
df= df.repartition(num_partitions).cache()

In [6]:
df.rdd.getNumPartitions()



168

### 3. Preview the data

In [7]:
# Count the number of observations
df.count()

                                                                                

650028

In [8]:
# Show the dataframe
df.show()

+--------------+--------------------+
|category_label|description_filtered|
+--------------+--------------------+
|          10.0|vatican palace sa...|
|          10.0|u mho secular com...|
|           1.0|ciara kelly tampo...|
|           0.0|vepa kamesam 8217...|
|           6.0|reward help kid g...|
|           7.0|inside city light...|
|           4.0|hillary clinton t...|
|          11.0|new return new yo...|
|           6.0|parent ask google...|
|           5.0|fortnite entropy ...|
|           7.0|discovery channel...|
|           2.0|homegrown stem ac...|
|           5.0|fact checking hil...|
|           6.0|dad show incredib...|
|           9.0|savanna georgia w...|
|           1.0|six bristol city ...|
|           2.0|      red admiral 20|
|           6.0| proposal antarctica|
|          10.0|troubled anna mar...|
|           1.0|kanye west make s...|
+--------------+--------------------+
only showing top 20 rows



In [9]:
# Print the schema of the dataframe
df.printSchema()

root
 |-- category_label: double (nullable = true)
 |-- description_filtered: string (nullable = true)



### 4. Convert filtered descriptions to arrays

In [10]:
# Create a new DataFrame with description_filtered as arrays
df= df.withColumn('description_filtered', split(col('description_filtered'), ' '))
# Show the new DataFrame
df.show(truncate=False)

+--------------+---------------------------------------------------------------------------------------------------------+
|category_label|description_filtered                                                                                     |
+--------------+---------------------------------------------------------------------------------------------------------+
|10.0          |[vatican, palace, say, transgender, valet, de, chambre, become, godparent]                               |
|10.0          |[u, mho, secular, community, won, significant, legal, victory]                                           |
|1.0           |[ciara, kelly, tampon, ad, protest, hundred, box, received, already]                                     |
|0.0           |[vepa, kamesam, 8217s, term, rbi, extended, three, month]                                                |
|6.0           |[reward, help, kid, get, active, dont, necessarily, lead, better, health, study]                         |
|7.0           |

## IV- Feature Engineering


### 1. Explode the filtered descriptions to get the words

In [11]:
exploded_df=df.select(explode(df.description_filtered)).alias('words')
exploded_df.show()

+-----------+
|        col|
+-----------+
|    vatican|
|     palace|
|        say|
|transgender|
|      valet|
|         de|
|    chambre|
|     become|
|  godparent|
|          u|
|        mho|
|    secular|
|  community|
|        won|
|significant|
|      legal|
|    victory|
|      ciara|
|      kelly|
|     tampon|
+-----------+
only showing top 20 rows



### 2. Get unique words in the filtered_description

In [12]:
unique_words=exploded_df.distinct()

### 3. Cache and show the unique words dataframe

In [13]:
unique_words=unique_words.cache()
unique_words.show()



+-------------+
|          col|
+-------------+
|    godparent|
|        still|
|       travel|
|         hope|
|       voyage|
|intermarriage|
|infinitesimal|
|       online|
|     mushball|
| transference|
|       harder|
|          art|
|       outfit|
|        spoil|
|       biting|
|     cautious|
|      elevate|
|     incoming|
|       poetry|
|   hoverboard|
+-------------+
only showing top 20 rows



                                                                                

### 4. Get the vocabulary size

In [14]:
vocabulary_size=unique_words.count()
vocabulary_size

114967

### 5. Unpersit the unique words dataframe(not needed anymore)

In [15]:
unique_words=unique_words.unpersist()

### 6. Get the smallest `n` such that $2^n$ is greater than `vocabulary_size`

In [16]:
n=ceil(log2(vocabulary_size))
n

17

### 7. Get the number of features for HashingTF

In [17]:
num_features=2**n
num_features

131072

### 8. Define the HashingTF and IDF stages

In [18]:
# Define the HashingTF and IDF stages
hashingTF = HashingTF(inputCol="description_filtered", outputCol="rawFeatures", numFeatures=num_features)
idf = IDF(inputCol="rawFeatures", outputCol="features")

## V- Models set up, training and evaluation

### 1. Set up Random Forest classifier

In [19]:
# Define the classifier

# Define random forest Tree classifier
rf = RandomForestClassifier(labelCol="category_label", featuresCol="features",seed=0)


### 2. Set up pipelines

We will  set up pipelines of the following transformations for Native Bayes and Linear reggression

- HashingTF
- IDF
- 3-Fold Cross-validation  without grid search

In [20]:
# Define parameter grids 
paramGrid_rf = ParamGridBuilder().build()

# Cross-validation for Random Forest
cv_rf = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid_rf,
                       evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                       numFolds=3, parallelism=1)

# Create pipeline

# Pipeline for Random Forest
pipeline_rf = Pipeline(stages=[hashingTF, idf, cv_rf])



### 3. Split the data

Let us split the data into train and test set: 80% for train and 20% for test

In [21]:
# Split data
(train_set, test_set) = df.randomSplit([0.80, 0.20], seed=0)

### 4. Create a function for model training

Let us create a function which takes as argument a model that it trains and then returns the trained model.

In [22]:
def train_model(model):
    return model.fit(train_set)

### 5. Define a function to evaluate the model

The function takes as parameter a fitted model, evaluates the model on train and test split and then return the train and test performance. The accuracy is the metric used.

In [23]:
# Initialize the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy")

# Function to evaluate model and get best parameters
def evaluate_model(fitted_model):

    print('Making predictions on the training set')

    train_predictions = fitted_model.transform(train_set)

    print('Making predictions on the test set')
    test_predictions = fitted_model.transform(test_set)

    print('Evaluating the model on training set')
    train_accuracy = evaluator.evaluate(train_predictions)

    print('Evaluating the model on test set')
    test_accuracy = evaluator.evaluate(test_predictions)

    print('Train accuracy:',train_accuracy)
    print('Test accuracy:',test_accuracy)
    return train_accuracy, test_accuracy

### 6. Create a function which takes pipelines and train the models, evaluate them and then return the results

In [24]:
def train_and_evaluate_model(model_pipeline=pipeline_rf,model_name="Random Forest"):
    
    print(f"Training {model_name} model")

    # Fit the model pipeline to the training set
    #fitted_model = model_pipeline.fit(train_set)
    fitted_model = train_model(model_pipeline)

    print("Done")
    print(f"Evaluating {model_name} model")

    # Evaluate the fitted model
    train_accuracy, test_accuracy = evaluate_model(fitted_model)
    print("Done")
    # Store the results
    results= {
            'fitted_model': fitted_model,
            "train_accuracy": train_accuracy,
            "test_accuracy": test_accuracy
        }


    return results

### 5. Call the function and interpret the results

#### a. Training and evaluation

In [25]:
import time
start=time.time()
results = train_and_evaluate_model()
end=time.time()
print('Duration:',end-start,'seconds')
results

Training Random Forest model


24/06/29 18:42:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:42:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:42:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:42:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:42:08 WARN DAGScheduler: Broadcasting large task binary with size 3.4 MiB
24/06/29 18:42:12 WARN DAGScheduler: Broadcasting large task binary with size 4.0 MiB
24/06/29 18:42:42 WARN DAGScheduler: Broadcasting large task binary with size 4.0 MiB
24/06/29 18:42:50 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB
24/06/29 18:42:56 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB
24/06/29 18:43:01 WARN DAGScheduler: Broadcasting large task binary with size 4.2 MiB
24/06/29 18:43:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:43:07 WARN DAGScheduler: Broadcasting larg

Done
Evaluating Random Forest model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/29 18:45:57 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
24/06/29 18:45:58 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB


Evaluating the model on test set




Train accuracy: 0.2393535072214005
Test accuracy: 0.23688275756061944
Done
Duration: 244.44258093833923 seconds


                                                                                

{'fitted_model': PipelineModel_485696d14a3b,
 'train_accuracy': 0.2393535072214005,
 'test_accuracy': 0.23688275756061944}

#### b. Results interpretetion

We remark that:
- The Random Forest model shows very poor performance on both the training and test sets, with accuracies below 24%. This indicates that the model is not capturing the patterns in the data effectively.
Given the very low performance of the Random Forest model, it is not worthwhile to spend significant effort on tuning its hyperparameters. Instead, it is better to try other feature engineering methods

Let us use Grid search with cross validation to find the best regularisation parameter. We will use 10 values of regularisation parameter varing in a log scale.

### 1. Pipeline creation

In [29]:

# Define parameter grids for Random Forest
paramGrid_rf = ParamGridBuilder() \
    .addGrid(rf.maxDepth, [20, 25, 30]) \
    .addGrid(rf.numTrees, [50, 100, 150]) \
    .build()

# Create Cross-validation for Random Forest
cv_rf = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid_rf,
                       evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                       numFolds=3, parallelism=1)

# Create pipeline for Random Forest
pipeline_rf = Pipeline(stages=[hashingTF,idf, cv_rf])

pipeline_rf


Pipeline_1aed6fc0d3d6

### 3. Interpreting the results

We remark that:
- The Random Forest model shows very poor performance on both the training and test sets, with accuracies below 24%. This indicates that the model is not capturing the patterns in the data effectively.
Given the very low performance of the Random Forest model, it is not worthwhile to spend significant effort on tuning its hyperparameters. Instead, it is better to try other feature engineering methods

## VII - Summary

### Random Forest Model Performance

- **Random Forest with HashingTF**:
  - Train Accuracy: 24%
  - Test Accuracy: 24%

### Analysis

The Random Forest model shows very poor performance on both the training and test sets, indicating it does not effectively capture data patterns. It's advisable to explore alternative feature engineering methods or more complex models for improvement.

In [26]:
# Remove the cache
df.unpersist()

DataFrame[category_label: double, description_filtered: array<string>]

In [27]:
# Stop the spark session
#spark.stop()