# News categorization

In this notebook we are going to build a Machine Learning model for news categorisation.

Our dataset is the one we preprocessed before, which has two colums:

- **description_filtered** which is the filtered descrition after performing cleaning, tokenisation, lemmatization and stopword removal on the description of the news
- **category_label** which is a numeric value that represents the category of our label.

We are going to study two Decison Tree clasifier with HashingTF-IDF

## I- Modules import

Let us import the modules we need.

In [29]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import  IDF, HashingTF
from pyspark.ml import  Pipeline
from math import ceil,log2
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.sql.functions import col,explode,split
import numpy as np

## II- Spark context and session creation

Let us create a spark session

In [30]:
spark = (SparkSession.builder
    .master('local[*]')
    .appName("NewsCategorization")
   .config("spark.driver.memory", '320g')\
    .config("spark.driver.memoryOverhead", "1t")
    .getOrCreate()
        )
spark

## III- Dataframe preparing

### 1. Load the data

In [31]:
# Load the ata

df = spark.read.parquet("dt/input/news.parquet", header=True, inferSchema=True)

### 2. Partition and cache the dataframe

In [32]:
# Get the current numbe rof RDD partitions
df.rdd.getNumPartitions()

7

In [33]:
# Repartitionning: Use 4 partitions per core
num_partitions=4*42
df= df.repartition(num_partitions).cache()

In [34]:
df.rdd.getNumPartitions()



168

### 3. Preview the data

In [35]:
# Count the number of observations
df.count()

                                                                                

650028

In [36]:
# Show the dataframe
df.show()

+--------------+--------------------+
|category_label|description_filtered|
+--------------+--------------------+
|          10.0|vatican palace sa...|
|          10.0|u mho secular com...|
|           1.0|ciara kelly tampo...|
|           0.0|vepa kamesam 8217...|
|           6.0|reward help kid g...|
|           7.0|inside city light...|
|           4.0|hillary clinton t...|
|          11.0|new return new yo...|
|           6.0|parent ask google...|
|           5.0|fortnite entropy ...|
|           7.0|discovery channel...|
|           2.0|homegrown stem ac...|
|           5.0|fact checking hil...|
|           6.0|dad show incredib...|
|           9.0|savanna georgia w...|
|           1.0|six bristol city ...|
|           2.0|      red admiral 20|
|           6.0| proposal antarctica|
|          10.0|troubled anna mar...|
|           1.0|kanye west make s...|
+--------------+--------------------+
only showing top 20 rows



In [37]:
# Print the schema of the dataframe
df.printSchema()

root
 |-- category_label: double (nullable = true)
 |-- description_filtered: string (nullable = true)



### 4. Convert filtered descriptions to arrays

In [38]:
# Create a new DataFrame with description_filtered as arrays
df= df.withColumn('description_filtered', split(col('description_filtered'), ' '))
# Show the new DataFrame
df.show(truncate=False)

+--------------+---------------------------------------------------------------------------------------------------------+
|category_label|description_filtered                                                                                     |
+--------------+---------------------------------------------------------------------------------------------------------+
|10.0          |[vatican, palace, say, transgender, valet, de, chambre, become, godparent]                               |
|10.0          |[u, mho, secular, community, won, significant, legal, victory]                                           |
|1.0           |[ciara, kelly, tampon, ad, protest, hundred, box, received, already]                                     |
|0.0           |[vepa, kamesam, 8217s, term, rbi, extended, three, month]                                                |
|6.0           |[reward, help, kid, get, active, dont, necessarily, lead, better, health, study]                         |
|7.0           |

## IV- Feature Engineering


### 1. Explode the filtered descriptions to get the words

In [39]:
exploded_df=df.select(explode(df.description_filtered)).alias('words')
exploded_df.show()

+-----------+
|        col|
+-----------+
|    vatican|
|     palace|
|        say|
|transgender|
|      valet|
|         de|
|    chambre|
|     become|
|  godparent|
|          u|
|        mho|
|    secular|
|  community|
|        won|
|significant|
|      legal|
|    victory|
|      ciara|
|      kelly|
|     tampon|
+-----------+
only showing top 20 rows



### 2. Get unique words in the filtered_description

In [40]:
unique_words=exploded_df.distinct()

### 3. Cache and show the unique words dataframe

In [41]:
unique_words=unique_words.cache()
unique_words.show()



+-------------+
|          col|
+-------------+
|    godparent|
|        still|
|       travel|
|         hope|
|       voyage|
|intermarriage|
|infinitesimal|
|       online|
|     mushball|
| transference|
|       harder|
|          art|
|       outfit|
|        spoil|
|       biting|
|     cautious|
|      elevate|
|     incoming|
|       poetry|
|   hoverboard|
+-------------+
only showing top 20 rows



                                                                                

### 4. Get the vocabulary size

In [42]:
vocabulary_size=unique_words.count()
vocabulary_size

114967

### 5. Unpersit the unique words dataframe(not needed anymore)

In [43]:
unique_words=unique_words.unpersist()

### 6. Get the smallest `n` such that $2^n$ is greater than `vocabulary_size`

In [44]:
n=ceil(log2(vocabulary_size))
n

17

### 7. Get the number of features for HashingTF

In [45]:
num_features=2**n
num_features

131072

### 8. Define the HashingTF and IDF stages

In [46]:
# Define the HashingTF and IDF stages
hashingTF = HashingTF(inputCol="description_filtered", outputCol="rawFeatures", numFeatures=num_features)
idf = IDF(inputCol="rawFeatures", outputCol="features")

## V- Models set up, training and evaluation

### 1. Set up Decison tree classifier

In [47]:
# Define the classifier

# Define Decision Tree classifier
dt = DecisionTreeClassifier(labelCol="category_label", featuresCol="features",seed=0)


### 2. Set up pipelines

We will  set up pipelines of the following transformations for Decison Tree

- HashingTF
- IDF
- 3-Fold Cross-validation  without grid search

In [48]:
# Define parameter grids 
paramGrid_dt = ParamGridBuilder().build()

# Cross-validation for Decision Trees
cv_dt = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid_dt,
                       evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                       numFolds=3, parallelism=1)

# Create pipeline

# Pipeline for Decision Trees
pipeline_dt = Pipeline(stages=[hashingTF, idf, cv_dt])

### 3. Split the data

Let us split the data into train and test set: 80% for train and 20% for test

In [49]:
# Split data
(train_set, test_set) = df.randomSplit([0.80, 0.20], seed=0)

### 4. Create a function for model training

Let us create a function which takes as argument a model that it trains and then returns the trained model.

In [50]:
def train_model(model):
    return model.fit(train_set)

### 5. Define a function to evaluate the model

The function takes as parameter a fitted model, evaluates the model on train and test split and then return the train and test performance. The accuracy is the metric used.

In [51]:
# Initialize the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy")

# Function to evaluate model and get best parameters
def evaluate_model(fitted_model):

    print('Making predictions on the training set')

    train_predictions = fitted_model.transform(train_set)

    print('Making predictions on the test set')
    test_predictions = fitted_model.transform(test_set)

    print('Evaluating the model on training set')
    train_accuracy = evaluator.evaluate(train_predictions)

    print('Evaluating the model on test set')
    test_accuracy = evaluator.evaluate(test_predictions)

    print('Train accuracy:',train_accuracy)
    print('Test accuracy:',test_accuracy)
    return train_accuracy, test_accuracy

### 6. Create a function which takes pipelines and train the models, evaluate them and then return the results

In [52]:
def train_and_evaluate_model(model_pipeline=pipeline_dt,model_name="Decision Tree"):
    
    print(f"Training {model_name} model")

    # Fit the model pipeline to the training set
    #fitted_model = model_pipeline.fit(train_set)
    fitted_model = train_model(model_pipeline)

    print("Done")
    print(f"Evaluating {model_name} model")

    # Evaluate the fitted model
    train_accuracy, test_accuracy = evaluate_model(fitted_model)
    print("Done")
    # Store the results
    results= {
            'fitted_model': fitted_model,
            "train_accuracy": train_accuracy,
            "test_accuracy": test_accuracy
        }


    return results

### 5. Call the function and interpret the results

#### a. Training and evaluation

In [53]:
import time
start=time.time()
results = train_and_evaluate_model()
end=time.time()
print('Duration:',end-start,'seconds')
results

Training Decision Tree model


24/06/29 18:43:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:43:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:43:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:43:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:43:51 WARN DAGScheduler: Broadcasting large task binary with size 3.4 MiB
24/06/29 18:43:55 WARN DAGScheduler: Broadcasting large task binary with size 3.9 MiB
24/06/29 18:45:50 WARN DAGScheduler: Broadcasting large task binary with size 3.9 MiB
24/06/29 18:47:19 WARN DAGScheduler: Broadcasting large task binary with size 3.9 MiB
24/06/29 18:48:56 WARN DAGScheduler: Broadcasting large task binary with size 3.9 MiB
24/06/29 18:50:36 WARN DAGScheduler: Broadcasting large task binary with size 3.9 MiB
24/06/29 18:52:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 18:52:14 WARN DAGScheduler: Broadcasting larg

Done
Evaluating Decision Tree model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/29 19:21:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/29 19:21:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB


Evaluating the model on test set




Train accuracy: 0.17962043649610607
Test accuracy: 0.17977942589247972
Done
Duration: 2267.4518296718597 seconds


                                                                                

{'fitted_model': PipelineModel_92228022e1d2,
 'train_accuracy': 0.17962043649610607,
 'test_accuracy': 0.17977942589247972}

In [54]:
# Results of the fitted Decison Tree classifier
results

{'fitted_model': PipelineModel_92228022e1d2,
 'train_accuracy': 0.17962043649610607,
 'test_accuracy': 0.17977942589247972}

#### b. Results interpretetion

We remark that:
- The Decision Tree model shows very poor performance on both the training and test sets, with accuracies below 18%. This indicates that the model is not capturing the patterns in the data effectively.
Given the very low performance of the Decision Tree model, it is not worthwhile to spend significant effort on tuning its hyperparameters. Instead, it is better to try other feature engineering methods

## VII - Summary

### Decision Tree Model Performance

- **Decision Tree with HashingTF**:
  - Train Accuracy: 18%
  - Test Accuracy: 18%

### Analysis

The Decision Tree model shows very poor performance on both the training and test sets, indicating it does not effectively capture data patterns. It's advisable to explore alternative feature engineering methods or more complex models for improvement.

In [55]:
# Remove the cache
df.unpersist()

DataFrame[category_label: double, description_filtered: array<string>]

In [56]:
# Stop the spark session
spark.stop()