# News categorization

In this notebook we are going to build a Machine Learning model for news categorization.

Our dataset is the one we preprocessed before, which has two columns:

- **description_filtered** which is the filtered description after performing cleaning, tokenization, lemmatization and stopword removal on the description of the news
- **category_label** which is a numeric value that represents the category of our label.

We converted the dataset format from csv to parquet.

We are going to study **Logistic regression** with word embedding

## I- Modules import

Let us import the modules we need.

In [12]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import  Pipeline
from pyspark.ml.feature import Word2Vec
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import col,split
import numpy as np

## II- Spark context and session creation

Let us create a spark session

In [2]:
spark = (SparkSession.builder
    .master('local[*]')
    .appName("NewsCategorization")
    .config("spark.driver.memory", '320g')\
    .config("spark.driver.memoryOverhead", "500g")\
    .getOrCreate()
        )
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/28 22:05:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/06/28 22:05:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## III- Dataframe preparing

### 1. Load the data

In [4]:
# Load the ata
df = spark.read.parquet("models/word_embedding/logistic_regression/news.parquet", header=True, inferSchema=True)

                                                                                

### 2. Partition and cache the dataframe

In [5]:
# Get the current numbe rof RDD partitions
df.rdd.getNumPartitions()

7

In [6]:
# Repartitionning: Use 4 partitions per core
num_partitions=4*40
df= df.repartition(num_partitions).cache()

In [7]:
df.rdd.getNumPartitions()



160

### 3. Preview the data

In [8]:
# Count the number of observations
df.count()

                                                                                

650028

In [9]:
# Show the dataframe
df.show()

+--------------+--------------------+
|category_label|description_filtered|
+--------------+--------------------+
|           9.0|9 dead c one hund...|
|          11.0|devos endure vill...|
|           6.0|college hitch tip...|
|          11.0|jonathan kozol de...|
|           6.0|reward help kid g...|
|           4.0|donald trump say ...|
|           7.0|universal orlando...|
|           4.0|north carolina bo...|
|           4.0|bernie sander hit...|
|           6.0|decade way failin...|
|           8.0|workmanship war r...|
|           7.0|ten strange attra...|
|          11.0|bridge divide uph...|
|           8.0|kathleen hanna re...|
|           4.0|dennis hastert ca...|
|           1.0|worker dy minneso...|
|           0.0|staterun lender c...|
|           3.0|normality l repor...|
|           5.0|nintendo q1 profi...|
|           0.0|association provi...|
+--------------+--------------------+
only showing top 20 rows



In [10]:
# Print the schema of the dataframe
df.printSchema()

root
 |-- category_label: double (nullable = true)
 |-- description_filtered: string (nullable = true)



### 4. Convert filtered descriptions to arrays

In [11]:
# Create a new DataFrame with description_filtered as arrays
df= df.withColumn('description_filtered', split(col('description_filtered'), ' '))
# Show the new DataFrame
df.show(truncate=False)

+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|category_label|description_filtered                                                                                                                                   |
+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|9.0           |[9, dead, c, one, hundred, thirty, military, cargo, carpenter, plane, clangor, georgia]                                                                |
|11.0          |[devos, endure, villain, non, student, part, 2]                                                                                                        |
|6.0           |[college, hitch, tip, parent]                                                                                                              

## IV- Feature Engineering


We are going word embedding to represent our filtered description.

We will try some values of vector size(100, 200, 300, ...) in order to find the optimum

## V- Models set up, training and evaluation

### 1. Set up Logistic regression classifier

In [14]:
# Define Logistic Regression classifier
lr = LogisticRegression(labelCol="category_label", featuresCol="word_embeddings")

### 2. Cross validation

3-Fold Cross-validation  without grid search

In [20]:
# Define parameter grids (you can specify more parameters if needed)
paramGrid_lr = ParamGridBuilder().build()
# Create cross validators

# Cross-validation for Logistic Regression
cv_lr = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid_lr,
                       evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                       numFolds=3, parallelism=1)


### 3. Split the data

Let us split the data into train and test set: 80% for train and 20% for test

In [16]:
# Split data
(train_set, test_set) = df.randomSplit([0.80, 0.20], seed=0)

### 4. Create a function for model training

Let us create a function which takes as argument a model that it trains and then returns the trained model.

In [17]:
def train_model(model):
    return model.fit(train_set)

### 5. Define the evaluator and a function to evaluate the model

The function takes as parameter a fitted model, evaluates the model on train and test split and then return the train and test performance. The accuracy is the metric used.

In [38]:
evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy")

In [39]:

# Function to evaluate model and get best parameters
def evaluate_model(fitted_model):

    print('Making predictions on the training set')

    train_predictions = fitted_model.transform(train_set)

    print('Making predictions on the test set')
    test_predictions = fitted_model.transform(test_set)

    print('Evaluating the model on training set')
    train_accuracy = evaluator.evaluate(train_predictions)

    print('Evaluating the model on test set')
    test_accuracy = evaluator.evaluate(test_predictions)

    print('Train accuracy:',train_accuracy)
    print('Test accuracy:',test_accuracy)
    return train_accuracy, test_accuracy

In [46]:
# Create pipelines

# Define the HashingTF and IDF stages
# Define Word2Vec

word2Vec = Word2Vec(vectorSize=100, minCount=5, inputCol="description_filtered", outputCol="word_embeddings")

# Pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[word2Vec, cv_lr])
pipeline_lr

Pipeline_dbcb36ef3008

### 6. Create a function which takes pipelines and train the models, evaluate them and then return the results

In [47]:
def train_and_evaluate_model(model_pipeline,model_name="Logistic Regression"):
    
    print(f"Training {model_name} model")

    # Fit the model pipeline to the training set
    #fitted_model = model_pipeline.fit(train_set)
    fitted_model = train_model(model_pipeline)

    print("Done")
    print(f"Evaluating {model_name} model")

    # Evaluate the fitted model
    train_accuracy, test_accuracy = evaluate_model(fitted_model)
    print("Done")
    # Store the results
    results= {
            'model_name': model_name,
            'fitted_model': fitted_model,
            "train_accuracy": train_accuracy,
            "test_accuracy": test_accuracy
        }

    return results

### 5. Call the function and interpret the results

#### a. Training and evaluation

In [48]:
results = train_and_evaluate_model(pipeline_lr)
results

Training Logistic Regression model


                                                                                 160]]

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


                                                                                

Evaluating the model on test set
Train accuracy: 0.67924459331285
Test accuracy: 0.6786917968600054
Done


{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_14c3f5949044,
 'train_accuracy': 0.67924459331285,
 'test_accuracy': 0.6786917968600054}

In [53]:
# Create pipelines

# Define the HashingTF and IDF stages
# Define Word2Vec

word2Vec = Word2Vec(vectorSize=200, minCount=5, inputCol="description_filtered", outputCol="word_embeddings")

# Pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[word2Vec, cv_lr])
pipeline_lr
results = train_and_evaluate_model(pipeline_lr)
results

Training Logistic Regression model


                                                                                160]]]

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


                                                                                

Evaluating the model on test set
Train accuracy: 0.694435840261037
Test accuracy: 0.6936854631300142
Done


{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_11f8cf869f6a,
 'train_accuracy': 0.694435840261037,
 'test_accuracy': 0.6936854631300142}

In [54]:
# Create pipelines

# Define the HashingTF and IDF stages
# Define Word2Vec

word2Vec = Word2Vec(vectorSize=300, minCount=5, inputCol="description_filtered", outputCol="word_embeddings")

# Pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[word2Vec, cv_lr])
pipeline_lr
results = train_and_evaluate_model(pipeline_lr)
results

Training Logistic Regression model


                                                                                0) / 160]

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


                                                                                

Evaluating the model on test set
Train accuracy: 0.7014773757005462
Test accuracy: 0.7003262830601512
Done


{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_f40dc6bf3e23,
 'train_accuracy': 0.7014773757005462,
 'test_accuracy': 0.7003262830601512}

In [55]:
# Create pipelines

# Define the HashingTF and IDF stages
# Define Word2Vec

word2Vec = Word2Vec(vectorSize=400, minCount=5, inputCol="description_filtered", outputCol="word_embeddings")

# Pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[word2Vec, cv_lr])
pipeline_lr
results = train_and_evaluate_model(pipeline_lr)
results

Training Logistic Regression model


                                                                                ) / 160]]

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


                                                                                

Evaluating the model on test set
Train accuracy: 0.7032916292304525
Test accuracy: 0.7014701930827991
Done


{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_416b8c8d6d14,
 'train_accuracy': 0.7032916292304525,
 'test_accuracy': 0.7014701930827991}

#### b. Results interpretetion

After trying with vector size in [100,200,300,400] we remark that the performance of the model doesn't increase significantly anymore. Wee then conclude that 300 is the optimum value for the vector size.

The performance for vecsize=300 is 70%. We will use that value of vecsize for the following, to tune hyperparameters so as to enhance the performance of our model.

## VI- Logistic regression hyperparameters tunning

Let us use Grid search with cross validation to find the best regularisation parameter. We will use 10 values of regularisation parameter varing in a log scale.

### 1. Pipeline creation

In [56]:
# Create pipelines

# Define the HashingTF and IDF stages
# Define Word2Vec

word2Vec = Word2Vec(vectorSize=300, minCount=5, inputCol="description_filtered", outputCol="word_embeddings")



# Define parameter grids for Logistic Regression with ElasticNet
reg_values = np.logspace(-4, 4, num=50)

paramGrid_lr = ParamGridBuilder() \
    .addGrid(lr.regParam, reg_values) \
    .build()
#

# Create Cross-validation for Logistic Regression
cv_lr = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid_lr,
                       evaluator=evaluator,
                       numFolds=3, parallelism=3)

# Create pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[word2Vec, cv_lr])

pipeline_lr


Pipeline_8b37f1944bc9

### 2. Hyperparameters tuning

In [57]:
results = train_and_evaluate_model(pipeline_lr)
results

Training Logistic Regression model


                                                                                ) / 160]]

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


                                                                                

Evaluating the model on test set
Train accuracy: 0.7014773757005462
Test accuracy: 0.7003262830601512
Done


{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_2706c8daf1d0,
 'train_accuracy': 0.7014773757005462,
 'test_accuracy': 0.7003262830601512}

### 3. Interpreting the results

In [58]:
results

{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_2706c8daf1d0,
 'train_accuracy': 0.7014773757005462,
 'test_accuracy': 0.7003262830601512}

Despite extensive hyperparameter tuning, the accuracy of the Logistic Regression model remains around 70% for both the training and test sets. This consistency indicates good generalization but suggests that the current approach has reached its performance limit. Further improvements may require additional feature engineering or exploring more complex models.

We will save our model

### 4. Get the best parameters

In [59]:
fitted_model=results['fitted_model']

# Get the best model
best_model = fitted_model.stages[-1].bestModel

# Print the best parameters
print(f"Best parameters for Logistic regression:")

for param, value in best_model.extractParamMap().items():
     print(f"  {param.name}: {value}")

Best parameters for Logistic regression:
  aggregationDepth: 2
  elasticNetParam: 0.0
  family: auto
  featuresCol: word_embeddings
  fitIntercept: True
  labelCol: category_label
  maxBlockSizeInMB: 0.0
  maxIter: 100
  predictionCol: prediction
  probabilityCol: probability
  rawPredictionCol: rawPrediction
  regParam: 9.999999999999999e-05
  standardization: True
  threshold: 0.5
  tol: 1e-06


### 5. Save the best model

Now let us save the model

In [60]:
# With pipeline
fitted_model.save("pipeline")

24/06/29 02:27:59 WARN TaskSetManager: Stage 142942 contains a task of very large size (1091 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [61]:
best_model.save("simple")

## VII- Summary

In this notebook we have studied  **Logistic regression** with word embedding.

Our study reveals that our model, even after tunning can not outperforms 70%..

In [62]:
# Remove the cache
df.unpersist()

DataFrame[category_label: double, description_filtered: array<string>]

In [63]:
# Stop the spark session
spark.stop()