# News categorization

In this notebook we are going to build a Machine Learning model for news categorisation.

Our dataset is the one we preprocessed before, which has two colums: 

- **description_filtered** which is the filtered descrition after performing cleaning, tokenisation, lemmatization and stopword removal on the description of the news
- **category_label** which is a numeric value that represents the category of our label.

We converted the dataset format from csv to parket.

We are going to study two classification models: **Naive Bayes** and **Logistic regression**

## I- Modules import

Let us import the modules we need.

In [98]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import  IDF, HashingTF
from pyspark.ml import  Pipeline
from math import ceil,log2
from pyspark.ml.classification import LogisticRegression,NaiveBayes,LogisticRegressionModel
from pyspark.sql.functions import col,explode,split
import numpy as np

## II- Spark context and session creation

Let us create a spark session

In [99]:
spark = (SparkSession.builder
    .appName("NewsCategorization")
    .getOrCreate()
        )
spark

## III- Dataframe preparing

### 1. Load the data

In [8]:
# Load the ata
df = spark.read.parquet("input/news.parquet", header=True, inferSchema=True)

                                                                                

### 2. Partition and cache the dataframe

In [9]:
df.rdd.getNumPartitions()

9

In [10]:
# Repartitionning: Use 5 partitions per core
num_partitions=5*40
df= df.repartition(num_partitions).cache()

In [11]:
df.rdd.getNumPartitions()



200

### 3. Preview the data

In [12]:
# Count the number of observations
df.count()

                                                                                

1716608

In [13]:
# Show the dataframe
df.show()

+--------------+--------------------+
|category_label|description_filtered|
+--------------+--------------------+
|           2.0|creating reserve ...|
|           0.0|bryan fite missou...|
|           1.0|seth meyers roy m...|
|           1.0|zac efron doesnt ...|
|           1.0|bigg bos 14 promo...|
|           0.0|government ha app...|
|           2.0|resident union hi...|
|           2.0|australian conser...|
|           0.0|govt ha plan inje...|
|           1.0|rap authentic eth...|
|           2.0|roger putnam say ...|
|           0.0|clear key member ...|
|           0.0|selling back publ...|
|           2.0|equitable elephan...|
|           2.0|appeal court reje...|
|           0.0|alex wade lucrati...|
|           1.0|kim kardashian tr...|
|           1.0|jimmy fallon rip ...|
|           1.0|freshly bring int...|
|           1.0|report arsenal ga...|
+--------------+--------------------+
only showing top 20 rows



In [14]:
# Print the schema of the dataframe
df.printSchema()

root
 |-- category_label: double (nullable = true)
 |-- description_filtered: string (nullable = true)



### 4. Convert filtered descriptions to arrays

In [15]:
# Create a new DataFrame with description_filtered as arrays
df= df.withColumn('description_filtered', split(col('description_filtered'), ' '))
# Show the new DataFrame
df.show(truncate=False)

+--------------+-------------------------------------------------------------------------------------------------------------------------------------+
|category_label|description_filtered                                                                                                                 |
+--------------+-------------------------------------------------------------------------------------------------------------------------------------+
|2.0           |[creating, reserve, time, drought]                                                                                                   |
|0.0           |[bryan, fite, missouri, man, find, centuryold, whiskey, bottle, attic, video]                                                        |
|1.0           |[seth, meyers, roy, moore, unfit, office, colicky, manbaby, trump]                                                                   |
|1.0           |[zac, efron, doesnt, actually, hate, high, school, musical]                   

## IV- Feature Engineering


### 1. Explode the filtered descriptions to get the words

In [16]:
exploded_df=df.select(explode(df.description_filtered)).alias('words')
exploded_df.show()

+----------+
|       col|
+----------+
|  creating|
|   reserve|
|      time|
|   drought|
|     bryan|
|      fite|
|  missouri|
|       man|
|      find|
|centuryold|
|   whiskey|
|    bottle|
|     attic|
|     video|
|      seth|
|    meyers|
|       roy|
|     moore|
|     unfit|
|    office|
+----------+
only showing top 20 rows



In [17]:
#df=df.unpersist()

### 2. Get unique words in the filtered_description

In [18]:
unique_words=exploded_df.distinct()

### 3. Cache and show the unique words dataframe

In [19]:
unique_words=unique_words.cache()
unique_words.show()



+------------+
|         col|
+------------+
|  stateowned|
|      cowell|
|      teigen|
|   connected|
|    bottomed|
|     moreand|
|transference|
|      online|
|       still|
|    vladimir|
|      filing|
| transaction|
|    tripping|
|       trail|
|        earl|
|    chattman|
|   diabolico|
|       1970s|
|      gloria|
|   viewpoint|
+------------+
only showing top 20 rows



                                                                                

### 4. Get the vocabulary size

In [20]:
vocabulary_size=unique_words.count()
vocabulary_size

128622

### 5. Unpersit the unique words dataframe(not needed anymore)

In [21]:
unique_words=unique_words.unpersist()

### 6. Get the smallest `n` such that $2^n$ is greater than `vocabulary_size`

In [23]:
n=ceil(log2(vocabulary_size))
n

17

### 7. Get the number of features for HashingTF

In [24]:
num_features=2**n
num_features

131072

### 8. Define the HashingTF and IDF stages

In [25]:
# Define the HashingTF and IDF stages
hashingTF = HashingTF(inputCol="description_filtered", outputCol="rawFeatures", numFeatures=num_features)
idf = IDF(inputCol="rawFeatures", outputCol="features")

## V- Models set up, training and evaluation

### 1. Set up Naive Bayesand Logistic regression classifiers

In [26]:
# Define the classifiers

# Logistic regression classifier
lr = LogisticRegression(labelCol="category_label", featuresCol="features")

# Naive Bayes classifier
nb = NaiveBayes(labelCol="category_label", featuresCol="features")

### 2. Set up pipelines

We will  set up pipelines of the following transformations for Native Bayes and Linear reggression

- HashingTF
- IDF
- 3-Fold Cross-validation  without grid search

In [44]:
# Define parameter grids
paramGrid_nb=paramGrid_lr=ParamGridBuilder().build()

# Create cross validators

# Cross-validation for Naive Bayes
cv_nb = CrossValidator(estimator=nb, estimatorParamMaps=paramGrid_nb,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=1)
# Cross-validation for Logistic Regression
cv_lr = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid_lr,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=1)


# Create pipelines
# Pipeline for Naive Bayes
pipeline_nb = Pipeline(stages=[hashingTF, idf, cv_nb])
# Pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[hashingTF, idf, cv_lr])
model_pipelines=pipeline_nb, pipeline_lr
model_pipelines

(Pipeline_8424d794f2c2, Pipeline_1053dd41d4f4)

### 3. Split the data

Let us split the data into train and test set: 80% for train and 20% for test

In [45]:
# Split data
(train_set, test_set) = df.randomSplit([0.8, 0.2], seed=0)

### 4. Create a function for model training

Let us create a function which takes as argument a model that it trains and then returns the trained model.

In [46]:
def train_model(model):    
    return model.fit(train_set)

### 5. Define a function to evaluate the model

The function takes as parameter a fitted model, evaluates the model on train and test split and then return the train and test performance. The accuracy is the metric used.

In [47]:
# Initialize the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy")

# Function to evaluate model and get best parameters
def evaluate_model(fitted_model):

    print('Making predictions on the training set')

    train_predictions = fitted_model.transform(train_set)

    print('Making predictions on the test set')
    test_predictions = fitted_model.transform(test_set)

    print('Evaluating the model on training set')
    train_accuracy = evaluator.evaluate(train_predictions)

    print('Evaluating the model on test set')
    test_accuracy = evaluator.evaluate(test_predictions)
    return train_accuracy, test_accuracy

In [58]:
# Function to evaluate model and get best parameters
def evaluate_model(fitted_model,model_name):
    

    # Get the best model from cross-validation
    #best_model = fitted_model.stages[-1].bestModel

    print('Making predictions on the training set')
    # Make predictions on the training set
    train_predictions = fitted_model.transform(train_set)

    print('Making predictions on the test set')
    # Make predictions on the test set
    test_predictions = fitted_model.transform(test_set)

    # Initialize the evaluator
    evaluator = MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy")

    print('Evaluating the model on training set')
    # Evaluate the model on the training set
    train_accuracy = evaluator.evaluate(train_predictions)

    print('Evaluating the model on test set')
    # Evaluate the model on the test set
    test_accuracy = evaluator.evaluate(test_predictions)

    print(f"{model_name} Train Accuracy: {train_accuracy}")
    print(f"{model_name} Test Accuracy: {test_accuracy}")

    # Print the best parameters
    #print(f"Best parameters for {model_name}:")

    #for param, value in best_model.extractParamMap().items():
        #print(f"  {param.name}: {value}")

    return train_accuracy, test_accuracy

### 6. Create a function which takes pipelines and train the models, evaluate them and then return the results

In [64]:
def train_and_evaluate_models(model_pipelines,model_names=["Naive Bayes", "Logistic Regression"]):

    # Initialize the results dictionary
    results = {}

    # Loop over the indices and model names simultaneously
    for idx, (model_pipeline, model_name) in enumerate(zip(model_pipelines, model_names)):
        print(f"Training {model_name} model")

        # Fit the model pipeline to the training set
        #fitted_model = model_pipeline.fit(train_set)
        fitted_model = train_model(model_pipeline)

        print("Done")
        print(f"Evaluating {model_name} model")

        # Evaluate the fitted model
        train_accuracy, test_accuracy = evaluate_model(fitted_model,model_name)

        # Store the results
        results[idx] = {
            'model_name': model_name,
            'fitted_model': fitted_model,
            "train_accuracy": train_accuracy,
            "test_accuracy": test_accuracy
        }

    if len(model_name)==0:
        results=results[0]

    return results

### 5. Call the function and interpret the results

#### a. Training and evaluation

In [65]:
results = train_and_evaluate_models(model_pipelines)
results

Training Naive Bayes model


24/06/05 10:52:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:23 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:29 WARN DAGScheduler: Broadcasting large task binary with size 34.1 MiB
24/06/05 10:52:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:39 WARN DAGScheduler: Broadcasting large task binary with size 34.1 MiB
24/06/05 10:52:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:52:42 WARN DAGScheduler: Broadcasting la

Done
Evaluating Naive Bayes model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/05 10:52:59 WARN DAGScheduler: Broadcasting large task binary with size 34.1 MiB
                                                                                

Evaluating the model on test set


24/06/05 10:53:02 WARN DAGScheduler: Broadcasting large task binary with size 34.1 MiB
                                                                                

Naive Bayes Train Accuracy: 0.786215323314107
Naive Bayes Test Accuracy: 0.7531130303074436
Training Logistic Regression model


24/06/05 10:53:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:09 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:16 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 10:53:44 WARN DAGScheduler: Broadcasting larg

24/06/05 11:08:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:08:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:08:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:08:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:08:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:08:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:08:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:08:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:08:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:09:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:09:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:09:12 WARN DAGScheduler: Broadcasting larg

24/06/05 11:24:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:24:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:24:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:24:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:24:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:24:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:24:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:24:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:25:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:25:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:25:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:25:13 WARN DAGScheduler: Broadcasting larg

24/06/05 11:39:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:39:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:40:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:40:05 WARN DAGScheduler: Broadcasting larg

24/06/05 11:54:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:54:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:54:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:54:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:55:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:55:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:55:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:55:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:55:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:55:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:55:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 11:55:36 WARN DAGScheduler: Broadcasting larg

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/05 12:01:44 WARN DAGScheduler: Broadcasting large task binary with size 31.0 MiB
                                                                                

Evaluating the model on test set


24/06/05 12:04:10 WARN DAGScheduler: Broadcasting large task binary with size 31.0 MiB

Logistic Regression Train Accuracy: 0.9337228107912878
Logistic Regression Test Accuracy: 0.8204098277092466


                                                                                

In [85]:
# Results of the fitted Naive Bayes classifier
results[0]

{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_4869f65c43bf,
 'train_accuracy': 0.9337686859249993,
 'test_accuracy': 0.8198680512066179}

In [86]:
# Results of the fitted Logistic regression classifier
results[1]

KeyError: 1

#### b. Results interpretetion

We remark that
- **Naive Bayes** sets a performance of **79%** on the train set and **75%** on the test set.
- **Logistic regression** sets a performance of **93%** on the train set and **82%** on the test set.

We can then conclude that that
- Both the two models set a (relatively) good  performance on both training and test set. They fit well.
- The **Logistic regression** model outperforms the **Naive Bayes** model

Then for the following we will use the **Logistic regression** classifier.

In the next section, we will tune the parameters of the **Logistic regression** to get the best parameters.

## VI- Logistic regression hyperparameters tuning

### 1. Pipeline creation

In [80]:
# Define parameter grids for Logistic regresion grid search
#reg_values = np.logspace(-4, 4, num=50)
#l1_ratios = np.linspace(0, 1, num=20)
l1_ratios = np.linspace(0, 1, num=5)

paramGrid_lr= (ParamGridBuilder()
               #.addGrid(lr.regParam, reg_values)
               .addGrid(lr.elasticNetParam,l1_ratios)
               
               .build())

# Create Cross-validation for Logistic Regression
cv_lr = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid_lr,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=3)


# Create pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[hashingTF, idf, cv_lr])

pipeline_lr

Pipeline_f6c22c6bc506

### 2. Hyperparameters tuning

In [81]:
results=train_and_evaluate_models(model_pipelines=[pipeline_lr],model_names=["Logistic Regression"])
results

Training Logistic Regression model


24/06/05 19:07:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:07:22 WARN DAGScheduler: Broadcasting larg

24/06/05 19:18:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:18:51 WARN DAGScheduler: Broadcasting larg

24/06/05 19:23:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:23:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:23:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:23:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:24:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:24:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:24:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:24:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:24:16 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:24:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:24:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:24:28 WARN DAGScheduler: Broadcasting larg

24/06/05 19:34:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:34:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:35:00 WARN DAGScheduler: Broadcasting larg

24/06/05 19:39:26 WARN TaskSetManager: Lost task 0.1 in stage 8352.0 (TID 621734) (172.16.1.2 executor 2): FetchFailed(null, shuffleId=2803, mapIndex=-1, mapId=-1, reduceId=0, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 2803 partition 0
	at org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1739)
	at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1686)
	at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1685)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1685)
	at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1327)
	at org.apache.spark.MapOutputTrackerWorker

24/06/05 19:39:26 WARN TaskSetManager: Lost task 15.1 in stage 8359.0 (TID 621738) (172.16.1.2 executor 1): FetchFailed(null, shuffleId=0, mapIndex=-1, mapId=-1, reduceId=15, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 partition 15
	at org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1739)
	at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1686)
	at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1685)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1685)
	at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1327)
	at org.apache.spark.MapOutputTrackerWorker.ge

24/06/05 19:39:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:39:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:40:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:40:19 WARN DAGScheduler: Broadcasting larg

24/06/05 19:56:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]]
24/06/05 19:57:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]
24/06/05 19:57:12 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:57:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:57:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:58:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:58:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:58:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:58:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:59:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:59:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 19:59:36 WARN DAGScheduler: Broadcasting l

24/06/05 20:00:36 WARN TaskSetManager: Lost task 3.1 in stage 8469.0 (TID 633854) (172.16.1.5 executor 4): FetchFailed(BlockManagerId(2, 172.16.1.2, 43252, None), shuffleId=2860, mapIndex=159, mapId=633808, reduceId=3, message=
org.apache.spark.shuffle.FetchFailedException
	at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:437)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1233)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:971)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:86)
	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.util.Complet

24/06/05 20:00:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:00:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:00:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:00:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:00:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:00:55 WARN TaskSetManager: Lost task 6.0 in stage 8469.0 (TID 633838) (172.16.1.2 executor 3): TaskKilled (Stage finished)
24/06/05 20:00:55 WARN TaskSetManager: Lost task 4.0 in stage 8469.0 (TID 633846) (172.16.1.2 executor 3): TaskKilled (Stage finished)
24/06/05 20:00:55 WARN TaskSetManager: Lost task 10.0 in stage 8469.0 (TID 633850) (172.16.1.2 executor 3): TaskKilled (Stage finished)
24/06/05 20:00:56 WARN TaskSetManager: Lost task 8.0 in stage 8469.0 (TID 633848) (172.16.1.2 executor 1): TaskKilled (Stage finished)
24/06/05 20:00:56 WARN TaskSe

24/06/05 20:09:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:09:47 WARN DAGScheduler: Broadcasting larg

24/06/05 20:18:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:18:54 WARN DAGScheduler: Broadcasting larg

24/06/05 20:29:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:29:44 WARN DAGScheduler: Broadcasting larg

24/06/05 20:38:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:38:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:38:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:39:26 WARN DAGScheduler: Broadcasting larg

24/06/05 20:48:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:48:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:49:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 20:49:11 WARN DAGScheduler: Broadcasting larg

24/06/05 21:00:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:00:50 WARN DAGScheduler: Broadcasting larg

24/06/05 21:10:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:10:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:11:00 WARN DAGScheduler: Broadcasting larg

24/06/05 21:21:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:21:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:21:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:21:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]]
24/06/05 21:21:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:21:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:22:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:22:16 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:22:16 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:22:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:22:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]
24/06/05 21:22:29 WARN DAGScheduler: Broadcasting l

24/06/05 21:33:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:33:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]
24/06/05 21:33:38 WARN DAGScheduler: Broadcasting lar

24/06/05 21:43:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:43:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:44:00 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]
24/06/05 21:44:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:44:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:44:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:44:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:44:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:44:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:44:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:44:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]
24/06/05 21:44:34 WARN DAGScheduler: Broadcasting la

24/06/05 21:55:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:55:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:55:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:55:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:55:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:55:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:55:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:55:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]
24/06/05 21:55:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:55:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:56:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 21:56:09 WARN DAGScheduler: Broadcasting lar

24/06/05 22:05:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:05:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:05:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:23 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:06:26 WARN DAGScheduler: Broadcasting larg

24/06/05 22:16:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:16:52 WARN DAGScheduler: Broadcasting larg

24/06/05 22:32:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:32:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:32:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:32:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:32:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:32:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:33:00 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:33:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:33:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:33:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:33:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/05 22:33:27 WARN DAGScheduler: Broadcasting larg

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/05 22:35:29 WARN DAGScheduler: Broadcasting large task binary with size 31.0 MiB
                                                                                

Evaluating the model on test set


24/06/05 22:37:51 WARN DAGScheduler: Broadcasting large task binary with size 31.0 MiB

Logistic Regression Train Accuracy: 0.9337686859249993
Logistic Regression Test Accuracy: 0.8198680512066179


                                                                                

{0: {'model_name': 'Logistic Regression',
  'fitted_model': PipelineModel_4869f65c43bf,
  'train_accuracy': 0.9337686859249993,
  'test_accuracy': 0.8198680512066179}}

### 3. Interpreting the results

In [88]:
results[0]

{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_4869f65c43bf,
 'train_accuracy': 0.9337686859249993,
 'test_accuracy': 0.8198680512066179}

### 4. Get the best parameters

In [90]:
fitted_model=results[0]['fitted_model']

# Get the best model
best_model = fitted_model.stages[-1].bestModel

# Print the best parameters
print(f"Best parameters for Logistic regression:")

for param, value in best_model.extractParamMap().items():
     print(f"  {param.name}: {value}")

Best parameters for Logistic regression:
  aggregationDepth: 2
  elasticNetParam: 0.0
  family: auto
  featuresCol: features
  fitIntercept: True
  labelCol: category_label
  maxBlockSizeInMB: 0.0
  maxIter: 100
  predictionCol: prediction
  probabilityCol: probability
  rawPredictionCol: rawPrediction
  regParam: 0.0
  standardization: True
  threshold: 0.5
  tol: 1e-06


### 5. Save the best model

In [93]:
best_model.save('output/news_categorization_model')

24/06/05 22:56:05 WARN TaskSetManager: Stage 11290 contains a task of very large size (30402 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

## VII- Summary

In this notebook we have studied two models for our news categorization task. There are **Naive Bayes** and **Logistic regression**.

Our study reveals that the **Logistic regression** was the one with best performance.

Then we tunned the Logistic regression hyperparameters using grid search with cross validation and then we find the best model that we save.

 The next step of our work will be to perform topic modeling task on our news dataset.

In [None]:
# Remove the cache
df.unpersist()

In [None]:
# Stop the application
spark.stop()