# News categorization

In this notebook we are going to build a Machine Learning model for news categorisation.

Our dataset is the one we preprocessed before, which has two colums:

- **description_filtered** which is the filtered descrition after performing cleaning, tokenisation, lemmatization and stopword removal on the description of the news
- **category_label** which is a numeric value that represents the category of our label.

We converted the dataset format from csv to parket.

We are going to study two classification models: **Naive Bayes** and **Logistic regression**

## I- Modules import

Let us import the modules we need.

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import  IDF, HashingTF
from pyspark.ml import  Pipeline
from math import ceil,log2
from pyspark.ml.classification import LogisticRegression,NaiveBayes,LogisticRegressionModel
from pyspark.sql.functions import col,explode,split
import numpy as np

## II- Spark context and session creation

Let us create a spark session

In [3]:
spark = (SparkSession.builder
    .master("spark://node15:7077")
    .appName("NewsCategorization")
    .getOrCreate()
        )
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/06 03:53:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/06/06 03:53:52 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## III- Dataframe preparing

### 1. Load the data

In [4]:
# Load the ata
df = spark.read.parquet("input/news.parquet", header=True, inferSchema=True)

                                                                                

### 2. Partition and cache the dataframe

In [5]:
df.rdd.getNumPartitions()

9

In [8]:
# Repartitionning: Use 4 partitions per core
num_partitions=4*36
df= df.repartition(num_partitions).cache()

In [9]:
df.rdd.getNumPartitions()



144

### 3. Preview the data

In [10]:
# Count the number of observations
df.count()

                                                                                

1716608

In [11]:
# Show the dataframe
df.show()

+--------------+--------------------+
|category_label|description_filtered|
+--------------+--------------------+
|           4.0|u sanction tagwir...|
|           7.0|surface duet imag...|
|          30.0|stray kitten max ...|
|          19.0|     valley prospect|
|          21.0|preservation chem...|
|          26.0|ultimate demise c...|
|          20.0|chance charwoman ...|
|          28.0|finding lose piec...|
|           8.0|exposure hon sabi...|
|          14.0|life redefined em...|
|          18.0|marriage job fare...|
|          24.0|powerless iron ho...|
|          13.0|kevin mchale thin...|
|          11.0|pageant limelight...|
|          31.0|gina rodriguez la...|
|          12.0|confect cane pop ...|
|          28.0|eye ukrayina hist...|
|          31.0|shakira postpones...|
|          24.0|egyptian mankind ...|
|          13.0|twitter follow dr...|
+--------------+--------------------+
only showing top 20 rows



In [12]:
# Print the schema of the dataframe
df.printSchema()

root
 |-- category_label: double (nullable = true)
 |-- description_filtered: string (nullable = true)



### 4. Convert filtered descriptions to arrays

In [13]:
# Create a new DataFrame with description_filtered as arrays
df= df.withColumn('description_filtered', split(col('description_filtered'), ' '))
# Show the new DataFrame
df.show(truncate=False)

+--------------+-----------------------------------------------------------------------------------------------+
|category_label|description_filtered                                                                           |
+--------------+-----------------------------------------------------------------------------------------------+
|4.0           |[u, sanction, tagwirei, wrong, devoid, logic]                                                  |
|7.0           |[surface, duet, image, express, triiodothyronine, website, suggesting, launching, imminent]    |
|30.0          |[stray, kitten, max, born, without, palpebra, get, sec, probability, thanks, stranger]         |
|19.0          |[valley, prospect]                                                                             |
|21.0          |[preservation, chemical, group, mobile, river, headphone, new, result, rural, woman]           |
|26.0          |[ultimate, demise, common, sum, part, politics]                                 

## IV- Feature Engineering


### 1. Explode the filtered descriptions to get the words

In [14]:
exploded_df=df.select(explode(df.description_filtered)).alias('words')
exploded_df.show()

+----------------+
|             col|
+----------------+
|               u|
|        sanction|
|        tagwirei|
|           wrong|
|          devoid|
|           logic|
|         surface|
|            duet|
|           image|
|         express|
|triiodothyronine|
|         website|
|      suggesting|
|       launching|
|        imminent|
|           stray|
|          kitten|
|             max|
|            born|
|         without|
+----------------+
only showing top 20 rows



In [15]:
#df=df.unpersist()

### 2. Get unique words in the filtered_description

In [16]:
unique_words=exploded_df.distinct()

### 3. Cache and show the unique words dataframe

In [17]:
unique_words=unique_words.cache()
unique_words.show()



+------------+
|         col|
+------------+
|        pant|
|      online|
|   recognize|
|       still|
|        hope|
|    medicare|
|   traveling|
|         art|
|      travel|
|       oscar|
|      gloria|
|      voyage|
| requirement|
|oleochemical|
|       dilma|
|      marrow|
|     melodic|
|       inner|
|      bazaar|
|       mammy|
+------------+
only showing top 20 rows



                                                                                

### 4. Get the vocabulary size

In [18]:
vocabulary_size=unique_words.count()
vocabulary_size

128622

### 5. Unpersit the unique words dataframe(not needed anymore)

In [19]:
unique_words=unique_words.unpersist()

### 6. Get the smallest `n` such that $2^n$ is greater than `vocabulary_size`

In [20]:
n=ceil(log2(vocabulary_size))
n

17

### 7. Get the number of features for HashingTF

In [21]:
num_features=2**n
num_features

131072

### 8. Define the HashingTF and IDF stages

In [22]:
# Define the HashingTF and IDF stages
hashingTF = HashingTF(inputCol="description_filtered", outputCol="rawFeatures", numFeatures=num_features)
idf = IDF(inputCol="rawFeatures", outputCol="features")

## V- Models set up, training and evaluation

### 1. Set up Naive Bayesand Logistic regression classifiers

In [23]:
# Define the classifiers

# Logistic regression classifier
lr = LogisticRegression(labelCol="category_label", featuresCol="features")

# Naive Bayes classifier
nb = NaiveBayes(labelCol="category_label", featuresCol="features")

### 2. Set up pipelines

We will  set up pipelines of the following transformations for Native Bayes and Linear reggression

- HashingTF
- IDF
- 3-Fold Cross-validation  without grid search

In [24]:
# Define parameter grids
paramGrid_nb=paramGrid_lr=ParamGridBuilder().build()

# Create cross validators

# Cross-validation for Naive Bayes
cv_nb = CrossValidator(estimator=nb, estimatorParamMaps=paramGrid_nb,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=1)
# Cross-validation for Logistic Regression
cv_lr = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid_lr,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=1)


# Create pipelines
# Pipeline for Naive Bayes
pipeline_nb = Pipeline(stages=[hashingTF, idf, cv_nb])
# Pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[hashingTF, idf, cv_lr])
model_pipelines=pipeline_nb, pipeline_lr
model_pipelines

(Pipeline_985aa63992d1, Pipeline_f319c4693ffa)

### 3. Split the data

Let us split the data into train and test set: 80% for train and 20% for test

In [25]:
# Split data
(train_set, test_set) = df.randomSplit([0.8, 0.2], seed=0)

### 4. Create a function for model training

Let us create a function which takes as argument a model that it trains and then returns the trained model.

In [26]:
def train_model(model):
    return model.fit(train_set)

### 5. Define a function to evaluate the model

The function takes as parameter a fitted model, evaluates the model on train and test split and then return the train and test performance. The accuracy is the metric used.

In [32]:
# Initialize the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy")

# Function to evaluate model and get best parameters
def evaluate_model(fitted_model):

    print('Making predictions on the training set')

    train_predictions = fitted_model.transform(train_set)

    print('Making predictions on the test set')
    test_predictions = fitted_model.transform(test_set)

    print('Evaluating the model on training set')
    train_accuracy = evaluator.evaluate(train_predictions)

    print('Evaluating the model on test set')
    test_accuracy = evaluator.evaluate(test_predictions)
    
    print('Train accuracy:',train_accuracy)
    print('Test accuracy:',test_accuracy)
    return train_accuracy, test_accuracy

24/06/06 04:05:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB

### 6. Create a function which takes pipelines and train the models, evaluate them and then return the results

In [30]:
def train_and_evaluate_models(model_pipelines,model_names=["Naive Bayes", "Logistic Regression"]):

    # Initialize the results dictionary
    results = {}

    # Loop over the indices and model names simultaneously
    for idx, (model_pipeline, model_name) in enumerate(zip(model_pipelines, model_names)):
        print(f"Training {model_name} model")

        # Fit the model pipeline to the training set
        #fitted_model = model_pipeline.fit(train_set)
        fitted_model = train_model(model_pipeline)

        print("Done")
        print(f"Evaluating {model_name} model")

        # Evaluate the fitted model
        train_accuracy, test_accuracy = evaluate_model(fitted_model)
        print("Done")
        # Store the results
        results[idx] = {
            'model_name': model_name,
            'fitted_model': fitted_model,
            "train_accuracy": train_accuracy,
            "test_accuracy": test_accuracy
        }

    if len(model_name)==0:
        results=results[0]

    return results

### 5. Call the function and interpret the results

#### a. Training and evaluation

In [33]:
results = train_and_evaluate_models(model_pipelines)
results

Training Naive Bayes model


24/06/06 04:05:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:05:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:05:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:05:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:05:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:06:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:06:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:06:03 WARN DAGScheduler: Broadcasting large task binary with size 34.1 MiB
24/06/06 04:06:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:06:09 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:06:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:06:15 WARN DAGScheduler: Broadcasting lar

Done
Evaluating Naive Bayes model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/06 04:06:51 WARN DAGScheduler: Broadcasting large task binary with size 34.1 MiB
24/06/06 04:06:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
                                                                                

Evaluating the model on test set


24/06/06 04:06:55 WARN DAGScheduler: Broadcasting large task binary with size 34.1 MiB
24/06/06 04:06:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
                                                                                

Train accuracy: 0.7864410655081743
Test accuracy: 0.7535687343626667
Done
Training Logistic Regression model


24/06/06 04:07:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:07:33 WARN DAGScheduler: Broadcasting larg

24/06/06 04:13:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:13:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:13:23 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:13:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:13:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:13:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:13:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:13:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:13:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:14:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:14:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:14:09 WARN DAGScheduler: Broadcasting larg

24/06/06 04:20:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:20:43 WARN DAGScheduler: Broadcasting larg

24/06/06 04:26:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:26:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:26:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:26:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:26:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:26:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:26:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:26:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:27:00 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:27:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:27:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:27:16 WARN DAGScheduler: Broadcasting larg

24/06/06 04:34:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:34:41 WARN DAGScheduler: Broadcasting large task binary with size 29.0 MiB
24/06/06 04:35:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:35:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:35:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:35:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:35:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:36:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:36:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:36:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:36:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:36:26 WARN DAGScheduler: Broadcasting lar

24/06/06 04:39:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:39:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:39:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:39:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:39:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:40:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:40:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:40:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:40:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:40:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:40:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:40:40 WARN BlockManager: Asked to remove b

24/06/06 04:48:12 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:48:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:48:23 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:48:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:48:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:48:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:48:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:48:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:48:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:49:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:49:09 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:49:18 WARN DAGScheduler: Broadcasting larg

24/06/06 04:57:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:57:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:57:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:57:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:57:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:57:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:57:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:57:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:58:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:58:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:58:16 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 04:58:19 WARN DAGScheduler: Broadcasting larg

24/06/06 05:06:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:06:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:07:00 WARN DAGScheduler: Broadcasting larg

24/06/06 05:14:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:14:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:14:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:14:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:14:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:14:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:15:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:15:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:15:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:15:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:15:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:15:32 WARN DAGScheduler: Broadcasting larg

24/06/06 05:24:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:24:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:24:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:24:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:24:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:24:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:24:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:24:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:24:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:25:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:25:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:25:12 WARN DAGScheduler: Broadcasting larg

24/06/06 05:32:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:32:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:09 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:33:55 WARN DAGScheduler: Broadcasting larg

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/06 05:36:11 WARN DAGScheduler: Broadcasting large task binary with size 31.0 MiB
                                                                                

Evaluating the model on test set


24/06/06 05:39:39 WARN DAGScheduler: Broadcasting large task binary with size 31.0 MiB

Train accuracy: 0.933455619724761
Test accuracy: 0.8194035458699964
Done


                                                                                

{0: {'model_name': 'Naive Bayes',
  'fitted_model': PipelineModel_54bb4d9d300d,
  'train_accuracy': 0.7864410655081743,
  'test_accuracy': 0.7535687343626667},
 1: {'model_name': 'Logistic Regression',
  'fitted_model': PipelineModel_73727468e5f1,
  'train_accuracy': 0.933455619724761,
  'test_accuracy': 0.8194035458699964}}

In [34]:
# Results of the fitted Naive Bayes classifier
results[0]

{'model_name': 'Naive Bayes',
 'fitted_model': PipelineModel_54bb4d9d300d,
 'train_accuracy': 0.7864410655081743,
 'test_accuracy': 0.7535687343626667}

In [35]:
# Results of the fitted Logistic regression classifier
results[1]

{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_73727468e5f1,
 'train_accuracy': 0.933455619724761,
 'test_accuracy': 0.8194035458699964}

#### b. Results interpretetion

We remark that
- **Naive Bayes** sets a performance of **79%** on the train set and **75%** on the test set.
- **Logistic regression** sets a performance of **93%** on the train set and **82%** on the test set.

We can then conclude that that
- Both the two models set a (relatively) good  performance on both training and test set. They fit well.
- The **Logistic regression** model outperforms the **Naive Bayes** model

Then for the following we will use the **Logistic regression** classifier.

In the next section, we will tune the parameters of the **Logistic regression** to get the best parameters.

## VI- Logistic regression hyperparameters tuning

### 1. Pipeline creation

In [37]:
# Define parameter grids for Logistic regresion grid search

reg_values = np.logspace(-4, 4, num=10)

paramGrid_lr= ParamGridBuilder().addGrid(lr.regParam, reg_values).build()

# Create Cross-validation for Logistic Regression
cv_lr = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid_lr,
                        evaluator=MulticlassClassificationEvaluator(labelCol="category_label", predictionCol="prediction", metricName="accuracy"),
                        numFolds=3, parallelism=3)


# Create pipeline for Logistic Regression
pipeline_lr = Pipeline(stages=[hashingTF, idf, cv_lr])

pipeline_lr

Pipeline_02bd9cb85ce5

### 2. Hyperparameters tuning

In [38]:
results=train_and_evaluate_models(model_pipelines=[pipeline_lr],model_names=["Logistic Regression"])
results

Training Logistic Regression model


24/06/06 05:53:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:53:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:53:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:53:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:53:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:53:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:53:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:53:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:54:00 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:54:00 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:54:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:54:01 WARN DAGScheduler: Broadcasting larg

24/06/06 05:59:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 05:59:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:00:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:00:02 WARN DAGScheduler: Broadcasting larg

24/06/06 06:05:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:05:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:06:02 WARN DAGScheduler: Broadcasting larg

24/06/06 06:11:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:11:58 WARN DAGScheduler: Broadcasting larg

24/06/06 06:17:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:17:44 WARN DAGScheduler: Broadcasting larg

24/06/06 06:23:00 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:16 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:23:43 WARN DAGScheduler: Broadcasting larg

24/06/06 06:29:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:29:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:29:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:29:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:29:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:29:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:29:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:30:00 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:30:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:30:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:30:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:30:18 WARN DAGScheduler: Broadcasting larg

24/06/06 06:39:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:39:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:40:01 WARN DAGScheduler: Broadcasting larg

24/06/06 06:45:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:45:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:46:00 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:46:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:46:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:46:54 WARN BlockManager: Asked to remove block broadcast_2794_piece3, which does not exist
24/06/06 06:46:54 WARN BlockManager: Asked to remove block broadcast_2794_piece2, which does not exist
24/06/06 06:46:54 WARN BlockManager: Asked to remove block broadcast_2794_piece1, which does not exist
24/06/06 06:46:54 WARN BlockManager: Asked to remove block broadcast_2794_piece0, which does not exist
24/06/06 06:46:54 WARN BlockManager: Asked to remove block broadcast_2794_piece4, which does not exist
24/06/06 06:46:54 WARN BlockManagerMaster: Failed to re

24/06/06 06:48:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:48:48 WARN DAGScheduler: Broadcasting larg

24/06/06 06:54:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:54:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 06:55:02 WARN DAGScheduler: Broadcasting larg

24/06/06 07:06:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:06:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:06:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:06:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:06:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:06:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:06:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:06:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:07:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:07:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:07:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:07:11 WARN DAGScheduler: Broadcasting larg

24/06/06 07:12:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:12:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:12:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:12:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:12:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:12:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:12:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:12:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:12:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:13:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:13:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:13:11 WARN DAGScheduler: Broadcasting larg

24/06/06 07:18:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:18:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:18:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:18:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:18:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:18:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:18:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:19:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:19:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:19:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:19:09 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:19:18 WARN DAGScheduler: Broadcasting larg

24/06/06 07:24:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:24:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:24:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:24:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:24:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:24:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:25:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:25:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:25:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:25:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:25:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:25:21 WARN DAGScheduler: Broadcasting larg

24/06/06 07:30:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:30:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:30:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:30:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:30:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:30:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:30:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:31:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:31:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:31:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:31:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:31:16 WARN DAGScheduler: Broadcasting larg

24/06/06 07:36:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB]
24/06/06 07:36:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:07 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:09 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:09 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:37:32 WARN DAGScheduler: Broadcasting lar

24/06/06 07:44:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:44:07 WARN BlockManagerMaster: Failed to remove broadcast 3939 with removeFromMaster = true - Block broadcast_3939_piece1 does not exist
org.apache.spark.SparkException: Block broadcast_3939_piece1 does not exist
	at org.apache.spark.errors.SparkCoreErrors$.blockDoesNotExistError(SparkCoreErrors.scala:318)
	at org.apache.spark.storage.BlockInfoManager.blockInfo(BlockInfoManager.scala:269)
	at org.apache.spark.storage.BlockInfoManager.removeBlock(BlockInfoManager.scala:547)
	at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2091)
	at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:2057)
	at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3(BlockManager.scala:2029)
	at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3$adapted(BlockManager.scala:2029)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at 

24/06/06 07:46:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:46:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:47:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:47:03 WARN DAGScheduler: Broadcasting larg

24/06/06 07:48:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:49:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:49:51 WARN BlockManager: Asked to remove block broadcast_4030_piece0, which does not exist
24/06/06 07:49:51 WARN BlockManagerMaster: Failed to remove broadcast 4030 with removeFromMaster = true - Block broadcast_4030_piece0 does not exist
org.apache.spark.SparkException: Block broadcast_4030_piece0 does not exist
	at org.apache.spark.errors.SparkCoreErrors$.blockDoesNotExistError(SparkCoreErrors.scala:318)
	at org.apache.spark.storage.BlockInfoManager.blockInfo(BlockInfoManager.scala:269)
	at org.apache.spark.storage.BlockInfoManager.removeBlock(BlockInfoManager.scala:547)
	at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2091)
	at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:2057)
	at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3(

24/06/06 07:51:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:51:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:51:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:51:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:51:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:51:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:51:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:51:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:51:55 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:52:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:52:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:52:07 WARN DAGScheduler: Broadcasting larg

24/06/06 07:57:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:57:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:57:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:57:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:58:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:58:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:58:04 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:58:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:58:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:58:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:58:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 07:58:30 WARN DAGScheduler: Broadcasting larg

24/06/06 08:05:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:47 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:05:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:06:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:06:08 WARN DAGScheduler: Broadcasting larg

24/06/06 08:13:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:13:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:13:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:13:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:13:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:13:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:13:38 WARN DAGScheduler: Broadcasting large task binary with size 29.0 MiB
24/06/06 08:13:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:13:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:14:44 WARN DAGScheduler: Broadcasting large task binary with size 29.0 MiB
24/06/06 08:14:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:14:58 WARN DAGScheduler: Broadcasting la

24/06/06 08:24:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:43 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:54 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:24:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:25:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:25:02 WARN DAGScheduler: Broadcasting larg

24/06/06 08:29:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:29:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:09 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:30:34 WARN DAGScheduler: Broadcasting larg

24/06/06 08:35:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:35:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:35:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:35:49 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:35:51 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:35:52 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:35:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:35:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:35:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:36:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:36:11 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:36:15 WARN DAGScheduler: Broadcasting larg

24/06/06 08:41:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:41:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:42:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:42:02 WARN DAGScheduler: Broadcasting larg

24/06/06 08:47:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:47:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:47:12 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:47:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:47:19 WARN DAGScheduler: Broadcasting large task binary with size 28.9 MiB
24/06/06 08:47:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:47:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:48:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:48:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:48:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:48:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:48:34 WARN DAGScheduler: Broadcasting lar

24/06/06 08:54:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:54:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:54:37 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:54:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:54:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:54:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:54:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:54:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:54:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:55:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:55:03 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 08:55:06 WARN DAGScheduler: Broadcasting larg

24/06/06 09:00:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:40 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:01:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:02:02 WARN DAGScheduler: Broadcasting larg

24/06/06 09:09:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:09:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:09:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:09:44 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:09:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:10:13 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:10:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:10:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:10:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:10:31 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:10:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:10:53 WARN DAGScheduler: Broadcasting larg

24/06/06 09:20:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:12 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:12 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:28 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:20:45 WARN DAGScheduler: Broadcasting larg

24/06/06 09:27:16 WARN DAGScheduler: Broadcasting large task binary with size 28.9 MiB
24/06/06 09:27:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:27:35 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:25 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:39 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:28:57 WARN DAGScheduler: Broadcasting lar

24/06/06 09:32:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:32:58 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:01 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:09 WARN BlockManagerMaster: Failed to remove broadcast 5964 with removeFromMaster = true - Block broadcast_5964 does not exist
org.apache.spark.SparkException: Block broadcast_5964 does not exist
	at org.apache.spark.errors.SparkCoreErrors$.blockDoesNotExistError(SparkCoreErrors.scala:318)
	at org.apache.spark.storage.BlockInfoManager.blockInfo(BlockInfoManager.scala:269)
	at org.apache.spark.storage.BlockInfoManager.removeBlock(BlockInfoManager.scala:547)
	at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2091)
	at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:2057)
	at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3(BlockManager.scala:2029)
	at or

24/06/06 09:33:21 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:24 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:33 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:36 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:44 WARN BlockManager: Asked to remove block broadcast_5973_piece4, which does not exist
24/06/06 09:33:45 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:48 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:33:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:34:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:34:10 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:34:20 WARN DAGScheduler: 

24/06/06 09:37:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:37:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:37:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:38:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:38:05 WARN BlockManagerMaster: Failed to remove broadcast 6036 with removeFromMaster = true - Block broadcast_6036_piece4 does not exist
org.apache.spark.SparkException: Block broadcast_6036_piece4 does not exist
	at org.apache.spark.errors.SparkCoreErrors$.blockDoesNotExistError(SparkCoreErrors.scala:318)
	at org.apache.spark.storage.BlockInfoManager.blockInfo(BlockInfoManager.scala:269)
	at org.apache.spark.storage.BlockInfoManager.removeBlock(BlockInfoManager.scala:547)
	at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2091)
	at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:2057)


24/06/06 09:39:15 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:39:18 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:39:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:39:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:39:39 WARN BlockManagerMaster: Failed to remove broadcast 6060 with removeFromMaster = true - Block broadcast_6060 does not exist
org.apache.spark.SparkException: Block broadcast_6060 does not exist
	at org.apache.spark.errors.SparkCoreErrors$.blockDoesNotExistError(SparkCoreErrors.scala:318)
	at org.apache.spark.storage.BlockInfoManager.blockInfo(BlockInfoManager.scala:269)
	at org.apache.spark.storage.BlockInfoManager.removeBlock(BlockInfoManager.scala:547)
	at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2091)
	at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:2057)
	at org.apache

24/06/06 09:40:57 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:06 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:08 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:20 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:29 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:42 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:46 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:41:59 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:42:09 WARN DAGScheduler: Broadcasting larg

24/06/06 09:47:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:47:30 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:47:38 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:47:41 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:47:50 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:47:53 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:48:02 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:48:05 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:48:14 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:48:17 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:48:26 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:48:29 WARN DAGScheduler: Broadcasting larg

24/06/06 09:52:19 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:52:22 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:52:32 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
24/06/06 09:52:33 WARN BlockManagerMaster: Failed to remove broadcast 6249 with removeFromMaster = true - Block broadcast_6249_piece2 does not exist
org.apache.spark.SparkException: Block broadcast_6249_piece2 does not exist
	at org.apache.spark.errors.SparkCoreErrors$.blockDoesNotExistError(SparkCoreErrors.scala:318)
	at org.apache.spark.storage.BlockInfoManager.blockInfo(BlockInfoManager.scala:269)
	at org.apache.spark.storage.BlockInfoManager.removeBlock(BlockInfoManager.scala:547)
	at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2091)
	at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:2057)
	at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3(BlockManager.scal

Done
Evaluating Logistic Regression model
Making predictions on the training set
Making predictions on the test set
Evaluating the model on training set


24/06/06 09:53:01 WARN DAGScheduler: Broadcasting large task binary with size 31.0 MiB
                                                                                

Evaluating the model on test set


24/06/06 09:56:25 WARN DAGScheduler: Broadcasting large task binary with size 31.0 MiB

Train accuracy: 0.9126280966941168
Test accuracy: 0.840298955229532
Done


                                                                                

{0: {'model_name': 'Logistic Regression',
  'fitted_model': PipelineModel_2650357d9142,
  'train_accuracy': 0.9126280966941168,
  'test_accuracy': 0.840298955229532}}

### 3. Interpreting the results

In [42]:
results=results[0]
results

{'model_name': 'Logistic Regression',
 'fitted_model': PipelineModel_2650357d9142,
 'train_accuracy': 0.9126280966941168,
 'test_accuracy': 0.840298955229532}

### 4. Get the best parameters

In [43]:
fitted_model=results['fitted_model']

# Get the best model
best_model = fitted_model.stages[-1].bestModel

# Print the best parameters
print(f"Best parameters for Logistic regression:")

for param, value in best_model.extractParamMap().items():
     print(f"  {param.name}: {value}")

Best parameters for Logistic regression:
  aggregationDepth: 2
  elasticNetParam: 0.0
  family: auto
  featuresCol: features
  fitIntercept: True
  labelCol: category_label
  maxBlockSizeInMB: 0.0
  maxIter: 100
  predictionCol: prediction
  probabilityCol: probability
  rawPredictionCol: rawPrediction
  regParam: 0.000774263682681127
  standardization: True
  threshold: 0.5
  tol: 1e-06


### 5. Save the best model

In [44]:
best_model.save('output/news_categorization_model')

24/06/06 11:41:16 WARN TaskSetManager: Stage 5308 contains a task of very large size (30404 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

## VII- Summary

In this notebook we have studied two models for our news categorization task. There are **Naive Bayes** and **Logistic regression**.

Our study reveals that the **Logistic regression** was the one with best performance.

Then we tunned the Logistic regression hyperparameters using grid search with cross validation and then we find the best model that we save.

 The next step of our work will be to perform topic modeling task on our news dataset.

In [45]:
# Remove the cache
df.unpersist()

DataFrame[category_label: double, description_filtered: array<string>]

In [46]:
# Stop the spark session
spark.stop()