# 4. Modelling

![picture](img/picture10.png)

The training notebook will cover the following:
#### Content
* Creation of pipeline
* Training and saving
* Varying training parameters


In [1]:
%_do_not_call_change_endpoint --username  --password  --server https://lighter-staging.vbrani.aisingapore.net/lighter/api  

In [None]:
%%configure -f
{"conf": {
        "spark.sql.warehouse.dir" : "s3a://dataops-example/justin",
        "spark.hadoop.fs.s3a.access.key":"",
        "spark.hadoop.fs.s3a.secret.key": "",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer", 
        "spark.kryoserializer.buffer.max": "2000M",
        "spark.driver.maxResultSize": "0",
        "spark.kubernetes.container.image": "justinljg/dep:1.08",
        "spark.kubernetes.container.image.pullPolicy" : "Always",
        "spark.jsl.settings.pretrained.cache_folder": "/opt/spark/work-dir",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.jsl.settings.annotator.log_folder": "/opt/spark/work-dir/logs",
        "spark.dynamicAllocation.enabled": "true",
        "spark.dynamicAllocation.shuffleTracking.enabled": "true",
        "spark.dynamicAllocation.minExecutors": "2",
        "spark.dynamicAllocation.maxExecutors": "10"
    },
 "executorMemory": "3G",
 "executorCores": 1,
 "driverMemory": "16G",
 "driverCores": 1
}

In [3]:
from pyspark.ml import Pipeline
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.classification import *
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

import time

import tensorflow as tf

import sparknlp
from sparknlp import DocumentAssembler
from sparknlp.annotator import *

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
0766a0c6-369d-44b9-8b57-ef0fd56690e5,spark-7991fa742e704511aece66d564b1f2f8,pyspark,idle,,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In this example a native model is called to train on the dataset.

In [4]:
%%sql

USE SparkNLP

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [5]:
df= spark.read.table("greview_model")


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
train_data, test_data = df.select("text", "label").orderBy(rand()).randomSplit([0.8, 0.2], seed=42)

# Check the number of rows in each set
print(f"Train set size: {train_data.count()} rows")
print(f"Test set size: {test_data.count()} rows")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Train set size: 348906 rows
Test set size: 87274 rows

## Creation of pipeline

#### Model used
<b>sent_small_bert_L8_512</b>

This is a smaller compact model that uses knowledge distillation to form a smaller, more compact model. 

The "sent" in its name indicates that the model is optimized for sentence-level tasks, as opposed to other language models that may be optimized for word-level or document-level tasks.

"L8" refers to the number of transformer layers in the model. The model has eight layers, which is fewer than some other transformer-based language models. However, this is balanced by the fact that each layer has a relatively large number of attention heads, which allows the model to capture more complex relationships between words and sentences.

Finally, "512" refers to the size of the hidden layer in the model. This is the size of the vector representation that the model produces for each word or sentence. A larger hidden layer size generally allows the model to capture more fine-grained details in the input data, but also increases the computational requirements for training and inference.

<br>

#### Defining the Pipeline

In [7]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

bert = BertSentenceEmbeddings.pretrained('sent_small_bert_L8_512')\
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("label")\
  .setMaxEpochs(15)\
  .setEnableOutputLogs(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    bert,
    classifierdl
])



FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

sent_small_bert_L8_512 download started this may take some time.
Approximate size to download 149.1 MB
[OK!]

In [8]:
train_data= train_data.limit(16000)
test_data= test_data.limit(4000)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Training and saving

This portion shows the training, evaluation and saving the SparkNLP model.

In [9]:
t0 = time.time()
model = pipeline.fit(train_data)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 7.99min

In [10]:
train_pred = model.transform(test_data)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
train_pred_processed = train_pred.select("label","class.result")
train_pred_processed.show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+------+
|label|result|
+-----+------+
|5    |[5]   |
|1    |[5]   |
|5    |[5]   |
|5    |[5]   |
|3    |[5]   |
|1    |[5]   |
|5    |[5]   |
|5    |[5]   |
|4    |[5]   |
|1    |[5]   |
|1    |[5]   |
|4    |[5]   |
|5    |[5]   |
|5    |[5]   |
|5    |[5]   |
|5    |[5]   |
|4    |[5]   |
|2    |[5]   |
|5    |[5]   |
|5    |[5]   |
+-----+------+
only showing top 20 rows

In [12]:
# Cast the "prediction" column to DoubleType
train_pred_processed = train_pred_processed.withColumn("result", col("result")[0].cast("double"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
# Create an instance of MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="result")

# Calculate the evaluation metric(s)
accuracy = evaluator.evaluate(train_pred_processed, {evaluator.metricName: "accuracy"})
weightedPrecision = evaluator.evaluate(train_pred_processed, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(train_pred_processed, {evaluator.metricName: "weightedRecall"})
f1 = evaluator.evaluate(train_pred_processed, {evaluator.metricName: "f1"})

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
# Print the evaluation metric(s)
print("Accuracy:", accuracy)
print("Weighted Precision:", weightedPrecision)
print("Weighted Recall:", weightedRecall)
print("F1 Score:", f1)

In [None]:
model.write().overwrite().save("s3a://dataops-example/nlp/models/greview_bert")

## Summary of the training runtime with varying configs

This is the runtimes of the notebook. The distributed framework is able to run about 4 times faster than a local context. The partition config to use scales with data but has a maximum limit before it slows down. The usage of intel numpy is not faster. The configs to use depends on many factors but hopefully this will serve as a guide.
![picture](img/picture17.png)

#### Spark Data Distribution<a id="Spark"></a>
Spark Applications consist of a driver process and a set of executor processes, which are managed by the JVM to handle resources.

![picture](img/picture11.png)

### Architecture of the Driver and Executors
The driver runs on the functions and sits on the node in the cluster. It maintains information about the Spark application, returns a response to the user code sent to it, and analyzes, distributes, and schedules the work across executors.

The executor runs the work that the driver assigns to it, executes the code, and reports the state of the computation back to the driver.

To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in the cluster. A DataFrame's partitions represent how the data is physically distributed across your cluster of machines during execution.

![picture](img/picture12.jpeg)
![picture](img/picture13.jpeg)

### Core, Slots, and Threads
Spark splits the work from the driver to the executors. An executor has a number of slots, which are assigned a task. A slot is commonly referred to as a CPU core in Spark and is implemented as a thread that works on a physical core's thread. They don't need to correspond to the number of physical CPU cores on the machine.

By doing this, after the driver breaks down a given command into tasks and partitions, which are tailor-made to fit our particular cluster configuration (e.g., 4 nodes - 1 driver and 3 executors, 8 cores per node, 2 threads per core), we can get our massive command executed as fast as possible (given our cluster in this case, 382 threads --> 48 tasks, 48 partitions - i.e., 1 partition per task).

If we assign 49 tasks and 49 partitions, the first pass would execute 48 tasks in parallel across the executor's cores (say in 10 minutes). Then that one remaining task in the next pass will execute on one core for another 10 minutes, while the rest of our 47 cores are sitting idle, meaning the whole job will take double the time at 20 minutes. This is obviously an inefficient use of our available resources and could be fixed by setting the number of tasks/partitions to a multiple of the number of cores we have (in this setup - 48, 96, etc.).


![picture](img/picture14.jpeg)

To understand further the performance of SparkNLP, please refer to this [webinar](https://www.johnsnowlabs.com/watch-webinar-speed-optimization-benchmarks-in-spark-nlp-3-making-the-most-of-modern-hardware/) by the founder of John Snow labs.

In [25]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.16min

In [12]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 2.38min

In [13]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 2.50min

In [14]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 2.67min

In [None]:
%%configure -f
{"conf": {
        "spark.sql.warehouse.dir" : "s3a://dataops-example/justin",
        "spark.hadoop.fs.s3a.access.key":"",
        "spark.hadoop.fs.s3a.secret.key": "",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer", 
        "spark.kryoserializer.buffer.max": "2000M",
        "spark.driver.maxResultSize": "0",
        "spark.kubernetes.container.image": "justinljg/dep:1.08",
        "spark.kubernetes.container.image.pullPolicy" : "Always",
        "spark.jsl.settings.pretrained.cache_folder": "/opt/spark/work-dir",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.jsl.settings.annotator.log_folder": "/opt/spark/work-dir/logs",
        "spark.dynamicAllocation.enabled": "true",
        "spark.dynamicAllocation.shuffleTracking.enabled": "true",
        "spark.dynamicAllocation.minExecutors": "2",
        "spark.dynamicAllocation.maxExecutors": "10"
    },
 "executorMemory": "3G",
 "executorCores": 2,
 "driverMemory": "16G",
 "driverCores": 1
}

In [9]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 2.09min

In [11]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.65min

In [12]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.52min

In [13]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.69min

In [14]:
train_data, test_data = df.select("text", "label").orderBy(rand()).randomSplit([0.8, 0.2], seed=42)
train_data= train_data.limit(32000)
test_data= test_data.limit(8000)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.06min

In [17]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.27min

In [18]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.18min

In [19]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.18min

In [None]:
%%configure -f
{"conf": {
        "spark.sql.warehouse.dir" : "s3a://dataops-example/justin",
        "spark.hadoop.fs.s3a.access.key":"",
        "spark.hadoop.fs.s3a.secret.key": "",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer", 
        "spark.kryoserializer.buffer.max": "2000M",
        "spark.driver.maxResultSize": "0",
        "spark.kubernetes.container.image": "justinljg/dep:1.08",
        "spark.kubernetes.container.image.pullPolicy" : "Always",
        "spark.jsl.settings.pretrained.cache_folder": "/opt/spark/work-dir",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.jsl.settings.annotator.log_folder": "/opt/spark/work-dir/logs",
        "spark.dynamicAllocation.enabled": "true",
        "spark.dynamicAllocation.shuffleTracking.enabled": "true",
        "spark.dynamicAllocation.minExecutors": "2",
        "spark.dynamicAllocation.maxExecutors": "10"
    },
 "executorMemory": "3G",
 "executorCores": 2,
 "driverMemory": "16G",
 "driverCores": 2
}

In [43]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [44]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.78min

In [45]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.71min

In [46]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.57min

In [47]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.79min

In [48]:
train_data, test_data = df.select("text", "label").orderBy(rand()).randomSplit([0.8, 0.2], seed=42)
train_data= train_data.limit(32000)
test_data= test_data.limit(8000)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [49]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [50]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.28min

In [51]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.57min

In [52]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 2.93min

In [53]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.11min

In [None]:
%%configure -f
{"conf": {
        "spark.sql.warehouse.dir" : "s3a://dataops-example/justin",
        "spark.hadoop.fs.s3a.access.key":"",
        "spark.hadoop.fs.s3a.secret.key": "",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer", 
        "spark.kryoserializer.buffer.max": "2000M",
        "spark.driver.maxResultSize": "0",
        "spark.kubernetes.container.image": "justinljg/dep:1.08",
        "spark.kubernetes.container.image.pullPolicy" : "Always",
        "spark.jsl.settings.pretrained.cache_folder": "/opt/spark/work-dir",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.jsl.settings.annotator.log_folder": "/opt/spark/work-dir/logs",
        "spark.dynamicAllocation.enabled": "true",
        "spark.dynamicAllocation.shuffleTracking.enabled": "true",
        "spark.dynamicAllocation.minExecutors": "2",
        "spark.dynamicAllocation.maxExecutors": "10"
    },
 "executorMemory": "3G",
 "executorCores": 4,
 "driverMemory": "16G",
 "driverCores": 2
}

In [9]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 2.16min

In [11]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.73min

In [12]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 1.82min

In [13]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 2.31min

In [14]:
train_data, test_data = df.select("text", "label").orderBy(rand()).randomSplit([0.8, 0.2], seed=42)
train_data= train_data.limit(32000)
test_data= test_data.limit(8000)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.72min

In [17]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.68min

In [18]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.40min

In [19]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 3.31min

In [20]:
train_data, test_data = df.select("text", "label").orderBy(rand()).randomSplit([0.8, 0.2], seed=42)
train_data= train_data.limit(64000)
test_data= test_data.limit(16000)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [21]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 6.49min

In [23]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 6.33min

In [24]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 6.31min

In [25]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 5.92min

In [26]:
train_data, test_data = df.select("text", "label").orderBy(rand()).randomSplit([0.8, 0.2], seed=42)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [42]:
train_data_conf1= train_data.repartition(20)
train_data_conf2= train_data.repartition(40)
train_data_conf3= train_data.repartition(80)
train_data_conf4= train_data.repartition(160)
train_data_conf5= train_data.repartition(200)
train_data_conf6= train_data.repartition(800)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [28]:
t0 = time.time()
model = pipeline.fit(train_data_conf1)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 32.23min

In [29]:
t0 = time.time()
model = pipeline.fit(train_data_conf2)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 33.29min

In [30]:
t0 = time.time()
model = pipeline.fit(train_data_conf3)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 30.43min

In [38]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 30.43min

In [39]:
t0 = time.time()
model = pipeline.fit(train_data_conf5)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 29.98min

In [43]:
t0 = time.time()
model = pipeline.fit(train_data_conf6)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 32.27min

In [None]:
%%configure -f
{"conf": {
        "spark.sql.warehouse.dir" : "s3a://dataops-example/justin",
        "spark.hadoop.fs.s3a.access.key":"",
        "spark.hadoop.fs.s3a.secret.key": "",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer", 
        "spark.kryoserializer.buffer.max": "2000M",
        "spark.driver.maxResultSize": "0",
        "spark.kubernetes.container.image": "justinljg/dep:1.06",
        "spark.kubernetes.container.image.pullPolicy" : "Always",
        "spark.jsl.settings.pretrained.cache_folder": "/opt/spark/work-dir",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.driver.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.options.claimName": "lighter-sparknlptest-pvc",
        "spark.kubernetes.executor.volumes.persistentVolumeClaim.lighter-sparknlptest-pvc.mount.path": "/opt/spark/work-dir",
        "spark.jsl.settings.annotator.log_folder": "/opt/spark/work-dir/logs",
        "spark.dynamicAllocation.enabled": "true",
        "spark.dynamicAllocation.shuffleTracking.enabled": "true",
        "spark.dynamicAllocation.minExecutors": "2",
        "spark.dynamicAllocation.maxExecutors": "10"
    },
 "executorMemory": "3G",
 "executorCores": 4,
 "driverMemory": "16G",
 "driverCores": 2
}

In [7]:
train_data, test_data = df.select("text", "label").orderBy(rand()).randomSplit([0.8, 0.2], seed=42)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
train_data_conf4= train_data.repartition(160)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
t0 = time.time()
model = pipeline.fit(train_data_conf4)
print(f"Training time: {(time.time() - t0)/60:.2f}min")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training time: 51.18min