### 4.10.1 Démarrage de la session Spark

In [1]:
# L'exécution de cette cellule démarre l'application Spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
1,application_1679072896953_0002,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Affichage des informations sur la session en cours et liens vers Spark UI :

In [2]:
%%info

ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
1,application_1679072896953_0002,pyspark,idle,Link,Link,,✔


### 4.10.2 Installation des packages

In [None]:
Les packages nécessaires ont été installés via l'étape de **bootstrap** à l'instanciation du serveur.

### 4.10.3 Import des librairies

In [3]:
import pandas as pd
import numpy as np
import io
import os
import tensorflow as tf
from PIL import Image
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras import Model
from pyspark.sql.functions import col, pandas_udf, PandasUDFType, element_at, split

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### 4.10.5 Traitement des données

In [4]:
PATH = 's3://p8-data-fruit-cbb'
PATH_Data = PATH+'/Test'
PATH_Result = PATH+'/Results'
print('PATH:        '+\
      PATH+'\nPATH_Data:   '+\
      PATH_Data+'\nPATH_Result: '+PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

PATH:        s3://p8-data-fruit-cbb
PATH_Data:   s3://p8-data-fruit-cbb/Test
PATH_Result: s3://p8-data-fruit-cbb/Results

#### 4.10.5.1 Chargement des données

In [None]:
Les images sont chargées au format binaire, ce qui offre,
plus de souplesse dans la façon de prétraiter les images.

Avant de charger les images, nous spécifions que nous voulons charger
uniquement les fichiers dont l'extension est jpg.


In [5]:
# data handling
from pyspark.sql.functions import element_at, split
from pyspark.sql.functions import pandas_udf, PandasUDFType
# from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from typing import Iterator

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
# ml tasks
from pyspark.ml.image import ImageSchema
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
# transform
from pyspark.ml.linalg import Vectors, VectorUDT

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
images = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load(PATH_Data)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
<u>Affichage des 5 premières images contenant</u> :
 - le path de l'image
 - la date et heure de sa dernière modification
 - sa longueur
 - son contenu encodé en valeur hexadécimal

In [9]:
images.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|s3://p8-data-frui...|2023-03-15 13:40:47|  6574|[FF D8 FF E0 00 1...|
|s3://p8-data-frui...|2023-03-15 13:41:47|  6574|[FF D8 FF E0 00 1...|
|s3://p8-data-frui...|2023-03-15 13:40:47|  6572|[FF D8 FF E0 00 1...|
|s3://p8-data-frui...|2023-03-15 13:41:47|  6572|[FF D8 FF E0 00 1...|
|s3://p8-data-frui...|2023-03-15 13:40:47|  6571|[FF D8 FF E0 00 1...|
+--------------------+-------------------+------+--------------------+
only showing top 5 rows

In [None]:
<u>Je ne conserve que le **path** de l'image et j'ajoute <br />
    une colonne contenant les **labels** de chaque image</u> :

In [10]:
images = images.withColumn('label', element_at(split(images['path'], '/'),-2))
print(images.printSchema())
print(images.select('path','label').show(5,False))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)
 |-- label: string (nullable = true)

None
+------------------------------------------------------+--------------+
|path                                                  |label         |
+------------------------------------------------------+--------------+
|s3://p8-data-fruit-cbb/Test/Pineapple/99_100.jpg      |Pineapple     |
|s3://p8-data-fruit-cbb/Test/Pineapple Mini/99_100.jpg |Pineapple Mini|
|s3://p8-data-fruit-cbb/Test/Pineapple/143_100.jpg     |Pineapple     |
|s3://p8-data-fruit-cbb/Test/Pineapple Mini/143_100.jpg|Pineapple Mini|
|s3://p8-data-fruit-cbb/Test/Pineapple/144_100.jpg     |Pineapple     |
+------------------------------------------------------+--------------+
only showing top 5 rows

None

#### 4.10.5.2 Préparation du modèle

In [None]:
Je vais utiliser la technique du transfert learning pour extraire les features des images.
J'ai choisi d'utiliser le modèle MobileNetV2 pour sa rapidité d'exécution

In [None]:
Dans l'ordre :

    Nous chargeons le modèle MobileNetV2 avec les poids précalculés
    issus d'imagenet et en spécifiant le format de nos images en entrée
    Nous créons un nouveau modèle avec:
        en entrée : l'entrée du modèle MobileNetV2
        en sortie : l'avant dernière couche du modèle MobileNetV2


In [11]:
model = MobileNetV2(weights='imagenet',
                    include_top=True,
                    input_shape=(224, 224, 3))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
new_model = Model(inputs=model.input,
                  outputs=model.layers[-2].output)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [14]:
# Affichage du résumé de notre nouveau modèle où nous constatons
# que nous récupérons bien en sortie un vecteur de dimension (1, 1, 1280) :

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:
new_model.summary()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
Conv1 (Conv2D)                  (None, 112, 112, 32) 864         input_1[0][0]                    
__________________________________________________________________________________________________
bn_Conv1 (BatchNormalization)   (None, 112, 112, 32) 128         Conv1[0][0]                      
__________________________________________________________________________________________________
Conv1_relu (ReLU)               (None, 112, 112, 32) 0           bn_Conv1[0][0]                   
______________________________________________________________________________________________

In [None]:
Tous les workeurs doivent pouvoir accéder au modèle ainsi qu'à ses poids.
Une bonne pratique consiste à charger le modèle sur le driver puis à diffuser
ensuite les poids aux différents workeurs.

In [14]:
brodcast_weights = sc.broadcast(new_model.get_weights())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
Mettons cela sous forme de fonction :

In [15]:
def model_fn():
    """
    Returns a MobileNetV2 model with top layer removed 
    and broadcasted pretrained weights.
    """
    model = MobileNetV2(weights='imagenet',
                        include_top=True,
                        input_shape=(224, 224, 3))
    for layer in model.layers:
        layer.trainable = False
    new_model = Model(inputs=model.input,
                  outputs=model.layers[-2].output)
    new_model.set_weights(brodcast_weights.value)
    return new_model

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### 4.10.5.3 Définition du processus de chargement des images <br/> et application de leur featurisation à travers l'utilisation de pandas UDF

In [None]:
Ce notebook définit la logique par étapes, jusqu'à Pandas UDF.

L'empilement des appels est la suivante :

    Pandas UDF
        featuriser une série d'images pd.Series
            prétraiter une image


In [16]:
def preprocess(content):
    """
    Preprocesses raw image bytes for prediction.
    """
    img = Image.open(io.BytesIO(content)).resize([224, 224])
    arr = img_to_array(img)
    return preprocess_input(arr)

def featurize_series(model, content_series):
    """
    Featurize a pd.Series of raw images using the input model.
    :return: a pd.Series of image features
    """
    input = np.stack(content_series.map(preprocess))
    preds = model.predict(input)
    # For some layers, output features will be multi-dimensional tensors.
    # We flatten the feature tensors to vectors for easier storage in Spark DataFrames.
    output = [p.flatten() for p in preds]
    return pd.Series(output)

@pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
    '''
    This method is a Scalar Iterator pandas UDF wrapping our featurization function.
    The decorator specifies that this returns a Spark DataFrame column of type ArrayType(FloatType).

    :param content_series_iter: This argument is an iterator over batches of data, where each batch
                              is a pandas Series of image data.
    '''
    # With Scalar Iterator pandas UDFs, we can load the model once and then re-use it
    # for multiple data batches.  This amortizes the overhead of loading big models.
    model = model_fn()
    for content_series in content_series_iter:
        yield featurize_series(model, content_series)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…



In [None]:
Les Pandas UDF, sur de grands enregistrements (par exemple, de très grandes images),
peuvent rencontrer des erreurs de type Out Of Memory (OOM).
Si vous rencontrez de telles erreurs dans la cellule ci-dessous,
essayez de réduire la taille du lot Arrow via 'maxRecordsPerBatch'

In [None]:
# spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1024")

#### 4.10.5.4 Exécutions des actions d'extractions de features

In [None]:
Nous pouvons maintenant exécuter la featurisation sur l'ensemble de notre DataFrame Spark.
REMARQUE : Cela peut prendre beaucoup de temps, tout dépend du volume de données à traiter.

Notre jeu de données de Test contient 22819 images.
Cependant, dans l'exécution en mode local,
nous traiterons un ensemble réduit de 300 images.


In [17]:
features_df = images.repartition(24).select(col("path"),
                                            col("label"),
                                            featurize_udf("content").alias("features")
                                           )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [18]:
features_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- path: string (nullable = true)
 |-- label: string (nullable = true)
 |-- features: array (nullable = true)
 |    |-- element: float (containsNull = true)

In [19]:
# get the number of partitions
print(features_df.rdd.getNumPartitions())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

24

In [20]:
#start = time.perf_counter()
features_df.show()
#stop = time.perf_counter()
#print(f'data load with spark.read, elapsed time: {stop - start:0.2f}s')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-----------------+--------------------+
|                path|            label|            features|
+--------------------+-----------------+--------------------+
|s3://p8-data-frui...|   Pineapple Mini|[0.0, 3.8690548, ...|
|s3://p8-data-frui...|   Pineapple Mini|[0.0, 4.6625853, ...|
|s3://p8-data-frui...|           Walnut|[0.04131617, 0.03...|
|s3://p8-data-frui...|           Walnut|[0.15421982, 0.0,...|
|s3://p8-data-frui...|            Peach|[0.68647873, 0.25...|
|s3://p8-data-frui...|          Avocado|[0.4176806, 0.0, ...|
|s3://p8-data-frui...|          Avocado|[0.47646114, 0.0,...|
|s3://p8-data-frui...|    Passion Fruit|[0.065396085, 0.0...|
|s3://p8-data-frui...|          Avocado|[0.52295953, 0.0,...|
|s3://p8-data-frui...|Tomato Cherry Red|[0.0, 2.183555, 0...|
|s3://p8-data-frui...|         Physalis|[0.0, 0.8341802, ...|
|s3://p8-data-frui...|         Physalis|[0.0, 0.5979265, ...|
|s3://p8-data-frui...|   Pineapple Mini|[0.0, 3.5926228, ...|
|s3://p8

In [None]:
Rappel du PATH où seront inscrits les fichiers au format "parquet"
contenant nos résultats, à savoir, un DataFrame contenant 3 colonnes :

    Path des images
    Label de l'image
    Vecteur de caractéristiques de l'image



In [21]:
print(PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

s3://p8-data-fruit-cbb/Results

In [None]:
Enregistrement des données traitées au format "parquet" :

In [22]:
features_df.write.mode("overwrite").parquet(PATH_Result)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### 4.10.6 Chargement des données enregistrées et validation du résultat

In [None]:
On charge les données fraichement enregistrées dans un DataFrame Pandas :

In [23]:
# read local results from parquet file
#start = time.perf_counter()
import pyarrow.parquet as pq
pd_final_df = pq.read_table(source=PATH_Result).to_pandas()
#stop = time.perf_counter()
#print(f'read local, elapsed time: {stop - start:0.2f}s')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [26]:
# cette façon de lire ne fonctionne pas
#df = pd.read_parquet(PATH_Result, engine='pyarrow')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [None]:
On affiche les 5 premières lignes du DataFrame :

In [24]:
pd_final_df.head()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                                path  ...                                           features
0  s3://p8-data-fruit-cbb/Test/Pineapple Mini/25_...  ...  [0.0, 3.8690548, 0.0, 0.0, 0.0, 0.0, 0.0072968...
1  s3://p8-data-fruit-cbb/Test/Pineapple Mini/17_...  ...  [0.0, 4.6625853, 0.15033168, 0.0, 0.0002589650...
2      s3://p8-data-fruit-cbb/Test/Walnut/18_100.jpg  ...  [0.04131617, 0.039828468, 0.0, 0.0, 0.64292824...
3      s3://p8-data-fruit-cbb/Test/Walnut/28_100.jpg  ...  [0.15421982, 0.0, 0.0, 0.0, 0.13024212, 0.0, 1...
4       s3://p8-data-fruit-cbb/Test/Peach/36_100.jpg  ...  [0.68647873, 0.25965068, 0.0, 0.0, 0.26959205,...

[5 rows x 3 columns]

In [25]:
# size of the results df
pd_final_df.info(verbose=False, memory_usage="deep")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Columns: 3 entries, path to features
dtypes: object(3)
memory usage: 83.5 KB

In [None]:
On valide que la dimension du vecteur de caractéristiques des images est bien de dimension 1280 :

In [26]:
pd_final_df.loc[0,'features'].shape

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(1280,)

## 4.10.7 Réduction de dimension

In [27]:
# from Array to Vectors for PCA
array_to_vector_udf = udf(lambda l: Vectors.dense(l), VectorUDT())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [28]:
vectorized_df = features_df.withColumn('netV2_vectors', array_to_vector_udf('features'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [29]:
vectorized_df.show(5, True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------+--------------------+--------------------+
|                path|         label|            features|       netV2_vectors|
+--------------------+--------------+--------------------+--------------------+
|s3://p8-data-frui...|Pineapple Mini|[0.0, 3.8690548, ...|[0.0,3.8690547943...|
|s3://p8-data-frui...|Pineapple Mini|[0.0, 4.6625853, ...|[0.0,4.6625852584...|
|s3://p8-data-frui...|        Walnut|[0.04131617, 0.03...|[0.04131617024540...|
|s3://p8-data-frui...|        Walnut|[0.15421982, 0.0,...|[0.15421982109546...|
|s3://p8-data-frui...|         Peach|[0.68647873, 0.25...|[0.68647873401641...|
+--------------------+--------------+--------------------+--------------------+
only showing top 5 rows

### 4.10.7.1 Initialisation de la PCA

In [30]:
# reduce with PCA - set k Max to determine the adequate nb of principal components
#start = time.perf_counter()
pca = PCA(k=20, inputCol='netV2_vectors', outputCol='pca_vectors')
model_pca = pca.fit(vectorized_df)
#stop = time.perf_counter()
#print(f'pca - fit best k nb, elapsed time: {stop - start:0.2f}s')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [31]:
# apply pca reduction
#start = time.perf_counter()
reduced_df = model_pca.transform(vectorized_df)
#stop = time.perf_counter()
#print(f'pca - application, elapsed time: {stop - start:0.2f}s')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [32]:
reduced_df.show(5, True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------+--------------------+--------------------+--------------------+
|                path|         label|            features|       netV2_vectors|         pca_vectors|
+--------------------+--------------+--------------------+--------------------+--------------------+
|s3://p8-data-frui...|Pineapple Mini|[0.0, 3.8690548, ...|[0.0,3.8690547943...|[15.2829704533087...|
|s3://p8-data-frui...|Pineapple Mini|[0.0, 4.6625853, ...|[0.0,4.6625852584...|[14.9522592809902...|
|s3://p8-data-frui...|        Walnut|[0.04131617, 0.03...|[0.04131617024540...|[1.55715939618987...|
|s3://p8-data-frui...|        Walnut|[0.15421982, 0.0,...|[0.15421982109546...|[0.77984547046876...|
|s3://p8-data-frui...|         Peach|[0.68647873, 0.25...|[0.68647873401641...|[-3.8481653083153...|
+--------------------+--------------+--------------------+--------------------+--------------------+
only showing top 5 rows

In [None]:
# Inverse transform: from Vectors to Array - i.e. Pandas readability

In [33]:
# from Array to Vectors for PCA
vector_to_array_udf = udf(lambda v: v.toArray().tolist(), ArrayType(FloatType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [34]:
final_df_pca = reduced_df.withColumn('pca_features', vector_to_array_udf('pca_vectors'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [35]:
final_df_pca.show(5, True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+
|                path|         label|            features|       netV2_vectors|         pca_vectors|        pca_features|
+--------------------+--------------+--------------------+--------------------+--------------------+--------------------+
|s3://p8-data-frui...|Pineapple Mini|[0.0, 3.8690548, ...|[0.0,3.8690547943...|[15.2829704533087...|[15.28297, 0.4833...|
|s3://p8-data-frui...|Pineapple Mini|[0.0, 4.6625853, ...|[0.0,4.6625852584...|[14.9522592809902...|[14.952259, 0.412...|
|s3://p8-data-frui...|        Walnut|[0.04131617, 0.03...|[0.04131617024540...|[1.55715939618987...|[1.5571594, 0.628...|
|s3://p8-data-frui...|        Walnut|[0.15421982, 0.0,...|[0.15421982109546...|[0.77984547046876...|[0.7798455, 0.205...|
|s3://p8-data-frui...|         Peach|[0.68647873, 0.25...|[0.68647873401641...|[-3.8481653083153...|[-3.8481653, 0.19...|
+--------------------+--

### 4.10.8 Results storage

In [None]:
# Write action, could be the only action that trigger - reducer process

In [36]:
# write local results on parquet file
#start = time.perf_counter()
final_df_pca.write.mode('overwrite').parquet(PATH_Result)
#stop = time.perf_counter()
#print(f'write local, elapsed time: {stop - start:0.2f}s')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [37]:
# /convert to csv
#final_df_pca.write.mode('overwrite').csv("PATH_Result+/zipcodes.csv") ça marche pas directement à cause des colonnes de type array

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [38]:
final_df_pca = final_df_pca.withColumn('features_string', array_to_string_udf(final_df_pca["features"]))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [39]:
final_df_pca = final_df_pca.withColumn('netV2_vectors_string', array_to_string_udf(final_df_pca["netV2_vectors"]))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [40]:
final_df_pca = final_df_pca.withColumn('pca_vectors_string', array_to_string_udf(final_df_pca["pca_vectors"]))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [41]:
final_df_pca = final_df_pca.withColumn('pca_features_string', array_to_string_udf(final_df_pca["pca_features"]))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [42]:
# supprimer les colonnes avec des array
#final_df_pca.drop...
final_df_pca.drop("features","netV2_vectors","pca_vectors","pca_features").write.csv("s3://p8-data-fruit-cbb/Results/p8fruits.csv")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [44]:
spark.stop()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…