# Sending data to Spark cluster from Local instance

suppose you would like to send pretrained scikit-learn model to your Spark cluster(e.g. for further usage with other packages like `spark-sklearn`)

**warning** this example assumes that both (py)Spark cluster and your local machine both have the same python packages versions

the following code requires numpy, scipy and scikit-learn to be installed

train and print precision

In [1]:
%%local
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import pandas as pd
import pickle
from sklearn import tree
from sklearn.metrics import precision_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=1)
X_test_pd = pd.DataFrame(X_test, columns=['a','b','c','d'])
Y_test_pd = pd.DataFrame(y_test, columns=['pred'])

decision_tree = tree.DecisionTreeClassifier()
decision_tree_model = decision_tree.fit(X_train, y_train)

y_pred = decision_tree_model.predict(X_test)
precision_score(y_test, y_pred, average='weighted')

0.97631578947368425

send test sets to `%spark`

In [2]:
%%send_to_spark -i X_test_pd -t df -n X_test

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,,pyspark3,idle,,,✔


SparkSession available as 'spark'.
Successfully passed 'X_test_pd' as 'X_test' to Spark kernel

In [3]:
%%send_to_spark -i Y_test_pd -t df -n y_test

Successfully passed 'Y_test_pd' as 'y_test' to Spark kernel

because `pickle.dumps` returns `bytearray` we encode it to base64

In [4]:
%%local
import codecs
decision_tree_pickled = codecs.encode(pickle.dumps(decision_tree_model), "base64").decode()

In [5]:
%%send_to_spark -i decision_tree_pickled -t str -n decision_tree_pickled

Successfully passed 'decision_tree_pickled' as 'decision_tree_pickled' to Spark kernel

decode from base64 in `%spark`

In [6]:
import pickle, codecs

decision_tree_model = pickle.loads(codecs.decode(decision_tree_pickled.encode(), "base64"))

convert Pyspark DataFrame into numpy arrays

In [7]:
import numpy as np
y_test = np.array(y_test.select('pred').collect())
X_test = np.array(X_test.collect())

run pretrained classifier and see if its precision matches the `%local` model

In [8]:
from sklearn.metrics import precision_score

y_pred = decision_tree_model.predict(X_test)
precision_score(y_test, y_pred, average='weighted')

0.97631578947368425

it does! we have successfully passed both string and pandas dataframe from `%local` to `%spark`