# Tutorial: Sending PySpark DataFrame to Arize

In the current version of Arize Python SDK, only Pandas DataFrames are supported. To log Spark DataFrames, which have `rdds` as their underlying structure, we will use `mapPartitions` to log them to arize.

Install Dependencies in Colab

In [None]:
!pip install pyspark
!pip install arize



# Parallelizing PySpark DataFrame
We first create a dummy PySpark DataFrame to send.


In [None]:
import pyspark
from pyspark.sql import Row, SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

# Read some dummy data for logging to Arize later
data = pd.read_csv('https://storage.googleapis.com/arize-assets/fixtures/compare-model-a.csv')
df_pandas = data[['loan_amount', 'interest_rate', 'grade']]
df_pandas['prediction_labels'] = data['prediction']

df_pandas = pd.concat([df_pandas] * 5)

print("This is a pandas DataFrame:")
display(df_pandas)

# Create PySpark dataframe unparallelized
df_spark = spark.createDataFrame(df_pandas)

print("\nThis is the corresponding spark DataFrame")
df_spark.printSchema()

This is a pandas DataFrame:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,loan_amount,interest_rate,grade,prediction_labels
0,10000.0,10.99,B,fraud
1,8200.0,15.61,D,not_fraud
2,14000.0,9.17,B,fraud
3,5400.0,24.50,F,fraud
4,1500.0,13.18,C,fraud
...,...,...,...,...
1131,25475.0,12.99,C,not_fraud
1132,5200.0,7.90,A,not_fraud
1133,18000.0,7.90,A,fraud
1134,28000.0,16.99,D,fraud



This is the corresponding spark DataFrame
root
 |-- loan_amount: double (nullable = true)
 |-- interest_rate: double (nullable = true)
 |-- grade: string (nullable = true)
 |-- prediction_labels: string (nullable = true)



# Using `mapPartitions` to log each partition to Arize
`map_func` will be applied to each partition on of spark_df, allowing local copies of `pd.DataFrame` to be made and logged to Arize.

`success` will be returned if all entries in a particular partition has been logged properly. Otherwise, it will return the error code and error message for that partition instead.

`map_send_arize` should send `spark_df` to Arize with at least one of: `shap, prediction_labels, actual_labels`

## How To Log to Arize:
`your_spark_df.rdd.mapPartitions(map_func).collect()`

You will also need to update the `API_KEY` and `ORGANIZATION_KEY`
### Setting up Arize Client:
First, copy the Arize `API_KEY` and `ORG_KEY` from your admin page linked below!


[![Button_Open.png](https://storage.googleapis.com/arize-assets/fixtures/Button_Open.png)](https://app.arize.com/admin)

In [None]:
import itertools
from arize.api import Client
from arize.types import ModelTypes
import concurrent.futures as cf
import uuid

def map_send_arize(
    pyspark_df_partition: itertools.chain
):  
    """
    Mapping function to be used to log to Arize
    """
    # Step 1: iterating through each Row to create pd.DataFrame
    pandas_df = None
    for row in pyspark_df_partition:
        row_dict = row.asDict()
        if pandas_df is None:
            pandas_df = pd.DataFrame(columns=row_dict.keys())
        pandas_df.loc[len(pandas_df)] = row_dict

    # Step 2: We keep prediction labels and features in the same PySpark DataFrame
    features = pandas_df.drop(columns=['prediction_labels'])
    prediction_labels = pandas_df['prediction_labels']

    # Step 3: Log to arize
    ORGANIZATION_KEY = 'ORGANIZATION_KEY'
    API_KEY = 'API_KEY'
    arize = Client(organization_key=ORGANIZATION_KEY, api_key=API_KEY)
    
    responses = arize.bulk_log(
        model_id='model-test-id-1',
        model_version='1.0',
        prediction_ids=pd.Series([str(uuid.uuid4()) for _ in range(len(pandas_df))]),
        # features and prediction passed in here
        features=features,
        prediction_labels=prediction_labels
    )
    
    # Step 4: Check for errors when logging
    res = []
    for response in cf.as_completed(responses):
        response_result = response.result()
        res.append(f'Status {response_result.status_code}: {response_result.text}')
    return iter(res)

# Logging Example
For each `bulk_log` response we get from Arize, `collect()` will combine the iterables as the map function completes.

In [None]:
%%time
log_output = df_spark.rdd.mapPartitions(map_send_arize).collect()
print(f'number of records sent: {len(df_pandas)}')
print(f'number of arize responses: {len(log_output)}')
print(f"all log requests are successful: {[response == 'Status 200: {}' for response in log_output]}")

number of records sent: 5680
number of arize responses: 8
all log requests are successful: [True, True, True, True, True, True, True, True]
CPU times: user 108 ms, sys: 12.2 ms, total: 120 ms
Wall time: 19.2 s


# **Overview**

[![Button_Open.png](https://storage.googleapis.com/arize-assets/fixtures/Button_Open.png)](https://arize.com/)

Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
- [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/)
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
- [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/)
- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.