# Debugging and Analyzing Data from Arize Platform



Use this template to explore, analyze, and debug using data from the Arize platform. It takes in the data export URL, which you enter below, and produces a clean pandas dataframe that can be used for analysis.


 ***Note: Make a copy of this notebook to allow edits***


## Setting up the dataframe

Import libraries and define some helper functions.

In [1]:
import json
import pandas as pd
import urllib.request

def get_value_from_dict(single_item_dict):
    if len(single_item_dict) > 1:
        print("FORMAT ERROR")
        print(single_item_dict)
        return
    return next(iter(single_item_dict.values()))

def clean_up_dict_values(dict_to_clean):
    for key in dict_to_clean:
        if type(dict_to_clean[key]) == dict:
            dict_to_clean[key] = get_value_from_dict(dict_to_clean[key])


**Edit paramaters** with your export url and desired file preferences.

In [2]:
# Add the URL to your file (provided by Arize) here
arize_ui_url = 'YOUR_DATA_EXPORT_URL'
file_name = "downloaded_data.json"

Retrieve data from either the url or locally (if stored). Follow prompt instructions for authorization. 



In [3]:
urllib.request.urlretrieve(arize_ui_url, file_name)

('downloaded_data.json', <http.client.HTTPMessage at 0x7f72af20a7d0>)

Set up dataframe with the exported data.

In [4]:
#construct the formatted dataframe in this dictionary
data_frame_dict = {}

#open up the json file
with open(file_name) as fp:

  # read the data point into a dictionary
  line = fp.readline()
  index = 0

  while line:

    formatted_data_point = {}
    data_point = json.loads(line)

    prediction_dict = data_point["prediction"]

    formatted_data_point["timestamp"] = prediction_dict["timestamp"]
    formatted_data_point["modelVersion"] = prediction_dict["modelVersion"]
    formatted_data_point["predictionId"] = data_point["predictionId"]

    #features
    features = prediction_dict["features"]
    clean_up_dict_values(features)
    for k in features:
      formatted_data_point[k] = features[k]

    #prediction
    del prediction_dict["features"]
    # score categorical models are structured differently
    if ("scoreCategorical" in prediction_dict["label"]):
      if ("score" in prediction_dict["label"]["scoreCategorical"]):
        score = prediction_dict["label"]["scoreCategorical"]["score"]
      else:
        score = None
      prediction = prediction_dict["label"]["scoreCategorical"]["categorical"]
      formatted_data_point["score"] = score
      formatted_data_point["prediction"] = prediction
    else:
      clean_up_dict_values(prediction_dict)
      prediction = prediction_dict["label"]
      formatted_data_point["prediction"] = prediction
    
    #actual
    actual_dict = data_point["actual"]
    # score categorical models are structured differently
    if ("scoreCategorical" in actual_dict["label"]):
      clean_up_dict_values(actual_dict["label"])
    
    clean_up_dict_values(actual_dict)
    actual = actual_dict["label"]
    formatted_data_point["actual"] = actual

    #add to new dataframe dict
    data_frame_dict[index] = formatted_data_point

    line = line = fp.readline()
    index += 1


prediction_df = pd.DataFrame(data_frame_dict)
prediction_df = prediction_df.transpose()
#Clean up - type timestamp to correct column type
prediction_df['timestamp'] = pd.to_datetime(prediction_df['timestamp'])
prediction_df['date_string'] = prediction_df.timestamp.dt.strftime('%Y-%m-%d')

Now the data is ready to be explored. Take a look at how it's formatted in the dataframe.

In [5]:
prediction_df.head()

Unnamed: 0,timestamp,modelVersion,predictionId,AveBedrms,AveOccup,AveRooms,HouseAge,Latitude,Longitude,MedInc,Population,prediction,actual,date_string
0,2021-04-13 22:23:05.047025724+00:00,1.0,07c463f6-1285-465b-ad36-2ef91e002371,1.02228,3.87744,4.1922,25,36.06,-119.01,1.6812,1392,0.704822,0.477,2021-04-13
1,2021-04-13 22:23:05.047025724+00:00,1.0,7524054d-0cb4-4315-91ad-ee7678d7c5c9,1.19349,2.67979,5.03938,30,35.14,-119.46,2.5313,1565,1.78588,0.458,2021-04-13
2,2021-04-13 22:23:05.047025724+00:00,1.0,6a0f4d4b-f584-405e-9c7d-ae224ff7ce41,1.18588,1.36033,3.97715,52,37.8,-122.44,3.4801,1310,2.76428,5.00001,2021-04-13
3,2021-04-13 22:23:05.047025724+00:00,1.0,818e5a07-14cf-43d9-8890-0b34b49216af,1.0202,3.44444,6.16364,17,34.28,-118.72,5.7376,1705,2.82127,2.186,2021-04-13
4,2021-04-13 22:23:05.047025724+00:00,1.0,a9865b1d-f76d-48f2-b85e-9c60dbed0575,1.02804,2.48364,5.49299,34,36.62,-121.93,3.725,1063,2.62435,2.78,2021-04-13


## Examples of breaking down the data

### Count of prediction and actuals

In [6]:
#Ability to look at mean of prediction and actuals
#If you are slicing on features in the platform this gives some examples how to slice on the same feature

# Note this will not work in classification models where the predictions are True/False
"""
print(prediction_df['actual'].mean())
print(prediction_df[(prediction_df['modelVersion'] == '1.0') ]['prediction'].mean())
print(prediction_df[(prediction_df['modelVersion'] == '1.0') & (prediction_df.date_string > "2021-03-20")]['prediction'].mean())
""";

### MSE and other metrics

In [7]:
# Note this will not work in classification models
"""
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error
print(mean_absolute_error(prediction_df['actual'], prediction_df['prediction']))
slice_grade_3 = prediction_df[(prediction_df.date_string > "2021-03-20") ]
print(mean_absolute_error(slice_grade_3['actual'], slice_grade_3['prediction']))
""";

### Grouping data

In [8]:
# Group all the prediction data by the day they were made
"""
prediction_df.groupby(['date_string']).count()['prediction'].head()
""";

## Workspace

Expand this notebook as much as you need for your data digging needs



In [9]:
# Begin work here