docs: Adding notebook for ICE explainer #1318

ezherdeva · 2021-12-17T22:29:22Z

No description provided.

ezherdeva · 2021-12-17T22:31:14Z

@memoryz
You didn't appear as a reviewer for some reason.

mhamilton723 · 2021-12-23T04:51:02Z

/azp run

azure-pipelines · 2021-12-23T04:51:12Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2021-12-23T04:51:35Z

notebooks/Interpretability - ICE explainer.ipynb

@@ -0,0 +1 @@
+{"cells":[{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b7488bd3-b1a1-4b4b-a3be-52e447c4a46c","showTitle":false,"title":""}},"source":["## Partial Dependence (PDP) and Individual Conditional Expectation (ICE) plots"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"6d7a6880-7982-41f0-b768-893b50a5fc96","showTitle":false,"title":""}},"source":["In this example, we train a classification model with the Adult Census Income dataset. Then we treat the model as a blackbox model and calculate the PDP and ICE plots for some selected categorical and numeric features. \n","\n","This dataset can be used to predict whether annual income exceeds $50,000/year or not based on demographic data from the 1994 U.S. Census. The dataset we're reading contains 32,561 rows and 14 columns/features.\n","\n","[More info on the dataset here](https://archive.ics.uci.edu/ml/datasets/Adult)\n","\n","We will train a classification model with a target - income >= 50K.\n","\n","---\n","\n","**Partial Dependence Plot (PDP) ** function at a particular feature value represents the average prediction if we force all data points to assume that feature value.\n","\n","**Individual Conditional Expectation (ICE)** plots display one line per instance that shows how the instance’s prediction changes when a feature changes. One line represents the predictions for one instance if we vary the feature of interest.\n","\n","PDP and ICE plots visualize and help to analyze the interaction between the target response and a set of input features of interest. It is essential when you are building a Machine Learning model to understand model behavior and how certain features influences overall prediction. One of the most popular use-cases is analyzing feature importance.\n","\n","---\n","Python dependencies:\n","\n","matplotlib==3.2.2\n","\n","numpy==1.19.2"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"90785380-3bea-4b75-81ea-60203b3daf8c","showTitle":false,"title":""}},"outputs":[],"source":["from pyspark.ml import Pipeline\n","from pyspark.ml.classification import GBTClassifier\n","from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder\n","import pyspark.sql.functions as F\n","from pyspark.ml.evaluation import BinaryClassificationEvaluator\n","\n","from synapse.ml.explainers import ICETransformer\n","\n","import matplotlib.pyplot as plt"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"74344307-ac58-4e04-8d80-9fb9d24a5803","showTitle":false,"title":""}},"source":["### Read and prepare the dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"f42818e8-2361-4a46-aa6b-f94367f91dd1","showTitle":false,"title":""}},"outputs":[],"source":["df = spark.read.parquet(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet\")\n","display(df)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"073cae48-beea-4a24-9c3a-2dde9815a4fb","showTitle":false,"title":""}},"outputs":[],"source":["categorical_features = [\"race\", \"workclass\", \"marital-status\", \"education\", \"occupation\", \"relationship\", \"native-country\", \"sex\"]\n","numeric_features = [\"age\", \"education-num\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n","string_indexer_outputs = [feature + \"_idx\" for feature in categorical_features]\n","one_hot_encoder_outputs = [feature + \"_enc\" for feature in categorical_features]\n","\n","pipeline = Pipeline(stages=[\n","    StringIndexer().setInputCol(\"income\").setOutputCol(\"label\").setStringOrderType(\"alphabetAsc\"),\n","    StringIndexer().setInputCols(categorical_features).setOutputCols(string_indexer_outputs),\n","    OneHotEncoder().setInputCols(string_indexer_outputs).setOutputCols(one_hot_encoder_outputs),\n","    VectorAssembler(inputCols=one_hot_encoder_outputs+numeric_features, outputCol=\"features\"),\n","    GBTClassifier(weightCol=\"fnlwgt\")])"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"8258a1fd-0725-421f-8506-20cd256f28c6","showTitle":false,"title":""}},"outputs":[],"source":["display(df.groupBy(\"education-num\").count())"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"81d6bb74-96e9-42ee-b3db-e42c0ea88376","showTitle":false,"title":""}},"source":["### Fit the model and view the predictions"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"94b1aca1-c46e-4ad5-9f80-19016ad99e49","showTitle":false,"title":""}},"outputs":[],"source":["model = pipeline.fit(df)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"85fcf25c-b3ad-4afd-a455-cef1c9b21cb4","showTitle":false,"title":""}},"source":["Check that model makes sense and has reasonable output. For this, we will check the model performance by calculating the ROC-AUC score."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"bd5b14c1-990d-48b4-b611-46787beed673","showTitle":false,"title":""}},"outputs":[],"source":["data = model.transform(df)\n","display(data.select('income', 'probability', 'prediction'))"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"45d0782a-dc16-4a37-977c-b3d426d650c8","showTitle":false,"title":""}},"outputs":[],"source":["eval_auc = BinaryClassificationEvaluator(labelCol=\"label\", rawPredictionCol=\"prediction\")\n","eval_auc.evaluate(data)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"8471ec63-9a87-4169-b35e-84f3a54c143a","showTitle":false,"title":""}},"source":["## PDP\n","\n","\\\\(X_S\\\\) - set of input features of interest, \\\\(X_C\\\\) - its complement.\n","\n","The partial dependence of the response \\\\(f\\\\) at a point \\\\(x_S\\\\) is defined as:\n","\n","$$ pd_{X_S}(x_S) = \\mathsf{E} X_C [f(x_S, X_C)] = \\int f(x_S, x_C) p(x_C)dx_C$$\n","\n","where \\\\(f(x_S, x_C)\\\\)  is the response function for a given sample whose values are defined by \\\\(x_S\\\\) for the features in \\\\(X_S\\\\) (i.e. the features you want to explain), and by \\\\(x_C\\\\) for the features in \\\\(X_C\\\\) (i.e. features that are not being analyzed).\n","\n","The compuation method estimates the above integaral by computing an average over the dataset \\\\(X\\\\):\n","\n","$$pd_{X_S}(x_S) \\approx \\frac{1}{n_{samples}} \\sum_{i=1}^n f(x_S, x_C^{(i)}) $$\n","\n","where \\\\(x_C^{(i)}\\\\) is the value of the i-th sample for the features in \\\\(X_C\\\\). For each value of \\\\(x_S\\\\), this method requires a full pass over the dataset \\\\(X\\\\).\n","\n","---\n","\n","We will show how features \"sex\", \"education\", \"worklass\", \"occupation\" (categorical feautures) and \"education-num\" and \"age\" (numeric features) affect the prediction of the income exceeds $50,000/year.\n","\n","--- \n","\n","Source: https://christophm.github.io/interpretable-ml-book/pdp.html"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"422b70ee-483d-470b-9105-7da2cefb8fec","showTitle":false,"title":""}},"source":["### Setup the transformer for PDP"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"df3b3215-5418-4c3e-8931-300cb2481884","showTitle":false,"title":""}},"outputs":[],"source":["pdp = ICETransformer(model=model, targetCol=\"probability\", kind=\"average\", targetClasses=[1]).\\\n","    setCategoricalFeatures(categorical_features).\\\n","    setNumericFeatures(numeric_features).setNumSamples(50)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"68774ede-f4ee-4437-bc40-16c7a2192098","showTitle":false,"title":""}},"source":["PDP is a spark transformer, the function **transform** returns the schema of (1 row * number features to explain) which contains dependence for the given feature in a format: feature_value -> dependence (in our case probability)."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"d79ebaea-69ef-4c07-8bbb-157d6b86ee9b","showTitle":false,"title":""}},"outputs":[],"source":["output_pdp = pdp.transform(df)\n","display(output_pdp)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"1b025798-2650-4d68-b5cd-136596d343cd","showTitle":false,"title":""}},"source":["### Visualization"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"e160c8bf-c4ff-4442-9e6c-f7615212b151","showTitle":false,"title":""}},"outputs":[],"source":["# Helper functions for visualization\n","\n","def get_pandas_df_from_column(df, col_name):\n","  keys_df = df.select(F.explode(F.map_keys(F.col(col_name)))).distinct()\n","  keys = list(map(lambda row: row[0], keys_df.collect()))\n","  key_cols = list(map(lambda f: F.col(col_name).getItem(f).alias(str(f)), keys))\n","  final_cols = key_cols\n","  pandas_df = df.select(final_cols).toPandas()\n","  return pandas_df\n","\n","def plot_dependence_for_categorical(df, col, col_int=True, figsize=(20, 5)):\n","  dict_values = {}\n","  col_names = list(df.columns)\n","\n","  for col_name in col_names:\n","    dict_values[col_name] = df[col_name][0].toArray()[0]\n","    marklist= sorted(dict_values.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n","    sortdict=dict(marklist)\n","\n","  fig = plt.figure(figsize = figsize)\n","  plt.bar(sortdict.keys(), sortdict.values())\n","\n","  plt.xlabel(col, size=13)\n","  plt.ylabel(\"Dependence\")\n","  plt.title(\"\")\n","  plt.show()\n","  \n","def plot_dependence_for_numeric(df, col, col_int=True, figsize=(20, 5)):\n","  dict_values = {}\n","  col_names = list(df.columns)\n","\n","  for col_name in col_names:\n","    dict_values[col_name] = df[col_name][0].toArray()[0]\n","    marklist= sorted(dict_values.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n","    sortdict=dict(marklist)\n","\n","  fig = plt.figure(figsize = figsize)\n","\n","  \n","  plt.plot(list(sortdict.keys()), list(sortdict.values()))\n","\n","  plt.xlabel(col, size=13)\n","  plt.ylabel(\"Dependence\")\n","  plt.ylim(0.0)\n","  plt.title(\"\")\n","  plt.show()\n","  "]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"2ebe88c8-0176-47da-9c9a-88c1dcfef123","showTitle":false,"title":""}},"source":["#### Example 1: \"Age\"\n","\n","We can observe non-linear dependency. Income rapidly grows from 24-38 age, after 58 it slightly drops and from 68 remains stable."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"332ea73e-d47f-4559-9c21-b9e7ccf6ded6","showTitle":false,"title":""}},"outputs":[],"source":["display(output_pdp)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"7f1386da-bd06-4dd6-9368-a031c982650c","showTitle":false,"title":""}},"outputs":[],"source":["df_education_num = get_pandas_df_from_column(output_pdp, 'age_dependence')\n","plot_dependence_for_numeric(df_education_num, 'age')"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"d1da8c5b-b4cb-455f-9289-f646bff56c1d","showTitle":false,"title":""}},"source":["#### Example 2: \"marital-status\"\n","\n","According to the result, the model treats \"married-cv-spouse\" as one category and all others as a second category. It looks reasonable, taking into account that GBT has a tree structure.\n","\n","If the model picks \"divorced\" as one category and the rest features as the second category- then most likely there is an error and some bias in data."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"4bbc72bf-307a-4d20-98f1-73673d81bb03","showTitle":false,"title":""}},"outputs":[],"source":["df_occupation = get_pandas_df_from_column(output_pdp, 'workclass_dependence')\n","plot_dependence_for_categorical(df_occupation, 'marital-status', False, figsize=(30, 5))"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"16b524bf-7587-461d-aa50-a88972cf2bb3","showTitle":false,"title":""}},"source":["#### Example 3: \"capital-gain\"\n","\n","Firstly we run PDP with default parameters for rangeMin and rangeMax. We can see that this representation is not useful, it is not granulated enough, because it was dynamically computed from the data. That is why we set rangeMin = 0 and rangeMax = 10000 to visualize more granulated interpretations for the part we're interested in.\n","\n","On the second graph we can observe how capital-gain affects the dependence."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"2d076473-9760-4634-8465-aa90cb2a40dc","showTitle":false,"title":""}},"outputs":[],"source":["df_education_num = get_pandas_df_from_column(output_pdp, 'capital-gain_dependence')\n","plot_dependence_for_numeric(df_education_num, 'capital-gain_dependence')"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"36853ee2-2b55-4cd5-ab53-b2090d8cb2c5","showTitle":false,"title":""}},"outputs":[],"source":["pdp_cap_gain = ICETransformer(model=model, targetCol=\"probability\", kind=\"average\", targetClasses=[1]).\\\n","    setNumericFeatures([{\"name\": \"capital-gain\", \"numSplits\": 20, \"rangeMin\": 0.0, \"rangeMax\": 10000.0}]).\\\n","    setNumSamples(50)\n","\n","output_pdp_cap_gain = pdp_cap_gain.transform(df)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"0854f823-00a4-428a-bb0e-9457ca010774","showTitle":false,"title":""}},"outputs":[],"source":["df_education_num = get_pandas_df_from_column(output_pdp_cap_gain, 'capital-gain_dependence')\n","plot_dependence_for_numeric(df_education_num, 'capital-gain_dependence')"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"773be71d-25ba-45ea-bf8f-98a9722e7da6","showTitle":false,"title":""}},"source":["### Conclusions\n","\n","**Advantages:**\n","\n","1) Plots is intuitive.\n","\n","2) PDPs perfectly represent how the feature influences the prediction on average (for not correlated features).\n","\n","3) Plots are easy to implement.\n","\n","**Disadvantages:**\n","\n","1) The realistic maximum number of features in a partial dependence function is two.\n","\n","2) Some PD plots do not show the feature distribution.\n","\n","3) The assumption of independence is the biggest issue with PD plots.\n","\n","4) PD plots only show the average marginal effects."]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"e65a4bca-e276-4ecd-b3ed-8171b4b41867","showTitle":false,"title":""}},"source":["## ICE\n","\n","\\\\(X_S\\\\) - set of input features of interest, \\\\(X_C\\\\) - its complement.\n","\n","\n","The equivalent to a PDP for individual data instances is called individual conditional expectation (ICE) plot. A PDP is the average of the lines of an ICE plot.\n","\n","The values for a line (and one instance) can be computed by keeping all other features the same, creating variants of this instance by replacing the feature’s value with values from a grid and making predictions with the black box model for these newly created instances. \n","\n","For each instance in $$ \\{ (x_{S}^{(i)},x_{C}^{(i)}) \\}_{i=1}^N$$ the curve \\\\(\\hat{f}_S^{(i)}\\\\) is plotted against \\\\(x_S^{(i)} \\\\), while \\\\( x_C^{(i)}\\\\) remains fixed.\n","\n","---\n","\n","\n","We will show the same features as for PDP to show a difference: \"sex\", \"education\", \"worklass\", \"occupation\" (categorical feautures) and \"education-num\" and \"age\" (numeric features)\n","\n","---\n","Source: https://christophm.github.io/interpretable-ml-book/ice.html"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"1fe87d15-5a3f-4c40-b28d-3e751da22900","showTitle":false,"title":""}},"source":["### Setup the transformer for ICE"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b2c4d0a7-f7ab-4afa-b6fd-4eac4fc8b27b","showTitle":false,"title":""}},"outputs":[],"source":["ice = ICETransformer(model=model, targetCol=\"probability\", targetClasses=[1]).\\\n","    setCategoricalFeatures(categorical_features).setNumericFeatures(numeric_features).setNumSamples(50)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"a2623d89-237f-4057-8057-9c2075d0c4fd","showTitle":false,"title":""}},"outputs":[],"source":["output = ice.transform(df)\n","display(output)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b6272501-44f2-4fff-9abe-b96c8129f943","showTitle":false,"title":""}},"source":["### Visualization"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"6f3dfaa4-a1e0-4308-a702-973325d4c58f","showTitle":false,"title":""}},"outputs":[],"source":["# Helper functions for visualization\n","from math import pi\n","\n","from collections import defaultdict\n","\n","def plot_ice_numeric(df, col, col_int=True, figsize=(20, 10)):\n","  dict_values = defaultdict(list)\n","  col_names = list(df.columns)\n","  num_instances = df.shape[0]\n","  \n","  instances_y = {}\n","  i = 0\n","\n","  for col_name in col_names:\n","    for i in range(num_instances):\n","      dict_values[i].append(df[col_name][i].toArray()[0])\n","  \n","  fig = plt.figure(figsize = figsize)\n","  for i in range(num_instances):\n","    plt.plot(col_names, dict_values[i], \"k\")\n","  \n","  \n","  plt.xlabel(col, size=13)\n","  plt.ylabel(\"Dependence\")\n","  plt.ylim(0.0)\n","  \n","  \n","  \n","def plot_ice_categorical(df, col, col_int=True, figsize=(20, 10)):\n","  dict_values = defaultdict(list)\n","  col_names = list(df.columns)\n","  num_instances = df.shape[0]\n","  \n","  angles = [n / float(df.shape[1]) * 2 * pi for n in range(df.shape[1])]\n","  angles += angles [:1]\n","  \n","  instances_y = {}\n","  i = 0\n","\n","  for col_name in col_names:\n","    for i in range(num_instances):\n","      dict_values[i].append(df[col_name][i].toArray()[0])\n","  \n","  fig = plt.figure(figsize = figsize)\n","  ax = plt.subplot(111, polar=True)\n","  plt.xticks(angles[:-1], col_names)\n","  \n","  for i in range(num_instances):\n","    values = dict_values[i]\n","    values += values[:1]\n","    ax.plot(angles, values, \"k\")\n","    ax.fill(angles, values, 'teal', alpha=0.1)\n","\n","  plt.xlabel(col, size=13)\n","  plt.show()\n","    "]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"856f3cbb-fc02-4279-9b16-059afa79e8d2","showTitle":false,"title":""}},"source":["#### Example 1: Numeric feature: \"Age\"\n","\n","All curves seem to follow the same course, so there are no obvious interactions. That means that the PDP is already a good summary of the relationships between the displayed features and the predicted income >=50K"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"8d794aca-2a9e-4791-a2a3-95f294731ebc","showTitle":false,"title":""}},"outputs":[],"source":["col_name =  'age_dependence'\n","age_dep = get_pandas_df_from_column(output, col_name)\n","\n","plot_ice_numeric(age_dep, col_name, figsize=(30, 10))"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"5cf03cf9-b753-4a45-92ac-7d1ac04449a2","showTitle":false,"title":""}},"source":["Helper function"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"9599a289-c8e7-4643-9f37-7134b185d391","showTitle":false,"title":""}},"outputs":[],"source":["def overlay_ice_with_pdp(df_ice, df_pdp, col, col_int=True, figsize=(20, 5)):\n","  dict_values = defaultdict(list)\n","  col_names_ice = list(df_ice.columns)\n","  num_instances = df_ice.shape[0]\n","  \n","  instances_y = {}\n","  i = 0\n","\n","  for col_name in col_names_ice:\n","    for i in range(num_instances):\n","      dict_values[i].append(df_ice[col_name][i].toArray()[0])\n","  \n","  fig = plt.figure(figsize = figsize)\n","  for i in range(num_instances):\n","    plt.plot(col_names_ice, dict_values[i], \"k\")\n","    \n","  dict_values = {}\n","  col_names = list(df_pdp.columns)\n","\n","  for col_name in col_names:\n","    dict_values[col_name] = df_pdp[col_name][0].toArray()[0]\n","    marklist= sorted(dict_values.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n","    sortdict=dict(marklist)\n","  \n","  plt.plot(col_names_ice, list(sortdict.values()), \"r\", linewidth=5)\n","  \n","  \n","  \n","  plt.xlabel(col, size=13)\n","  plt.ylabel(\"Dependence\")\n","  plt.ylim(0.0)\n","  plt.show()\n","  "]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"97996bd4-074b-4e05-bc11-10620415480e","showTitle":false,"title":""}},"source":["This shows how PDP visualizes the average dependence. Red line - PDP plot, black lines - ICE plots"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"36bfc955-b6c8-4e9d-a28e-42c7e9889692","showTitle":false,"title":""}},"outputs":[],"source":["col_name =  'age_dependence'\n","overlay_ice_with_pdp(age_dep, df_education_num, col=col_name, figsize=(30, 10))"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"d0a80081-a3f6-41d0-8f6d-b686220dbeaa","showTitle":false,"title":""}},"source":["#### Example 2: Categorical feature: \"occupation\""]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"1fee7eb9-6e89-480c-b864-fed2cfbf33f5","showTitle":false,"title":""}},"outputs":[],"source":["col_name = 'occupation_dependence'\n","occupation_dep = get_pandas_df_from_column(output, col_name)\n","\n","\n","plot_ice_categorical(occupation_dep, col_name, figsize=(30, 10))"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"4f6494ac-6bf5-408c-9ac2-c44b54afba38","showTitle":false,"title":""}},"source":["### Conclusions\n","\n","\n","**Advantages:**\n","\n","1) Plots are intuitive to understand. One line represents the predictions for one instance if we vary the feature of interest.\n","\n","2) ICE curves can uncover more complex relationships.\n","\n","**Disadvantages:**\n","\n","1) ICE curves can only display one feature meaningfully - otherwise you should overlay multiple surfaces.\n","\n","2) Some points in the lines might be invalid data points according to the joint feature distribution. It causes by correlations between features.\n","\n","3) In ICE plots it might not be easy to see the average."]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b974545e-371b-41b0-b9a4-adb3036d9eb8","showTitle":false,"title":""}},"source":["## Summary\n","\n","Partial dependence plots (PDP) and Individual Conditional Expectation (ICE) plots can be used to visualize and analyze interaction between the target response and a set of input features of interest.\n","\n","Both PDPs and ICEs assume that the input features of interest are independent from the complement features, and this assumption is often violated in practice.\n","\n","ICE shows the dependence on average, but if you want to observe features individually - you can use ICE.\n","\n","Using examples above we showed how it can be usefull to draw such plots to analyze how machine learning model made their predictions, what was important and how we can interpret the results."]}],"metadata":{"application/vnd.databricks.v1+notebook":{"dashboards":[],"language":"python","notebookMetadata":{"pythonIndentUnit":2},"notebookName":"PDP-ICE-tutorial-new","notebookOrigID":2416290700869370,"widgets":{}},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}


nit: Perhaps pump this through a json formatter so the diff lines doing forward will be nice

Can't find how to do it :( But you can hit the "View file" and it beautifully displays the file. The rest comments I fixed

mhamilton723 · 2021-12-23T04:53:31Z

pdp = ICETransformer(model=model, targetCol="probability", kind="average", targetClasses=[1]).
setCategoricalFeatures(categorical_features).
setNumericFeatures(numeric_features).setNumSamples(50)

^ Might want to use either setters or init args for consistency here and elsewhere

mhamilton723 · 2021-12-23T04:53:58Z

plt.title("")

Might not need this line if you don't want titles

mhamilton723 · 2021-12-23T04:55:59Z

col_name = 'age_dependence'
age_dep = get_pandas_df_from_column(output, col_name)

Could remove some duplication with
age_dep = get_pandas_df_from_column(output, 'age_dependence')

codecov-commenter · 2021-12-23T04:58:07Z

Codecov Report

Merging #1318 (e83ab9a) into master (906b408) will increase coverage by 0.08%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1318      +/-   ##
==========================================
+ Coverage   84.69%   84.77%   +0.08%     
==========================================
  Files         284      287       +3     
  Lines       13916    14231     +315     
  Branches      673      732      +59     
==========================================
+ Hits        11786    12065     +279     
- Misses       2130     2166      +36

Impacted Files	Coverage Δ
...rosoft/azure/synapse/ml/train/TrainRegressor.scala	`86.53% <0.00%> (-0.97%)`	⬇️
...osoft/azure/synapse/ml/train/TrainClassifier.scala	`82.57% <0.00%> (-0.76%)`	⬇️
...oft/azure/synapse/ml/core/schema/SparkSchema.scala	`82.60% <0.00%> (-0.73%)`	⬇️
...oft/azure/synapse/ml/io/http/HTTPTransformer.scala	`93.47% <0.00%> (-0.14%)`	⬇️
...ala/org/apache/spark/ml/param/DataFrameParam.scala	`70.83% <0.00%> (ø)`
...osoft/azure/synapse/ml/core/contracts/Params.scala	`95.65% <0.00%> (ø)`
...azure/synapse/ml/geospatial/AzureMapsHelpers.scala
.../azure/synapse/ml/geospatial/AzureMapsSearch.scala
...rosoft/azure/synapse/ml/geospatial/Geocoders.scala	`95.65% <0.00%> (ø)`
...synapse/ml/cognitive/TextAnalyticsSDKSchemas.scala	`81.19% <0.00%> (ø)`
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b64094...e83ab9a. Read the comment docs.

mhamilton723 · 2022-01-03T16:11:59Z

/azp run

azure-pipelines · 2022-01-03T16:12:10Z

Azure Pipelines successfully started running 1 pipeline(s).

memoryz · 2022-01-13T03:19:52Z

@ezherdeva please reformat the notebook in jupyter lab so it's easier to code review and comment.

memoryz · 2022-01-13T03:24:23Z

also please move the notebook to notebooks/features/responsible_ai folder.

…spark into ezherdeva/ice_docs

ezherdeva · 2022-01-14T02:46:44Z

I've changed everything according to the comments. Also, I reformatted the JSON view, it's easy to comment now. Please, have a look @memoryz

ezherdeva · 2022-01-14T02:47:25Z

/azp run

azure-pipelines · 2022-01-14T02:47:30Z

Commenter does not have sufficient privileges for PR 1318 in repo microsoft/SynapseML

mhamilton723 · 2022-01-14T03:39:25Z

/azp run

azure-pipelines · 2022-01-14T03:39:36Z

Azure Pipelines successfully started running 1 pipeline(s).

ezherdeva added 2 commits December 17, 2021 13:34

Add demo notebook

c27f887

clear outputs

2c0dc68

ezherdeva requested a review from mhamilton723 as a code owner December 17, 2021 22:29

mhamilton723 reviewed Dec 23, 2021

View reviewed changes

mhamilton723 requested a review from memoryz December 23, 2021 04:57

ezherdeva and others added 4 commits December 30, 2021 06:31

fix

efb722e

fix display

491a8d5

fix display and size

0872a90

Merge branch 'master' into ezherdeva/ice_docs

840965c

ezherdeva added 5 commits January 13, 2022 18:22

apply changes

18178f5

Merge branch 'ezherdeva/ice_docs' of https://github.com/ezherdeva/mml…

6cd49f7

…spark into ezherdeva/ice_docs

delete file

d95126d

change view

09957a9

change view

e83ab9a

ezherdeva requested a review from mhamilton723 January 14, 2022 02:46

mhamilton723 approved these changes Jan 14, 2022

View reviewed changes

Merge branch 'master' into ezherdeva/ice_docs

6f6fd2b

mhamilton723 merged commit 059732a into microsoft:master Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Adding notebook for ICE explainer #1318

docs: Adding notebook for ICE explainer #1318

ezherdeva commented Dec 17, 2021

ezherdeva commented Dec 17, 2021

mhamilton723 commented Dec 23, 2021

azure-pipelines bot commented Dec 23, 2021

mhamilton723 Dec 23, 2021

ezherdeva Dec 30, 2021

mhamilton723 commented Dec 23, 2021 •

edited

Loading

mhamilton723 commented Dec 23, 2021

mhamilton723 commented Dec 23, 2021

codecov-commenter commented Dec 23, 2021 •

edited

Loading

mhamilton723 commented Jan 3, 2022

azure-pipelines bot commented Jan 3, 2022

memoryz commented Jan 13, 2022

memoryz commented Jan 13, 2022

ezherdeva commented Jan 14, 2022 •

edited

Loading

ezherdeva commented Jan 14, 2022

azure-pipelines bot commented Jan 14, 2022

mhamilton723 commented Jan 14, 2022

azure-pipelines bot commented Jan 14, 2022

docs: Adding notebook for ICE explainer #1318

docs: Adding notebook for ICE explainer #1318

Conversation

ezherdeva commented Dec 17, 2021

ezherdeva commented Dec 17, 2021

mhamilton723 commented Dec 23, 2021

azure-pipelines bot commented Dec 23, 2021

mhamilton723 Dec 23, 2021

Choose a reason for hiding this comment

ezherdeva Dec 30, 2021

Choose a reason for hiding this comment

mhamilton723 commented Dec 23, 2021 • edited Loading

mhamilton723 commented Dec 23, 2021

mhamilton723 commented Dec 23, 2021

codecov-commenter commented Dec 23, 2021 • edited Loading

Codecov Report

mhamilton723 commented Jan 3, 2022

azure-pipelines bot commented Jan 3, 2022

memoryz commented Jan 13, 2022

memoryz commented Jan 13, 2022

ezherdeva commented Jan 14, 2022 • edited Loading

ezherdeva commented Jan 14, 2022

azure-pipelines bot commented Jan 14, 2022

mhamilton723 commented Jan 14, 2022

azure-pipelines bot commented Jan 14, 2022

mhamilton723 commented Dec 23, 2021 •

edited

Loading

codecov-commenter commented Dec 23, 2021 •

edited

Loading

ezherdeva commented Jan 14, 2022 •

edited

Loading