<a id="top"></a>
# AntakIA tutorial
***
### Unsing AntakIA with no GUI!

AntakIA helps you understand and explain your _black-box_ machine-learning models, by identifying the most relevant way of segregating your dataset and the best surrogate models to apply on these freshly created regions. In this notebook, we will show you how to use the automatic dyadic-clustering algorithm of AntakIA.

> This notebook is a tutorial on how to use AntakIA without the GUI. If you want to use the GUI, please refer to the [AntakIA GUI tutorial](antakia_gui.ipynb).
> 
> For more information about AntakIA, please refer to the [AntakIA documentation](https://ai-vidence.github.io/antakia/) or go to [AI-vidence's website](https://ai-vidence.com/).

__In this notebook, you will learn how to:__
- Create a dataset object from a CSV file
- Instanciate an AntakIA object
- Manually define regions and apply sub-models
- Run the automatic dyadic-clustering algorithm
- Visualize the results

## Context :

__Let's pretend that we are a real estate agent and that we want to predict the price of a house based on its characteristics.__ We have a dataset of more than 20000 blocks of houses, each block being described by 8 features (e.g. medium income of the owners, number of rooms, etc.). We also have the price of each block of houses. We already trained a machine-learning model (in our case, a simple XGBoost) that will predict the price of a house based on its characteristics. This is very helpful to estimate the price of a house that we want to sell !

__The main issue is the following :__ we want to explain to our customers why their house is worth a certain price. We can't just show them the machine-learning model, because it is a _black-box_ model. We need to find a way to explain the price of a house based on its characteristics. This is where AntakIA comes in handy !

We start by importing the necessary libraries.

In [1]:
import pandas as pd 

Then, our dataset. Ours is [this one](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html); it can be found in the `data` folder of this repository.

Note that we already computed some explanatory values (in our case, SHAP values) and saved them in the CSV file. This is not necessary, as AntakIA can do it, but it will save us some computation time!

In [2]:
df = pd.read_csv('../data/california_housing/california_housing.csv').drop(['Unnamed: 0'], axis=1)

X = df.iloc[:,0:8] # the dataset
Y = df.iloc[:,9] # the target variable
SHAP = df.iloc[:,[10,11,12,13,14,15,16,17]] # the SHAP values

We also have a trained XGBoost model that we will use to predict the price of a house.

In [3]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(random_state = 9)
model.fit(X, Y)
"model fitted"

'model fitted'

__Let's now import `antakia`!__

In [4]:
import antakia

## 1. Creating the dataset object

We first use the [`Dataset`](https://ai-vidence.github.io/antakia/documentation/dataset/) class to create a dataset object. This object will be used to store the data and the machine-learning model.

In [5]:
dataset = antakia.Dataset(X, model = model, y=Y)
print(f'Size of the original dataset: {len(dataset)} lines')
dataset.frac(0.1) # 10% of the original dataset is enough to explore
print(f'Size of the data we want to explore: {len(dataset)} lines')

Size of the original dataset: 20640 lines
Size of the data we want to explore: 2064 lines


## 2. Creating the AntakIA object
We then use the [`AntakIA`](https://ai-vidence.github.io/antakia/documentation/antakia/) class to create an AntakIA object. This is the main object of the package!
This is where we import our explanatory values (in our case, SHAP values).

In [6]:
atk = antakia.AntakIA(dataset, import_explanation = SHAP)

## 3. Creating custom regions
We can now create our own regions and apply sub-models on them.
Doing it this way means that we already know how to segregate (at least partly) our dataset. This is not always the case, and this is why AntakIA has a GUI and an automatic dyadic-clustering algorithm (see next section). Here, we pretend that a business expert told us that a particular region is very interesting to look at. He sent it to us in a [`JSON` file](../data/business.json), as a list of the indexes of the data points that belong to this region.

A region (a set of points) is called a [`Potato`](https://ai-vidence.github.io/antakia/documentation/potato/) in AntakIA. Let's create one!



In [7]:
potato = antakia.Potato(atk, json_path="../data/business")

print(potato)

Potato:
 ------------------
       State: json importation 
       Number of points: 928 
       Percentage of the dataset: 44.96% 
       Sub-model: NoneType


A `Potato` object takes as input the `AntakIA` object. __You will discover that the `AntakIA` object is the main object of the package: it links all the other objects together!__

This region is a fraction of the dataset. But what does it represent? We could use data-visulization tools to find out, such as plotting the points on a map or a scatter plot. AntakIA chooses to use the [SkopeRules](https://github.com/scikit-learn-contrib/skope-rules) tool! This tool will give us, with a certain precision and recall that we can define, simple rules on the features that describe best the region. Let's use it with the `applySkope` method of the Potato!

_The rules must be found both in the Values Space and in the Explanation Space. We need to specify which explanation space to consider! That's why this methdo takes as an argument one of the explanation spaces of the AntakIA object._

In [8]:
potato.applySkope('Imported')

Let's see what we got!

In [9]:
potato.getVSrules()

[[0.5, '<=', 'MedInc', '<=', 12.448],
 [0.444, '<=', 'AveBedrms', '<=', 1.963],
 [34.915, '<=', 'Latitude', '<=', 41.86]]

These rules define the region we are creating right now, with the following score:

_(Precision, Recall, Extract from the decision tree)_)

In [27]:
potato.getVSscore()

(0.999, 1.0, 10)

The same can be done for the __explanation space:__

In [26]:
print(potato.getESrules())
print(potato.getESscore())

[[-0.394, '<=', 'AveBedrms_shap', '<=', 0.062], [-1.282, '<=', 'Latitude_shap', '<=', 0.08], [-0.171, '<=', 'Longitude_shap', '<=', 1.307]]
(0.979, 0.921, 10)


This done, we want to apply a linear sub-model on this region. Let's find what is the best sub-model for this region!

In [11]:
import sklearn

sub_models = [sklearn.linear_model.LinearRegression(), sklearn.linear_model.SGDRegressor(), sklearn.linear_model.Ridge()] # list of sub-models to choose from

for sub_model in sub_models:
    sub_model.fit(potato.data, potato.y)
    print(f"Score for {sub_model.__class__.__name__} : {round(sub_model.score(potato.data, potato.y),4)}")

Score for LinearRegression : 0.6796
Score for SGDRegressor : -5.967526598835133e+29
Score for Ridge : 0.6795


Let's say that the most relevant sub-model for this region is the `Ridge` sub-model. We can now apply it on this region.

In [12]:
potato.setSubModel(sklearn.linear_model.Ridge())

We are now happy with this region, so we can add it to the list of regions stored in the `AntakIA` object.

In [13]:
atk.newRegion(potato)

The regions are a list of `Potato` objects. Let's see what we have so far.

In [14]:
print(atk.regions) # 1 potato!
print()
print(atk.regions[0])

[<antakia.potato.Potato object at 0x2846d1690>]

Potato:
 ------------------
       State: lasso 
       Number of points: 928 
       Percentage of the dataset: 44.96% 
       Sub-model: Ridge


## 4. Using the automatic dyadic-clustering algorithm
We now want to ask AntakIA to find automatically the best way to segregate our dataset. We will use the automatic dyadic-clustering algorithm to do so. Let's first reset our regions, and then run the algorithm.

In [15]:
atk.resetRegions()
atk.computeDyadicClustering(sub_models = True) # compute the dyadic clustering, and asking it to find the best sub-model for each region

Let's see what we have now.

In [16]:
regions = atk.getRegions()
print(f'Number of regions created: {len(regions)}')

Number of regions created: 3


Let's explore the first one as an example.

In [17]:
my_region = atk.getRegions()[0]

print(f"Size of the region: {len(my_region)}")

print('First glance at the data:')
display(my_region.getVSdata().head())

print('Which sub-model is the most appropriate for this region?') # the sub-models are choosen from a list in the AntakIA object. There are default sub-models, but you can add your own (see the documentation).
print(my_region.getSubModel())

Size of the region: 355
First glance at the data:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
3,1.5933,37.0,3.998331,1.046745,2489.0,4.155259,32.69,-117.11
5,5.7695,3.0,4.826263,1.057576,2522.0,2.547475,37.38,-121.89
6,3.1975,24.0,4.218609,1.070461,2478.0,2.238482,34.1,-118.13
15,5.2898,25.0,6.722008,1.028958,3240.0,3.127413,33.86,-117.56
21,1.7514,34.0,2.660137,1.091087,2991.0,2.929481,34.09,-118.29


Which sub-model is the most appropriate for this region?
{'model': 'RandomForestRegressor', 'score': 0.9575}


Now, as a data-scientist, you can do pretty much whatever you want with this data, such as visualizing it on a map !

In [18]:
import plotly.graph_objects as go

df = my_region.data

fig = go.Figure(data=go.Scattergeo(
        lon = df['Longitude'],
        lat = df['Latitude'],
        mode = 'markers',
        marker_color = df['MedInc'],
        ))
fig.update_layout(title = 'Visualisation of the region on a map!')
fig.show()

That's it ! You now know how to use AntakIA without the GUI. If you want to use the GUI, please refer to the [AntakIA GUI tutorial](antakia_gui.ipynb).
***

## List if usefull links

- [AntakIA documentation](https://ai-vidence.github.io/antakia/) - The official documentation of AntakIA
- [AntakIA GitHub repository](https://github.com/AI-vidence/antakia/tree/main) - The GitHub repository of AntakIA. Do not forget to __star__ it if you like it!
- [AntakIA video tutorials](https://www.youtube.com/@AI-vidence) - The YouTube channel of AI-vidence, with video tutorials on AntakIA!
- [AI-vidence's website](https://ai-vidence.com/) - The website of AI-vidence, the company behind AntakIA

[Top of Page](#top)
<img style="float: right;" src="https://raw.githubusercontent.com/AI-vidence/antakia/main/docs/img/Logo-AI-vidence.png" alt="AI-vidence" width="200px"/> 