# Multi-Table Classifier with the core API

In this notebook, we will learn how to train a classifier in a simple multi-table dataset. It is recommended to see the [single table tutorial](../single_table_classifier_core) first and understand the basics of [Khiops dictionary files](../kdic_intro).


In [1]:
import warnings
import pandas as pd
from khiops import core as kh
from khiops.tools import download_datasets

# Download the sample datasets from GitHub if not available
warnings.filterwarnings("ignore", message="Download.*") # Ignore dataset download warning
download_datasets()

## The _AccidentsSummary_ Dataset
We'll train a multi-table classifier on a the dataset `AccidentsSummary`. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema: 

```
Accidents
|
+----1:n----Vehicles
```
Let's first check the content of the tables:

- The main table `Accidents`
- The secondary table `Vehicles` in a `1:n` relation with `Accidents`

In [8]:
# Store the locations of the `AccidentsSummary` dataset
accidents_table_path = f"{kh.get_samples_dir()}/AccidentsSummary/Accidents.txt"
vehicles_table_path = f"{kh.get_samples_dir()}/AccidentsSummary/Vehicles.txt"

# Print the first lines of the data files
print("Accidents table:")
display(pd.read_csv(accidents_table_path, sep="\t", encoding="latin1").head(5))
print("Vehicles table:")
display(pd.read_csv(vehicles_table_path, sep="\t", encoding="latin1").head(5))

Accidents table:


Unnamed: 0,AccidentId,Gravity,Date,Hour,Light,Department,Commune,InAgglomeration,IntersectionType,Weather,CollisionType,PostalAddress
0,201800000001,NonLethal,2018-01-24,15:05:00,Daylight,590,5,No,Y-type,Normal,2Vehicles-BehindVehicles-Frontal,route des Ansereuilles
1,201800000002,NonLethal,2018-02-12,10:15:00,Daylight,590,11,Yes,Square,VeryGood,NoCollision,Place du général de Gaul
2,201800000003,NonLethal,2018-03-04,11:35:00,Daylight,590,477,Yes,T-type,Normal,NoCollision,Rue nationale
3,201800000004,NonLethal,2018-05-05,17:35:00,Daylight,590,52,Yes,NoIntersection,VeryGood,2Vehicles-Side,30 rue Jules Guesde
4,201800000005,NonLethal,2018-06-26,16:05:00,Daylight,590,477,Yes,NoIntersection,Normal,2Vehicles-Side,72 rue Victor Hugo


Vehicles table:


Unnamed: 0,AccidentId,VehicleId,Direction,Category,PassengerNumber,FixedObstacle,MobileObstacle,ImpactPoint,Maneuver
0,201800000001,A01,Unknown,Car<=3.5T,0,,Vehicle,RightFront,TurnToLeft
1,201800000001,B01,Unknown,Car<=3.5T,0,,Vehicle,LeftFront,NoDirectionChange
2,201800000002,A01,Unknown,Car<=3.5T,0,,Pedestrian,,NoDirectionChange
3,201800000003,A01,Unknown,Motorbike>125cm3,0,StationaryVehicle,Vehicle,Front,NoDirectionChange
4,201800000003,B01,Unknown,Car<=3.5T,0,,Vehicle,LeftSide,TurnToLeft


To train a classifier with the Khiops core API, we must specify a multi-table dataset. 
The schema is specified via the Khiops dictionary file, let's see the contents its for the `AccidentsSummary` dataset:

In [3]:
accidents_kdic_path = f"{kh.get_samples_dir()}/AccidentsSummary/Accidents.kdic"
with open(accidents_kdic_path) as accidents_kdic_file:
    print(accidents_kdic_file.read())

Root Dictionary Accident(AccidentId)
{
  Categorical AccidentId;
  Categorical	Gravity;
  Date Date;
  Time Hour;
  Categorical Light;
  Categorical Department;
  Categorical Commune;
  Categorical InAgglomeration;
  Categorical IntersectionType;
  Categorical Weather;
  Categorical CollisionType;
  Categorical PostalAddress;
  Table(Vehicle) Vehicles;
};

Dictionary Vehicle(AccidentId, VehicleId)
{
 Categorical AccidentId;
 Categorical VehicleId;
 Categorical Direction;
 Categorical Category;
 Numerical PassengerNumber;
 Categorical FixedObstacle;
 Categorical MobileObstacle;
 Categorical ImpactPoint;
 Categorical Maneuver;
};



We note that the `Accident` table contains a special `Table` variable. This special variable allows to create a `1:n` relation. The target table is in its argument between parentheses (`Vehicle`).

## Training the Classifier

While the dictionary file specifies the table schemas and their relations, it does not contain any information about the data files. On a single table task the third mandatory parameter of `train_predictor` specifies the data table file. For multi-table tasks this parameter is still used to specify the main table; to specify the rest of the tables we use the optional parameter `additional_data_tables`.

The `additional_data_tables` parameter is a Python `dict` whose keys are the data paths of each table and the values are their file paths (in our case just a single pair). For more information about data-paths see basics of [Khiops dictionary files](../kdic_intro).

By default, the Khiops creates at most 100 multi-table variables (`max_variables`) and 10 random decision trees (`max_trees`). We change these values for this example:

In [4]:
model_report_path, model_kdic_path = kh.train_predictor(
    accidents_kdic_path,
    "Accident",
    accidents_table_path,
    "Gravity",
    "./mt_results",
    additional_data_tables={
        "Accident`Vehicles": vehicles_table_path,
    },
    max_constructed_variables=1000,
    max_trees=0,
)

## Printing the accuracy and AUC of the model

To get the performances, we load the model report file into the variable `model_report` as in the previous "Single Table Classifier" tutorial.

In [5]:
model_report = kh.read_analysis_results_file(model_report_path)
accidents_train_performance = model_report.train_evaluation_report.get_snb_performance()
accidents_test_performance = model_report.test_evaluation_report.get_snb_performance()

print(f"Accidents train accuracy: {accidents_train_performance.accuracy}")
print(f"Accidents train auc     : {accidents_train_performance.auc}")
print(f"Accidents test accuracy : {accidents_test_performance.accuracy}")
print(f"Accidents test auc      : {accidents_test_performance.auc}")

Accidents train accuracy: 0.944848
Accidents train auc     : 0.816594
Accidents test accuracy : 0.945303
Accidents test auc      : 0.817946


## Deploying the Classifier

We are now going to deploy the `Accidents` classifier that we have just trained.

To this end we use the model dictionary file that the `train_predictor` function created in conjunction the the `deploy_model` core API function. Note that the name of the dictionary for the model is `SNB_Accident`.

Similarly to the model training we must set the `additional_data_tables` parameter to take into account the secondary table.

For simplicity, we'll just deploy on the whole data table file (one usually would do this on new data):

In [6]:
accidents_deployed_path = "./mt_results/accidents_deployed.txt"
kh.deploy_model(
    model_kdic_path,             # Path of the model dictionary file
    "SNB_Accident",              # Name of the model dictionary
    accidents_table_path,        # Path of the table to deploy the model
    accidents_deployed_path,     # Path of the output (deployed) file
    additional_data_tables = {   # Pairs of {"data-path": "file-path"} describing the other tables
        "SNB_Accident`Vehicles": vehicles_table_path,
    },
)

The deployed model is in the path in the variable `accidents_deployed_path`, let's have a look at it

In [7]:
display(pd.read_csv(accidents_deployed_path, sep="\t").head(10))

Unnamed: 0,AccidentId,PredictedGravity,ProbGravityLethal,ProbGravityNonLethal
0,201800000001,NonLethal,0.048355,0.951645
1,201800000002,NonLethal,0.073891,0.926109
2,201800000003,NonLethal,0.059189,0.940811
3,201800000004,NonLethal,0.03227,0.96773
4,201800000005,NonLethal,0.014231,0.985769
5,201800000006,NonLethal,0.107437,0.892563
6,201800000007,NonLethal,0.092529,0.907471
7,201800000008,NonLethal,0.06683,0.93317
8,201800000009,NonLethal,0.230061,0.769939
9,201800000010,NonLethal,0.043399,0.956601


The deployed data table file contains three columns
- `PredictedGravity`: Which contains the class prediction
- `ProbGravityLethal`, `ProbGravityNonLethal`: Which contain the probability of each class of `Accidents`.