# Core Basics 3: Train a Classifier on a Snowflake Multi-Table Dataset

In this notebook, we learn how to train a classifier with a more complex multi-table data where a secondary table is itself a parent table of another table (ie. snowflake schema). It is highly recommended to see the _Basics 1_ and _Basics 2_  lessons if you are not familiar with Khiops.

Make sure you have installed [Khiops](https://khiops.org/setup/) and [Khiops Visualization](https://khiops.org/setup/visualization/).

We start by importing Khiops, checking its installation and defining some helper functions:

In [None]:
import os
import platform
import subprocess
from khiops import core as kh

# Define helper functions
def peek(file_path, n=10):
    """Shows the first n lines of a file"""
    with open(file_path, encoding="utf8", errors="replace") as file:
        for line in file.readlines()[:n]:
            print(line, end="")
    print("")


def os_open(path):
    """Opens a file or directory with its default application"""
    if platform.system() == "Windows":
        os.startfile(path)
    elif platform.system() == "Darwin":
        subprocess.call(["open", path])
    else:
        subprocess.call(["xdg-open", path])


# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()

### Training a Multi-Table Classifier

We'll train a multi-table classifier on a extension of dataset `AccidentsSummary` that we used in the previous notebook  _Sklearn Basics 2_. This dataset `Accidents` contains two additional tables `Place` and `User` and is organized in the following relational snowflake schema:

```
Accident
|
| -- 1:n -- Vehicle
|             |
|             |-- 1:n -- User
|
| -- 1:1 -- Place
```

Note that the target variable is `Gravity`.

To train the KhiopsClassifier for this setup, this schema must be codified in the dictionary file. Let's check the contents of the `Accidents` dictionary file:

In [None]:
accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "Accidents")
accidents_kdic = os.path.join(accidents_dataset_dir, "Accidents.kdic")

print(f"Accidents dictionary file location: {accidents_kdic}")
print("")
peek(accidents_kdic, n=45)

Note the following differences in comparison with the dictionary of dataset `AccidentsSummary`.

- The schema for the main table contains one extra special variable defined with the statement `Entity(Place) Place` which indicate a `1:1` relationship between `Accident` and `Place` tables.
- The main table `Accident` and entity `Place` have the same key `AccidentId`. Table `Vehicle` and its child table `User` have two keys `AccidentId` and `VehicleId`.

Now let's store the location of the tables and peek their contents:

In [None]:
accidents_data_file = os.path.join(accidents_dataset_dir, "Accidents.txt")
print(f"Accidents data table: {accidents_data_file}")
print("")
peek(accidents_data_file)

vehicles_data_file = os.path.join(accidents_dataset_dir, "Vehicles.txt")
print(f"Vehicles data table: {vehicles_data_file}")
print("")
peek(vehicles_data_file)

places_data_file = os.path.join(accidents_dataset_dir, "Places.txt")
print(f"Places data table: {places_data_file}")
print("")
peek(places_data_file)

users_data_file = os.path.join(accidents_dataset_dir, "Users.txt")
print(f"Users data table: {users_data_file}")
print("")
peek(users_data_file)

#### Train a classifier for the `Accidents` database with 1000 variables

The call to the train_predictor is exactly the same as seen before on the exercice of the previous notebook _Sklearn Basics 2_. The only difference is the extension of the dictionary `additional_data_tables`, which contains paths of the additional tables, with two new paths:

- Path of entity `Place` is ``Accident`Place``.
- Path of table `User` is ``Accident`Vehicles`Users``.


Same as previously, we'll ask Khiops to create 1000 additional features with its multi-table AutoML mode.

Do not forget:
- The target variable is `Gravity`
- Set `max_trees=0`

With these considerations, let's now train the classifier:

In [None]:
accidents_results_dir = os.path.join("exercises", "Accidents")
accidents_report, accidents_model_kdic = kh.train_predictor(
    accidents_kdic,
    dictionary_name="Accident",
    data_table_path=accidents_data_file,
    target_variable="Gravity",
    results_dir=accidents_results_dir,
    additional_data_tables={
        "Accident`Vehicles": vehicles_data_file,
        "Accident`Place": places_data_file,
        "Accident`Vehicles`Users": users_data_file,
    },
    max_constructed_variables=1000,
    max_trees=0,
)
print(f"Accidents report file: {accidents_report}")
print(f"Accidents modeling dictionary file: {accidents_model_kdic}")

#### Take a look to the report
Which variables predict well the gravity of an accident?

In [None]:
# To visualize uncomment the line below
# os_open(accidents_report)