# Core Basics 1: Train, Evaluate and Deploy a Classifier
In this lesson we will learn how to train, evaluate and deploy classifiers with Khiops.

Make sure you have installed [Khiops](https://khiops.org/setup/) and [Khiops Visualization](https://khiops.org/setup/visualization/).

We start by importing Khiops and defining some helper functions:

In [None]:
import os
import platform
import subprocess
from khiops import core as kh

# Define peek helper function
def peek(file_path, n=10):
    """Shows the first n lines of a file"""
    with open(file_path, encoding="utf8", errors="replace") as file:
        for line in file.readlines()[:n]:
            print(line, end="")
    print("")


# If there are any issues, you may print Khiops status with the following command:
# kh.get_runner().print_status()

## Training a Classifier
We'll train a classifier for the `Iris` dataset. This is a classical dataset containing the data of different plants belonging to the genus _Iris_. It contains 150 records, 50 for each of three variants of _Iris_: _Setosa_, _Virginica_ and _Versicolor_. The records for each sample contain the length and width of its petal and sepal. The standard task for this dataset is to construct a classifier for the type of _Iris_ taking as inputs the length and width characteristics.

Now to train a classifier with Khiops, we use two types of files:
- A plain-text delimited data file (for example a `csv` file)
- A _dictionary_ file which describes the schema of the above data table (`.kdic` file extension)


Let's save, into variables, the locations of these files for the `Iris` dataset and then take a look at their contents:

In [None]:
iris_kdic = os.path.join(kh.get_samples_dir(), "Iris", "Iris.kdic")
iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt")

print(f"Iris dictionary file: {iris_kdic}")
peek(iris_kdic)
print(f"Iris data file: {iris_data_file}\n")
peek(iris_data_file)

Note that the _Iris_ variant information is in the column `Class`. Now let's specify the path to the analysis report file.

In [None]:
analysis_report_file_path_Iris = os.path.join("exercises", "Iris", "AnalysisReport.khj")

print(f"Iris analysis report file path: {analysis_report_file_path_Iris}")

We are now ready to train the classifier with the Khiops function `train_predictor`. This method returns a tuple containing the location of two files:
- the modeling report (`AnalysisReport.khj`): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics. It is saved into `analysis_report_file_path_Iris` variable that we just defined.
- model's _dictionary_ file (`AnalysisReport.model.kdic`): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data.

In [None]:
iris_report, iris_model_kdic = kh.train_predictor(
    iris_kdic,
    dictionary_name="Iris",
    data_table_path=iris_data_file,
    target_variable="Class",
    analysis_report_file_path=analysis_report_file_path_Iris,
    max_trees=0,  # by default Khiops constructs 10 decision tree variables
)
print(f"Iris report file: {iris_report}")
print(f"Iris modeling dictionary: {iris_model_kdic}")

Note that `iris_report` (the first element of the tuple returned by train_predictor) is identical to `analysis_report_file_path_Iris`. 

In the next sections, we'll use the file at `iris_report` to assess the models' performances and the file at `iris_model_kdic` to deploy it. Now we can have a look at the report with the Khiops Visualization app:

In [None]:
# To visualize uncomment the line below
# kh.visualize_report(iris_report)

### Exercise

We'll repeat the previous steps on the `Adult` dataset. This dataset contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable `class`, which indicates if the individual earns `more` or `less` than 50,000 dollars.

Let's start by putting, into variables, the paths for the `Adult` dataset:

In [None]:
adult_kdic = os.path.join(kh.get_samples_dir(), "Adult", "Adult.kdic")
adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt")

#### Print the file locations and use the function `peek` to list their contents

In [None]:
print(f"Adult dictionary file: {adult_kdic}")
peek(adult_kdic)
print(f"Adult data file: {adult_data_file}\n")
peek(adult_data_file)

We now specify the path to the analysis report file for this exercise:

In [None]:
analysis_report_file_path_Adult = os.path.join(
    "exercises", "Adult", "AnalysisReport.khj"
)

print(f"Adult analysis report file path: {analysis_report_file_path_Adult}")

#### Train a classifier for the `Adult` database
Note the name of the target variable is `class` (**in lower case!**). Do not forget to set `max_trees=0`. Save the resulting file locations into the variables `adult_report` and `adult_model_kdic` and print them.

In [None]:
adult_report, adult_model_kdic = kh.train_predictor(
    adult_kdic,
    dictionary_name="Adult",
    data_table_path=adult_data_file,
    target_variable="class",
    analysis_report_file_path=analysis_report_file_path_Adult,
    max_trees=0,
)
print(f"Adult report file: {adult_report}")
print(f"Adult modeling dictionary file: {adult_model_kdic}")

#### Inspect the results with the Khiops Visualization app

In [None]:
# To visualize uncomment the line below
# kh.visualize_report(adult_report)

## Accessing a Classifiers' Basic Evaluation Metrics

We access the classifier's evaluation metrics by loading the file at `iris_report` with the Khiops function `read_analysis_results_file`:

In [None]:
iris_results = kh.read_analysis_results_file(iris_report)
print(type(iris_results))

The resulting object is an instance of the `AnalysisResults` class. The model evaluation reports are stored in its `train_evaluation_report` and `test_evaluation_report` attributes which are of class `EvaluationReport`.

In [None]:
iris_train_eval = iris_results.train_evaluation_report
iris_test_eval = iris_results.test_evaluation_report
print(type(iris_train_eval))
print(type(iris_test_eval))

We access the default predictor's metrics with the `get_snb_performance` method of the evaluation report objects:

In [None]:
iris_train_performance = iris_train_eval.get_snb_performance()
iris_test_performance = iris_test_eval.get_snb_performance()

These objects are of class `PredictorPerformance`. They have access to `accuracy` and `auc` attributes:

In [None]:
print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris test accuracy:  {iris_test_performance.accuracy}")
print("")
print(f"Iris train AUC: {iris_train_performance.auc}")
print(f"Iris test AUC:  {iris_test_performance.auc}")

### Exercise
#### Read the contents of the file at `adult_report` for the Adult analysis and print its type

In [None]:
adult_results = kh.read_analysis_results_file(adult_report)
type(adult_results)

#### Save the evaluation reports of the `Adult` classification to the variables `adult_train_eval` and `adult_test_eval`

In [None]:
adult_train_eval = adult_results.train_evaluation_report
adult_test_eval = adult_results.test_evaluation_report

#### Show the model's train and test accuracies and AUCs

In [None]:
adult_train_performance = adult_train_eval.get_snb_performance()
adult_test_performance = adult_test_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult test accuracy:  {adult_test_performance.accuracy}")
print("")
print(f"Adult train AUC: {adult_train_performance.auc}")
print(f"Adult test AUC:  {adult_test_performance.auc}")

## Deploying a Classifier
We are going to deploy the `Iris` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file `iris_model_kdic`. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Let's take a quick look at its contents:

In [None]:
peek(iris_model_kdic, 25)

Note that the modeling dictionary contains 4 used variables:
- `PredictedClass` : The class with the highest probability according to the model
- `ProbClassIris-setosa`, `ProbClassIris-versicolor`, `ProbClassIris-virginica`: The probabilities of each class according to the model

These will be the columns of the table obtained after deploying the model. This table will be saved at `iris_deployment_file`.

In [None]:
iris_deployment_file = os.path.join("exercises", "Iris", "iris_deployment.txt")
kh.deploy_model(
    iris_model_kdic,
    dictionary_name="SNB_Iris",
    data_table_path=iris_data_file,
    output_data_table_path=iris_deployment_file,
)

peek(iris_deployment_file)

### Exercise
#### Use the `deploy_model` function to deploy the model stored in the file at `adult_model_kdic`
Which columns are deployed?

In [None]:
adult_deployment_file = os.path.join("exercises", "Adult", "adult_deployment.txt")
kh.deploy_model(
    adult_model_kdic,
    dictionary_name="SNB_Adult",
    data_table_path=adult_data_file,
    output_data_table_path=adult_deployment_file,
)
peek(adult_deployment_file)