# Simple Classifier
In this lesson, we will learn how to train, evaluate and deploy a classifier with pyKhiops sklearn.

We start by importing pyKhiops sklearn classifier `KhiopsClassifier` and saving the location of the Khiops `Samples` directory into a variable:

In [None]:
from os import path
import pandas as pd

from khiops import core as kh
from khiops.sklearn import KhiopsClassifier

samples_dir = kh.get_runner().samples_dir
print(f"Khiops samples directory located at {samples_dir}")

## Training a Classifier

We'll train a classifier for the `Iris` dataset. This is a classical dataset containing data of different plants belonging to the genus _Iris_. It contains 150 records, 50 for each of the three _Iris_'s variants: _Setosa_, _Virginica_ and _Versicolor_. Each record contains the length and the width of both the petal and the sepal of the plant. The standard task, when using this dataset, is to construct a classifier for the type of the _Iris_, based on the petal and sepal characteristics.

To train a classifier with Khiops, we only need a dataframe that we are going to load from a file. 

Let's first save the location of this file into a variable `iris_data_file`, load it and take a look at its content:

In [None]:
iris_data_file = path.join(samples_dir, "Iris", "Iris.txt")
print("")
print(f"Iris data: 10 first records")
iris_df = pd.read_csv(iris_data_file, sep="\t")
iris_df.head()

Before training the classifier, we split the data  into the feature matrix (sepal length, width, etc) and the target vector containing the labels (the `Class` column).

In [None]:
X_iris_train = iris_df.drop("Class", axis=1)
y_iris_train = iris_df["Class"]

Let's check the contents of the feature matrix and the target vector:

In [None]:
print("Features of the Iris dataset:")
display(X_iris_train.head())
print("")
print("Label of the Iris dataset:")
display(y_iris_train.head())

Let's now train the classifier with the pyKhiops function `KhiopsClassifier`. This method returns a model ready to classify new Iris plants.

*Note: By default Khiops builds 10 decision trees. This is not necessary for this tutorial so we set `n_trees=0`*

In [None]:
pkc_iris = KhiopsClassifier(n_trees=0)
pkc_iris.fit(X_iris_train, y_iris_train)

### Exercise


We'll repeat the same steps with the `Adult` dataset. It contains characteristics of a adult population in the USA such as age, gender and education. The task here is to predict the variable `class` which indicates if the individual earns `more` or `less` than 50,000 dollars.

Let's start by loading the `Adult` dataframe and checking its contents:

#### Load the adult dataset and take a look at its content

In [None]:
adult_data_file = path.join(samples_dir, "Adult", "Adult.txt")
print("")
print(f"Adult data: 10 first records")
adult_df = pd.read_csv(adult_data_file, sep="\t")
adult_df.head()

#### Build the feature matrix and the the target vector to train the `Adult` classifier
Note that the name of the target variable is `class` (**in lower case!**). 

In [None]:
X_adult_train = adult_df.drop(["class"], axis=1)
y_adult_train = adult_df["class"]
print("Adult dataset feature matrix (first 10 rows):")
display(X_adult_train.head(10))
print("")
print("Adult dataset target vector (first 10 values):")
display(y_adult_train.head(10))

#### Train a classifier for the `Adult` dataset
Do not forget to set `n_trees=0`

In [None]:
pkc_adult = KhiopsClassifier(n_trees=0)
pkc_adult.fit(X_adult_train, y_adult_train)

## Accessing the Classifier' Basic Train Evaluation Metrics

Khiops calculates evaluation metrics for the training dataset. We access them via the model's attribute `model_report` which is an instance of the `AnalysisResults` class. Let's check this out:

In [None]:
iris_results = pkc_iris.model_report_
print(type(iris_results))

The model evaluation report is stored in the `train_evaluation_report` attribute of `iris_results`.

In [None]:
iris_train_eval = iris_results.train_evaluation_report
print(type(iris_train_eval))

We access the default predictor's metrics with the `get_snb_performance` method of `iris_train_eval`:

In [None]:
iris_train_performance = iris_train_eval.get_snb_performance()
print(type(iris_train_performance))

This object `iris_train_performance` is of class `PredictorPerformance` and has `accuracy` and `auc` attributes:

In [None]:
print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris train AUC     : {iris_train_performance.auc}")

The `PredictorPerformance` object has also a confusion matrix attribute:

In [None]:
iris_classes = iris_train_performance.confusion_matrix.values
iris_confusion_matrix = pd.DataFrame(
    iris_train_performance.confusion_matrix.matrix,
    columns=iris_classes,
    index=iris_classes,
)
print("Iris train confusion matrix:")
iris_confusion_matrix

### Exercise
#### Access the adult modeling report and print its type

In [None]:
adult_results = pkc_adult.model_report_
type(adult_results)

#### Save the evaluation report of the `Adult` classification into the variable `adult_train_eval`

In [None]:
adult_train_eval = adult_results.train_evaluation_report

#### Show the model's train accuracy, auc and confusion matrix

In [None]:
adult_train_performance = adult_train_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult train AUC     : {adult_train_performance.auc}")

adult_classes = adult_train_performance.confusion_matrix.values
adult_confusion_matrix = pd.DataFrame(
    adult_train_performance.confusion_matrix.matrix,
    columns=adult_classes,
    index=adult_classes,
)
print("Adult train confusion matrix:")
adult_confusion_matrix

## Deploying a Classifier
We are now going to deploy the `Iris` classifier `pkc_iris`, that we have just trained, on the same dataset (normally we do this on new data). 

The learned classifier can be deployed in two different ways:

- to predict a class that can be obtained using the `predict` method of the model.
- to predict class probabilities that can be obtained using the `predict_proba` method of the model.

Let's first predict the `Iris` labels:

In [None]:
iris_predictions = pkc_iris.predict(X_iris_train)
print("Iris model predictions (first 10 values):")
iris_predictions[:10]

Let's now predict the probabilities for each `Iris` type.
Note that the column order of this matrix is given by the estimator attribute `pkc.classes_`:

In [None]:
iris_probas = pkc_iris.predict_proba(X_iris_train)
print(f"Iris classes {pkc_iris.classes_}")
print("Iris model probabilities for each class (first 10 rows):")
iris_probas[:10]

###  Exercise
#### Use the `predict` and  `predict_proba`  methods to deploy the `Adult` model `pkc_adult`
Which columns are deployed in each case?

In [None]:
adult_predictions = pkc_adult.predict(X_adult_train)
print("Adult model predictions (first 10 values):")
display(adult_predictions[:10])

adult_probas = pkc_adult.predict_proba(X_adult_train)
print(f"Adult classes {pkc_adult.classes_}")
print("Adult model predictions for each class (first 10 rows):")
display(adult_probas[:10])