# Sklearn Basics 4: Train a Coclustering
The steps to train a coclustering model with Khiops are very similar to what we have already seen in the basic classifier tutorials.

We start by importing the sklearn estimator `KhiopsCoclustering` and defining a helper function:

In [None]:
import os
import platform
import subprocess
import pandas as pd
from khiops import core as kh
from khiops.sklearn import KhiopsCoclustering

# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()

For this tutorial, we use the dataset `CountriesByOrganization` that contains the relation country-organization for a large number of countries and organizations (*it is bit outdated though*). The objective is to build a coclustering between Country and Organization and see which countries resemble the most in terms of organizations.

Let's first load this dataset and check its content:

In [None]:
countries_data_file = os.path.join(
    "data", "CountriesByOrganization", "CountriesByOrganization.csv"
)
X_countries = pd.read_csv(countries_data_file, sep=";")
print("CountriesByOrganization dataset:")
display(X_countries)

Now, let's build the coclustering model.

Note that a coclustering model is learned in an unsupervised way and aims to cluster jointly rows and columns of a matrix. So we need to provide a column name to be able to deploy it on a specific column. We do this by setting the `fit` parameter `id_column`:

In [None]:
khcc_countries = KhiopsCoclustering()
khcc_countries.fit(X_countries, id_column="Country")

Now let's access the coclustering training report to obtain the cluster information of the `Country` dimension. Since in each dimension there is a hierarchical cluster, so we only access the leaf clusters:

In [None]:
countries_clusters = khcc_countries.model_report_.coclustering_report.get_dimension(
    "Country"
).clusters
countries_leaf_clusters = [cluster for cluster in countries_clusters if cluster.is_leaf]
print(f"Number of leaf clusters: {len(countries_leaf_clusters)}:")
for index, cluster in enumerate(countries_leaf_clusters, start=1):
    print(f"cluster {index:02d}: {cluster.name}")

The composition of the clusters is also available. For the first one we have:

In [None]:
print(f"Members of the cluster {countries_leaf_clusters[0].name}:")
for value_obj in countries_clusters[0].leaf_part.values:
    print(value_obj.value)

The coclustering is a complex model, so it is better to visualize it with the Khiops Co-visualization app. Let's export the report to a `.khcj` file and open it:

In [None]:
countries_report = os.path.join("exercises", "countries.khcj")
khcc_countries.export_report_file(countries_report)
# explorer_open(countries_report)

Finally, let's deploy the coclustering model on the training data `countries_df`:

In [None]:
countries_predictions = khcc_countries.predict(X_countries)
print("Predicted clusters (first 10)")
display(countries_predictions[:10])

### Exercise
We'll build a coclustering model for the `Tokyo2021` dataset. It is extracted from the `Athletes` table of the [Tokyo 2021 Kaggle dataset](https://www.kaggle.com/arjunprasadsarkhel/2021-olympics-in-tokyo) and each record contains three variables:
- `Name`: the name of a competing athlete
- `Country`: the country (or organization) it represents
- `Discipline`: the athlete's discipline

The objective with this exercise is to make a coclustering between `Country` and `Discipline` and see which countries resemble the most in terms of the athletes they bring to the Olympics. We start by loading the contents into a dataframe:

In [None]:
tokyo_data_file = os.path.join("data", "Tokyo2021", "Athletes.csv")
X_tokyo = pd.read_csv(tokyo_data_file, encoding="latin1")
print("Tokyo2021 dataset (first 10 rows):")
display(X_tokyo.head(10))

#### Train the coclustering for the variables `Country` and `Discipline`

Call `fit` parameters with the following parameters:
- `X=X_tokyo[["Country", "Discipline"]]`
- `id_column="Country"`

In [None]:
khcc_tokyo = KhiopsCoclustering()
khcc_tokyo.fit(X_tokyo[["Country", "Discipline"]], id_column="Country")

#### Obtain the number and names of the clusters of the `Country` dimension

In [None]:
tokyo_clusters = khcc_tokyo.model_report_.coclustering_report.get_dimension(
    "Country"
).clusters
tokyo_leaf_clusters = [cluster for cluster in tokyo_clusters if cluster.is_leaf]
print(f"Number of leaf clusters: {len(tokyo_leaf_clusters)}:")
for index, cluster in enumerate(tokyo_leaf_clusters):
    print(f"cluster {index:02d}: {cluster.name}")

#### Print the members of one of the clusters

In [None]:
print(f"Members of the cluster {tokyo_leaf_clusters[29].name}:")
for value_obj in tokyo_leaf_clusters[29].leaf_part.values:
    print(value_obj.value)

**Check the results with the covisualization app**

In [None]:
tokyo_report = os.path.join("exercises", "tokyo.khcj")
khcc_tokyo.export_report_file(tokyo_report)

# To visualize uncomment the lines below
# khcc_tokyo.export_report_file("./tokyo_report.khcj")
# kh.export_report_file("./tokyo_report.khcj")

#### Deploy the learned coclustering model on the training data and check the obtained clusters

In [None]:
tokyo_predictions = khcc_tokyo.predict(X_tokyo[["Country", "Discipline"]])
print("Predicted clusters (first 10)")
display(tokyo_predictions[:10])