
# Core Basics 4: Train a Coclustering
The steps to train a coclustering model with Khiops are very similar to what we have already seen in the basic classifier tutorials.

We now execute the tutorial setup:

In [None]:
from os import path
from khiops import core as kh
from helper_functions import explorer_open, peek

As stated before, sometimes it is better to have a more adapted visualization for an unsupervised analysis. We illustrate this point with the dataset `CountriesByOrganization` that contains the relation country-organization for a large number of organizations and countries (*it is bit outdated though*)

In [None]:
countries_kdic = path.join(
    "data", "CountriesByOrganization", "CountriesByOrganization.kdic"
)
countries_data_file = path.join(
    "data", "CountriesByOrganization", "CountriesByOrganization.csv"
)

print("")
print(f"CountriesByOrganization dictionary file location: {countries_kdic}")
print("")
peek(countries_kdic, n=15)

print("")
print(f"CountriesByOrganization data table file location: {countries_data_file}")
print("")
peek(countries_data_file)

We now create a coclustering model for this dataset

In [None]:
countries_results_dir = path.join("exercises", "CountriesByOrganization")

countries_cc_report = kh.train_coclustering(
    countries_kdic,
    dictionary_name="CountriesByOrganization",
    data_table_path=countries_data_file,
    coclustering_variables=["Country", "Organization"],
    results_dir=countries_results_dir,
    field_separator=";",
)

We can now browse the results with the Khiops Covisualization app:

In [None]:
# explorer_open(path.dirname(countries_cc_report))

We can now dump the country clusters and its metrics to a file with the `extract_clusters` function

In [None]:
country_clusters_file = path.join(
    "exercises", "CountriesByOrganization", "CountryClusters.txt"
)
kh.extract_clusters(
    countries_cc_report,
    cluster_variable="Country",
    clusters_file_path=country_clusters_file,
)
peek(country_clusters_file, n=100)

### Exercise
We'll build a coclustering for the `Tokyo2021` dataset. It is extracted for the `Athletes` table of the [Tokyo 2021 Kaggle dataset](https://www.kaggle.com/arjunprasadsarkhel/2021-olympics-in-tokyo) and each record contains three variables:
- `Name`: the name of a competing athlete
- `Country`: the country (or organization) it represents
- `Discipline`: the athletes discipline

The idea for this exercise is to make a coclustering between `Country` and `Discipline` and see which countries resemble the most in terms of the athletes they bring to the Olympics. 

We start by saving the dataset dictionary file and data table location into variables:

In [None]:
tokyo_kdic = path.join("data", "Tokyo2021", "Athletes.kdic")
tokyo_data_file = path.join("data", "Tokyo2021", "Athletes.csv")
tokyo_results_dir = path.join("exercises", "Tokyo2021")

#### `peek` the contents of the dictionary and data files

In [None]:
print("")
print(f"Tokyo2021 dictionary file location: {tokyo_kdic}")
print("")
peek(tokyo_kdic, n=15)

print("")
print(f"Tokyo data table file location: {tokyo_data_file}")
print("")
peek(tokyo_data_file)

#### Train the coclustering for the variables `Country` and `Discipline`
Do not forget that the separator is `,`

In [None]:
tokyo_cc_report = kh.train_coclustering(
    tokyo_kdic,
    dictionary_name="Athletes",
    coclustering_variables=["Country", "Discipline"],
    data_table_path=tokyo_data_file,
    results_dir=tokyo_results_dir,
    field_separator=",",
)

You may see the coclustering with the covisualization app:

In [None]:
# explorer_open(path.dirname(tokyo_coclustering_report))

#### Use `extract_clusters` to extract the country clusters and `peek` its contents

In [None]:
tokyo_country_clusters_file = path.join("exercises", "Tokyo2021", "CountryClusters.txt")

kh.extract_clusters(
    tokyo_cc_report,
    cluster_variable="Country",
    clusters_file_path=tokyo_country_clusters_file,
)
peek(tokyo_country_clusters_file, n=200)

#### Use `extract_clusters` to extract the discipline clusters and `peek` its contents

In [None]:
tokyo_discipline_clusters_file = path.join(
    "exercises", "Tokyo2021", "CountryClusters.txt"
)

kh.extract_clusters(
    tokyo_cc_report,
    cluster_variable="Discipline",
    clusters_file_path=tokyo_discipline_clusters_file,
)
peek(tokyo_discipline_clusters_file, n=200, l=100)