## Purpose
In this example we demonstrate how to:
   - Build a coreset tree from file(s):
       - Build from single files
       - Build from list of files
       - Build from all files in folder
       - Build from list of folders
       - Build when the targets and features are in the different files
   - Build from pandas DataFrame, and from list of DataFrames
   - Splitting data to few categories with parameter coreset_by
   - Build from dataset(s) in form of numpy arrays

In this example we'll be using the well-known Covertype Dataset (https://archive.ics.uci.edu/ml/datasets/covertype).


In [1]:
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_covtype
import numpy as np

from dataheroes.services import CoresetTreeServiceLG

## Prepare datasets

In [2]:
# Load Covertype dataset as a pandas data frame.
# In the output data frame all columns are features beside the last column.
# The last column (Cover_Type) is the target
df = fetch_covtype(as_frame=True).frame

# Split dataframe: df1 = 50%, df2=25%, df3=25%
df1, df2 = train_test_split(df, test_size=0.5, random_state=42)
df2, df3 = train_test_split(df2, test_size=0.5, random_state=42)

# Prepare data directory and set the file names.
data1_dir = Path("data1_dir")
data2_dir = Path("data2_dir")
data1_dir.mkdir(parents=True, exist_ok=True)
data2_dir.mkdir(parents=True, exist_ok=True)
data1_file_path = data1_dir / "data1.csv"
data2_file_path = data1_dir / "data2.csv"
data3_file_path = data2_dir / "data3.csv"

# Store data as CSV.
# After that we have the following structure:
#   data1_dir
#       data1.csv (~290,000 samples)
#       data2.csv (~145,000 samples)
#   data2_dir
#       data3.csv (~145,000 samples)
df1.to_csv(data1_file_path, index=False)
df2.to_csv(data2_file_path, index=False)
df3.to_csv(data3_file_path, index=False)

### 1. Build the tree from file or files
Run `build_from_file` on the first file.
It will include ~290K sample. Let's use `sample_size` of 10K and `coreset_size` of 2K. Besides csv any format could be used, trough setting `reader_f` and `reader_kwargs` params.

In [3]:
# Tell the tree how data is structured.
# In this example we have one target column, all other columns are features.
data_params = {'target': {'name': 'Cover_Type'}}
# Initialize the service and build the tree.
# The tree uses the local file system to store its data.
# After this step you will have a new directory .dataheroes_cache
service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)

### 1.1. Build the coreset tree with single file
Build method returns reference to service_obj, we supress this unnecessary output through `%%capture`.

In [4]:
%%capture
service_obj.build_from_file(data1_file_path, sample_size=10_000)

### 1.2 Build the coreset tree with directory (that contains two files)

In [5]:
%%capture
# For building the tree from the scratch we should initialize new service object
service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
service_obj.build_from_file(data1_dir, sample_size=10_000)

### 1.3 Build the coreset tree with list of files
(Not only lists, but any Iterators could be used)

In [6]:
%%capture
service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
service_obj.build_from_file([data1_file_path, data3_file_path], sample_size=10_000)

### 1.4 Build the coreset tree with list of directories (all 3 files should be used)

In [7]:
%%capture
service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
service_obj.build_from_file([data1_dir, data2_dir], sample_size=10_000)

### 2. Build when the targets and features are in the different files.

In [14]:
%%capture
# Split target (last column) and features (all another columns)
df1_X = df1.iloc[:, :-1]
df1_y = df1.iloc[:, -1]
# Prepare directory
data3_dir = Path("data3_dir")
data3_dir.mkdir(parents=True, exist_ok=True)
# Store features and targets in two files
data1_X_file_path = data3_dir / "data1_X.csv"
data1_y_file_path = data3_dir / "data1_y.csv"
df1_X.to_csv(data1_X_file_path, index=False)
df1_y.to_csv(data1_y_file_path, index=False)

service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
service_obj.build_from_file(data1_X_file_path, target_file_path=data1_y_file_path, sample_size=10_000)

### 3. Build when we coreset_by on the elevation feature.
We should have a function that split data to tree nodes on the following way:
 Elevation < 2400, 2400-2449, 2450-2499, 2500..., 3250-3300, >3300.

In [15]:
%%capture
def coreset_by_elevation(X):
    # list of boundaries [2400, 2450, 2500, ... 3300]
    boundaries = [2400 + i * 50 for i in range(19)]
    # X[0] - Elevation is first feature in dataset
    # We should return index of interval
    return np.searchsorted(boundaries, X[0])

service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
service_obj.build_from_file(data1_file_path, sample_size=10_000, coreset_by=coreset_by_elevation)

### 4. Build with pandas DataFrame

In [16]:
%%capture
service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
service_obj.build_from_df(df1, sample_size=10_000)

### 5. Build with list of pandas DataFrames

In [17]:
%%capture
service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
service_obj.build_from_df([df1, df2], sample_size=10_000)

### 6. Build with dataset

In [18]:
%%capture
service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
# Prepare dataset in form of numpy arrays, features and targets separately
X = df1.iloc[:, :-1].to_numpy()
y = df1.iloc[:, -1].to_numpy()
# Build
service_obj.build(X, y, sample_size=10_000)

### 7. Build with list of datasets

In [19]:
%%capture
service_obj = CoresetTreeServiceLG(coreset_size=2_000, data_params=data_params)
# Prepare dataset from first dataframe
X1 = df1.iloc[:, :-1].to_numpy()
y1 = df1.iloc[:, -1].to_numpy()
# Same for second dataframe
X2 = df2.iloc[:, :-1].to_numpy()
y2 = df2.iloc[:, -1].to_numpy()
# Build with two datasets
service_obj.build([X1,X2], [y1,y2], sample_size=10_000)