## Purpose
In this example we will demonstrate how to:
   - Build a Coreset tree from file(s):
       - Build from a single file
       - Build from a list of files
       - Build from all files in a folder
       - Build from a list of folders
       - Build when the target and features are in the different files
   - Build from a pandas DataFrame, and from list of DataFrames
   - Build while splitting the data to a few categories with the coreset_by parameter
   - Build from dataset(s) in the form of numpy arrays

In this example we'll be using the well-known Covertype Dataset (https://archive.ics.uci.edu/ml/datasets/covertype).


In [1]:
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_covtype
import numpy as np

from dataheroes import CoresetTreeServiceLG

## Prepare datasets

In [2]:
# Load Covertype dataset as a pandas data frame.
# In the output data frame all columns are features beside the last column.
# The last column (Cover_Type) is the target
df = fetch_covtype(as_frame=True).frame

# Split dataframe: df1 = 50%, df2=25%, df3=25%
df1, df2 = train_test_split(df, test_size=0.5, random_state=42)
df2, df3 = train_test_split(df2, test_size=0.5, random_state=42)

# Prepare data directory and set the file names.
data1_dir = Path("data1_dir")
data2_dir = Path("data2_dir")
data1_dir.mkdir(parents=True, exist_ok=True)
data2_dir.mkdir(parents=True, exist_ok=True)
data1_file_path = data1_dir / "data1.csv"
data2_file_path = data1_dir / "data2.csv"
data3_file_path = data2_dir / "data3.csv"

# Store data as CSV.
# After that we will have the following structure:
#   data1_dir
#       data1.csv (~290,000 samples)
#       data2.csv (~145,000 samples)
#   data2_dir
#       data3.csv (~145,000 samples)
df1.to_csv(data1_file_path, index=False)
df2.to_csv(data2_file_path, index=False)
df3.to_csv(data3_file_path, index=False)

### 1. Build the Coreset tree from a file or multiple files
Run `build_from_file` on the first file. It will include ~290K sample. 

Besides the csv format any format could be used, by  setting the `reader_f` and `reader_kwargs` params. 

We pass `n_classes` and `n_instances` to help the tree calculate an optimal Coreset size. Depending on task type `optimized_for` could be `cleaning` or `training`

In [3]:
# Tell the tree how data is structured.
# In this example we have one target column, all other columns are features.
data_params = {'target': {'name': 'Cover_Type'}}
# Initialize the service and build the tree.
# The tree uses the local file system to store its data.
# After this step you will have a new directory .dataheroes_cache
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='training',
                                   n_classes=7,
                                   n_instances=290_000
                                  )

### 1.1. Build the coreset tree with a single file

In [4]:
service_obj.build_from_file(data1_file_path)

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x164a99d00>

### 1.2 Build the coreset tree with a directory (containing two files)

In [5]:
# For building the tree from scratch we should initialize a new service
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   n_classes=7,
                                   n_instances=435_000
                                  )
service_obj.build_from_file(data1_dir)

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x1694f5cd0>

### 1.3 Build the coreset tree with a list of files
(Not only lists, but any Iterators could be used)

In [6]:
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   n_classes=7,
                                   n_instances=435_000
                                  )
service_obj.build_from_file([data1_file_path, data3_file_path])

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x169942cd0>

### 1.4 Build the coreset tree with a list of directories (all 3 files should be used)

In [7]:
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   n_classes=7,
                                   n_instances=580_000
                                  )
service_obj.build_from_file([data1_dir, data2_dir])

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x16964d6d0>

### 2. Build when the target and features are in the different files.
Do a build optimized for training

In [8]:
# Split target (last column) and features (all another columns)
df1_X = df1.iloc[:, :-1]
df1_y = df1.iloc[:, -1]
# Prepare directory
data3_dir = Path("data3_dir")
data3_dir.mkdir(parents=True, exist_ok=True)
# Store features and target in two separate files
data1_X_file_path = data3_dir / "data1_X.csv"
data1_y_file_path = data3_dir / "data1_y.csv"
df1_X.to_csv(data1_X_file_path, index=False)
df1_y.to_csv(data1_y_file_path, index=False)

service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='training',
                                   n_classes=7,
                                   n_instances=290_000)
service_obj.build_from_file(data1_X_file_path, target_file_path=data1_y_file_path)

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x164b6fa00>

### 3. Build when we coreset_by the elevation feature.
We want to have a function that splits the data to tree nodes in the following way:
 Elevation < 2400, 2400-2449, 2450-2499, 2500..., 3250-3300, >3300.

In [9]:
def coreset_by_elevation(X):
    # list of boundaries [2400, 2450, 2500, ... 3300]
    boundaries = [2400 + i * 50 for i in range(19)]
    # X[0] - Elevation is first feature in dataset
    # We should return index of interval
    return np.searchsorted(boundaries, X[0])

service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   coreset_size=2_000
                                  )
service_obj.build_from_file(data1_file_path, coreset_by=coreset_by_elevation)

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x1699426a0>

### 4. Build with a pandas DataFrame

In [10]:
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   n_classes=7,
                                   n_instances=290_000
                                  )
service_obj.build_from_df(df1)

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x169947d90>

### 5. Build with a list of pandas DataFrames

In [11]:
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   n_classes=7,
                                   n_instances=435_000
                                  )
service_obj.build_from_df([df1, df2])

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x1694a5370>

### 6. Build with a dataset

In [12]:
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   n_classes=7,
                                   n_instances=290_000
                                  )
# Prepare the dataset in form of numpy arrays, where features and target are separate
X = df1.iloc[:, :-1].to_numpy()
y = df1.iloc[:, -1].to_numpy()
# Build
service_obj.build(X, y)

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x169d75850>

### 7. Build with a list of datasets

In [13]:
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   n_classes=7,
                                   n_instances=435_000
                                  )
# Prepare dataset from first dataframe
X1 = df1.iloc[:, :-1].to_numpy()
y1 = df1.iloc[:, -1].to_numpy()
# Same for second dataframe
X2 = df2.iloc[:, :-1].to_numpy()
y2 = df2.iloc[:, -1].to_numpy()
# Build with two datasets
service_obj.build([X1,X2], [y1,y2])

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x169d56580>

### 8. Build using the coreset_size and chunk_size directly
Instead of passing `n_classes` and `n_instances` in order to help the optimizer calculate the `coreset_size` and `chunk_size`, we can pass these params directly. We will use `chunk_size` of 10K and `coreset_size` of 2K.

In [14]:
service_obj = CoresetTreeServiceLG(data_params=data_params, 
                                   optimized_for='cleaning',
                                   chunk_size=10_000,
                                   coreset_size=2_000,
                                  )
service_obj.build_from_file(data1_file_path)

<dataheroes.services.tree_services.CoresetTreeServiceLG at 0x1694a50d0>