## Imports

In [8]:
import os
import sys

import numpy as np
import pandas as pd

dir_parts = os.getcwd().split(os.path.sep)
root_index = dir_parts.index('ML-B')
root_path = os.path.sep.join(dir_parts[:root_index + 1])
sys.path.append(root_path + '/code/')
from data.data_config import Dataset
from data.data_utils import load_monk, load_cup

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Path

In [2]:
# Directories
monk_dir = root_path + '/data/monk/'
cup_dir = root_path + '/data/cup/'

# Filepaths (MONK)
m1_dev_path, m1_test_path = Dataset.MONK_1.dev_path, Dataset.MONK_1.test_path # MONK 1
m2_dev_path, m2_test_path = Dataset.MONK_2.dev_path, Dataset.MONK_2.test_path # MONK 2
m3_dev_path, m3_test_path = Dataset.MONK_3.dev_path, Dataset.MONK_3.test_path # MONK 3

# Filepaths (CUP)
cup_dev_path, cup_test_path = Dataset.CUP.dev_path, Dataset.CUP.test_path

# Introduction
In this notebook we take a look at the datasets at hand, i.e. the three MONK's problems and the CUP datasets dedicated to the ML challenge.

## MONK
MONK's problems is a set of three artificial domains over the same attribute space. Note that, one of the MONK's problems has some noise added.

Each problem has the same characteristics:
- **Number of Instances**: 432;
- **Number of Attributes**: 8 (including class/target attribute).
- **Missing attribute values**: None

The attributes format is the following:
1. **class**: 0, 1 
2. **a1**:    1, 2, 3
3. **a2**:    1, 2, 3
4. **a3**:    1, 2
5. **a4**:    1, 2, 3
6. **a5**:    1, 2, 3, 4
7. **a6**:    1, 2
8. **Id**:    (A unique symbol for each instance)

where each $ai$ with $i = \{1, ..., 6\}$ corresponds to an attribute/feature of the dataset.

Target concepts associated to the MONK's problems:
- **MONK-1**: $(a1 = a2) \text{ or } (a5 = 1)$;
- **MONK-2**: EXACTLY TWO of ${a1 = 1, a2 = 1, a3 = 1, a4 = 1, a5 = 1, a6 = 1}$;
- **MONK-3**: $(a5 = 3 \text{ and } a4 = 1)$ or $(a5 /= 4 \text{ and } a2 /= 3)$, with $5\%$ class noise added to the training set.


For simplicity, considering the the three MONK datasets are of same format, the following steps are performed only for one the MONK-1 problem.

## Data
The MONK datasets is loaded into memory in one-hot encoding format, separating the features (x) and the labels/classes (y).

In [3]:
# Load MONK-1
x_dev_m1, y_dev_m1, x_test_m1, y_test_m1 = load_monk(m1_dev_path, m1_test_path)

In [4]:
x_dev_m1.shape, y_dev_m1.shape

((124, 17), (124,))

In [8]:
x_test_m1.shape, y_test_m1.shape

((124, 17), (124,))

In [9]:
x_dev_m1

array([[ True, False, False, ..., False,  True, False],
       [ True, False, False, ..., False, False,  True],
       [ True, False, False, ..., False,  True, False],
       ...,
       [False, False,  True, ..., False, False,  True],
       [False, False,  True, ..., False, False,  True],
       [False, False,  True, ...,  True, False,  True]])

In [10]:
y_dev_m1

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

# CUP

Dataset for che ML competition, consisting of a training set and a **blind** test set (i.e., samples without target values).

**Regression** on 3 target variables: *x*, *y*, and *z*.

The attributes format is the following:
- First column is a pattern name (ID);
- Central 10 columns are 10 attributes with continuous values;
- Last 3 columns are the 3 labels.

## Data

In [3]:
# Load CUP
x_dev_cup, y_dev_cup, x_test_cup = load_cup(cup_dev_path, cup_test_path)

In [4]:
x_dev_cup.shape, y_dev_cup.shape

((1000, 10), (1000, 3))

In [5]:
x_test_cup.shape

(900, 10)

In [9]:
type(x_dev_cup)

numpy.ndarray

## Feature and Labels scaling
Let's check wether the features/labels are already properly scaled and/or normalized.

In [12]:
def feature_analysis(x_data: np.ndarray):
    """Compute mean and standard deviation of some data <data>. Returns mean and std dev.
    
    
        Args:
        - data: the data to analyze
    """
    return np.mean(data, axis=0), np.std(data, axis=0)

In [17]:
print('--- DEVELOPMENT ---')
features_mean_dev, features_std_dev = feature_analysis(x_dev_cup)
print(f'Features mean:\n {features_mean_dev}\n')
print(f'Features mean:\n {features_std_dev}')

--- DEVELOPMENT ---
Features mean:
 [-2.08721929e-17 -1.24344979e-17  2.48689958e-17  2.32702746e-16
  1.70530257e-16  3.19744231e-17 -1.77635684e-18  1.77635684e-17
 -7.10542736e-18  3.19744231e-17]

Features mean:
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
