# FS4MT : Features selection for multi-table data
## Practical exercise on the *Accident* dataset

This tool allows variables (columns in secondary tables) to be filtered in a multi-table schema prior to supervised classification.

Current tools only take into account variables in the main table. We propose a method for exploring data in secondary tables in order to extract parameters relevant to classification. By selecting only the parameters that carry information, the classification task is simplified, reducing processing time. The parameters taken into account here are the native secondary variables (variables present in the secondary tables) and the construction primitives (mathematical rules).

This system initially focused on creating a measure of the importance of variables and primitives in relation to the target variable. The ultimate goal is to reduce the space of primitives and variables in order to select only the relevant elements.


**Method used**

The importance estimation method used is a univariate method with discretisation of the Count variable to limit the effect of noise on the secondary tables.


**Organisation** 

The **fs4mt** library consists of the following directories:
* DATA: 2 datasets:
  - Accident: from the road accident database, we consider here the tables in star schema form
  - synth1: Orange synthetic data
* library: 2 Python classes 
* notebook: notebooks analysing each of the 2 datasets
* readme


In [1]:
# Import packages
import os
from khiops import core as kh
import warnings
import sys

In [3]:
# Import FS4MT
from fs4mt.class_UnivariateMultitableAnalysis import UnivariateMultitableAnalysis
from fs4mt.class_VariableSelectionStatistics import VariableSelectionStatistics
from fs4mt.class_UnivariateMultitableAnalysis import add_noise

<div style="background-color:MediumSeaGreen; text-align:center; vertical-align: middle; padding:40px 0; color:white">

## For the dataset accident_star
## Load data and initialize dictionary with 10 noise variables

In [4]:
# Set data information for the dataset “Accident”

data_path = os.path.join("..", "DATA", "Accident") # data directory
dictionary_file_path = os.path.join(data_path, "Accidents_etoile.kdic") # Khiops dictionary
data_table_path = os.path.join(data_path, "Accidents.txt") # main table
vehicle_table_path = os.path.join(data_path, "Vehicles.txt") # secondary table
user_table_path = os.path.join(data_path, "Users.txt") # secondary table
place_table_path = os.path.join(data_path, "Places.txt") # secondary table
main_dictionary_name = "Accident" # name in the Khiops dictionary
# A dictionary containing the data paths and file paths for a multi-table dictionary file
additional_data_tables = {
    main_dictionary_name + "`Place": place_table_path,
    main_dictionary_name + "`Vehicles": vehicle_table_path,
    main_dictionary_name + "`Users": user_table_path,
}
target = "Gravity" # target variable to predict

In [5]:
# Create dictionary with 10 noise variables
# If the file exists it is not overwritten

dictionary_file_path_noise = os.path.join(data_path, "noisy_dictionary.kdic")
if not os.path.exists(dictionary_file_path_noise):
    dictionary_domain_10, _ = add_noise(
        kh.read_dictionary_file(dictionary_file_path), 10
    )
    dictionary_domain_10.export_khiops_dictionary_file(dictionary_file_path_noise)

<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:40px 0;">
    
## analysis of the accident_star dataset

<div style="background-color:LightSalmon; text-align:center; vertical-align: middle; padding:10px 0;">
exploration_type = "Variable" with count_effect_reduction = True

In [6]:
# Initialization of results and output directories names
output_khiops_dir = "output_khiops_discretization" # Path of the output khiops analysis directory, default "output_khiops"
results_dir = "results_discretization" # Path of the results directory, default "results"

# Initialization of analysis parameters (here default parameters)
exploration_type = "Variable" # Parameter to be analyze, 'All' for both variable and primitive, 'Variable' or 'Primitive' for only variable or primitive, default to 'Variable'. 
count_effect_reduction = True # State of discretization, True is used, default to True.

<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
univariate analysis of secondary table variables

In [7]:
%%time

# Create the multi-table analysis object UnivariateMultitableAnalysis
obj_discret = UnivariateMultitableAnalysis(
    dictionary_file_path_noise, # Path of a Khiops dictionary file.
    main_dictionary_name, # Name of the dictionary to be analyzed.
    data_table_path, # Path of the data table file.
    additional_data_tables, # A dictionary containing the data paths and file paths for a multi-table dictionary file.
    target, # Name of the target variable.
    exploration_type=exploration_type, # Parameter to be analyze, 'All' for both variable and primitive, 'Variable' or 'Primitive' for only variable or primitive, default to 'Variable'. 
    count_effect_reduction=count_effect_reduction, # State of discretization, True is used, default to True.
    output_khiops_dir=output_khiops_dir, # Path of the output khiops analysis directory, defaults to "output_khiops".
    results_dir=results_dir, #Path of the results directory, defaults to "results".
)

# Compute importance variable list, return a python dictionary
variable_importance_dictionary_discret = obj_discret.variables_analysis()

table : Place


100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:25<00:00,  1.09it/s]


table : Vehicles
discretization : group 1/2


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:29<00:00,  1.54s/it]


discretization : group 2/2


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:34<00:00,  1.79s/it]


table : Users
discretization : group 1/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:32<00:00,  1.41s/it]


discretization : group 2/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:38<00:00,  1.67s/it]


discretization : group 3/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:35<00:00,  1.53s/it]


Khiops reports saved in "output_khiops_discretization" directory

Variables exploration report in json format saved : results_discretization\variable_exploration.json


CPU times: total: 281 ms
Wall time: 3min 21s





<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
statistics from univariate analysis of secondary table variables

In [8]:
# Create Variable Statistics object VariableSelectionStatistics
obj_stat_discret = VariableSelectionStatistics(
    dictionary_file_path_noise, #  Path of a Khiops dictionary file.
    exploration_type=exploration_type, # Parameter to be analyze, 'All' for both variable and primitive, 'Variable' or 'Primitive' for only variable or primitive, default to 'Variable'. 
    count_effect_reduction=count_effect_reduction, # State of discretization, True is used, default to True.
    results_dir=results_dir, #Path of the results directory, defaults to "results".
)

# read the variables analysis, return a dataframe with variables importance
df_variable_exploration_discret = obj_stat_discret.get_variables_analysis()

    rank     table            variable         type   levelMT   
0      1     Place          RoadNumber  Categorical  0.059002  \
1      2     Place            RoadType  Categorical  0.053591   
2      3     Place  RoadMarkerDistance    Numerical  0.050187   
3      4     Place        RoadMarkerId  Categorical  0.049178   
4      5     Users  PedestrianLocation  Categorical  0.047309   
..   ...       ...                 ...          ...       ...   
64    65     Place                 N_3    Numerical  0.000000   
65    66     Place                 N_4    Numerical  0.000000   
66    67  Vehicles     PassengerNumber    Numerical  0.000000   
67    68     Place       RoadSecNumber  Categorical  0.000000   
68    69     Place    SurfaceCondition  Categorical  0.000000   

                                 levelMT aggregate  real number of aggregates   
0                                 Place.RoadNumber                          1  \
1                                   Place.RoadType       

In [9]:
# write tabulate report
obj_stat_discret.write_to_csv_report(df_variable_exploration_discret)

Report file saved : results_discretization\variable_exploration.txt


In [10]:
# Print the number of variables with level zero
nb_zero_discret = obj_stat_discret.get_variable_number_zero_level(
    df_variable_exploration_discret
)

# filter khiops dictionary : generate and save a filtered Khiops dictionary
# filter the 10 variables with low level in dictionary
obj_stat_discret.filter_variables_in_dictionary(df_variable_exploration_discret, 10)

# if None then filter all variables with level zero
obj_stat_discret.filter_variables_in_dictionary(df_variable_exploration_discret)

Number of variables with level zero : 28

Filtering 10 variables with low level :
Dictionary saved : ..\DATA\Accident\reduced_dictionary_filter_10.kdic

Filtering all variables with level zero :
The variable 'AccidentId' in the table 'Place' is a key, it is not set to Unused
The variable 'AccidentId' in the table 'User' is a key, it is not set to Unused
The variable 'AccidentId' in the table 'Vehicle' is a key, it is not set to Unused
Dictionary saved : ..\DATA\Accident\reduced_dictionary_level_0.kdic
