# FS4MT : Features selection for multi-table data
## Practical exercise on the *Accident* dataset

This tool allows variables (columns in secondary tables) to be filtered in a multi-table schema prior to supervised classification.

Current tools only take into account variables in the main table. We propose a method for exploring data in secondary tables in order to extract parameters relevant to classification. By selecting only the parameters that carry information, the classification task is simplified, reducing processing time. The parameters taken into account here are the native secondary variables (variables present in the secondary tables) and the construction primitives (mathematical rules).

This system initially focused on creating a measure of the importance of variables and primitives in relation to the target variable. The ultimate goal is to reduce the space of primitives and variables in order to select only the relevant elements.


**Method used**

The importance estimation method used is a univariate method with discretisation of the Count variable to limit the effect of noise on the secondary tables.


**Organisation** 

The **fs4mt** library consists of the following directories:
* DATA: 2 datasets:
  - Accident: from the road accident database, we consider here the tables in star schema form
  - synth1: Orange synthetic data
* library: 2 Python classes 
* notebook: notebooks analysing each of the 2 datasets
* readme


In [1]:
# Import python packages
import os
from khiops import core as kh
import warnings
import sys

In [2]:
os.environ["KHIOPS_PROC_NUMBER"] = "8"
warnings.filterwarnings("ignore")

In [3]:
os.chdir("..")

In [4]:
pwd

'C:\\Users\\QRBS5531\\Mes dossiers\\OF\\ScoringFutur\\fs4mt_tool'

In [5]:
from fs4mt.class_UnivariateMultitableAnalysis import UnivariateMultitableAnalysis
from fs4mt.class_VariableSelectionStatistics import VariableSelectionStatistics
from fs4mt.class_UnivariateMultitableAnalysis import get_dataset
from fs4mt.class_UnivariateMultitableAnalysis import add_noise

In [13]:
DataPath = os.path.join(".")
DataLibPath = os.path.join(DataPath, "fs4mt")
if DataLibPath not in sys.path:
    sys.path.append(DataLibPath)

<div style="background-color:MediumSeaGreen; text-align:center; vertical-align: middle; padding:40px 0; color:white">

## dataset accident_star
## initialisation with noise variables

In [6]:
# Set information
rule = kh.all_construction_rules  # construction rules to used
k = 50  # number of aggregates per variable

# Set data information
(
    data_path,
    dictionary_file_path,
    data_table_path,
    secondary_table_list,
    Additional_data_tables,
    main_dictionary_name,
    target,
) = get_dataset("Accident_star")

In [7]:
# Create dictionary with 10 noise variables if it doesn't already exist

dictionary_file_path_noise = os.path.join(data_path, "noisy_dictionary.kdic")
if not os.path.exists(dictionary_file_path_noise):
    dictionary_domain_10, _ = add_noise(
        kh.read_dictionary_file(dictionary_file_path), 10
    )
    dictionary_domain_10.export_khiops_dictionary_file(dictionary_file_path_noise)

<div style="background-color:LightBlue; text-align:center; vertical-align: middle; padding:10px 0;">
Create results and output directories

In [8]:
# Create results and output directories
def create_directories(results_dir_name, output_dir_name):
    results_dir = os.path.join(DataPath, results_dir_name)
    if not os.path.exists(results_dir):
        os.mkdir(results_dir)
    output_dir = os.path.join(DataPath, output_dir_name)
    if not os.path.exists(output_dir):
        os.mkdir(output_dir)

<div style="background-color:LightBlue; text-align:center; vertical-align: middle; padding:10px 0;">
Univariate Multitable Analysis

In [9]:
def create_object_multitable_analysis(
    exploration_type="Variable",
    count_effect_reduction=True,
    results_dir="",
    output_dir="",
):
    # Dictionary with 10 noise variable + filtering
    # Create Analysis Variable object
    obj = UnivariateMultitableAnalysis(
        dictionary_file_path_noise,
        main_dictionary_name,
        data_table_path,
        Additional_data_tables,
        target,
        exploration_type=exploration_type,
        count_effect_reduction=count_effect_reduction,
        max_constructed_variables_per_variable=k,
        results_dir=results_dir,
        output_dir=output_dir,
    )
    return obj

<div style="background-color:LightBlue; text-align:center; vertical-align: middle; padding:10px 0;">
Variable Selection Statistics

In [10]:
def create_object_statistics(
    exploration_type="Variable",
    count_effect_reduction=True,
    output_dir="",
):
    # Create Variable Statistics object
    obj = VariableSelectionStatistics(
        dictionary_file_path_noise,
        main_dictionary_name,
        exploration_type=exploration_type,
        count_effect_reduction=count_effect_reduction,
        output_dir=output_dir,
    )
    return obj

<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:40px 0;">
    
## analysis of the accident_star dataset

<div style="background-color:LightSalmon; text-align:center; vertical-align: middle; padding:10px 0;">
exploration_type = "Variable", count_effect_reduction = True

In [11]:
results_dir_var_true = "output_khiops_variable"
output_dir_var_true = "results_variable"

<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
univariate analysis of secondary table variables

In [14]:
# Create results and output directories
create_directories(results_dir_var_true, output_dir_var_true)

In [15]:
%%time

obj_var_true = create_object_multitable_analysis(
    results_dir=results_dir_var_true,
    output_dir=output_dir_var_true,
)

# Compute importance variable list
variable_importance_dictionary_var_true = obj_var_true.variables_analysis()

table : Place


100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:25<00:00,  1.10it/s]


table : Vehicles
discretization : group 1/2


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:27<00:00,  1.46s/it]


discretization : group 2/2


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:31<00:00,  1.65s/it]


table : Users
discretization : group 1/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:29<00:00,  1.30s/it]


discretization : group 2/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:33<00:00,  1.46s/it]


discretization : group 3/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:30<00:00,  1.32s/it]

CPU times: total: 312 ms
Wall time: 3min 5s





<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
statistics from univariate analysis of secondary table variables

In [15]:
# Create Variable Statistics object
obj_stat_var_true = create_object_statistics(
    output_dir=output_dir_var_true,
)

# get variable importance and primitive importance dataframes
df_variable_exploration_var_true = obj_stat_var_true.get_variables_analysis()

file "results_variable\primitive_exploration.json" doesn't exist


In [16]:
# write text report
obj_stat_var_true.write_txt_report(df_variable_exploration_var_true)

In [17]:
# statistics from variable analysis

list_var_var_true = obj_stat_var_true.get_list_variable(
    df_variable_exploration_var_true
)
print(list_var_var_true)

nb_zero_var_true = obj_stat_var_true.get_variable_number_zero_level(
    df_variable_exploration_var_true
)
print("number of variables with level zero : " + str(nb_zero_var_true))

['AccidentId', 'RoadType', 'RoadNumber', 'RoadSecNumber', 'RoadLetter', 'Circulation', 'LaneNumber', 'SpecialLane', 'Slope', 'RoadMarkerId', 'RoadMarkerDistance', 'Layout', 'StripWidth', 'LaneWidth', 'SurfaceCondition', 'Infrastructure', 'Localization', 'SchoolNear', 'N_0', 'C_0', 'N_1', 'C_1', 'N_2', 'C_2', 'N_3', 'C_3', 'N_4', 'C_4', 'AccidentId', 'VehicleId', 'Direction', 'Category', 'PassengerNumber', 'FixedObstacle', 'MobileObstacle', 'ImpactPoint', 'Maneuver', 'N_10', 'C_10', 'N_11', 'C_11', 'N_12', 'C_12', 'N_13', 'C_13', 'N_14', 'C_14', 'AccidentId', 'VehicleId', 'Seat', 'Category', 'Gender', 'TripReason', 'SafetyDevice', 'SafetyDeviceUsed', 'PedestrianLocation', 'PedestrianAction', 'PedestrianCompany', 'BirthYear', 'N_5', 'C_5', 'N_6', 'C_6', 'N_7', 'C_7', 'N_8', 'C_8', 'N_9', 'C_9']
nb de variables avec level nul : 28


<div style="background-color:LimeGreen; text-align:center; vertical-align: middle; padding:40px 0;">

## analysis of the accident_star dataset without discretisation

<div style="background-color:LightSalmon; text-align:center; vertical-align: middle; padding:10px 0;">
exploration_type = "Variable", count_effect_reduction = False

In [18]:
count_effect_reduction_false = False
results_dir_var_false = "output_khiops_variable_no_discretisation"
output_dir_var_false = "results_variable_no_discretisation"

<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
univariate analysis of secondary table variables

In [19]:
# Create results and output directories
create_directories(results_dir_var_false, output_dir_var_false)

In [20]:
%%time

obj_var_false = create_object_multitable_analysis(
    count_effect_reduction=count_effect_reduction_false,
    results_dir=results_dir_var_false,
    output_dir=output_dir_var_false,
)

# Compute importance variable list
variable_importance_dictionary_var_false = obj_var_false.variables_analysis()

table : Place


100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:30<00:00,  1.11s/it]


table : Vehicles


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:40<00:00,  2.12s/it]


table : Users


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:42<00:00,  1.86s/it]

CPU times: total: 125 ms
Wall time: 1min 54s





<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
statistics from univariate analysis of secondary table variables

In [21]:
# Create Variable Statistics object
obj_stat_var_false = create_object_statistics(
    count_effect_reduction=count_effect_reduction_false,
    output_dir=output_dir_var_false,
)

# get variable importance and primitive importance dataframes
df_variable_exploration_var_false = obj_stat_var_false.get_variables_analysis()

file "results_variable_no_discretisation\primitive_exploration.json" doesn't exist


In [22]:
# write text report
obj_stat_var_false.write_txt_report(df_variable_exploration_var_false)

In [23]:
# statistics from variable analysis

list_var_var_false = obj_stat_var_false.get_list_variable(
    df_variable_exploration_var_false
)
print(list_var_var_false)

nb_zero_var_false = obj_stat_var_false.get_variable_number_zero_level(
    df_variable_exploration_var_false
)
print("number of variables with level zero : " + str(nb_zero_var_false))

['AccidentId', 'RoadType', 'RoadNumber', 'RoadSecNumber', 'RoadLetter', 'Circulation', 'LaneNumber', 'SpecialLane', 'Slope', 'RoadMarkerId', 'RoadMarkerDistance', 'Layout', 'StripWidth', 'LaneWidth', 'SurfaceCondition', 'Infrastructure', 'Localization', 'SchoolNear', 'N_0', 'C_0', 'N_1', 'C_1', 'N_2', 'C_2', 'N_3', 'C_3', 'N_4', 'C_4', 'AccidentId', 'VehicleId', 'Direction', 'Category', 'PassengerNumber', 'FixedObstacle', 'MobileObstacle', 'ImpactPoint', 'Maneuver', 'N_10', 'C_10', 'N_11', 'C_11', 'N_12', 'C_12', 'N_13', 'C_13', 'N_14', 'C_14', 'AccidentId', 'VehicleId', 'Seat', 'Category', 'Gender', 'TripReason', 'SafetyDevice', 'SafetyDeviceUsed', 'PedestrianLocation', 'PedestrianAction', 'PedestrianCompany', 'BirthYear', 'N_5', 'C_5', 'N_6', 'C_6', 'N_7', 'C_7', 'N_8', 'C_8', 'N_9', 'C_9']
nb de variables avec level nul : 15
