# FS4MT : Features selection for multi-table data
## Practical exercise on the *Accident* dataset

This tool allows variables (columns in secondary tables) to be filtered in a multi-table schema prior to supervised classification.

Current tools only take into account variables in the main table. We propose a method for exploring data in secondary tables in order to extract parameters relevant to classification. By selecting only the parameters that carry information, the classification task is simplified, reducing processing time. The parameters taken into account here are the native secondary variables (variables present in the secondary tables) and the construction primitives (mathematical rules).

This system initially focused on creating a measure of the importance of variables and primitives in relation to the target variable. The ultimate goal is to reduce the space of primitives and variables in order to select only the relevant elements.


**Method used**

The importance estimation method used is a univariate method with discretisation of the Count variable to limit the effect of noise on the secondary tables.


**Organisation** 

The **fs4mt** library consists of the following directories:
* DATA: 2 datasets:
  - Accident: from the road accident database, we consider here the tables in star schema form
  - synth1: Orange synthetic data
* library: 2 Python classes 
* notebook: notebooks analysing each of the 2 datasets
* readme


In [1]:
# Import packages
import os
from khiops import core as kh
import warnings
import sys

In [2]:
# Initialization of the number of processors for Khiops
os.environ["KHIOPS_PROC_NUMBER"] = "8"
warnings.filterwarnings("ignore")

In [3]:
os.chdir("..")

In [4]:
pwd

'C:\\Users\\QRBS5531\\Mes dossiers\\git\\feature-selection-for-multi-table-data'

In [5]:
# Import FS4MT
from fs4mt.class_UnivariateMultitableAnalysis import UnivariateMultitableAnalysis
from fs4mt.class_VariableSelectionStatistics import VariableSelectionStatistics
from fs4mt.class_UnivariateMultitableAnalysis import get_dataset
from fs4mt.class_UnivariateMultitableAnalysis import add_noise

In [6]:
DataPath = os.path.join(".")
DataLibPath = os.path.join(DataPath, "fs4mt")
if DataLibPath not in sys.path:
    sys.path.append(DataLibPath)

In [7]:
%load_ext autoreload
%autoreload 2

<div style="background-color:MediumSeaGreen; text-align:center; vertical-align: middle; padding:40px 0; color:white">

## For the dataset accident_star
## Initialization of data with 10 noise variables

In [8]:
# Set data information
# Here we load the provided dataset “Accident”
(
    data_path,
    dictionary_file_path,
    data_table_path,
    secondary_table_list,
    Additional_data_tables,
    main_dictionary_name,
    target,
) = get_dataset("Accident_star")

The dataset "Accident_star" has been loaded successfully.


In [9]:
# Set information

rule = kh.all_construction_rules  # construction rules to used
k = 50  # number of aggregates per variable

In [10]:
# Create dictionary with 10 noise variables
# If the file exists it is not overwritten

dictionary_file_path_noise = os.path.join(data_path, "noisy_dictionary.kdic")
if not os.path.exists(dictionary_file_path_noise):
    dictionary_domain_10, _ = add_noise(
        kh.read_dictionary_file(dictionary_file_path), 10
    )
    dictionary_domain_10.export_khiops_dictionary_file(dictionary_file_path_noise)

<div style="background-color:LightBlue; text-align:center; vertical-align: middle; padding:10px 0;">
Create results and output directories

In [11]:
# The function to create results and output directories
def create_directories(results_dir_name, output_dir_name):
    results_dir = os.path.join(DataPath, results_dir_name)
    if not os.path.exists(results_dir):
        os.mkdir(results_dir)
    output_dir = os.path.join(DataPath, output_dir_name)
    if not os.path.exists(output_dir):
        os.mkdir(output_dir)

<div style="background-color:LightBlue; text-align:center; vertical-align: middle; padding:10px 0;">
Univariate Multitable Analysis

In [12]:
# The function to create the multi-table analysis object
def create_object_multitable_analysis(
    exploration_type="Variable",
    count_effect_reduction=True,
    results_dir="",
    output_dir="",
):
    # Dictionary with 10 noise variable + filtering
    # Create Analysis Variable object
    obj = UnivariateMultitableAnalysis(
        dictionary_file_path_noise, # Path of a Khiops dictionary file.
        main_dictionary_name, # Name of the dictionary to be analyzed.
        data_table_path, # Path of the data table file.
        Additional_data_tables, # A dictionary containing the data paths and file paths for a multi-table dictionary file.
        target, # Name of the target variable.
        exploration_type=exploration_type, # Parameter to be analyze, 'All' for both variable and primitive, 'Variable' or 'Primitive' for only variable or primitive, default to 'Variable'. 
        count_effect_reduction=count_effect_reduction, # State of discretization, True is used, default to True.
        max_constructed_variables_per_variable=k, # Maximum number of variables to construct per native variable, defaults to 10.
        results_dir=results_dir, # Path of the results directory, defaults to "Results".
        output_dir=output_dir, #Path of the output directory, defaults to "".
    )
    return obj

<div style="background-color:LightBlue; text-align:center; vertical-align: middle; padding:10px 0;">
Variable Selection Statistics

In [13]:
# The function to create the statistics object
def create_object_statistics(
    exploration_type="Variable",
    count_effect_reduction=True,
    output_dir="",
):
    # Create Variable Statistics object
    obj = VariableSelectionStatistics(
        dictionary_file_path_noise, #  Path of a Khiops dictionary file.
        exploration_type=exploration_type, # Parameter to be analyze, 'All' for both variable and primitive, 'Variable' or 'Primitive' for only variable or primitive, default to 'Variable'. 
        count_effect_reduction=count_effect_reduction, # State of discretization, True is used, default to True.
        output_dir=output_dir, # Path of the results directory, defaults to "Results".
    )
    return obj

<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:40px 0;">
    
## analysis of the accident_star dataset

<div style="background-color:LightSalmon; text-align:center; vertical-align: middle; padding:10px 0;">
exploration_type = "Variable" with count_effect_reduction = True

In [14]:
# Initialization of results and output directories names
results_dir_discret = "output_khiops_discretization"
output_dir_discret = "results_discretization"

<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
univariate analysis of secondary table variables

In [15]:
# Create results and output directories specified above
create_directories(results_dir_discret, output_dir_discret)

In [16]:
%%time

# Initialize the multi-table analysis object
obj_discret = create_object_multitable_analysis(
    results_dir=results_dir_discret,
    output_dir=output_dir_discret,
)

# Compute importance variable list
variable_importance_dictionary_discret = obj_discret.variables_analysis()

table : Place


100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:32<00:00,  1.17s/it]


table : Vehicles
discretization : group 1/2


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:36<00:00,  1.91s/it]


discretization : group 2/2


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:37<00:00,  1.99s/it]


table : Users
discretization : group 1/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:37<00:00,  1.62s/it]


discretization : group 2/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:41<00:00,  1.81s/it]


discretization : group 3/3


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:40<00:00,  1.78s/it]


Khiops reports are written in "output_khiops_discretization" directory

Variables exploration report in json format is written in "results_discretization" directory


CPU times: total: 609 ms
Wall time: 3min 54s





<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
statistics from univariate analysis of secondary table variables

In [17]:
# Create Variable Statistics object
obj_stat_discret = create_object_statistics(
    output_dir=output_dir_discret,
)

# get the dataframe with variables importance
df_variable_exploration_discret = obj_stat_discret.get_variables_analysis()

Variables results loaded


In [18]:
# write text report
obj_stat_discret.write_txt_report(df_variable_exploration_discret)

Report file written : results_discretization\variable_exploration.txt


In [19]:
# statistics from variable analysis

# Get and print variables list
list_var_discret = obj_stat_discret.get_list_variable(
    df_variable_exploration_discret
)

List of variables analyzed :
['AccidentId', 'RoadType', 'RoadNumber', 'RoadSecNumber', 'RoadLetter', 'Circulation', 'LaneNumber', 'SpecialLane', 'Slope', 'RoadMarkerId', 'RoadMarkerDistance', 'Layout', 'StripWidth', 'LaneWidth', 'SurfaceCondition', 'Infrastructure', 'Localization', 'SchoolNear', 'N_0', 'C_0', 'N_1', 'C_1', 'N_2', 'C_2', 'N_3', 'C_3', 'N_4', 'C_4', 'AccidentId', 'VehicleId', 'Direction', 'Category', 'PassengerNumber', 'FixedObstacle', 'MobileObstacle', 'ImpactPoint', 'Maneuver', 'N_10', 'C_10', 'N_11', 'C_11', 'N_12', 'C_12', 'N_13', 'C_13', 'N_14', 'C_14', 'AccidentId', 'VehicleId', 'Seat', 'Category', 'Gender', 'TripReason', 'SafetyDevice', 'SafetyDeviceUsed', 'PedestrianLocation', 'PedestrianAction', 'PedestrianCompany', 'BirthYear', 'N_5', 'C_5', 'N_6', 'C_6', 'N_7', 'C_7', 'N_8', 'C_8', 'N_9', 'C_9']


In [20]:
# Get and print the number of variables with level zero
nb_zero_discret = obj_stat_discret.get_variable_number_zero_level(
    df_variable_exploration_discret
)

Number of variables with level zero : 28


<div style="background-color:LimeGreen; text-align:center; vertical-align: middle; padding:40px 0;">

## analysis of the accident_star dataset without discretization

<div style="background-color:LightSalmon; text-align:center; vertical-align: middle; padding:10px 0;">
exploration_type = "Variable" with count_effect_reduction = False

In [21]:
# Initialization of results and output directories names
results_dir_no_discret = "output_khiops_variable_no_discretization"
output_dir_no_discret = "results_variable_no_discretization"

# Removal of the count effect reduction
count_effect_reduction_false = False

<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
univariate analysis of secondary table variables

In [22]:
# Create results and output directories specified above
create_directories(results_dir_no_discret, output_dir_no_discret)

In [23]:
%%time

# Initialize the multi-table analysis object
obj_no_discret = create_object_multitable_analysis(
    count_effect_reduction=count_effect_reduction_false,
    results_dir=results_dir_no_discret,
    output_dir=output_dir_no_discret,
)

# Compute importance variable list
variable_importance_dictionary_no_discret = obj_no_discret.variables_analysis()

table : Place


100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:34<00:00,  1.24s/it]


table : Vehicles


100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:44<00:00,  2.36s/it]


table : Users


100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:48<00:00,  2.11s/it]


Khiops reports are written in "output_khiops_variable_no_discretization" directory

Variables exploration report in json format is written in "results_variable_no_discretization" directory


CPU times: total: 297 ms
Wall time: 2min 8s





<div style="background-color:LightGreen; text-align:center; vertical-align: middle; padding:10px 0;">
statistics from univariate analysis of secondary table variables

In [24]:
# Create Variable Statistics object
obj_stat_no_discret = create_object_statistics(
    count_effect_reduction=count_effect_reduction_false,
    output_dir=output_dir_no_discret,
)

# get the dataframe with variables importance
df_variable_exploration_no_discret = obj_stat_no_discret.get_variables_analysis()

Variables results loaded


In [25]:
# write text report
obj_stat_no_discret.write_txt_report(df_variable_exploration_no_discret)

Report file written : results_variable_no_discretization\variable_exploration.txt


In [26]:
# statistics from variable analysis

# Get and print variables list
list_var_no_discret = obj_stat_no_discret.get_list_variable(
    df_variable_exploration_no_discret
)

List of variables analyzed :
['AccidentId', 'RoadType', 'RoadNumber', 'RoadSecNumber', 'RoadLetter', 'Circulation', 'LaneNumber', 'SpecialLane', 'Slope', 'RoadMarkerId', 'RoadMarkerDistance', 'Layout', 'StripWidth', 'LaneWidth', 'SurfaceCondition', 'Infrastructure', 'Localization', 'SchoolNear', 'N_0', 'C_0', 'N_1', 'C_1', 'N_2', 'C_2', 'N_3', 'C_3', 'N_4', 'C_4', 'AccidentId', 'VehicleId', 'Direction', 'Category', 'PassengerNumber', 'FixedObstacle', 'MobileObstacle', 'ImpactPoint', 'Maneuver', 'N_10', 'C_10', 'N_11', 'C_11', 'N_12', 'C_12', 'N_13', 'C_13', 'N_14', 'C_14', 'AccidentId', 'VehicleId', 'Seat', 'Category', 'Gender', 'TripReason', 'SafetyDevice', 'SafetyDeviceUsed', 'PedestrianLocation', 'PedestrianAction', 'PedestrianCompany', 'BirthYear', 'N_5', 'C_5', 'N_6', 'C_6', 'N_7', 'C_7', 'N_8', 'C_8', 'N_9', 'C_9']


In [27]:
# Get and print the number of variables with level zero
nb_zero_no_discret = obj_stat_no_discret.get_variable_number_zero_level(
    df_variable_exploration_no_discret
)

Number of variables with level zero : 15
