Prepare V11 version : internal release notes #223

marcboulle · 2024-04-05T15:03:20Z

Lien avec issue de pilotage global Prepare V11 version
#6

Il s'agit de rendre publique les releases notes internes concernant Khiops V11, ayant un impact potentiel sur l'(ensemble de l'éco-système

pilotage: partage des informations sur les nouvelles foinctionnalités
pykhiops
- prise en compte dès que possible (dès les version 10.2.x) de ce qui est deprecated et sera supprimé en V11, pour prévenir les utilisateur
- évolution des pykhiops core pour prendre en compte l'évlution des paramètres (en plsu ou en moins) et les nouveaux scénario
outils de visualisation: évolution selon les nouvelle fonctionnalités (bien avancé, à finaliser)
automl
documentation
...

Les release notes internes sont complètes et en phase avec la version 10.5.0-a1, et il n'y aura quasiment plus d'évolution
Proposition à discuter:

on publie ces release notes internes sur une page wiki du repo khiops
- si évolutions, on ajoute des sous-sections avec ces évolutions en synchronisation avec des tags de khiops
chaque repo "client" crée une issue si nécessaire pour prendre en compte ces évolutions

alexisbondu · 2024-04-10T09:18:52Z

Ce doc sera écrit directement en commentaire de cette issue + faire une issue dans pyKhiops pour avertir des depricated

marcboulle · 2024-04-11T08:18:15Z

Preparation de Khiops V11

Nouvelles fonctionnalités de Khiops V11: cf. commentaire suivant Khiops 11.0 internal release notes

Reste à faire pour la V11

Khiops
- SNB avec données sparse: intégré, à finaliser
- arbres pour la régression: développé, à intégrer
- finalisation de la fonctionnalité d'interprétabilité
- collecte des tokens les plus fréquents pour la construction de feature de type text
- finalisation du coclustering instances x variables
- prise en cours des retour d'une diffusion en béta test, dès intégration du SNB sparse
Khiops visualization
- histograms: visualiser la série des histogrammes simplifiés
- non supervisé: prise en compte d'une colonne "Parts"
Khiops covisualization
- correction des bugs existants
- prise en compte du coclustering instances x variables
pykhiops:
- prise en compte des nouvelle fonctionnalités
documentation
- prise en compte des nouvelle fonctionnalités

Potentiellement pris en compte dans la V10.2.x

Les évolutions suivantes développées pour la V11 seront potentiellement reportée dans la branche V10.2.0

New option in khiops excutables: -s to obtain system information
Khiops covisualization: correction des bugs existants

Dans ce cas, il faudra les supprimer des release notes de la V11
Des corrections de bugs ont ainsi déjà été reportées vers la V10.2.0 (cf. "Bug fix" dans le WHATSNEW.txt)

Mise à jour des Khiops 11.0 internal release notes

Référence dans le commentaire suivant de titre Khiops 11.0 internal release notes
Mise a jour:

si besoin au fur et a mesure de la prise en compte du reste a faire
en décrivant les nouveaute dans l'historique ci-dessous

Historique des mises à jour

initialisation: alimentation par relecture des commit notes
- sources
  - ancien fichier version.txt
    - LearningDoc\ProjectManagement\KhiopsHistoricalProject2023\Learning\Doc\version.txt
    - de V10.2.0i a V10.4.2i non compris
  - git log du github KhiopsML/khiops
    - depuis V10.4.2i
jj/mm:2024: détail des nouveautés
11/04/2024: initialisation, de 10.2.0i à 10.5.0-a1
jusqu'au point de commit "Merge pull request 196 assertion violated in kwprobabilitytabletest #227 from KhiopsML/196-assertion-violated-in-kwprobabilitytabletest"
24/05/2024: 10.5.0-a1 à 10.5.0-b1
- le code retour est désormais systématiquement 0 si OK, 1 sinon (plus de code retour à 2)
- on précise ce qui sera déjà dans une version 10.2.x

Diffusion en béta-test

30/05/2024
- version 10.5.0-b.1: https://github.com/KhiopsML/khiops/releases/tag/10.5.0-b.1
  - installeur Windows: https://github.com/KhiopsML/khiops/releases/download/10.5.0-b.1/khiops-10.5.0-b.1-setup.exe
- diffusée uniquement en interne
24/07/2024: version 10.5.2-b.1: https://github.com/KhiopsML/khiops/releases/tag/10.5.2-b.0

marcboulle · 2024-04-11T08:20:36Z

Khiops 11.0 internal release notes

The purpose of the internal release notes is:

to give all detailed evolutions and correction potentially usefull for the Khiops eco-system
to allow pykhiops and AutoML to adapt in advance to the functional parts of these evolutions
to be the base for the file whatsnewV10.0.txt, the "official" release note (quick summary)

These release notes follow the last version of Khiops, described in the Khiops 10.2 release notes.

Khiops 11.0 is a major version, with several major functional improvements.

Major improvements

Text data

new Text type for variables in tabular or multi-table schema
Automatic feature construction from Text variables

SNB classifier for sparse data

extension to sparse data

Random forests for regression

Khiops interpretation

Instance-based interpretation of scores
Exact computation of Shapley values
Build an interpretation dictionary
- To deploy interpretation values

Histograms

Optimal histograms for univariate data exploration

Coclustering instances x variables

extension of existing variable x variable coclustering, for joint density estimation
to instances x variables coclustering, for exploratory analysis

New visualization tools

visualization
- new panel to visualize histograms
covisualization
- accounting for the case of instances x variables coclustering

Simplified ergonomy

simplification of panels and fields, everywhere, as much as possible
fast path: to train a model without a dictionary
results visualization and edition of dictionaries from the graphical interface

Detailed evolutions

Functional improvements

Text data

new type Text available in Khiops dictionaries
- Text variables can contain up to 1000000 bytes
- Categorical variables are now limited to 1000 bytes
type detected in automatic "build dictionary" feature
automatic feature construction
- parameter "number of text features ", with default value 10000
- text features:
  - words: default automatic tokenization
  - ngrams: black-box using ngrams of bytes, for blob-like variables
  - tokens: open to user defined tokenization
new derivations rules for Text variables
- TextLoadFile: load a Text variable from a text file, up to 1000000 chars, replacing end of lines by whitespaces
- FromText, ToText: conversion with categorial variables
- rules similar to those related to categorical variables:
  - TextLength, TextLeft, TextRight, TextMiddle,
  - TextTokenLength, TextTokenLeft, TextTokenRight, TextTokenMiddle,
  - TextTranslate, TextSearch, TextReplace, TextReplaceAll
  - TextRegexMatch, TextRegexSearch, , TextRegexReplace, TextRegexReplaceAll
  - TextToUpper, TextToLower,
  - TextConcat, TextHash, TextEncrypt
  - GetText(Entity, Text)
new type TextList: list of Text variables, to avoid scalability problems when concatenating Text variables from a corpus
- dedicated derivation rules
  - creation
    - TextList(text1, text2, …)
    - TextListConcat(textList1, textList2, …)
  - Inspection
    - TextListSize, TextListAt
  - extract from sub-tables
    - GetTextList
    - TableAllTexts
    - TableAllTextLists

Optimal histograms

by default in unsupervised learning (not target variable), the new MODL preprocessing methods are activated
- numerical variables: optimal histogram are built to for accurate density estimation and usefull exploratory analysis
- categorical variable: optimal number of frequent value are kept, with the rare values in a default group
former unsupervised preprocessing methods can still be used if specified
- discretization method: MODL (optimal), EqualWidth, EqualFrequency, None
  - EqualWidth: bounds are now computed on exact equal width bound, without discrading empty intervals
- grouping method: MODL (optimal), Basic grouping, None

Preprocessing

in supervised learning, MODL is now the only available method
- all other alternative methods are removed
max part number is now the only constraint that can be specified
- it is an "universal" constraint that applies to all preprocessing methods
  - discretization/grouping
  - supervised/unsupervised
  - univariate/bivariate

Extend max year from 4000 to 9999 in timestamps

allow better automatic type recognition when year 9999 is used in databases

Khiops visualization

See Khiops visualization release notes

Khiops covisualization

See Khiops covisualization release notes

Khiops reports files .khj

Extensions of json format

section "variable statistics"
- new field "parts" in the case of unsupervised learning
- field "missingNumber" is now also available for catageorical variables
- new field "sparseMissingNumber" to count the number of present values in sparse data blocks (technical field, not visualized)
"variablesDetailedStatistics"
- new sub-section "modlHistograms" in the case of unsupervised learning with MODL optimal histigram for numerical variables
  - "histogramNumber": number of available histograms, sorted by increasing granularities
  - "intrepretableHistogramNumber": number of interpretable histogrammes (potentaiily one histogram less)
  - "truncationEpsilon": truncation epsilon used by the TMH (Truncation Management Heuristic) (0 if no truncation detected in data)
  - "removedSingularIntervalNumber": number of singular intervals removed from the finest histogram to obtain the first interpretable histogram
  - "granularities": vector of histogram granularities
  - "intervalNumbers": vector of histogram interval numbers
  - "peakIntervalNumbers": vector of histogram peak interval numbers
  - "spikeIntervalNumbers": vector of histogram spike interval numbers
  - "emptyIntervalNumbers": vector of histogram empty interval numbers
  - "levels": vector of histogram levels
  - "informationRates": vector of histogram information rates (between 0 and 100 for interpretable histograms)
  - "histograms": array of histograms
    - each histogram isa sub-object described by the following vectors
      - "bounds": interval bounds
      - "frequencies": interval frequencies

Khiops coclustering reports files .khcj

Simplified ergonomy

Khiops

simplified management of dictionaries
- removed pane "Data dictionary"
- extended menu "Data dictionary"
  - new menu item "Reload"
  - new menu item "Dictionary management": open a dialog box similar to former "Data dictionary" pane
- new dialog box "Dictionary management"
  - similar to a simplified version of former "Data dictionary" pane
  - new button "Edit dictionary file", to open the dictionary file using a text editor
simplified pane "Train database"
- new fields "Analysis dictionary" and "Dictionary file", replacing the related fields in former "Data dictionary pane"
- simplified layout: sub-panes for "Sampling" and "Selection" specifications
fast path for first analysis of database without specifying a dictionary
- just fill in the "Data base file" the click on "Train model" to
  detect the file format, automaticcaly build the dictionary and train a model
extended menu "Help"
- new sub-menu "Quick start"
simplified pane "Parameters"
- sub-pane "Predictors/Feature engineering"
  - new field "Keep selected variables only": to keep in reports only the constructed variable selected by the SNB predictor
  - new field "Max number of text features": maximum number of features constructed from Text variables (default: 10000)
  - field "Max number of constructed variables": default number of variable constructed from multi-table schema is now 1000
- sub-pane "Predictors/Advanced predictor parameters"
  - new field "Do data preparation only"
    - removed fields:
      - "Selective Naive Bayes": trained, except if "Do data preparation only" is triggered
      - "Baseline predictor": never used in classification, always provided in regression
      - "Number of univariate predictors": supressed
  - new button "Text feature parameters"
    - open a Dialog box "Text feature parameters"
      - field "Text features, with three choices: words, ngrams, tokens
  - removed former button ""Selective Naive Bayes parameters"
    - former "Selective Naive Bayes" dialog box now directly in the layout
- sub-pane "Preprocessing"
  - removed sub-pane "Discretization" (4 fields)
  - removed sub-pane "Value grouping" (4 fields)
  - new field "Max part number": universal constraint on all preprocessings, univariate/bivariate discretization/value grouping
  - new button "Advanced parameters"
    - open a dialog box "Unsupervised parameters", with 2 fields (only remaining parameters)
      - "Discretization method": among "MODL", "EqualWidth, "EqualFrequency", "None"
      - "Grouping method": among "MODL", "BasicGrouping, "None"
- sub-pane "System parameters"
  - removed field "Max number of items in reports"
simplified pane "Results"
- now only two fields
  - "Analysis report": replace former fields "Results files directory" and "Result files prefix"
  - "Short description"
- and two buttons
  - "Export as xls": replace all former .xls reports fields
  - "Visualize results": new button to open the visualization tool directly
menu "Tool"
- new sub-menu "Interpret models"
  - open a dialog box "Interpret model"
    - allow to build an interpretation dictionary, to build the Shapley values
simplified tool dialog boxes
- "Deploy model"
  - simplified layout with "Sampling" and "Selection sub-panes, as in the "Train database" pane
- "Evaluate model"
  - simplified layout with "Sampling" and "Selection sub-panes, as in the "Train database" pane
  - the evaluation report is now with format .khj, with a button "Export as xls"

Khiops coclustering

simplifications, similar to those of the Khiops tools
- simplified management of dictionaries
- fast path for first analysis of database without specifying a dictionary
- simplified pane "Database"
- extended menu "Help"
- simplified pane "Results"
new options to build instances x variables coclusterings in pane "Parameters"
- field "Coclustering type", to choose between "Variable coclustering" and "Instances x Variables coclustering"
- new sub-pane "Parameters/Instances x variables parameters"
simplified tool dialog boxes
- "Simplify coclustering"
  - removed pane "Results"
    - new field "Simplified coclustering report" added at the top of the dialog box
- "Extract clusters"
  - removed panes "Cluster parameters" and "Results"
    - new fields "Coclustering variable" and "Cluster table file" added at the top of the dialog box
- "Prepare deployment"
  - removed pane "Results"
    - new field "Coclustering variable" added at the top of the dialog box

Integration improvements

A new environment variable KHIOPS_API_MODE is available for better integration with pykhiops API

defaut behavior is not set, as in the Khiops desktop tool:
- result file names are stored in the directory of the input database if their path is relative
- suffix are imposed where necessary
if KHIOPS_API_MODE is set to true (e.g. in pykhiops), result files names are used as is

Allready in v10.2.x

New option in khiops excutables: -s to obtain system information
The return code is now only 0 (success) or 1 (failure)
- the old return code 2 is removed: user errors in the log are considered normal behaviour, with return code 0
- return code 1 (failure) is reserved for fatal errors, such as segmentation fault or memory overflow

Parallelization of new algorithms

Performance improvement

I/O performance improvement

Reliability improvement

The modeling results have been stabilised and are now independent of the platform.

New internal derivation rules

Impact in KhiopsGuide, section "8. Appendix: variable blocks and sparse data management"

New internal derivation rules

DataGridBlock
DataGridStatsBlock

Bug fixes

Many minor fixes

alexisbondu assigned marcboulle Apr 10, 2024

alexisbondu added Priority/1 To do after P0 Priority/0 To do NOW and removed Priority/1 To do after P0 labels Apr 10, 2024

folmos-at-orange mentioned this issue Apr 10, 2024

Add deprecation warnings for removals in Khiops V11 KhiopsML/khiops-python#176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare V11 version : internal release notes #223

Prepare V11 version : internal release notes #223

marcboulle commented Apr 5, 2024

alexisbondu commented Apr 10, 2024

marcboulle commented Apr 11, 2024 •

edited

Loading

marcboulle commented Apr 11, 2024 •

edited

Loading

Prepare V11 version : internal release notes #223

Prepare V11 version : internal release notes #223

Comments

marcboulle commented Apr 5, 2024

alexisbondu commented Apr 10, 2024

marcboulle commented Apr 11, 2024 • edited Loading

Preparation de Khiops V11

Reste à faire pour la V11

Potentiellement pris en compte dans la V10.2.x

Mise à jour des Khiops 11.0 internal release notes

Diffusion en béta-test

marcboulle commented Apr 11, 2024 • edited Loading

Khiops 11.0 internal release notes

Major improvements

Detailed evolutions

Functional improvements

Khiops visualization

Khiops covisualization

Khiops reports files .khj

Khiops coclustering reports files .khcj

Simplified ergonomy

Integration improvements

Parallelization of new algorithms

Performance improvement

I/O performance improvement

Reliability improvement

New internal derivation rules

Bug fixes

marcboulle commented Apr 11, 2024 •

edited

Loading

marcboulle commented Apr 11, 2024 •

edited

Loading