Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare V11 version : internal release notes #223

Open
marcboulle opened this issue Apr 5, 2024 · 3 comments
Open

Prepare V11 version : internal release notes #223

marcboulle opened this issue Apr 5, 2024 · 3 comments
Assignees
Labels
Priority/0 To do NOW

Comments

@marcboulle
Copy link
Collaborator

Lien avec issue de pilotage global Prepare V11 version
#6

Il s'agit de rendre publique les releases notes internes concernant Khiops V11, ayant un impact potentiel sur l'(ensemble de l'éco-système

  • pilotage: partage des informations sur les nouvelles foinctionnalités
  • pykhiops
    • prise en compte dès que possible (dès les version 10.2.x) de ce qui est deprecated et sera supprimé en V11, pour prévenir les utilisateur
    • évolution des pykhiops core pour prendre en compte l'évlution des paramètres (en plsu ou en moins) et les nouveaux scénario
  • outils de visualisation: évolution selon les nouvelle fonctionnalités (bien avancé, à finaliser)
  • automl
  • documentation
  • ...

Les release notes internes sont complètes et en phase avec la version 10.5.0-a1, et il n'y aura quasiment plus d'évolution
Proposition à discuter:

  • on publie ces release notes internes sur une page wiki du repo khiops
    • si évolutions, on ajoute des sous-sections avec ces évolutions en synchronisation avec des tags de khiops
  • chaque repo "client" crée une issue si nécessaire pour prendre en compte ces évolutions
@alexisbondu
Copy link
Collaborator

Ce doc sera écrit directement en commentaire de cette issue + faire une issue dans pyKhiops pour avertir des depricated

@marcboulle
Copy link
Collaborator Author

marcboulle commented Apr 11, 2024

Preparation de Khiops V11

Nouvelles fonctionnalités de Khiops V11: cf. commentaire suivant Khiops 11.0 internal release notes

Reste à faire pour la V11

  • Khiops
    • SNB avec données sparse: intégré, à finaliser
    • arbres pour la régression: développé, à intégrer
    • finalisation de la fonctionnalité d'interprétabilité
    • collecte des tokens les plus fréquents pour la construction de feature de type text
    • finalisation du coclustering instances x variables
    • prise en cours des retour d'une diffusion en béta test, dès intégration du SNB sparse
  • Khiops visualization
    • histograms: visualiser la série des histogrammes simplifiés
    • non supervisé: prise en compte d'une colonne "Parts"
  • Khiops covisualization
    • correction des bugs existants
    • prise en compte du coclustering instances x variables
  • pykhiops:
    • prise en compte des nouvelle fonctionnalités
  • documentation
    • prise en compte des nouvelle fonctionnalités

Potentiellement pris en compte dans la V10.2.x

Les évolutions suivantes développées pour la V11 seront potentiellement reportée dans la branche V10.2.0

  • New option in khiops excutables: -s to obtain system information
  • Khiops covisualization: correction des bugs existants

Dans ce cas, il faudra les supprimer des release notes de la V11
Des corrections de bugs ont ainsi déjà été reportées vers la V10.2.0 (cf. "Bug fix" dans le WHATSNEW.txt)

Mise à jour des Khiops 11.0 internal release notes

Référence dans le commentaire suivant de titre Khiops 11.0 internal release notes
Mise a jour:

  • si besoin au fur et a mesure de la prise en compte du reste a faire
  • en décrivant les nouveaute dans l'historique ci-dessous

Historique des mises à jour

  • initialisation: alimentation par relecture des commit notes
    • sources
      • ancien fichier version.txt
        • LearningDoc\ProjectManagement\KhiopsHistoricalProject2023\Learning\Doc\version.txt
        • de V10.2.0i a V10.4.2i non compris
      • git log du github KhiopsML/khiops
        • depuis V10.4.2i
  • jj/mm:2024: détail des nouveautés
  • 11/04/2024: initialisation, de 10.2.0i à 10.5.0-a1
    jusqu'au point de commit "Merge pull request 196 assertion violated in kwprobabilitytabletest #227 from KhiopsML/196-assertion-violated-in-kwprobabilitytabletest"
  • 24/05/2024: 10.5.0-a1 à 10.5.0-b1
    • le code retour est désormais systématiquement 0 si OK, 1 sinon (plus de code retour à 2)
    • on précise ce qui sera déjà dans une version 10.2.x

Diffusion en béta-test

@marcboulle
Copy link
Collaborator Author

marcboulle commented Apr 11, 2024

Khiops 11.0 internal release notes

The purpose of the internal release notes is:

  • to give all detailed evolutions and correction potentially usefull for the Khiops eco-system
  • to allow pykhiops and AutoML to adapt in advance to the functional parts of these evolutions
  • to be the base for the file whatsnewV10.0.txt, the "official" release note (quick summary)

These release notes follow the last version of Khiops, described in the Khiops 10.2 release notes.

Khiops 11.0 is a major version, with several major functional improvements.

Major improvements

Text data

  • new Text type for variables in tabular or multi-table schema
  • Automatic feature construction from Text variables

SNB classifier for sparse data

  • extension to sparse data

Random forests for regression

Khiops interpretation

  • Instance-based interpretation of scores
  • Exact computation of Shapley values
  • Build an interpretation dictionary
    • To deploy interpretation values

Histograms

  • Optimal histograms for univariate data exploration

Coclustering instances x variables

  • extension of existing variable x variable coclustering, for joint density estimation
  • to instances x variables coclustering, for exploratory analysis

New visualization tools

  • visualization
    • new panel to visualize histograms
  • covisualization
    • accounting for the case of instances x variables coclustering

Simplified ergonomy

  • simplification of panels and fields, everywhere, as much as possible
  • fast path: to train a model without a dictionary
  • results visualization and edition of dictionaries from the graphical interface

Detailed evolutions

Functional improvements

Text data

  • new type Text available in Khiops dictionaries
    • Text variables can contain up to 1000000 bytes
    • Categorical variables are now limited to 1000 bytes
  • type detected in automatic "build dictionary" feature
  • automatic feature construction
    • parameter "number of text features ", with default value 10000
    • text features:
      • words: default automatic tokenization
      • ngrams: black-box using ngrams of bytes, for blob-like variables
      • tokens: open to user defined tokenization
  • new derivations rules for Text variables
    • TextLoadFile: load a Text variable from a text file, up to 1000000 chars, replacing end of lines by whitespaces
    • FromText, ToText: conversion with categorial variables
    • rules similar to those related to categorical variables:
      • TextLength, TextLeft, TextRight, TextMiddle,
      • TextTokenLength, TextTokenLeft, TextTokenRight, TextTokenMiddle,
      • TextTranslate, TextSearch, TextReplace, TextReplaceAll
      • TextRegexMatch, TextRegexSearch, , TextRegexReplace, TextRegexReplaceAll
      • TextToUpper, TextToLower,
      • TextConcat, TextHash, TextEncrypt
      • GetText(Entity, Text)
  • new type TextList: list of Text variables, to avoid scalability problems when concatenating Text variables from a corpus
    • dedicated derivation rules
      • creation
        • TextList(text1, text2, …)
        • TextListConcat(textList1, textList2, …)
      • Inspection
        • TextListSize, TextListAt
      • extract from sub-tables
        • GetTextList
        • TableAllTexts
        • TableAllTextLists

Optimal histograms

  • by default in unsupervised learning (not target variable), the new MODL preprocessing methods are activated
    • numerical variables: optimal histogram are built to for accurate density estimation and usefull exploratory analysis
    • categorical variable: optimal number of frequent value are kept, with the rare values in a default group
  • former unsupervised preprocessing methods can still be used if specified
    • discretization method: MODL (optimal), EqualWidth, EqualFrequency, None
      • EqualWidth: bounds are now computed on exact equal width bound, without discrading empty intervals
    • grouping method: MODL (optimal), Basic grouping, None

Preprocessing

  • in supervised learning, MODL is now the only available method
    • all other alternative methods are removed
  • max part number is now the only constraint that can be specified
    • it is an "universal" constraint that applies to all preprocessing methods
      • discretization/grouping
      • supervised/unsupervised
      • univariate/bivariate

Extend max year from 4000 to 9999 in timestamps

  • allow better automatic type recognition when year 9999 is used in databases

Khiops visualization

See Khiops visualization release notes

Khiops covisualization

See Khiops covisualization release notes

Khiops reports files .khj

Extensions of json format

  • section "variable statistics"
    • new field "parts" in the case of unsupervised learning
    • field "missingNumber" is now also available for catageorical variables
    • new field "sparseMissingNumber" to count the number of present values in sparse data blocks (technical field, not visualized)
  • "variablesDetailedStatistics"
    • new sub-section "modlHistograms" in the case of unsupervised learning with MODL optimal histigram for numerical variables
      • "histogramNumber": number of available histograms, sorted by increasing granularities
      • "intrepretableHistogramNumber": number of interpretable histogrammes (potentaiily one histogram less)
      • "truncationEpsilon": truncation epsilon used by the TMH (Truncation Management Heuristic) (0 if no truncation detected in data)
      • "removedSingularIntervalNumber": number of singular intervals removed from the finest histogram to obtain the first interpretable histogram
      • "granularities": vector of histogram granularities
      • "intervalNumbers": vector of histogram interval numbers
      • "peakIntervalNumbers": vector of histogram peak interval numbers
      • "spikeIntervalNumbers": vector of histogram spike interval numbers
      • "emptyIntervalNumbers": vector of histogram empty interval numbers
      • "levels": vector of histogram levels
      • "informationRates": vector of histogram information rates (between 0 and 100 for interpretable histograms)
      • "histograms": array of histograms
        • each histogram isa sub-object described by the following vectors
          • "bounds": interval bounds
          • "frequencies": interval frequencies

Khiops coclustering reports files .khcj

Simplified ergonomy

Khiops

  • simplified management of dictionaries
    • removed pane "Data dictionary"
    • extended menu "Data dictionary"
      • new menu item "Reload"
      • new menu item "Dictionary management": open a dialog box similar to former "Data dictionary" pane
    • new dialog box "Dictionary management"
      • similar to a simplified version of former "Data dictionary" pane
      • new button "Edit dictionary file", to open the dictionary file using a text editor
  • simplified pane "Train database"
    • new fields "Analysis dictionary" and "Dictionary file", replacing the related fields in former "Data dictionary pane"
    • simplified layout: sub-panes for "Sampling" and "Selection" specifications
  • fast path for first analysis of database without specifying a dictionary
    • just fill in the "Data base file" the click on "Train model" to
      detect the file format, automaticcaly build the dictionary and train a model
  • extended menu "Help"
    • new sub-menu "Quick start"
  • simplified pane "Parameters"
    • sub-pane "Predictors/Feature engineering"
      • new field "Keep selected variables only": to keep in reports only the constructed variable selected by the SNB predictor
      • new field "Max number of text features": maximum number of features constructed from Text variables (default: 10000)
      • field "Max number of constructed variables": default number of variable constructed from multi-table schema is now 1000
    • sub-pane "Predictors/Advanced predictor parameters"
      • new field "Do data preparation only"
        • removed fields:
          • "Selective Naive Bayes": trained, except if "Do data preparation only" is triggered
          • "Baseline predictor": never used in classification, always provided in regression
          • "Number of univariate predictors": supressed
      • new button "Text feature parameters"
        • open a Dialog box "Text feature parameters"
          • field "Text features, with three choices: words, ngrams, tokens
      • removed former button ""Selective Naive Bayes parameters"
        • former "Selective Naive Bayes" dialog box now directly in the layout
    • sub-pane "Preprocessing"
      • removed sub-pane "Discretization" (4 fields)
      • removed sub-pane "Value grouping" (4 fields)
      • new field "Max part number": universal constraint on all preprocessings, univariate/bivariate discretization/value grouping
      • new button "Advanced parameters"
        • open a dialog box "Unsupervised parameters", with 2 fields (only remaining parameters)
          • "Discretization method": among "MODL", "EqualWidth, "EqualFrequency", "None"
          • "Grouping method": among "MODL", "BasicGrouping, "None"
    • sub-pane "System parameters"
      • removed field "Max number of items in reports"
  • simplified pane "Results"
    • now only two fields
      • "Analysis report": replace former fields "Results files directory" and "Result files prefix"
      • "Short description"
    • and two buttons
      • "Export as xls": replace all former .xls reports fields
      • "Visualize results": new button to open the visualization tool directly
  • menu "Tool"
    • new sub-menu "Interpret models"
      • open a dialog box "Interpret model"
        • allow to build an interpretation dictionary, to build the Shapley values
  • simplified tool dialog boxes
    • "Deploy model"
      • simplified layout with "Sampling" and "Selection sub-panes, as in the "Train database" pane
    • "Evaluate model"
      • simplified layout with "Sampling" and "Selection sub-panes, as in the "Train database" pane
      • the evaluation report is now with format .khj, with a button "Export as xls"

Khiops coclustering

  • simplifications, similar to those of the Khiops tools
    • simplified management of dictionaries
    • fast path for first analysis of database without specifying a dictionary
    • simplified pane "Database"
    • extended menu "Help"
    • simplified pane "Results"
  • new options to build instances x variables coclusterings in pane "Parameters"
    • field "Coclustering type", to choose between "Variable coclustering" and "Instances x Variables coclustering"
    • new sub-pane "Parameters/Instances x variables parameters"
  • simplified tool dialog boxes
    • "Simplify coclustering"
      • removed pane "Results"
        • new field "Simplified coclustering report" added at the top of the dialog box
    • "Extract clusters"
      • removed panes "Cluster parameters" and "Results"
        • new fields "Coclustering variable" and "Cluster table file" added at the top of the dialog box
    • "Prepare deployment"
      • removed pane "Results"
        • new field "Coclustering variable" added at the top of the dialog box

Integration improvements

A new environment variable KHIOPS_API_MODE is available for better integration with pykhiops API

  • defaut behavior is not set, as in the Khiops desktop tool:
    • result file names are stored in the directory of the input database if their path is relative
    • suffix are imposed where necessary
  • if KHIOPS_API_MODE is set to true (e.g. in pykhiops), result files names are used as is

Allready in v10.2.x

  • New option in khiops excutables: -s to obtain system information
  • The return code is now only 0 (success) or 1 (failure)
    • the old return code 2 is removed: user errors in the log are considered normal behaviour, with return code 0
    • return code 1 (failure) is reserved for fatal errors, such as segmentation fault or memory overflow

Parallelization of new algorithms

Performance improvement

I/O performance improvement

Reliability improvement

The modeling results have been stabilised and are now independent of the platform.

New internal derivation rules

Impact in KhiopsGuide, section "8. Appendix: variable blocks and sparse data management"

New internal derivation rules

  • DataGridBlock
  • DataGridStatsBlock

Bug fixes

Many minor fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority/0 To do NOW
Projects
None yet
Development

No branches or pull requests

2 participants