Skip to content

ClearScore/diver

Repository files navigation

DIVER

Diver is the Dataset Inspector, Visualiser and Encoder library, automating and codifying common data science project steps as standardised and reusable methods.

See example-notebooks/house-price-demo.ipynb for a full walkthrough or follow this link: https://tinyurl.com/ye9hfbzp.

dataset_inspector

A set of functions which help perform checks for common dataset issues which can impact machine learning model performance.

inspector flow

dataset_conditioner

A scikit-learn-formatted module which can perform various data-type encodings in a single go, and save the associated attributes from a train-set encoding to reuse on a test-set encoding:

  • The .fit_transform method learns various encodings (feature means and variances; categorical feature elements - yellow in the flow chart below) and then performs the various encodings on the feature train set
  • The .transform method applies train-set encodings to a test set

fit_transform flow

dataset_visualiser

Functions for visualising aspects of the dataset

Correlation analysis

  • Display the correlation matrix for the top n correlating features (n specified by the user) against the dependent variable (at the bottom row of the matrix)

correlation

Latest PyPI Version

  • MAJOR: 0. -
  • MINOR: 2. - New Sklearn single feature missing value imputers (mean, median, zero, most frequent) replace previous manual implementations
  • BUGFIX: 0. -

Future Work

categorical_excess_cardinality_flagger_and_reducer

  • Option for instances where there are no categorical features

missing_value_conditioner

ordinal_encoder

  • Create a function to do this

timestamp_encoder

dataset_inspector as class

  • Memorise training set settings (cardinality reductions, cut features) as attributes in order to apply the same settings to test set
  • fit_transform/transform format as with dataset_conditioner

Unit test all functions

Extreme values

Check infer_useful_cols

  • Seems to be missing timestamps

useful_cols dtype and filltype dataframe within inspector

PCA option?

Verbose progress bars

  • Inspector
  • Conditioner

Label balanced class checker (for classification problems)

Distribution and correlation analysis

  • Display correlation matrix for top n correlates alongside target at the bottom
  • Display pairplot for top n correlates alongside target at the bottom
  • Or instead of top n correlates, instead threshold of cumulative variance
  • Option to DROOP lower correlates (lower than threshold) if desired

Useful reading