## Feature clustering and feature impact de-collinearisation

In this notebook, we'll look at a few ways to improve feature impact and reduce features by means of various clustering techniques.

We start with some standard imports.  Note that for plotting, we will use `bokeh`, as it offers nice interactivity features and embeds nicely into notebooks.

In [1]:
!pip install bokeh


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
import os
from datetime import datetime
import pickle
import urllib3

import numpy as np
import pandas as pd
import datarobot as dr

from sklearn.cluster import DBSCAN, AffinityPropagation
from sklearn.manifold import MDS
from sklearn.preprocessing import minmax_scale

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, output_file, output_notebook, show
from bokeh.palettes import Dark2_7, Category20_20

# get rid of annoying SSL warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)  # suppresses warnings!

# wider .head()s
pd.options.display.width = 240
pd.options.display.max_columns = 200
pd.options.display.max_rows = 2000

output_notebook()

In [4]:
client = dr.Client()  # ssl_verify=False)
client.endpoint


'https://app.datarobot.com/api/v2'

For demo mode, we are going to build a simple project, using a modified version of the Zillow dataset.  To avoid demo mode, simply ensure that `GET_PROJECT_ID` and `GET_MODEL_ID` are populated.

Let's set up a few globals.

In [4]:
# DATA_PATH = u"."

# CREATE_PROJ_NAME = "ZPCN - collinear feature impact demo"
# TRAIN_DATA = "ZPCN.csv"

# CALCULATE_OTHER_FIS = False

# # Feature derivation settings
# USE_RATIO = True
# USE_BOTH = False

# # Settings for add-one-in feature impact and retraining of model
# SCORING_TYPE = 'crossValidation'
# MAX_WORKERS = -1

To work with an existing project/model, enter the project ID and model ID below.  To build the project from scratch, set both to `None`.

In [4]:
# Build project from scratch
# GET_PROJECT_ID = None
# GET_MODEL_ID = None

# Name of the target variable if creating project from scratch
TARGET_FEAT = "Churned"

# Or, use pre-built project/model — example:

# extract project ID and model ID from the relevant URL:
# https://app.eu.datarobot.com/projects/5fa257053a267b0a2995ae14/models/5fa3aa1c99469d1d5a47b0a9/blueprint
#                       GET_PROJECT_ID>>^^^^^^^^^^^^^^^^^^^^^^^^        ^^^^^^^^^^^^^^^^^^^^^^^^<<GET_MODEL_ID

GET_PROJECT_ID = '67a648b75bfcc8d4c9a7d3c4'
GET_MODEL_ID = '67a66bf36bf0f970f4d61fae'

We ingest the source data — we will need it even if we are using a pre-built project.  If we don't have that project, we'll also build it now.

In [14]:
# we'll need the data either way...
print('Reading data...')
# model_data = pd.read_csv(os.path.join(DATA_PATH, TRAIN_DATA), encoding='latin-1')

model_data = dr.Dataset.get('67a640a67200381a9991d65f').get_as_dataframe(low_memory=True)

# build project if don't already have
if GET_PROJECT_ID is None:
    print('No project ID given.  Creating project...')
    project = dr.Project.create(model_data, project_name=CREATE_PROJ_NAME)

    print('Setting target variable...')
    project.set_worker_count(-1)  # maximum workers available
    project.set_target(target=TARGET_FEAT)

    print('Running autopilot...')
    project.wait_for_autopilot()
    
    GET_PROJECT_ID = project.id

else:
    # get existing project
    print('Getting project', GET_PROJECT_ID)
    project = dr.Project.get(project_id=GET_PROJECT_ID)
    TARGET_FEAT = project.target
    print('Target is', TARGET_FEAT)

Reading data...
  return pd.read_csv(csv_file.name)


In [15]:
model_data.shape

We also want to get the best *non-blender* model if we build the project from scratch.  (We'll avoid blenders, as they're slow to train and analyse.)

In [10]:
def get_leaderboard(models, omit_dsets=[], verbose=False):
    values = []
    for model in models:
        if verbose:
            print('Fetching metrics for', model.id, model.model_type)
        
        s = {
            f"{metric} | {dataset}": value
            for metric, scores in model.metrics.items()
            for dataset, value in scores.items()
            if dataset not in omit_dsets
        }
        try:
            s.update({
                "model_type": model.model_type,
                "id": model.id,
                "model_link": model.get_leaderboard_ui_permalink(),
                "model_object": model,
                "sample_pct": model.sample_pct,
            })
        except AttributeError:
            s.update({
                "model_type": model.model_type,
                "id": model.id,
                "model_link": model.get_uri(),
                "model_object": model,
                "sample_pct": model.sample_pct,
            })
        values.append(s)
    
    return pd.DataFrame(values).set_index("id")


In [11]:
if GET_MODEL_ID is None:
    # get the best non-blender model
    proj_metric = project.metric
    if project.is_datetime_partitioned:
        models = project.get_datetime_models(order_by=['-metric', 'sample_pct'])
        sort_column = 'backtesting'
    else:   
        models = project.get_models(order_by=['-metric', 'sample_pct'])
        sort_column = 'crossValidation'

    leaderboard = get_leaderboard(models, verbose=True)

    # remove the blenders
    leaderboard = leaderboard[~leaderboard['model_type'].str.contains('Blender')]

    # sort on the metric
    proj_metrics = project.get_metrics(project.target)
    sort_ascending = [d for d in proj_metrics['metric_details'] if d['metric_name'] == proj_metric][0]['ascending']
    leaderboard.sort_values(by=f'{proj_metric} | {sort_column}', ascending=sort_ascending, inplace=True)

    # remove frozen models
    leaderboard = leaderboard[leaderboard['sample_pct'] < 100]

    best_model = leaderboard.iloc[0, :]['model_object']

else:
    best_model = dr.Model.get(project=GET_PROJECT_ID, model_id=GET_MODEL_ID)

We need a helper function to create a featurelist if it doesn't yet exist; if it does, we want to return the feature list.

In [12]:
def get_or_create_featurelist(project, fl_name, features):
    try:
        flist = project.create_featurelist(fl_name, features)
    except dr.errors.ClientError as e:
        assert e.status_code == 422  # check there's nothing else kaput
        # this is horrible syntax, but works — we can't access featurelists by name, so we iterate
        # over all fl in the project until we find one that matches the name of what we want
        scratch_fl = project.get_featurelists()
        flist = [fl for fl in scratch_fl if fl.name == fl_name][0]
    return flist

Next, we define a function to generate the feature clusters and plot them as a correlation map.

In [13]:
def make_feature_clusters_map(feats_df, distances=None, max_dist=0.4, min_feats=3, blob_size=10,
                              target_col='target', allow_unclustered_feats=True,
                              use_feature_importance=False, project=None):
    """
    
       Clusters the numeric feature space and renders a correlation map of the features, colour-coded by cluster number.

           Parameters:
           
               feats_df (pd.DataFrame):         The features to be clustered; features in columns, observations in rows.
               distances (pd.DataFrame):        Optional precomputed distance matrix.  Will compute distance matrix 
                   from features if None.
               max_dist (float):                The maximum distance (1 - correlation) between two features for the
                   relationship to not be 'noisy' in DBSCAN terms.  Ignored if using affinity propagation.
               min_feats (int):                 The minimum number of features to be included in a single DBSCAN
               cluster. Ignored if using affinity propagation.
               blob_size (float):               The size of the markers in the correlation map.
               target_col (str):                The name of the target column in feats_df.
               allow_unclustered_feats (bool):  whether to run a clustering algorithm that allows certain features to be
                   labelled as 'noise' and not belonging to a particular cluster (DBSCAN). If False, assign all features
                   to a cluster, and allow affinity propagation clustering to automatically determine the optimal
                   number of clusters/features.
               use_feature_importance (bool):   whether to use feature importance from DR for sizing blobs (True) or
                   pairwise correlation (False)
               project (dr.Project):            Project object for feature importance.  Ignored otherwise.

           Returns:
           
               feats_sct_data (pd.DataFrame):   The source data for the correlation map, including the cluster
                   numbers of the various features.

    """

    # let's make a distance matrix
    if distances is None:
        corr_mtrx = feats_df.corr()
        print('Correlation matrix shape:', corr_mtrx.shape)
        dist_mtrx = 1 - corr_mtrx
    else:
        dist_mtrx = distances.copy()

    dist_mtrx = dist_mtrx.dropna(axis=0, how='all').dropna(axis=1, how='all')
    print('distance matrix shape:', dist_mtrx.shape)

    print('Clustering features.')
    dist = dist_mtrx.values

    if allow_unclustered_feats:
        # use DBSCAN to cluster if we allow 'noise' clusters, i.e. feats that don't belong to any cluster
        print('max_dist', max_dist)
        clusterer = DBSCAN(eps=max_dist, min_samples=min_feats, metric='precomputed').fit(dist)
    else:
        # use affinity propagation 
        # TODO: add cosine similarity calculation
        clusterer = AffinityPropagation(verbose=True, damping=0.8, max_iter=1000).fit(dist)

    # use MDS to calculate layout for scatter plot
    print('Calculating MDS embedding for scatter plot.')
    embedding = MDS(n_components=2, dissimilarity='precomputed', verbose=1, n_init=4, max_iter=250)
    feats_plot = embedding.fit_transform(dist)
    feats_sct_data = pd.DataFrame(feats_plot, index=dist_mtrx.index, columns=['x', 'y'])
    feats_sct_data['cluster'] = 1 + clusterer.labels_
    print(feats_sct_data.cluster.max(), 'clusters found.')

    # get feature importances for blob size
    if use_feature_importance:
        print('Getting feature importances.')
        for f in feats_sct_data.index:
            print(f)
            feats_sct_data.loc[f, 'importance'] = dr.Feature.get(project_id=project.id, feature_name=f).importance
    # or use correlation
    else:
        feats_sct_data['importance'] = 1 - dist_mtrx[target_col]
        feats_sct_data = feats_sct_data.loc[feats_sct_data.index != target_col, :]

    # and scale
    feats_sct_data['importance_scaled'] = minmax_scale(feats_sct_data.importance.abs(), feature_range=(0.2, 1.2))

    # build a custom palette that can take up to 100 clusters. Start with the colour for `no cluster`
    custom_palette = ['#aaaaaa'] + list(Category20_20) + list(Category20_20) + list(Category20_20) + list(Category20_20) + list(Category20_20)

    # assign colours by cluster number
    feats_sct_data['color'] = [custom_palette[c] for c in feats_sct_data.cluster]

    feats_sct_data.index.name = 'feature_name'

    print('Building original features correlation map.')

    # draw scatter
    TOOLS = "hover,pan,wheel_zoom,box_zoom,reset,save"

    feats_sct_data['blob_size'] = feats_sct_data.importance_scaled * blob_size
    source = ColumnDataSource(feats_sct_data.reset_index())

    # set up canvas
    # figure should be square
    p = figure(width=800, height=800, tools=TOOLS,
               x_range=(-1., 1.), y_range=(-1., 1.))
    p.xaxis.visible = False
    p.yaxis.visible = False

    # set up blobs
    renderer = p.circle(x='x', y='y', size='blob_size', source=source, color='color', fill_alpha=0.7,
                        line_color='color', line_width=2, line_alpha=0.3)
    p.hover.tooltips = [('feature', '@feature_name'), ('cluster', '@cluster'), ('importance', '@importance')]

    show(p)

    return feats_sct_data

Let's test the cluster map.

In [14]:
fsd = make_feature_clusters_map(model_data, blob_size=20, target_col=TARGET_FEAT, allow_unclustered_feats=False)

Correlation matrix shape: (41, 41)
distance matrix shape: (41, 41)
Clustering features.
Converged after 38 iterations.
Calculating MDS embedding for scatter plot.
breaking at iteration 83 with stress 75.42062851269415
breaking at iteration 133 with stress 64.73191004226335
breaking at iteration 111 with stress 69.98370849746614
breaking at iteration 148 with stress 65.65954654593774
5 clusters found.
Building original features correlation map.




In [15]:
fsd = make_feature_clusters_map(model_data, blob_size=20, max_dist=.5, target_col=TARGET_FEAT, allow_unclustered_feats=True)

Correlation matrix shape: (41, 41)
distance matrix shape: (41, 41)
Clustering features.
max_dist 0.5
Calculating MDS embedding for scatter plot.
breaking at iteration 125 with stress 66.66830434432921
breaking at iteration 107 with stress 71.68465967125965
breaking at iteration 150 with stress 69.3446786279916
breaking at iteration 122 with stress 69.09720022051837
2 clusters found.
Building original features correlation map.




Here's the main body of the code.  We define a function that takes the raw data, project and best model IDs, and then does the following:

- Arrange the features into clusters (either comprehensive or noisy) and plot a correlation map.

- For each cluster:
    - Remove all the features in the cluster
    - For each feature:
        - Add one feature back to the full list of features not in the current cluster
        - Retrain the best model, using this new feature list
        - Find which single feature gives the biggest uptick in model performance from adding it back.  This becomes our new 'lead feature' for the cluster.
    - _Optionally, to save time,_ ignore the previous step and take the feature with the strongest correlation to the target variable as the 'lead feature' instead.
    - Difference and/or ratio the other features in the cluster against the new lead feature _(optional)_.
    
- Build a new project with the lead features plus the 'new' features 
    - We can now look at cleaner, _per-cluster_ feature impact on the best model in the new project.

In [41]:
dd

['1stFlrSF',
 '2ndFlrSF',
 '3SsnPorch',
 'Alley',
 'BedroomAbvGr',
 'BldgType',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtFinType1',
 'BsmtFinType2',
 'BsmtFullBath',
 'BsmtHalfBath',
 'BsmtQual',
 'BsmtUnfSF',
 'CentralAir',
 'Condition1',
 'Condition2',
 'Electrical',
 'EnclosedPorch',
 'ExterCond',
 'ExterQual',
 'Exterior1st',
 'Exterior2nd',
 'Fence',
 'FireplaceQu',
 'Fireplaces',
 'Foundation',
 'FullBath',
 'Functional',
 'GarageArea',
 'GarageCars',
 'GarageCond',
 'GarageFinish',
 'GarageQual',
 'GarageType',
 'GarageYrBlt',
 'GrLivArea',
 'HalfBath',
 'Heating',
 'HeatingQC',
 'HouseStyle',
 'Id',
 'KitchenAbvGr',
 'KitchenQual',
 'LandContour',
 'LandSlope',
 'LotArea',
 'LotAreaNoise1',
 'LotAreaNoise2',
 'LotAreaNoise3',
 'LotConfig',
 'LotFrontage',
 'LotShape',
 'LowQualFinSF',
 'MSSubClass',
 'MSZoning',
 'MasVnrArea',
 'MasVnrType',
 'MiscFeature',
 'MiscVal',
 'MoSold',
 'Neighborhood',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'PavedDr

In [48]:
def feature_clustering(model_data, project_id, best_model_id, target_col='target',
                       scoring_type='validation', calculate_other_fis=False, dist_matrix=None, use_dr_fam=False,
                       use_ratio=False, use_diff=False,  max_workers=4,
                       bypass_dr_retrains=True, **kwargs):

    """

    :param model_data: The source dataset
    :param project_id: DataRobot project ID of the base project
    :param best_model_id: DataRobot model ID of the best model/the one we will use.
    :param target_col: Target variable.
    :param scoring_type: What kind of scoring to use for the retrained models.
    :param calculate_other_fis: Whether to calculate feature impacts on the retrained models.
    :param use_ratio: When building the new 'clean' features, whether to take the ratio against the cluster's anchor feature
    :param use_diff: When building the new 'clean' features, whether to take the difference against the cluster's
    anchor feature
    :param max_workers: For retraining the project at the end.
    :param bypass_dr_retrains: Accelerate things by just picking the feature in each cluster with the highest
    correlation to the target variable.
    :return:
        dr_project_nocoll: retrained dr.Project object, ex collinear features.
    """

    # init a few objects
    project = dr.Project.get(project_id)
    best_model = dr.Model.get(project=project_id, model_id=best_model_id)
    is_otv = project.is_datetime_partitioned
    
    optimisation_metric = project.metric
    print('Project metric:', optimisation_metric)

    # get list of features used in best model
    feat_list = best_model.get_features_used()

    # have the clusterer make the distance matrix?
    if dist_matrix is None:
        if use_dr_fam:
            # retrieve feature association matrix
            feature_association_matrix = dr.FeatureAssociationMatrix.get(project.id, featurelist_id=best_model.featurelist_id)
            features = [f['feature'] for f in feature_association_matrix.features]

            fam_df = pd.DataFrame(index=features, columns=features)

            for row in feature_association_matrix.strengths:
                fam_df.loc[row['feature1'], row['feature2']] = row['statistic']
                fam_df.loc[row['feature2'], row['feature1']] = row['statistic']

            # we'll pass this to the clusterer
            dist_matrix = 1 - fam_df
            non_clustering_feats = model_data.columns.difference(fam_df.columns)
            new_df = model_data.loc[:, dist_matrix.columns]

        else:

            # for now, we focus on numeric data types only.  let's get just those out of our modelling data
            numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
            new_df = model_data.loc[:, feat_list].select_dtypes(include=numerics)

            # we'll need the non-clustering stuff later, so let's make a note of those
            non_clustering_feats = model_data.columns.difference(new_df.columns)

    else:
        non_clustering_feats = model_data.columns.difference(dist_matrix.columns)
        new_df = dist_matrix.copy()

    # do the feature clustering and present the correlation map
    feats_sct_data = make_feature_clusters_map(new_df, target_col=target_col, distances=dist_matrix, **kwargs)

        # make dataframe of data for new model
    no_cluster_feats = list(non_clustering_feats) + list(feats_sct_data.loc[feats_sct_data.cluster == 0, :].index)

    new_model_data = model_data.loc[:, no_cluster_feats].copy()
    new_model_data[target_col] = model_data[target_col]

    # get baseline feature impact and format as a DF
    print('Getting baseline feature impact.')

    fi = best_model.get_or_request_feature_impact(max_wait=1200)
    base_fi = pd.DataFrame(fi).set_index('featureName')

    # and we'll store a few things in dicts, starting with our baseline
    impacts = {'baseline': base_fi}
    retrained_models = {'baseline': best_model}
    feat_lists = {'baseline': get_or_create_featurelist(project, 'baseline', feat_list)}
    job_ids = {}
    
    
    # and store the unnormalized impacts in a df for later comparison
    unn_imps = pd.DataFrame(base_fi.impactUnnormalized).rename(
            columns={'impactUnnormalized': 'baseline'})

    # iterate over feature clusters
    for clus in range(1, feats_sct_data.cluster.max() + 1):

        featlist_stem = 'clust{:d}_'.format(clus)

        cluster_output = {}

        # build a list of features in the cluster
        coll_feats = list(feats_sct_data.loc[feats_sct_data.cluster == clus, :].index.values)
        print('\nProcessing cluster', clus)
        print('Features:', coll_feats)
        print()

        # doing this the hard way: build a bunch of add-one-back models for each cluster
        if not bypass_dr_retrains:

            # get the stub feature list excluding the collinear features in this cluster
            stub_feats = list(set(feat_list) - set(coll_feats))

            # we'll work through the individual features - first, build the models
            print('Iterating over features, training new models')
            for f in coll_feats:
                # Make a feature list of stub + f
                featl_name = featlist_stem + f
                # check feat list hasn't been created yet --> create
                feat_lists[f] = get_or_create_featurelist(project, featl_name, stub_feats + [f])
                # train the model
                if is_otv:
                    job_ids[f] = best_model.train_datetime(featurelist_id=feat_lists[f].id)
                else:
                    job_ids[f] = best_model.train(featurelist_id=feat_lists[f].id,
                                                  scoring_type=scoring_type)
                retrained_models[f] = None

            # then, collect the models once created (it's faster to split this way, can go asynchronously)
            print('Getting new models')
            for f in coll_feats:
                # get the feature list name
                featl_name = featlist_stem + f
                # and the model
                try:
                    retrained_models[f] = dr.models.modeljob.wait_for_async_model_creation(project_id=project.id,
                                                                                           model_job_id=job_ids[f])
                    print('Retrained model on list', featl_name, 'as Model.id', retrained_models[f].id)
                except dr.errors.ClientError as e:
                    if e.status_code == 404:
                        retrained_models[f] = job_ids[f].get_result()
                
                # if feat impact is calculated already, get it
                if calculate_other_fis:
                    impacts[f] = retrained_models[f].get_or_request_feature_impact(wait=1200)
                    impacts[f] = pd.DataFrame(impacts[f]).set_index('featureName')
                    featl_name = featlist_stem + f
                    unn_imps[featl_name] = impacts[f].impactUnnormalized

            # collect the metrics for our retrained models            
            metrics = {}

            for rmk in retrained_models.keys():
                metrics[rmk] = {mname: retrained_models[rmk].metrics[mname][scoring_type]
                                for mname in retrained_models[rmk].metrics.keys()}

            metr_DF = pd.DataFrame(metrics)

            # let's cross-check the most impactful feature -- this time on the full data
            # (we want the feature which gives us the best-performing model stand-alone)
            USE_MIN = True

            if USE_MIN:
                bpfsm = metr_DF.loc[optimisation_metric, coll_feats].idxmin()
            else:
                bpfsm = metr_DF.loc[optimisation_metric, coll_feats].idxmax()

            print('Best feature in cluster {:}:'.format(clus), bpfsm)
            print(metr_DF)

            cluster_output[clus] = metr_DF

            fl_bpfsm = featlist_stem + bpfsm

            if calculate_other_fis:
                unn_imps = unn_imps.sort_values(by=fl_bpfsm, ascending=False)
                print(unn_imps)

        else:
            bpfsm = feats_sct_data.loc[feats_sct_data.cluster == clus, 'importance_scaled'].idxmax()
            print('Best feature in cluster {:}:'.format(clus), bpfsm)

        # now let's difference the other features in the cluster against the best feature
        print('Deriving features.')
        new_model_data['{:} (cluster {:})'.format(bpfsm, clus)] = model_data[bpfsm]
        for f in coll_feats:
            if f != bpfsm:
                if use_diff:
                    new_model_data['d_' + f + '_' + bpfsm] = model_data[f] - model_data[bpfsm]
                if use_ratio:
                    new_model_data['r_' + f + '_' + bpfsm] = model_data[f] / model_data[bpfsm]

    print('\n\nCluster processing complete.')

    # and build a new project with the reshaped data
    print('Creating project with de-collinearised and differenced data...')
    dr_proj_nocoll = dr.Project.create(new_model_data, project_name=project.project_name + '_coll. removed')

    print('Setting target variable...')
    dr_proj_nocoll.set_worker_count(max_workers)
    dr_proj_nocoll.set_target(target=project.target)

    print('Opening leaderboard.')
    dr_proj_nocoll.open_leaderboard_browser()

    return dr_proj_nocoll

Here's a full run of this, doing a comprehensive set of clusters.

In [21]:
GET_PROJECT_ID, GET_MODEL_ID

('66584e795a2f8c0ab7b08579', None)

In [23]:
proj_nocoll = feature_clustering(model_data, project.id, best_model.id, target_col=TARGET_FEAT,
                       scoring_type='validation', calculate_other_fis=False,
                       use_ratio=True, use_diff=True, 
                       bypass_dr_retrains=False, allow_unclustered_feats=False, max_dist=0.4)

Project metric: Gamma Deviance
Correlation matrix shape: (41, 41)
distance matrix shape: (41, 41)
Clustering features.
Converged after 38 iterations.
Calculating MDS embedding for scatter plot.
breaking at iteration 93 with stress 66.36414489078493
breaking at iteration 154 with stress 68.262436849818
breaking at iteration 63 with stress 68.82817977195428
breaking at iteration 118 with stress 70.05089592939066
5 clusters found.
Building original features correlation map.




Getting baseline feature impact.

Processing cluster 1
Features: ['1stFlrSF', 'GarageArea', 'GarageCars', 'GarageYrBlt', 'MasVnrArea', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd']

Iterating over features, training new models
Getting new models
Retrained model on list clust1_1stFlrSF as Model.id 66586bdbab1641896ab0857c
Retrained model on list clust1_GarageArea as Model.id 66586bdc0355ef3053b0862c
Retrained model on list clust1_GarageCars as Model.id 66586bdd0355ef3053b08637
Retrained model on list clust1_GarageYrBlt as Model.id 66586bdee9448eadaa695544
Retrained model on list clust1_MasVnrArea as Model.id 66586be0ab1641896ab08590
Retrained model on list clust1_OverallQual as Model.id 66586be1e9448eadaa695559
Retrained model on list clust1_TotalBsmtSF as Model.id 66586be2e9448eadaa695564
Retrained model on list clust1_YearBuilt as Model.id 66586be318a87ae99fb086e8
Retrained model on list clust1_YearRemodAdd as Model.id 66586be518a87ae99fb086f3
Best feature in cluster 1: G

In [51]:
proj_nocoll_2 = feature_clustering(model_data, project.id, best_model.id, target_col=TARGET_FEAT,
                       scoring_type='validation', calculate_other_fis=False, use_dr_fam=True,
                       use_ratio=False, use_diff=False, 
                       bypass_dr_retrains=True, allow_unclustered_feats=False, max_dist=0.5)

Project metric: Gamma Deviance
distance matrix shape: (50, 50)
Clustering features.
Converged after 57 iterations.
Calculating MDS embedding for scatter plot.
breaking at iteration 136 with stress 144.93188663225834
breaking at iteration 127 with stress 143.342659258474
breaking at iteration 139 with stress 144.9994916134004
breaking at iteration 98 with stress 146.0629691386569
13 clusters found.
Building original features correlation map.




Getting baseline feature impact.

Processing cluster 1
Features: ['Exterior1st', 'Exterior2nd']

Best feature in cluster 1: Exterior2nd
Deriving features.

Processing cluster 2
Features: ['FireplaceQu', 'Fireplaces']

Best feature in cluster 2: FireplaceQu
Deriving features.

Processing cluster 3
Features: ['LotArea', 'LotAreaNoise1', 'LotAreaNoise2', 'LotAreaNoise3']

Best feature in cluster 3: LotArea
Deriving features.

Processing cluster 4
Features: ['SaleCondition', 'SaleType']

Best feature in cluster 4: SaleType
Deriving features.

Processing cluster 5
Features: ['MasVnrArea', 'MasVnrType']

Best feature in cluster 5: MasVnrArea
Deriving features.

Processing cluster 6
Features: ['GarageYrBlt', 'Neighborhood', 'YearBuilt', 'YearRemodAdd', 'Foundation']

Best feature in cluster 6: Neighborhood
Deriving features.

Processing cluster 7
Features: ['1stFlrSF', 'GrLivArea', 'TotalBsmtSF', 'TotRmsAbvGrd']

Best feature in cluster 7: GrLivArea
Deriving features.

Processing cluster 8
Fe

In [None]:
# TODO: iteratively add back differenced features to each cluster and see whether an individual feature adds value