<img src="https://relevance.ai/wp-content/uploads/2021/11/logo.79f303e-1.svg" width="150" alt="Relevance AI" />
<h5> Developer-first vector platform for ML teams </h5>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RelevanceAI/workflows/blob/main/workflows/impact-analysis/impact-analysis_form.ipynb)

# 😄 Feature Analysis Workflow

Feature analysis allows us to explore the impact of vectors or tags or labels on certain parameters. 
Please specify the vector field you want to analyse and the key metric you plan to map to.

In [None]:

token = "<copy paste from https://cloud.relevance.ai/sdk/api>"#@param  {type:"string"}
dataset_id = "<your dataset ID here>" #@param {type:"string"}
vector_field = "<your vector field here>" #@param  {type:"string"} 
key_metric = "<your key metric here>" #@param  {type:"string"}

# Used to determine what document ID it should be
document_id = None #@param  {type:"string"}

!pip install -q -U RelevanceAI==2.3.2
!pip install -q catboost
!pip install -q shap

# Instantiating the client.
from relevanceai import Client 
client = Client(token=token)


ds = client.Dataset(dataset_id)
docs = ds.get_all_documents(select_fields=[vector_field, key_metric])

import numpy as np
from catboost import CatBoostClassifier, Pool
X = ds.get_field_across_documents(vector_field, docs)
train_data = np.array(X)
label = ds.get_field_across_documents(key_metric, docs)

test_data = catboost_pool = Pool(train_data, label)
# Modify this if you want to play around with more parameters
model = CatBoostClassifier(
    iterations=2,
    depth=2,
    learning_rate=1,
    loss_function='Logloss',
    verbose=True
)

model.fit(train_data, label)


import pandas as pd
pd.DataFrame({'feature_importance': model.get_feature_importance(catboost_pool), 
              'feature_names': column_values}).sort_values(by=['feature_importance'], ascending=False)

pd.DataFrame({'feature_importance': model.get_feature_importance(catboost_pool), 
              'feature_names': column_values}).sort_values(by=['feature_importance'], 
                                ascending=False).set_index("feature_names").plot(kind='bar')

import shap
shap_df = pd.DataFrame(train_data)
shap_df.columns = column_values
shap_df.index = [d['_id'] for d in docs]
if document_id is None:
  # Choose a random ID for now
  document_id = docs[20]['_id']

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(shap_df)
for i, d in enumerate(docs):
    d['shap_vector_'] = shap_values[i]
ds.upsert_documents(docs)
doc = ds.get(document_id)
expected_value = explainer.expected_value
ds.insert_metadata({"shap_expected_value": expected_value})
print("Now we are ready!")

# warning make sure
shap.initjs()
metadata = ds.metadata
expected_value = metadata['shap_expected_value']
shap.force_plot(expected_value, np.array(doc['document']['shap_vector_']), feature_names=column_values)

# 🌇 Next Steps

This is just a quick tutorial on Relevance AI, there are many more applications that is possible such as zero-shot based labelling, recommendations, anomaly detection, projector and more:

- Explore our platform and check out new workflows at https://cloud.relevance.ai
- There are more indepth tutorials and guides at https://docs.relevance.ai
- There are detailed library references at https://relevanceai.readthedocs.io/
- Join our slack community at https://join.slack.com/t/relevance-ai/shared_invite/zt-11fo8oush-dHPd57wamhoQ7J5arNv1mg