# Breast Cancer Example

This notebook will show some basic use cases for how Aladdin can be used as a feature store.

This is based on a classic dataset for detecting breast cancer.
We will try to walk through a potential thought process.

## Scenario
We have been given a dataset of breast cancer scans, and want to see if we can detect malignant cancer cells.

### Load the feature store
First thing we need is to load the feature store. This can be done by referencing a local file, a remote file, which we will be doing using the [aladdin-example repo](https://raw.githubusercontent.com/otovo/aladdin-example/main/feature-store.json). But it can also be referencing a private S3 bucket, or a more developer friendly autogeneration by reading the feature views from the current directory. This can be done by running `FeatureStore.from_dir()`.

Let's see how it can be done.

In [4]:
from aladdin import FeatureStore, FileSource

ValueError: mutable default <class 'dict'> for field tags is not allowed: use default_factory

In [None]:
# The online store, which use the online source
online_store = await FileSource("https://raw.githubusercontent.com/otovo/aladdin-example/main/feature-store.json", {}).feature_store()

# The offline store, which use the batch sources
offline_store = online_store.offline_store()

NameError: name 'FileSource' is not defined

In [None]:
# The availible feature views / grouped features
list(offline_store.feature_views.keys())

NameError: name 'offline_store' is not defined

## Get the dataset

A good place to start is to get the raw data in some format.
Now the above notebook cell lists the availible feature views, or group of features using other words.

We can see that there is two views `breast_scan_raw` and `breast_scan_transformed`.
This feature view will contain features, data sources etc. 

So lets look closer at the `breast_scan_raw` first.

In [1]:
all_data = await offline_store.feature_view("breast_scans_raw").all().to_df()
all_data

NameError: name 'offline_store' is not defined

We can see that the data source contains 569 enteries of breast cancer scans, and a whole wide range of metrics.

If you want to see all the features in detail, you can go to the `breast_scan_raw.py` file.

Now here is where we would do some EDA, and hopefully craft some fancy features. While also select some the feature of interest.

So after our "hypothetical" analysis did we figure out that using the scaled standard diviation is a good feature, and that we selected a subset of the features.
And that we stored them in a new feature view `breast_scan_transformed`.

Let's have a quick look.

In [5]:
new_features = await offline_store.feature_view("breast_scans_transformed").all().to_df()
new_features

Unnamed: 0,scan_id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst,area_mean_scaled,perimeter_mean_scaled,radius_mean_scaled,compactness_mean_scaled,smoothness_mean_scaled,is_malignant
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,0.7119,0.2654,0.4601,0.11890,0.983510,1.268817,1.096100,3.280628,1.567087,True
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,0.2416,0.1860,0.2750,0.08902,1.907030,1.684473,1.828212,-0.486643,-0.826235,True
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,0.4504,0.2430,0.3613,0.08758,1.557513,1.565126,1.578499,1.052000,0.941382,True
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,0.6869,0.2575,0.6638,0.17300,-0.763792,-0.592166,-0.768233,3.399917,3.280667,True
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,0.4000,0.1625,0.2364,0.07678,1.824624,1.775011,1.748758,0.538866,0.280125,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,0.4107,0.2216,0.2060,0.07115,2.341795,2.058974,2.109139,0.218868,1.040926,True
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,0.3215,0.1628,0.2572,0.06637,1.722326,1.614511,1.703356,-0.017817,0.102368,True
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,0.3403,0.1418,0.2218,0.07820,0.577445,0.672084,0.701667,-0.038646,-0.839745,True
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,0.9387,0.2650,0.4087,0.12400,1.733693,1.980781,1.836725,3.269267,1.524426,True


Now we can see that many of the columns are the same, however a lot of them are also new!

We have a new `is_malignant` that is a true false value indicating if is good or bad cancer cells, and a lot of standard scaled features like `radius_mean_scaled`.

Both  `breast_scan_raw` and `breast_scan_transformed` feature view uses the same source. The interesting thing is that `breast_scan_transformed` transforms the raw values while knowing which features depend on which.
Therefore can all transformations be optimized for extra fast computations.

A simple example would be `is_malignant`. Where `is_malignant` is dependent on `diagnosis`.

## Data set generation

With our new transformed features would the next natural step be to train a model. However, in order to train a model would we need to select a train, test, and preferably a validation set.

This is often done by using `sklearn`, however `aladdin` also provides a similar and flexible way of abstracting our split strategy.
The default here is a `StrategicSplitStrategy` which makes sure that the distribution between the differnet classes are equal across train, test and validation.

Selecting a train, test, validation set is possible on any data job that is run from our feature store.

Let's see how it is done.

In [6]:
from aladdin.split_strategy import StrategicSplitStrategy

train_set = await offline_store.feature_view("breast_scans_transformed")\
    .all(limit=300)\
    .split_with(
        # Train + test = 0.9 => validation = 0.1
        strategy=StrategicSplitStrategy(
            train_size_percentage=0.8,
            test_size_percentage=0.1
        ),
        target_column="is_malignant"
    )\
    .use_pandas()

  train = train.append(split(sub_group, 0, self.train_size_percentage))
  test = test.append(
  develop = develop.append(
  train = train.append(split(sub_group, 0, self.train_size_percentage))
  test = test.append(
  develop = develop.append(


The `train_set` do not contains both inputs and targets for train, test, and validation.

In [7]:
# Train Set
train_set.train_output.value_counts(normalize=True), train_set.train_output.value_counts()

(False    0.514644
 True     0.485356
 Name: is_malignant, dtype: float64,
 False    123
 True     116
 Name: is_malignant, dtype: int64)

In [8]:
# Test Set
train_set.test_output.value_counts(normalize=True), train_set.test_output.value_counts()

(True     0.5
 False    0.5
 Name: is_malignant, dtype: float64,
 True     15
 False    15
 Name: is_malignant, dtype: int64)

In [9]:
# Validation Set
train_set.develop_output.value_counts(normalize=True), train_set.develop_output.value_counts()

(False    0.516129
 True     0.483871
 Name: is_malignant, dtype: float64,
 False    16
 True     15
 Name: is_malignant, dtype: int64)

## Real time data

One of the hard problems `aladdin` tries to solve, is keeping all features consistent across offline, and online feature stores.

Lets simulate how it could look.

In [10]:
[req.all_required_feature_names for req in online_store.feature_view("breast_scans_transformed").view.request_all.needed_requests]

[{'area_mean',
  'compactness_mean',
  'diagnosis',
  'perimeter_mean',
  'radius_mean',
  'smoothness_mean'}]

In [11]:
import os
# Set the online redis url
os.environ['REDIS_URL'] = "redis://localhost:6379"

In [12]:
await online_store.feature_view("breast_scans_transformed")\
    .write({
        'scan_id': [1, 2],
        'area_mean': [1005, 1002],
        'compactness_mean': [0.23, 0.10],
        'perimeter_mean': [78, 90],
        'radius_mean': [20, 18],
        'smoothness_mean': [0.10, 0.2]
    })

### Fetch the new values

The above code simulated a write to the online database. This can originate from any stream process you want.

Either a Kafka stream, or mayby simpler a HTTP POST command.

However, there is no value in storing the values if we can't fetch them. 

So let's get our new values for `scan_id = 1`.

In [13]:
data_subset = await online_store.features_for({
    "scan_id": [1, 2]
}, features=[
    "breast_scans_transformed:radius_mean",
    "breast_scans_transformed:radius_mean_scaled",
    "breast_scans_transformed:perimeter_mean",
    "breast_scans_transformed:perimeter_mean_scaled",
    "breast_scans_transformed:area_mean",
    "breast_scans_transformed:area_mean_scaled",
    "breast_scans_transformed:compactness_mean",
    "breast_scans_transformed:compactness_mean_scaled",
    "breast_scans_transformed:smoothness_mean",
    "breast_scans_transformed:smoothness_mean_scaled",
]).to_df()
data_subset

Unnamed: 0,area_mean_scaled,perimeter_mean_scaled,area_mean,radius_mean_scaled,smoothness_mean_scaled,compactness_mean_scaled,radius_mean,compactness_mean,perimeter_mean,scan_id,smoothness_mean
0,0.994876,-0.574881,1005.0,1.666466,0.258794,2.379331,20.0,0.23,78.0,1,0.1
1,0.986351,-0.081034,1002.0,1.098937,7.369082,-0.082196,18.0,0.1,90.0,2,0.2


Look at that! We now have our processed values, which is scaled based using a `StandardScaler`, aka the amount of standard diviations from the mean.
And we have the original values as reference.
All this without implementing a single line of transformation logic our self.

However, we still have a small problem. As it is kind of painfull writing out all the fetures we want to inclued each time. So how could we make this even simpler?

Thats where Model Services come in.

## Model Service


In [14]:
online_store.all_models

['breast_cancer_model_v1']

In [15]:
model_data = await online_store.model("breast_cancer_model_v1").features_for({
    'scan_id': [1, 2]
}).to_df()
model_data

Unnamed: 0,area_mean_scaled,smoothness_mean_scaled,compactness_mean_scaled,scan_id,perimeter_mean_scaled,radius_mean_scaled
0,0.994876,0.258794,2.379331,1,-0.574881,1.666466
1,0.986351,7.369082,-0.082196,2,-0.081034,1.098937
