# Breast Cancer Example

This notebook will show some basic use cases for how Aladdin can be used as a feature store.

This is based on a classic dataset for detecting breast cancer.
We will try to walk through a potential thought process.

## Scenario
We have been given a dataset of breast cancer scans, and want to see if we can detect malignant cancer cells.

### Load the feature store
First thing we need is to load the feature store. This can be done by referencing a local file, which we will be doning here. But it can also be referencing a private S3 bucket, or a more developer friendly autogeneration by reading the feature views from the current directory. This can be done by running `FeatureStore.from_dir()`.

Let's see how it can be done.

In [1]:
from aligned import FeatureStore, FileSource
from aligned.validation.pandera import PanderaValidator

In [2]:
# The online store, which use the online source
online_store = await FileSource.from_path("feature-store.json").feature_store()

# The offline store, which use the batch sources
feature_store: FeatureStore = online_store.offline_store()

# Explore a bit

Before we begin, let's see which feature views we have availible

In [3]:
# The availible feature views / grouped features
list(feature_store.feature_views.keys())

['breast_scans_radius',
 'breast_scans_diagnosis',
 'breast_scans_smoothness',
 'breast_scans_area',
 'breast_scans_compactness',
 'breast_scans_raw',
 'titanic']

# Load some data

Let's load all the data from the feature view `titanic`. Aka the titanic data-set, but slightly modified in a few places.

Furthermore, will this also fetch all the transformed features that we want! 
E.g: We have an `ordinal_sex` which encodes `male` as `0` and `female` as `1`. We also have them encoded as one-hot-encodings, and the `age` variable is standard scaled in `scaled_age` etc.

In [4]:
dataset = await feature_store.feature_view("titanic")\
    .all()\
    .to_df()

In [5]:
dataset.head(10)

Unnamed: 0,passenger_id,survived,Pclass,name,sex,age,sibsp,Parch,Ticket,Fare,...,is_female,is_male,floored_age,ordinal_sex,ratio,logical_and,logical_or,floor_ratio,abs_scaled_age,inverted_is_mr
0,1,False,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,False,True,22.0,0.0,-0.016261,False,True,-0.016261,0.357734,False
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,True,False,38.0,1.0,0.018144,True,True,0.018144,0.689464,False
2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,True,False,26.0,1.0,-0.00369,False,True,-0.00369,0.095934,True
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,True,False,35.0,1.0,0.014089,True,True,0.014089,0.493114,False
4,5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,...,False,True,35.0,0.0,0.014089,False,True,0.014089,0.493114,False
5,6,False,3,"Moran, Mr. James",male,,0,0,330877,8.4583,...,False,True,,0.0,,False,True,,,False
6,7,False,1,"McCarthy, Mr. Timothy J",other,54.0,0,0,17463,51.8625,...,False,False,54.0,,0.03216,False,True,0.03216,1.736661,False
7,8,False,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,...,False,True,2.0,0.0,-0.833365,False,False,-0.833365,1.66673,True
8,9,True,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,...,True,False,27.0,1.0,-0.001129,True,True,-0.001129,0.030485,False
9,10,True,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,...,True,False,14.0,1.0,-0.062952,True,True,-0.062952,0.881332,False


# Validation

Now for the eagle eyed will you notice that there is one instance that have `other` in the `sex` column, leading to a NA value in `ordinal_sex`. Now this may reduce the perfromance of our model, so we can delceare that as requiered. We can do the same thing for the `age` feature, making that one required.

In [6]:
validated_dataset = await feature_store.feature_view("titanic")\
    .all()\
    .validate(PanderaValidator())\
    .to_df()

In [7]:
validated_dataset.head(10)

Unnamed: 0,level_0,index,passenger_id,survived,Pclass,name,sex,age,sibsp,Parch,...,is_female,is_male,floored_age,ordinal_sex,ratio,logical_and,logical_or,floor_ratio,abs_scaled_age,inverted_is_mr
0,0,0,1,False,3,"Braund, Mr. Owen Harris",male,22.0,1,0,...,False,True,22,0,-0.016261,False,True,-0.016261,0.357734,False
1,1,1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,...,True,False,38,1,0.018144,True,True,0.018144,0.689464,False
2,2,2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,...,True,False,26,1,-0.00369,False,True,-0.00369,0.095934,True
3,3,3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,...,True,False,35,1,0.014089,True,True,0.014089,0.493114,False
4,4,4,5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,...,False,True,35,0,0.014089,False,True,0.014089,0.493114,False
5,6,7,8,False,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,...,False,True,2,0,-0.833365,False,False,-0.833365,1.66673,True
6,7,8,9,True,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,...,True,False,27,1,-0.001129,True,True,-0.001129,0.030485,False
7,8,9,10,True,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,...,True,False,14,1,-0.062952,True,True,-0.062952,0.881332,False
8,9,10,11,True,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,...,True,False,4,1,-0.383958,False,True,-0.383958,1.535831,True
9,10,11,12,True,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,...,True,False,58,1,0.034456,False,True,0.034456,1.99846,True


Notice that we now have no missing values! 🚀

# Data set generation

We have now loaded some data. However, A very common workflow is to create a train, test and validation set.

This can also be done for supervised datasets as follows

In [8]:
dataset = await feature_store.feature_view("titanic")\
    .all()\
    .validate(PanderaValidator())\
    .test_size(0.2, target_column="survived")\
    .validation_size(0.2)\
    .use_df()

Notice we do not get 6 different variables back as `X_train, X_test, X_validate, y_train, y_test, y_validate = await ...`, but rather one variable that keeps them easily organized. 
Not only that, but we also have a reference to the combined dataset if we need that for some reason.

And by default will it use a stratigic split, so you have an equal distribution between the classes in each dataset.

In [9]:
dataset.train_input.describe()

Unnamed: 0,sibsp,rounded_age,constant_filled_age,ceiled_age,age,mean_filled_age,adding,subtracting,ordinal_sex,ratio,scaled_age,floor_ratio,abs_scaled_age,floored_age
count,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0
mean,0.530374,29.189252,29.193925,29.21028,29.193925,29.193925,29.724299,-28.663551,0.385514,-0.06815,0.113108,-0.06815,0.72618,29.172897
std,0.963752,14.382991,14.385073,14.391732,14.385073,14.385073,14.076627,14.750149,0.487286,0.287047,0.941501,0.287047,0.60883,14.388325
min,0.0,1.0,0.75,1.0,0.75,0.75,0.83,-71.0,0.0,-2.33139,-1.748543,-2.33139,0.030485,0.0
25%,0.0,20.75,20.875,21.0,20.875,20.875,21.0,-37.0,0.0,-0.020674,-0.431365,-0.020674,0.231315,20.75
50%,0.0,28.0,28.0,28.0,28.0,28.0,28.0,-28.0,0.0,0.001249,0.034965,0.001249,0.554083,28.0
75%,1.0,37.0,37.0,37.0,37.0,37.0,38.0,-20.0,1.0,0.016865,0.624014,0.016865,1.082163,37.0
max,5.0,71.0,71.0,71.0,71.0,71.0,71.0,4.0,1.0,0.040131,2.849308,0.040131,2.849308,71.0


In [10]:
dataset.test_input.describe()

Unnamed: 0,sibsp,rounded_age,constant_filled_age,ceiled_age,age,mean_filled_age,adding,subtracting,ordinal_sex,ratio,scaled_age,floor_ratio,abs_scaled_age,floored_age
count,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0,143.0
mean,0.41958,28.237762,28.251189,28.272727,28.251189,28.251189,28.670769,-27.831608,0.370629,-0.111199,0.051406,-0.111199,0.710618,28.223776
std,0.834237,14.215622,14.219054,14.207309,14.219054,14.219054,14.043557,14.440685,0.484671,0.494982,0.930635,0.494982,0.600174,14.243071
min,0.0,0.0,0.42,1.0,0.42,0.42,0.42,-74.0,0.0,-4.214621,-1.770141,-4.214621,0.030485,0.0
25%,0.0,19.0,19.0,19.0,19.0,19.0,19.5,-35.0,0.0,-0.029162,-0.554083,-0.029162,0.231315,19.0
50%,0.0,28.0,28.0,28.0,28.0,28.0,28.0,-27.0,0.0,0.001249,0.034965,0.001249,0.554083,28.0
75%,1.0,35.5,35.5,35.5,35.5,35.5,36.0,-19.0,1.0,0.014802,0.525839,0.014802,1.079922,35.5
max,4.0,74.0,74.0,74.0,74.0,74.0,74.0,2.0,1.0,0.041158,3.045658,0.041158,3.045658,74.0


In [11]:
dataset.validate_input.describe()

Unnamed: 0,sibsp,rounded_age,constant_filled_age,ceiled_age,age,mean_filled_age,adding,subtracting,ordinal_sex,ratio,scaled_age,floor_ratio,abs_scaled_age,floored_age
count,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0
mean,0.556338,32.507042,32.508803,32.514085,32.508803,32.508803,33.065141,-31.952465,0.302817,-0.027846,0.330066,-0.027846,0.796597,32.5
std,0.91887,14.889544,14.891344,14.885728,14.891344,14.891344,14.672962,15.162358,0.461103,0.214152,0.974636,0.214152,0.648509,14.904781
min,0.0,1.0,0.75,1.0,0.75,0.75,2.75,-80.0,0.0,-2.33139,-1.748543,-2.33139,0.030485,0.0
25%,0.0,21.25,21.25,21.25,21.25,21.25,21.25,-41.75,0.0,-0.019179,-0.406821,-0.019179,0.296765,21.25
50%,0.0,31.0,31.0,31.0,31.0,31.0,32.0,-30.0,0.0,0.007462,0.231315,0.007462,0.558564,31.0
75%,1.0,42.0,42.0,42.0,42.0,42.0,42.0,-21.0,1.0,0.022649,0.951263,0.022649,1.193339,42.0
max,5.0,80.0,80.0,80.0,80.0,80.0,80.0,1.25,1.0,0.042979,3.438357,0.042979,3.438357,80.0


## Real time data

One of the hard problems `aligned` tries to solve, is keeping all features consistent across offline, and online feature stores.

Lets simulate how it could look.

The following code will only work if you connect to a redis cluster. Either by having it run locally with `brew install redis` on MacOS, or running it in a Docker container

In [12]:
import os
# Set the online redis url
os.environ['REDIS_URL'] = "redis://localhost:6379"

We need to populate the store with some data, so why not send in some "new" passenger information, and see how it handles it.

Also be aware that currently can any feature be left out, and it will be filled with `None`. This may chagne in the future in order inforce better quality control.

In [13]:
online_store.feature_view("titanic").write_input

{'age', 'cabin', 'name', 'passenger_id', 'sex', 'sibsp', 'survived'}

In [14]:
await online_store.feature_view("titanic")\
    .write({
        'passenger_id': [10001, 10002, 10003, 10004],
        'age': [25, 54, None, None],
        'cabin': [None, "A40", "C53", None],
        'sex': ["male", "male", "female", "other"],
        'sibsp': [2, 0, 0, 3]
    })

### Fetch the new values

The above code simulated a write to the online database. This can originate from any stream process you want.

Either a Kafka stream, Redis stream, or mayby simpler a HTTP POST command.

However, there is no value in storing the values if we can't fetch them. 

So let's get our new values for `passenger_id = [10001, 10002, 10003, 10004]`.

In [15]:
data_subset = await online_store.features_for({
    "passenger_id": [10001, 10002, 10003, 10004, 10005]
}, features=[
    "titanic:scaled_age",
    "titanic:age",
    "titanic:mean_filled_age",
    "titanic:ordinal_sex",
    "titanic:is_male",
    "titanic:is_female",
    "titanic:abs_scaled_age",
    "titanic:cabin"
]).to_df()
data_subset

Unnamed: 0,passenger_id,age,mean_filled_age,scaled_age,cabin,abs_scaled_age,is_female,is_male,ordinal_sex
0,10001,25.0,25.0,-0.161384,,0.161384,False,True,0.0
1,10002,54.0,54.0,1.736661,A40,1.736661,False,True,0.0
2,10003,,27.465769,,C53,,True,False,1.0
3,10004,,27.465769,,,,False,False,
4,10005,,,,,,,,


Look at that! We now have our processed values, which is scaled based using a `StandardScaler`, aka the amount of standard diviations from the mean.
And we have the original values as reference.
All this without implementing a single line of transformation logic our self.

However, we still have a small problem. As it is kind of painfull writing out all the fetures we want to inclued each time. So how could we make this even simpler?

Thats where Model Services come in.

## Model Service


In [16]:
online_store.all_models

['cancer_detection', 'titanic_model']

In [17]:
model_data = await online_store.model("titanic_model").features_for({
    'passenger_id': [10001, 10002, 10003, 10004, 10005]
}).to_df()
model_data

Unnamed: 0,rounded_age,sibsp,constant_filled_age,sex,ceiled_age,passenger_id,is_mr,age,mean_filled_age,survived,...,logical_and,scaled_age,logical_or,has_siblings,floor_ratio,abs_scaled_age,inverted_is_mr,is_female,is_male,floored_age
0,25.0,2.0,25.0,male,25.0,10001,True,25.0,25.0,True,...,True,-0.161384,True,True,-0.006455,0.161384,False,False,True,25.0
1,54.0,0.0,54.0,male,54.0,10002,True,54.0,54.0,True,...,True,1.736661,True,False,0.03216,1.736661,False,False,True,54.0
2,,0.0,0.0,female,,10003,True,,27.465769,True,...,True,,True,False,,,False,True,False,
3,,3.0,0.0,other,,10004,True,,27.465769,True,...,True,,True,True,,,False,False,False,
4,,,,,,10005,,,,,...,,,,,,,,,,
