# The Data Station - BlindML and DoD component working together

This notebook will use a couple of Data Station components to train a simple machine learning model based on examples, without revealing the training data to the end user.

First, the demo code:

In [1]:
import test_dod_integ



I'm going to ask the Data Station to make me a predictive model for converting US ZIP codes to State names.

I'll start by giving a small example of the mappings I want the ML model to give me:

In [3]:
examples = [["98011", "California"],
            ["32804", "Florida"]]

The Data Station already knows about the Adventure Works dataset, which describes the operations of a company in about 60 different spreadsheets. Unknown to the user, it's possible to combine some of the customer data from various sheets in that dataset to give ZIP code / State associations.

I will ask the Data Station to generate a model for me based on the above examples. The following call will take around a minute to run, and will generate a lot of console output showing the inner workings of the DoD and BlindML components. This workings would not normally be accessible to an end user in this use case.

In [4]:
model = test_dod_integ.get_model_by_example(examples)

INFO:elasticsearch:GET http://localhost:9200/text/_search?filter_path=hits.hits._source.id%2Chits.hits._score%2Chits.total%2Chits.hits._source.dbName%2Chits.hits._source.sourceName%2Chits.hits._source.columnName%2Chits.hits.highlight.text [status:200 request:0.193s]
INFO:elasticsearch:GET http://localhost:9200/text/_search?filter_path=hits.hits._source.id%2Chits.hits._score%2Chits.total%2Chits.hits._source.dbName%2Chits.hits._source.sourceName%2Chits.hits._source.columnName%2Chits.hits.highlight.text [status:200 request:0.013s]
INFO:elasticsearch:GET http://localhost:9200/text/_search?filter_path=hits.hits._source.id%2Chits.hits._score%2Chits.total%2Chits.hits._source.dbName%2Chits.hits._source.sourceName%2Chits.hits._source.columnName%2Chits.hits.highlight.text [status:200 request:0.057s]
INFO:elasticsearch:GET http://localhost:9200/text/_search?filter_path=hits.hits._source.id%2Chits.hits._score%2Chits.total%2Chits.hits._source.dbName%2Chits.hits._source.sourceName%2Chits.hits._sourc

total views: 1
Dataframe is    PostalCode            Name
0       98011      Washington
1       97205          Oregon
2       55802       Minnesota
3       75201           Texas
4       94109      California
..        ...             ...
67      29577  South Carolina
68      21201        Maryland
69      57000         Moselle
70       7001        Tasmania
71      78000         Yveline

[72 rows x 2 columns]
Dataframe types are PostalCode    category
Name          category
dtype: object
getting training data
starting classification
done with classification
classifier is built
Classifier accuracy: 0.017543859649122806
test finished with accuracy 0.017543859649122806


When this has finished, you should see that the model was trained, but quite poorly with an accuracy of probably about 0.01 (i.e. 1%).

Printing it out, you can see it's an instance of a scikit-learn model.

In [6]:
model

AutoSklearnClassifier(per_run_time_limit=6, time_left_for_this_task=60)

You can interact with it in the same way as you would interact with any scikit-learn model.

This code asks for a prediction of which state the 90210 zipcode lies in.

In [7]:
import pandas as pd
df = pd.DataFrame(data={"PostalCode": ["90210"]})
df["PostalCode"] = df["PostalCode"].astype('category')
model.predict(df)

array(['Alabama'], dtype=object)

The model probably got that answer wrong. But it probably mentioned a state that wasn't in the initial training set: it has learned other state names from related data.

If you scroll through the "secret" output above, you should see somewhere in the middle a table like this:

```
total views: 1
Dataframe is    PostalCode            Name
0       98011      Washington
1       97205          Oregon
2       55802       Minnesota
3       75201           Texas
4       94109      California
..        ...             ...
67      29577  South Carolina
68      21201        Maryland
69      57000         Moselle
70       7001        Tasmania
71      78000         Yveline
```

which shows that DoD component discovered 1 view of the data which looked like our examples. You can see two features which contribute to poor training:
* in addition to ZIP codes, DoD found Australian and French postal codes
* the discovered joined dataset only had 72 examples of training data

If the Data Station had more data and was given more time, DoD would be able to find a much better model: for example, when loaded with some census data and given 15 minutes to train, it produces a model with >90% accuracy.