In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

from pandas_ml_utils import pd, np, SkModel, FeaturesAndLabels, stratified_random_splitter
from pandas_ml_utils_test.config import DF_NOTES
import matplotlib.pyplot as plt

The following data set contains variables to determine whether a Note is valid or not

In [2]:
DF_NOTES.tail()

Unnamed: 0,variance,skewness,kurtosis,entropy,authentic
1367,0.40614,1.3492,-1.4501,-0.55949,1
1368,-1.3887,-4.8773,6.4774,0.34179,1
1369,-3.7503,-13.4586,17.5932,-2.7771,1
1370,-3.5637,-8.3827,12.393,-1.2823,1
1371,-2.5419,-0.65804,2.6842,1.1952,1


Now lets estimate which features might be useful to predict the the label whether a note is authentic (1) or not (0).
But before ew do that we add some redundancy and some random data. The feature selection should obviously be able to
get rid of such useless data. Since we do not know if the data is sorted in some way we use a
`stratified_random_splitter` to make sure that we have equally many instance of each class in the training and test set.

In [3]:
# make experiment reproducable
np.random.seed(42)

res = DF_NOTES.model.feature_selection(
     features_and_labels=FeaturesAndLabels(
        features=["variance", "skewness", "kurtosis", "entropy",
                  lambda df: (df["skewness"] * 0.5).rename("skewness red"),       # add correlated data
                  lambda df: pd.Series(np.random.random(len(df)), name='noise'),  # add random noise
                  ],
        labels=["authentic"],
        label_type=int
    ),
    training_data_splitter=stratified_random_splitter(0.3),
    correlated_features_th=0.8                                                    # keep kurtosis
)

res

drop redundant features: {'skewness': 1.0}


Data was not in RNN shape
Data was not in RNN shape
Data was not in RNN shape
Data was not in RNN shape
Data was not in RNN shape
Data was not in RNN shape
Data was not in RNN shape


Training Data,Test Data,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0
Unnamed: 0_level_1,ranking,importance,importance_std,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Unnamed: 0_level_2,ranking,importance,importance_std,ranking (Test),importance (Test),importance_std (Test)
ranking  importance  importance_std  variance  1  0.429736  0.451354  kurtosis  1  0.22821  0.315722  entropy  2  None  None  skewness red  1  0.342054  0.371517  noise  3  None  None,ranking  importance  importance_std  ranking (Test)  importance (Test)  importance_std (Test)  variance  1  0.429736  0.451354  1  0.523644  0.406844  kurtosis  1  0.22821  0.315722  1  0.042252  0.104006  entropy  2  None  None  1  0.048120  0.120008  skewness red  1  0.342054  0.371517  1  0.294965  0.306785  noise  3  None  None  1  0.091019  0.193292,,,,,
,,,,,,
,ranking  importance  importance_std  variance  1  0.429736  0.451354  kurtosis  1  0.22821  0.315722  entropy  2  None  None  skewness red  1  0.342054  0.371517  noise  3  None  None,,,,,
,ranking,importance,importance_std,,,
variance,1,0.429736,0.451354,,,
kurtosis,1,0.22821,0.315722,,,
entropy,2,,,,,
skewness red,1,0.342054,0.371517,,,
noise,3,,,,,
,,,,,,

Unnamed: 0,ranking,importance,importance_std
variance,1,0.429736,0.451354
kurtosis,1,0.22821,0.315722
entropy,2,,
skewness red,1,0.342054,0.371517
noise,3,,
,,,
,ranking  importance  importance_std  variance  1  0.429736  0.451354  kurtosis  1  0.22821  0.315722  entropy  2  None  None  skewness red  1  0.342054  0.371517  noise  3  None  None,,

Unnamed: 0,ranking,importance,importance_std
variance,1,0.429736,0.451354
kurtosis,1,0.22821,0.315722
entropy,2,,
skewness red,1,0.342054,0.371517
noise,3,,

Unnamed: 0,ranking,importance,importance_std,ranking (Test),importance (Test),importance_std (Test)
variance,1,0.429736,0.451354,1.0,0.523644,0.406844
kurtosis,1,0.22821,0.315722,1.0,0.042252,0.104006
entropy,2,,,1.0,0.04812,0.120008
skewness red,1,0.342054,0.371517,1.0,0.294965,0.306785
noise,3,,,1.0,0.091019,0.193292
,,,,,,
,ranking  importance  importance_std  ranking (Test)  importance (Test)  importance_std (Test)  variance  1  0.429736  0.451354  1  0.523644  0.406844  kurtosis  1  0.22821  0.315722  1  0.042252  0.104006  entropy  2  None  None  1  0.048120  0.120008  skewness red  1  0.342054  0.371517  1  0.294965  0.306785  noise  3  None  None  1  0.091019  0.193292,,,,,

Unnamed: 0,ranking,importance,importance_std,ranking (Test),importance (Test),importance_std (Test)
variance,1,0.429736,0.451354,1,0.523644,0.406844
kurtosis,1,0.22821,0.315722,1,0.042252,0.104006
entropy,2,,,1,0.04812,0.120008
skewness red,1,0.342054,0.371517,1,0.294965,0.306785
noise,3,,,1,0.091019,0.193292


As we can see, the feature selection process got rid of the two unnecessary variables. It is also worth to note that
the remaining features are equally important in the training and test data set. It is a pretty useful heuristic to chek
if the features are equally informative on out of sample data as well. Because if this is not the case we will only
over fit in the best case.

Now we can just re-use the best estimated model (which is a RandomForrest classifier / regressor):
``` python
skmodel = res.model.sk_model.best_estimator_.estimator_
```

Or we use a different type of estimator, i.e. we might want to use a Multi Layer Perceptron. In order to get a very
rough idea of the network size we could just use the number of nodes from the regression trees.

```python
from sklearn.neural_network import MLPClassifier
network_size = res.training_summary.nr_of_nodes
nr_layers = 2

hidden = [int(network_size ** (1 / float(nr_layers)))] * nr_layers
print("hidden size", hidden)

skmodel = MLPClassifier(hidden_layer_sizes=hidden, activation='tanh')
```

For a quick experiment we would be good to go and just invoke `df.model.fit(SkModel(skmodel, FeaturesAndLabels(...))`.
But we want to save the model and load it painless in any context we like to i.e. in a flask application. This means
we do not care about any modules which might be imported. And we do not need to know if the raw data needs to be
pre-processed in some form. Therefore we re-assign the model in a model context.

In [5]:
with DF_NOTES.model("notes-$V.model") as m:  # note the $V makro allows to save multiple versions of the model
    from pandas_ml_utils import SkModel, FeaturesAndLabels, ClassificationSummary, stratified_random_splitter
    from sklearn.neural_network import MLPClassifier
    
    network_size = res.training_summary.nr_of_nodes
    nr_layers = 2

    hidden = [int(network_size ** (1 / float(nr_layers)))] * nr_layers
    print("hidden size", hidden)

    fit = m.fit(
        SkModel(
            MLPClassifier(hidden_layer_sizes=hidden, activation='tanh'),
            FeaturesAndLabels(
                features=["variance", "skewness", "kurtosis"],
                labels=["authentic"]
            ),
            summary_provider=ClassificationSummary
        ),
        training_data_splitter=stratified_random_splitter(0.25)
    )

fit

hidden size [24, 24]
Data was not in RNN shape
Data was not in RNN shape
Data was not in RNN shape
saved model to: /home/kic/sources/private/projects/pandas-quant/pandas-ml-utils/examples/notes-202010-15-59240.model


Training Data,Test Data,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0
Unnamed: 0_level_1,prediction,label,feature,feature,feature
Unnamed: 0_level_2,authentic,authentic,variance,skewness,kurtosis
Unnamed: 0_level_3,prediction,label,feature,feature,feature
Unnamed: 0_level_4,authentic,authentic,variance,skewness,kurtosis
prediction  label  feature  authentic  authentic  variance  skewness  kurtosis  1363  0.977014  1  -1.16670  -1.4237  2.92410  1364  0.978962  1  -2.83910  -6.6300  10.48490  1365  0.996073  1  -4.50460  -5.8126  10.88670  1366  0.998724  1  -2.41000  3.7433  -0.40215  1367  0.996661  1  0.40614  1.3492  -1.45010,prediction  label  feature  authentic  authentic  variance  skewness  kurtosis  1327  0.917208  1  -0.24037  -1.78370  2.13500  1331  0.998506  1  0.22432  -0.52147  -0.40386  1084  0.989635  1  -1.48000  -10.52440  9.91760  1049  0.999775  1  -3.95940  4.02890  -0.35845  916  0.997573  1  -0.53900  -5.16700  3.43990,,,,
,,,,,
prediction  label  feature  authentic  authentic  variance  skewness  kurtosis  1363  0.977014  1  -1.16670  -1.4237  2.92410  1364  0.978962  1  -2.83910  -6.6300  10.48490  1365  0.996073  1  -4.50460  -5.8126  10.88670  1366  0.998724  1  -2.41000  3.7433  -0.40215  1367  0.996661  1  0.40614  1.3492  -1.45010,prediction  label  feature  authentic  authentic  variance  skewness  kurtosis  1363  0.977014  1  -1.16670  -1.4237  2.92410  1364  0.978962  1  -2.83910  -6.6300  10.48490  1365  0.996073  1  -4.50460  -5.8126  10.88670  1366  0.998724  1  -2.41000  3.7433  -0.40215  1367  0.996661  1  0.40614  1.3492  -1.45010,,,,
,prediction,label,feature,feature,feature
,authentic,authentic,variance,skewness,kurtosis
1363,0.977014,1,-1.16670,-1.4237,2.92410
1364,0.978962,1,-2.83910,-6.6300,10.48490
1365,0.996073,1,-4.50460,-5.8126,10.88670
1366,0.998724,1,-2.41000,3.7433,-0.40215
1367,0.996661,1,0.40614,1.3492,-1.45010

Unnamed: 0_level_0,prediction,label,feature,feature,feature
Unnamed: 0_level_1,authentic,authentic,variance,skewness,kurtosis
1363,0.977014,1.0,-1.1667,-1.4237,2.9241
1364,0.978962,1.0,-2.8391,-6.63,10.4849
1365,0.996073,1.0,-4.5046,-5.8126,10.8867
1366,0.998724,1.0,-2.41,3.7433,-0.40215
1367,0.996661,1.0,0.40614,1.3492,-1.4501
,,,,,
prediction  label  feature  authentic  authentic  variance  skewness  kurtosis  1363  0.977014  1  -1.16670  -1.4237  2.92410  1364  0.978962  1  -2.83910  -6.6300  10.48490  1365  0.996073  1  -4.50460  -5.8126  10.88670  1366  0.998724  1  -2.41000  3.7433  -0.40215  1367  0.996661  1  0.40614  1.3492  -1.45010,prediction  label  feature  authentic  authentic  variance  skewness  kurtosis  1363  0.977014  1  -1.16670  -1.4237  2.92410  1364  0.978962  1  -2.83910  -6.6300  10.48490  1365  0.996073  1  -4.50460  -5.8126  10.88670  1366  0.998724  1  -2.41000  3.7433  -0.40215  1367  0.996661  1  0.40614  1.3492  -1.45010,,,,

Unnamed: 0_level_0,prediction,label,feature,feature,feature
Unnamed: 0_level_1,authentic,authentic,variance,skewness,kurtosis
1363,0.977014,1,-1.1667,-1.4237,2.9241
1364,0.978962,1,-2.8391,-6.63,10.4849
1365,0.996073,1,-4.5046,-5.8126,10.8867
1366,0.998724,1,-2.41,3.7433,-0.40215
1367,0.996661,1,0.40614,1.3492,-1.4501

Unnamed: 0_level_0,prediction,label,feature,feature,feature
Unnamed: 0_level_1,authentic,authentic,variance,skewness,kurtosis
1327,0.917208,1.0,-0.24037,-1.7837,2.135
1331,0.998506,1.0,0.22432,-0.52147,-0.40386
1084,0.989635,1.0,-1.48,-10.5244,9.9176
1049,0.999775,1.0,-3.9594,4.0289,-0.35845
916,0.997573,1.0,-0.539,-5.167,3.4399
,,,,,
prediction  label  feature  authentic  authentic  variance  skewness  kurtosis  1327  0.917208  1  -0.24037  -1.78370  2.13500  1331  0.998506  1  0.22432  -0.52147  -0.40386  1084  0.989635  1  -1.48000  -10.52440  9.91760  1049  0.999775  1  -3.95940  4.02890  -0.35845  916  0.997573  1  -0.53900  -5.16700  3.43990,prediction  label  feature  authentic  authentic  variance  skewness  kurtosis  1327  0.917208  1  -0.24037  -1.78370  2.13500  1331  0.998506  1  0.22432  -0.52147  -0.40386  1084  0.989635  1  -1.48000  -10.52440  9.91760  1049  0.999775  1  -3.95940  4.02890  -0.35845  916  0.997573  1  -0.53900  -5.16700  3.43990,,,,

Unnamed: 0_level_0,prediction,label,feature,feature,feature
Unnamed: 0_level_1,authentic,authentic,variance,skewness,kurtosis
1327,0.917208,1,-0.24037,-1.7837,2.135
1331,0.998506,1,0.22432,-0.52147,-0.40386
1084,0.989635,1,-1.48,-10.5244,9.9176
1049,0.999775,1,-3.9594,4.0289,-0.35845
916,0.997573,1,-0.539,-5.167,3.4399


Here we are with a nicely fitted model :tada:

We should also note that our model has been saved. So we can create any new script, app, etc. and just load it:
```python
from pandas_ml_utils import Model

df = load_a_notes_data_frame()
model = Model.load('notes-202010-15-56730.model')
prediction = df.model.predict(model)
```