## Building a simple classifier for predicting type of Iris & publish it in Kusto

Open dataset from UCI Repository: __[Iris](http://archive.ics.uci.edu/ml/datasets/Iris)__

The well known simple data set for classification. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant. For each sample there are 4 attributes for Sepal & Petal Length & Width
 
Predicted attribute: class of iris plant.

In [None]:
import pandas as pd
import datetime
import pickle
import binascii

In [None]:
reload_ext Kqlmagic

### Retrieving the table for classification from Kusto

In [None]:
%kql kusto://code;cluster='help';database='Samples'

In [None]:
# NOTE: to make hash() consistent set env. variable PYTHONHASHSEED=0
%env PYTHONHASHSEED=0

q = '''
Iris
'''

fn = "df" + str(hash(q)) + ".pkl"
print("Cache file name: ", fn)

In [None]:
fn = "df" + str(hash(q)) + ".pkl"
try:
    df = pd.read_pickle(fn)
    print("Load df from " + fn)
except:
    print("Execute query...")
    try:
        %kql res << -query q
        df = res.to_dataframe()
        print("Save df to " + fn)
        df.to_pickle(fn)
        print("\n", df.shape, "\n", df.columns)
    except Exception as ex:
        print(ex)

In [None]:
print(df.shape, "\n")
print(df[-4:])

In [None]:
df.groupby(['Species']).size()

## Train Model

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(df, test_size=0.2, random_state=0)

In [None]:
train_x = train[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]
train_y = train['Species']
test_x = test[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]
test_y = test['Species']

print(train_x.shape, train_y.shape, test_x.shape, test_y.shape)

In [None]:
from sklearn import tree
from sklearn import neighbors
from sklearn import naive_bayes
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

#four classifier types
clf1 = tree.DecisionTreeClassifier()
clf2 = LogisticRegression()
clf3 = neighbors.KNeighborsClassifier()
clf4 = naive_bayes.GaussianNB()

In [None]:
clf1 = clf1.fit(train_x, train_y)
clf2 = clf2.fit(train_x, train_y)
clf3 = clf3.fit(train_x, train_y)
clf4 = clf4.fit(train_x, train_y)

#### Accuracy on Training set

In [None]:
for clf, label in zip([clf1, clf2, clf3, clf4], ['Decision Tree', 'Logistic Regression', 'K Nearest Neighbour', 'Naive Bayes']):
            scores = cross_val_score(clf, train_x, train_y, cv=5, scoring='accuracy')
            print("Accuracy: %0.4f (+/- %0.4f) [%s]" % (scores.mean(), scores.std(), label))

#### Accuracy on Testing set

In [None]:
for clf, label in zip([clf1, clf2, clf3, clf4], ['Decision Tree', 'Logistic Regression', 'K Nearest Neighbour', 'Naive Bayes']):
            scores = cross_val_score(clf, test_x, test_y, cv=5, scoring='accuracy')
            print("Accuracy: %0.4f (+/- %0.4f) [%s]" % (scores.mean(), scores.std(), label))

## Export the model to Kusto

In [None]:
models_tbl = 'ML_Models'
model_name = 'Iris'

#### Create a dataframe containing model name, timestamp & the serialized model

In [None]:
bmodel = pickle.dumps(clf4)
smodel = binascii.hexlify(bmodel)

now = datetime.datetime.now()
dfm = pd.DataFrame({'name':[model_name], 'timestamp':[now], 'model':[smodel]})
dfm

#### Store it in table of models

In [None]:
set_query = '''
.set-or-append {0} <|
let tbl = dfm;
tbl
'''.format(models_tbl)
print(set_query)

In [None]:
%kql -query set_query

## Test Model

#### Extract the last version of the named model from the table of models

In [None]:
get_query = '''
let tbl_name = models_tbl;
let m_name = model_name;
table(tbl_name)
| where name == m_name
| top 1 by timestamp desc
'''
print(get_query)

In [None]:
%kql res << -query get_query
model_df = res.to_dataframe()
qmodel = model_df.loc[0, 'model']

#### Create the trained model object and test it

In [None]:
import pickle
import binascii

bmodel = binascii.unhexlify(qmodel)
clfp = pickle.loads(bmodel)
print(clfp)

In [None]:
pscore = cross_val_score(clfp, test_x, test_y, cv=5, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f)" % (pscore.mean(), pscore.std()))

In [None]:
clfp.predict(test_x[:5])