# Iris Prediction

We import the iris dataset, train and evaluate a XGBoost, and use this fitted model to make a prediction on a single observation.  
The objective is then to be able to make predictions on demand using Flask server and docker container.

### Imports

In [1]:
import numpy as np
import pandas as pd
import random
import pickle

import xgboost
import sklearn
from sklearn.datasets import load_iris

In [27]:
print(np.__version__)
print(pd.__version__)
print(sklearn.__version__)
print(xgboost.__version__)

1.19.3
1.1.4
0.24.1
1.3.3


# Load Data

In [2]:
data = load_iris()
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [3]:
X, y = load_iris(return_X_y=True, as_frame=True)
df = pd.concat([X, y], axis=1)

### Train Test split

In [4]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(df.drop("target", axis=1), df["target"])
X_train

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
83,6.0,2.7,5.1,1.6
145,6.7,3.0,5.2,2.3
143,6.8,3.2,5.9,2.3
47,4.6,3.2,1.4,0.2
67,5.8,2.7,4.1,1.0
...,...,...,...,...
55,5.7,2.8,4.5,1.3
7,5.0,3.4,1.5,0.2
54,6.5,2.8,4.6,1.5
121,5.6,2.8,4.9,2.0


# Modelling

### XGB

In [5]:
model_xgb = xgboost.sklearn.XGBClassifier()

### Train

In [6]:
model_xgb.fit(X_train, y_train)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [7]:
pred = model_xgb.predict(X_test)
pred_proba = model_xgb.predict_proba(X_test)

### Evaluation

We clearly overfit our data, but no problem this is not the interesting part of the project

In [8]:
sklearn.metrics.roc_auc_score(y_test, pred_proba, multi_class='ovr')

0.9959554334554334

# Serialization of model

In [9]:
with open("model.pkl", "wb") as file:
    pickle.dump(model_xgb, file)
    

# Prediction for 1 observation

In [10]:
with open("model.pkl", "rb") as file:
    model_loaded = pickle.load(file)

In [11]:
observation = X_test.sample(n=1)
observation

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
107,7.3,2.9,6.3,1.8


In [16]:
arg = list(observation.iloc[0])
arg

[7.3, 2.9, 6.3, 1.8]

In [18]:
pd.DataFrame([arg], columns=["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"])

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,7.3,2.9,6.3,1.8


In [12]:
pred_obs = model_loaded.predict(observation)
data.target_names[int(pred_obs)]

'virginica'