### Part 3: Modeling
<br>
The goal of this project is to create a model that can predict what type of common element of computer user interfaces an image is from a hand-written drawing (buttons, toggles, windows, etc.).
<br>
<br>
This is part 3 of that project, and covers building models.
<br>
<br>

In [4]:
# Import necessary packages
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score,classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
import numpy as np
from sklearn import svm
from sklearn.linear_model import LogisticRegression
import cv2
import pandas as pd
import glob
import os
import pyarrow.parquet as pq
import seaborn as sns
from matplotlib import rcParams
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from statistics import stdev

In [4]:
table = pq.read_table('/Users/grahamsmith/Documents/SpringboardWork/UIsketch.parquet')
df = table.to_pandas()

<br>
Set the labels to be numeric to make modeling easier (note: this should have been done in the preprocessing step but I forgot).
<br>
<br>

In [5]:
labels = set(df['label'])
dummies = list(range(len(labels)))
labeldf = pd.DataFrame([labels, dummies]).T
df['label'] = df['label'].replace(list(labeldf.iloc[:,0]),labeldf.iloc[:,1])

<br>
This table is so that the labels can be easily mapped back to the numbers for interpretation.
<br>
<br>

In [6]:
labeldf

Unnamed: 0,0,1
0,checkbox_unchecked,0
1,card,1
2,data_table,2
3,alert,3
4,chip,4
5,floating_action_button,5
6,radio_button_unchecked,6
7,grid_list,7
8,switch_disabled,8
9,tooltip,9



<br>Do the train/test split that we figured out in the pre-processing notebook
<br>
<br>

In [7]:
# Sample 80% of images within each class aka label.
train = df.groupby('label', group_keys=False).apply(lambda x: x.sample(frac=0.8))
# Get the indicies of the images not in the test set and assign those images to the test set.
testind = list(set(df.index) - set(train.index))
test = df.iloc[testind]

In [13]:
#my computer kept crashing so all the following is done w/ a 10% subset of the data
train_small = train.groupby('label', group_keys=False).apply(lambda x: x.sample(frac=0.01))
x_train = train_small.loc[:, train_small.columns != 'label']
y_train = train_small['label']
x_test = test.loc[:, test.columns != 'label']
y_test = test['label']

<br>
The first model I attempted was XGBoost, because it has ridge regression built in and I figured with so many parameters some would probably be reduced to 0. It is also faster than multiple logistic regression by itself.
<br>
<br>

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

numeric_pipeline = Pipeline(
    steps=[("impute", SimpleImputer(strategy="mean")), 
           ("scale", StandardScaler())]
)

num_cols = x_train.select_dtypes(include="number").columns

from sklearn.compose import ColumnTransformer

full_processor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, num_cols),
    ]
)

In [24]:
import xgboost as xgb

xgb_cl = xgb.XGBClassifier()
print(type(xgb_cl))

<class 'xgboost.sklearn.XGBClassifier'>


In [25]:
from sklearn.metrics import accuracy_score

# Init classifier
xgb_cl = xgb.XGBClassifier()

# Fit
xgb_cl.fit(x_train, y_train)

# Predict
preds = xgb_cl.predict(x_test)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):




In [28]:
accuracy_score(y_test, preds)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.18      0.11      0.14       205
           1       0.14      0.13      0.14       194
           2       0.11      0.11      0.11       178
           3       0.10      0.12      0.11       170
           4       0.11      0.19      0.14       181
           5       0.18      0.11      0.14       184
           6       0.24      0.28      0.25       189
           7       0.07      0.05      0.05       175
           8       0.23      0.17      0.19       181
           9       0.09      0.15      0.11       169
          10       0.12      0.12      0.12       177
          11       0.35      0.33      0.34       173
          12       0.07      0.05      0.06       208
          13       0.10      0.07      0.09       175
          14       0.18      0.17      0.17       171
          15       0.15      0.22      0.18       222
          16       0.06      0.04      0.05       166
          17       0.17    

In [8]:
#what's the STD of the precision?
stdev([.18,.14,.11,.10,.11,.18,.24,.01,.23,.09,.12,.35,.07,.10,.18,.15,.06,.17,.14,.29,.25])

0.08170504443248461

<br>
This initial XGBoost model has an overall accuracy score of 42%. This is only slightly better than the ~40% we were getting with KNN in the preprocessing step. More importantly it has a lot less variance in precision between each class, with a standard deviation of only 0.08 so it's a lot more consistent.
<br>
<br>
Even with cutting down my data drammatically it still took more then 12 hours to run this model, so I decided against trying to tune the parameters and instead moved on to my second method LightGBM. I chose it primarily because it is much faster than XGBoost with many of the same benefits.
<br>
<br>

In [14]:
import optuna  # pip install optuna
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
from sklearn import metrics

In [15]:
model = lgb.LGBMClassifier(learning_rate=0.09, max_depth=-5, random_state=42)
model.fit(x_train,y_train, eval_set=[(x_test,y_test), (x_train,y_train)],
          verbose=20, eval_metric='logloss')

[20]	training's multi_logloss: 0.315811	valid_0's multi_logloss: 2.89903
[40]	training's multi_logloss: 0.0285181	valid_0's multi_logloss: 3.0573
[60]	training's multi_logloss: 0.00252844	valid_0's multi_logloss: 3.33069
[80]	training's multi_logloss: 0.000270244	valid_0's multi_logloss: 3.64147
[100]	training's multi_logloss: 0.000120762	valid_0's multi_logloss: 3.75902


LGBMClassifier(learning_rate=0.09, max_depth=-5, random_state=42)

In [22]:
result = model.predict(x_test)

#preds = predict(model, x_test)
print(classification_report(y_test, result))

              precision    recall  f1-score   support

           0       0.18      0.12      0.15       205
           1       0.12      0.13      0.12       194
           2       0.12      0.11      0.11       178
           3       0.08      0.09      0.09       170
           4       0.13      0.27      0.17       181
           5       0.18      0.09      0.12       184
           6       0.23      0.29      0.25       189
           7       0.04      0.03      0.03       175
           8       0.33      0.20      0.25       181
           9       0.08      0.10      0.09       169
          10       0.15      0.14      0.14       177
          11       0.30      0.38      0.34       173
          12       0.10      0.08      0.09       208
          13       0.09      0.07      0.08       175
          14       0.13      0.16      0.15       171
          15       0.17      0.23      0.19       222
          16       0.06      0.04      0.05       166
          17       0.30    

#### Summary
<br>
In this notebook we attempted both XGBoost and LightGBM as algorithms to predict image class. They both had pretty much identical to performance, with both having both lower overall accuracy (bad) and lower variance (good) than the baseline K-Nearest-Neighbors algorithm. also struggled in the same places (classifying drop down menus and labels). However it did only take a breezy 4 hours to run, so it seems like that makes it superior in efficiency at the very least.
<br>
<br>
While I'm going to leave the project here for the sake of expediancy, I will note that if I had more time a Convolutional Neural Network would almost certainly be more accurate than the models I've used here. Sadly all these models are still a far cry from good enough to replace a human at the task of identifying the images, just about all we can say they is that they are better than random chance.
<br>
<br>
