### Part 3: Modeling

The goal of this project is to create a model that can predict what type of common element of computer user interfaces an image is from a hand-written drawing (buttons, toggles, windows, etc.).

This is part 3 of that project, and covers building models.

In [1]:
# Import necessary packages
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score,classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
import numpy as np
from sklearn import svm
from sklearn.linear_model import LogisticRegression
import cv2
import pandas as pd
import glob
import os
import pyarrow.parquet as pq
import seaborn as sns
from matplotlib import rcParams
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
import xgboost as xgb
from sklearn.metrics import mean_squared_error

  from pandas import MultiIndex, Int64Index


In [5]:
table = pq.read_table('/Users/grahamsmith/Documents/SpringboardWork/Springboard/UIsketch.parquet')
df = table.to_pandas()

Set the labels to be numeric to make modeling easier (note: this should have been done in the preprocessing step but I forgot).

In [7]:
labels = set(df['label'])
dummies = list(range(len(labels)))
labeldf = pd.DataFrame([labels, dummies]).T
df['label'] = df['label'].replace(list(labeldf.iloc[:,0]),labeldf.iloc[:,1])

This table is so that the labels can be easily mapped back to the numbers for interpretation.

In [8]:
labeldf

Unnamed: 0,0,1
0,slider,0
1,label,1
2,alert,2
3,image,3
4,menu,4
5,radio_button_unchecked,5
6,text_field,6
7,dropdown_menu,7
8,chip,8
9,switch_disabled,9


Do the train/test split that we figured out in the pre-processing notebook

In [33]:
# Sample 80% of images within each class aka label.
train = df.groupby('label', group_keys=False).apply(lambda x: x.sample(frac=0.8))
# Get the indicies of the images not in the test set and assign those images to the test set.
testind = list(set(df.index) - set(train.index))
test = df.iloc[testind]

Unfortunately, my computer kept crashing, so I could only get any of the following models to run by cutting out 70% of my data.

In [None]:
train_small = train.groupby('label', group_keys=False).apply(lambda x: x.sample(frac=0.3))
x_train = train_small.loc[:, train_small.columns != 'label']
y_train = train_small['label']
x_test = test.loc[:, test.columns != 'label']
y_test = test['label']

The first model I attempted was XGBoost, because it has ridge regression built in and I figured with so many parameters some would probably be reduced to 0. It is also faster than multiple logistic regression by itself.

In [45]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

numeric_pipeline = Pipeline(
    steps=[("impute", SimpleImputer(strategy="mean")), 
           ("scale", StandardScaler())]
)

num_cols = x_train.select_dtypes(include="number").columns

from sklearn.compose import ColumnTransformer

full_processor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, num_cols),
    ]
)

In [50]:
import xgboost as xgb

xgb_cl = xgb.XGBClassifier()
print(type(xgb_cl))

<class 'xgboost.sklearn.XGBClassifier'>


In [52]:
from sklearn.metrics import accuracy_score

# Init classifier
xgb_cl = xgb.XGBClassifier()

# Fit
xgb_cl.fit(x_train, y_train)

# Predict
preds = xgb_cl.predict(x_test)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):




In [53]:
accuracy_score(y_test, preds)

0.41496598639455784

This initial XGBoost model has an overall accuracy score of 42%. This is only slightly better than the ~40% we were getting with KNN in the preprocessing step.

Even with cutting down my data drammatically it still took more then 12 hours to run this model, so I decided against trying to tune the parameters and instead moved on to my second method LightGBM. I chose it primarily because it is much faster than XGBoost with many of the same benefits.

In [59]:
import optuna  # pip install optuna
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold

def objective(trial, X, y):
    param_grid = {}  # to be filled in later
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1121218)

    cv_scores = np.empty(5)
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        model = lgbm.LGBMClassifier(objective="binary", **param_grid)
        model.fit(
            X_train,
            y_train,
            eval_set=[(X_test, y_test)],
            eval_metric="binary_logloss",
            early_stopping_rounds=100,
        )
        preds = model.predict_proba(X_test)
        cv_scores[idx] = preds

    return np.mean(cv_scores)

In [61]:
param_grid = {
    "max_depth": [3, 4, 5, 7],
    "learning_rate": [0.1, 0.01, 0.05],
    "gamma": [0, 0.25, 1],
    "reg_lambda": [0, 1, 10],
    "scale_pos_weight": [1, 3, 5],
    "subsample": [0.8],
    "colsample_bytree": [0.5],
}

In [None]:
from sklearn.model_selection import GridSearchCV

# Init classifier
xgb_cl = xgb.XGBClassifier(objective="binary:logistic")

# Init Grid Search
grid_cv = GridSearchCV(xgb_cl, param_grid, n_jobs=-1, cv=3, scoring="roc_auc")

# Fit
_ = grid_cv.fit(x_train, y_train)

In [55]:
import lightgbm as lgb
from sklearn import metrics

In [56]:
model = lgb.LGBMClassifier(learning_rate=0.09, max_depth=-5, random_state=42)
model.fit(x_train,y_train, eval_set=[(x_test,y_test), (x_train,y_train)],
          verbose=20, eval_metric='logloss')

[20]	training's multi_logloss: 0.37064	valid_0's multi_logloss: 2.18407
[40]	training's multi_logloss: 0.0915149	valid_0's multi_logloss: 2.09388
[60]	training's multi_logloss: 0.0471361	valid_0's multi_logloss: 2.15989
[80]	training's multi_logloss: 0.0383709	valid_0's multi_logloss: 2.26604
[100]	training's multi_logloss: 0.0364348	valid_0's multi_logloss: 2.3888


LGBMClassifier(learning_rate=0.09, max_depth=-5, random_state=42)

In [9]:
print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))

Training accuracy 0.9744
Testing accuracy 0.6340


Oh ho! Now we're getting somewhere. LightGBM was both considerably faster (took only a breezy 4 hours to run!) but was also considerably more accurate, hitting 63% on the test set.

While I'm going to leave the project here for the sake of expediancy, I will note that if I had more time a Convolutional Neural Network would almost certainly be more accurate than the models I've used here. Sadly 63% is still a far cry from good enough to replace a human at the task of identifying the images.