# Your First Image Classifier: Using k-NN to Classify Images
# Data Segregation

The purpose of this dataset is to correctly classify an image as containing a dog, cat, or panda.
Containing only 3,000 images, the Animals dataset is meant to be another **introductory** dataset
that we can quickly train a KNN model and obtain initial results (no so good accuracy) that has potential to be used as a baseline. 

Let's take the following steps:

1. Data segregation
2. Split clean data into train and test

<center><img width="800" src="https://drive.google.com/uc?export=view&id=1fKGuR5U5ECf7On6Zo1UWzAIWZrMmZnGc"></center>

## Step 01: Setup

Start out by installing the experiment tracking library and setting up your free W&B account:


*   **pip install wandb** – Install the W&B library
*   **import wandb** – Import the wandb library
*   **wandb login** – Login to your W&B account so you can log all your metrics in one place

In [1]:
!pip install wandb -qU

[K     |████████████████████████████████| 1.9 MB 10.8 MB/s 
[K     |████████████████████████████████| 162 kB 46.9 MB/s 
[K     |████████████████████████████████| 182 kB 54.3 MB/s 
[K     |████████████████████████████████| 63 kB 1.5 MB/s 
[K     |████████████████████████████████| 162 kB 22.2 MB/s 
[K     |████████████████████████████████| 158 kB 37.5 MB/s 
[K     |████████████████████████████████| 157 kB 29.8 MB/s 
[K     |████████████████████████████████| 157 kB 32.9 MB/s 
[K     |████████████████████████████████| 157 kB 63.7 MB/s 
[K     |████████████████████████████████| 157 kB 43.5 MB/s 
[K     |████████████████████████████████| 157 kB 28.1 MB/s 
[K     |████████████████████████████████| 157 kB 40.4 MB/s 
[K     |████████████████████████████████| 157 kB 40.5 MB/s 
[K     |████████████████████████████████| 156 kB 45.8 MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [2]:
import wandb
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

### Import Packages

In [3]:
# import the necessary packages
from imutils import paths
import logging
import os
import cv2
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [4]:
# configure logging
# reference for a logging obj
logger = logging.getLogger()

# set level of logging
logger.setLevel(logging.INFO)

# create handlers
c_handler = logging.StreamHandler()
c_format = logging.Formatter(fmt="%(asctime)s %(message)s",datefmt='%d-%m-%Y %H:%M:%S')
c_handler.setFormatter(c_format)

# add handler to the logger
logger.handlers[0] = c_handler

## Step 02 Data Segregation

In [5]:
# since we are using Jupyter Notebooks we can replace our argument
# parsing code with *hard coded* arguments and values
args = {
  "project_name": "first_image_classifier",
  "artifact_name_feature": "clean_data:latest",
  "artifact_name_target": "label:latest",
  "train_feature_artifact": "train_x",
  "train_target_artifact": "train_y",
  "test_feature_artifact": "test_x",
  "test_target_artifact": "test_y"
}

In [6]:
# open the W&B project created in the Fetch step
run = wandb.init(entity="morsinaldo",project=args["project_name"], job_type="data_segregation")

logger.info("Downloading and reading clean data artifact")
clean_data = run.use_artifact(args["artifact_name_feature"])
clean_data_path = clean_data.file()

logger.info("Downloading and reading label data artifact")
label_data = run.use_artifact(args["artifact_name_target"])
label_data_path = label_data.file()

# unpacking the artifacts
data = joblib.load(clean_data_path)
label = joblib.load(label_data_path)

[34m[1mwandb[0m: Currently logged in as: [33mmorsinaldo[0m. Use [1m`wandb login --relogin`[0m to force relogin


15-10-2022 04:08:49 Downloading and reading clean data artifact
15-10-2022 04:08:50 Downloading and reading label data artifact


In [7]:
# partition the data into training and testing splits using 75% of
# the data for training and the remaining 25% for testing
(train_x, test_x, train_y, test_y) = train_test_split(data, label,test_size=0.25, random_state=42)

In [8]:
logger.info("Train x: {}".format(train_x.shape))
logger.info("Train y: {}".format(train_y.shape))
logger.info("Test x: {}".format(test_x.shape))
logger.info("Test y: {}".format(test_y.shape))

15-10-2022 04:08:51 Train x: (2250, 3072)
15-10-2022 04:08:51 Train y: (2250,)
15-10-2022 04:08:51 Test x: (750, 3072)
15-10-2022 04:08:51 Test y: (750,)


In [9]:
logger.info("Dumping the train and test data artifacts to the disk")

# Save the artifacts using joblib
joblib.dump(train_x, args["train_feature_artifact"])
joblib.dump(train_y, args["train_target_artifact"])
joblib.dump(test_x, args["test_feature_artifact"])
joblib.dump(test_y, args["test_target_artifact"])

15-10-2022 04:08:51 Dumping the train and test data artifacts to the disk


['test_y']

In [10]:
# train_x artifact
artifact = wandb.Artifact(args["train_feature_artifact"],
                          type="TRAIN_DATA",
                          description="A json file representing the train_x"
                          )

logger.info("Logging train_x artifact")
artifact.add_file(args["train_feature_artifact"])
run.log_artifact(artifact)

15-10-2022 04:08:51 Logging train_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f4586cdcc50>

In [11]:
# train_y artifact
artifact = wandb.Artifact(args["train_target_artifact"],
                          type="TRAIN_DATA",
                          description="A json file representing the train_y"
                          )

logger.info("Logging train_y artifact")
artifact.add_file(args["train_target_artifact"])
run.log_artifact(artifact)

15-10-2022 04:08:51 Logging train_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f4586c7f3d0>

In [12]:
# test_x artifact
artifact = wandb.Artifact(args["test_feature_artifact"],
                          type="TEST_DATA",
                          description="A json file representing the test_x"
                          )

logger.info("Logging test_x artifact")
artifact.add_file(args["test_feature_artifact"])
run.log_artifact(artifact)

15-10-2022 04:08:51 Logging test_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f4586c851d0>

In [13]:
# test_y artifact
artifact = wandb.Artifact(args["test_target_artifact"],
                          type="TEST_DATA",
                          description="A json file representing the test_y"
                          )

logger.info("Logging test_y artifact")
artifact.add_file(args["test_target_artifact"])
run.log_artifact(artifact)

15-10-2022 04:08:52 Logging test_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f4586c9a850>

In [14]:
run.finish()