<a href="https://colab.research.google.com/github/MiguelEuripedes/embedded_AI/blob/main/Projects/first_image_classifier/knn_classifier/DataSegregation_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Your First Image Classifier: Using k-NN to Classify Images

## Data Segregation

The purpose of this dataset is to correctly classify an image as containing a dog, cat, or panda. Containing only 3,000 images, the Animals dataset is meant to be another introductory dataset that we can quickly train a KNN model and obtain initial results (no so good accuracy) that has potential to be used as a baseline.

Let's take the following steps:

1. Data segregation
2. Split clean data into train and test

### Step 01: Setup

Start out by installing the experiment tracking library and setting up your free W&B account:

* **pip install wandb** – Install the W&B library
* **import wandb** – Import the wandb library
* **wandb login** – Login to your W&B account so you can log all your metrics in one place

In [None]:
!pip install wandb -qU

[K     |████████████████████████████████| 1.9 MB 4.9 MB/s 
[K     |████████████████████████████████| 182 kB 45.1 MB/s 
[K     |████████████████████████████████| 162 kB 42.5 MB/s 
[K     |████████████████████████████████| 63 kB 713 kB/s 
[K     |████████████████████████████████| 162 kB 47.4 MB/s 
[K     |████████████████████████████████| 158 kB 43.2 MB/s 
[K     |████████████████████████████████| 157 kB 21.6 MB/s 
[K     |████████████████████████████████| 157 kB 8.3 MB/s 
[K     |████████████████████████████████| 157 kB 39.6 MB/s 
[K     |████████████████████████████████| 157 kB 45.8 MB/s 
[K     |████████████████████████████████| 157 kB 45.2 MB/s 
[K     |████████████████████████████████| 157 kB 46.5 MB/s 
[K     |████████████████████████████████| 157 kB 54.5 MB/s 
[K     |████████████████████████████████| 156 kB 41.8 MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [None]:
import wandb
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

#### Import Packages

Import the necessary packages

In [None]:
from imutils import paths
import logging
import os
import cv2
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

configure logging reference for a logging object

In [None]:
logger = logging.getLogger()

Set level of logging

In [None]:
logger.setLevel(logging.INFO)

Create handlers

In [None]:
c_handler = logging.StreamHandler()
c_format = logging.Formatter(fmt="%(asctime)s %(message)s",datefmt='%d-%m-%Y %H:%M:%S')
c_handler.setFormatter(c_format)

Add handler to the logger

In [None]:
logger.handlers[0] = c_handler

### Step 02: Data Segregation

Since we are using Jupyter Notebooks we can replace our argument parsing code with *hard coded* arguments and values

In [None]:
args = {
  "project_name": "first_image_classifier",
  "artifact_name_feature": "clean_data:latest",
  "artifact_name_target": "label:latest",
  "train_feature_artifact": "train_x",
  "train_target_artifact": "train_y",
  "test_feature_artifact": "test_x",
  "test_target_artifact": "test_y"
}

Open the W&B project created in the Fetch step

In [None]:
run = wandb.init(entity="euripedes",project=args["project_name"], job_type="data_segregation")

[34m[1mwandb[0m: Currently logged in as: [33meuripedes[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
logger.info("Downloading and reading clean data artifact")
clean_data = run.use_artifact(args["artifact_name_feature"])
clean_data_path = clean_data.file()

13-10-2022 01:26:46 Downloading and reading clean data artifact


In [None]:
logger.info("Downloading and reading label data artifact")
label_data = run.use_artifact(args["artifact_name_target"])
label_data_path = label_data.file()

13-10-2022 01:26:54 Downloading and reading label data artifact


Unpacking the artifacts

In [None]:
data = joblib.load(clean_data_path)

In [None]:
label = joblib.load(label_data_path)

Partition the data into training and testing splits using 75% of the data for training and the remaining 25% for testing

In [None]:
(train_x, test_x, train_y, test_y) = train_test_split(data, label,test_size=0.25, random_state=42)

In [None]:
logger.info("Train x: {}".format(train_x.shape))
logger.info("Train y: {}".format(train_y.shape))
logger.info("Test x: {}".format(test_x.shape))
logger.info("Test y: {}".format(test_y.shape))

13-10-2022 01:27:07 Train x: (2250, 3072)
13-10-2022 01:27:07 Train y: (2250,)
13-10-2022 01:27:07 Test x: (750, 3072)
13-10-2022 01:27:07 Test y: (750,)


In [None]:
logger.info("Dumping the train and test data artifacts to the disk")

13-10-2022 01:27:11 Dumping the train and test data artifacts to the disk


Save the artifacts using joblib

In [None]:
joblib.dump(train_x, args["train_feature_artifact"])
joblib.dump(train_y, args["train_target_artifact"])
joblib.dump(test_x, args["test_feature_artifact"])
joblib.dump(test_y, args["test_target_artifact"])

['test_y']

**Train_x artifact**

In [None]:
artifact = wandb.Artifact(args["train_feature_artifact"],
                          type="TRAIN_DATA",
                          description="A json file representing the train_x"
                          )

logger.info("Logging train_x artifact")
artifact.add_file(args["train_feature_artifact"])
run.log_artifact(artifact)

13-10-2022 01:27:26 Logging train_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fd6d3bd3550>

**Train_y artifact**

In [None]:
artifact = wandb.Artifact(args["train_target_artifact"],
                          type="TRAIN_DATA",
                          description="A json file representing the train_y"
                          )

logger.info("Logging train_y artifact")
artifact.add_file(args["train_target_artifact"])
run.log_artifact(artifact)

13-10-2022 01:27:29 Logging train_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fd6d3bdce90>

**Test_x artifact**

In [None]:
artifact = wandb.Artifact(args["test_feature_artifact"],
                          type="TEST_DATA",
                          description="A json file representing the test_x"
                          )

logger.info("Logging test_x artifact")
artifact.add_file(args["test_feature_artifact"])
run.log_artifact(artifact)

13-10-2022 01:27:32 Logging test_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fd6d3be39d0>

**Test_y artifact**

In [None]:
artifact = wandb.Artifact(args["test_target_artifact"],
                          type="TEST_DATA",
                          description="A json file representing the test_y"
                          )

logger.info("Logging test_y artifact")
artifact.add_file(args["test_target_artifact"])
run.log_artifact(artifact)

13-10-2022 01:27:35 Logging test_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fd6d3bf36d0>

In [None]:
run.finish()

VBox(children=(Label(value='8.848 MB of 8.848 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…