<a href="https://colab.research.google.com/github/MiguelEuripedes/embedded_AI/blob/main/Projects/first_image_classifier/knn_classifier/Preprocessing_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Your First Image Classifier: Using k-NN to Classify Images

## Exploratory Data Analysis and Pre-processing

The purpose of this dataset is to correctly classify an image as containing a dog, cat, or panda. Containing only 3,000 images, the Animals dataset is meant to be another introductory dataset that we can quickly train a KNN model and obtain initial results (no so good accuracy) that has potential to be used as a baseline.

Let's take the following steps:
1. Exploratory Data Analysis (EDA)
2. Pre-processing

### Step 01: Setup

Start out by installing the experiment tracking library and setting up your free W&B account:

* **pip install wandb** – Install the W&B library
* **import wandb** – Import the wandb library
* **wandb login** – Login to your W&B account so you can log all your metrics in one place

In [None]:
!pip install wandb -qU

[K     |████████████████████████████████| 1.9 MB 5.2 MB/s 
[K     |████████████████████████████████| 182 kB 59.5 MB/s 
[K     |████████████████████████████████| 162 kB 56.9 MB/s 
[K     |████████████████████████████████| 63 kB 1.6 MB/s 
[K     |████████████████████████████████| 162 kB 47.6 MB/s 
[K     |████████████████████████████████| 158 kB 58.3 MB/s 
[K     |████████████████████████████████| 157 kB 46.2 MB/s 
[K     |████████████████████████████████| 157 kB 59.6 MB/s 
[K     |████████████████████████████████| 157 kB 55.3 MB/s 
[K     |████████████████████████████████| 157 kB 58.3 MB/s 
[K     |████████████████████████████████| 157 kB 55.5 MB/s 
[K     |████████████████████████████████| 157 kB 50.9 MB/s 
[K     |████████████████████████████████| 157 kB 62.5 MB/s 
[K     |████████████████████████████████| 156 kB 48.7 MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [None]:
import wandb
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

#### Import Packages

Import the necessary packages

In [None]:
from imutils import paths
import logging
import os

In [None]:
import cv2
import numpy as np
import joblib

Configure logging:

Reference to a logging object

In [None]:
logger = logging.getLogger()

Set the level of logging

In [None]:
logger.setLevel(logging.INFO)

Create handlers

In [None]:
c_handler = logging.StreamHandler()
c_format = logging.Formatter(fmt="%(asctime)s %(message)s",datefmt='%d-%m-%Y %H:%M:%S')
c_handler.setFormatter(c_format)

Add handler to the Logger

In [None]:
logger.handlers[0] = c_handler

### Step 02: EDA

Since we are using Jupyter Notebooks we can replace our argument parsing code with *hard coded* arguments and values

In [None]:
args = {
	"dataset": "animals",
  "project_name": "first_image_classifier",
  "artifact_name": "animals_raw_data:latest",
  "eda_name": "eda_animals"
}

Open the W&B project created in the Fetch step

In [None]:
run = wandb.init(entity="euripedes",project=args["project_name"], job_type="preprocessing")

[34m[1mwandb[0m: Currently logged in as: [33meuripedes[0m. Use [1m`wandb login --relogin`[0m to force relogin


Download the raw data from W&B

In [None]:
raw_data = run.use_artifact(args["artifact_name"])
data_dir = raw_data.download()
logger.info("Path: {}".format(data_dir))

[34m[1mwandb[0m: Downloading large artifact animals_raw_data:latest, 187.97MB. 3000 files... 
[34m[1mwandb[0m:   3000 of 3000 files downloaded.  
Done. 0:0:28.1
13-10-2022 00:54:38 Path: ./artifacts/animals_raw_data:v0


Create a table with columns we want to track/compare

In [None]:
preview_dt = wandb.Table(columns=["id", "image", "label","size"])

Create a **new artifact** to store the EDA data

In [None]:
eda_data = wandb.Artifact(args["eda_name"], type="eda_data")

Grab the list of images that we'll be describing

In [None]:
imagePaths = list(paths.list_images(data_dir))

Append all images to the artifact

In [None]:
for img in imagePaths:
  "img example: ./artifacts/animals_raw_data:v0/dogs/dogs_00892.jpg"
  label = img.split(os.path.sep)
  image = cv2.imread(img)
  preview_dt.add_data(label[-1], wandb.Image(img), label[-2], str(image.shape[0]) + " X " + str(image.shape[1]))

Save artifact to W&B

In [None]:
eda_data.add(preview_dt, "EDA_Table")
run.log_artifact(eda_data)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f391afb2750>

---

### Step 03: Clean Data

New args dictionary to cleaning the data

In [None]:
args = {
	"dataset": "clean_data",
  "label": "label",
  "project_name": "first_image_classifier",
  "artifact_name": "animals_raw_data:latest"
}

Download the raw data from W&B

In [None]:
raw_data = run.use_artifact(args["artifact_name"])
data_dir = raw_data.download()
logger.info("Path: {}".format(data_dir))

[34m[1mwandb[0m: Downloading large artifact animals_raw_data:latest, 187.97MB. 3000 files... 
[34m[1mwandb[0m:   3000 of 3000 files downloaded.  
Done. 0:0:0.4
13-10-2022 00:56:15 Path: ./artifacts/animals_raw_data:v0


**A basic simple preprocessor:**

In [None]:
class SimplePreprocessor:
	def __init__(self, width, height, inter=cv2.INTER_AREA):
		# store the target image width, height, and interpolation
		# method used when resizing
		self.width = width
		self.height = height
		self.inter = inter

	def preprocess(self, image):
		# resize the image to a fixed size, ignoring the aspect
		# ratio
		return cv2.resize(image, (self.width, self.height),interpolation=self.inter)

**Building an image loader:**

In [None]:
class SimpleDatasetLoader:
  def __init__(self, preprocessors=None, logger=None):
		# store the image preprocessor
    self.preprocessors = preprocessors
    self.logger = logger

		# if the preprocessors are None, initialize them as an
		# empty list
    if self.preprocessors is None:
      self.preprocessors = []

  def load(self, imagePaths, verbose=-1):
		# initialize the list of features and labels
    data = []
    labels = []

		# loop over the input images
    for (i, imagePath) in enumerate(imagePaths):
			# load the image and extract the class label assuming
			# that our path has the following format:
			# /path/to/dataset/{class}/{image}.jpg
			# e.g "img example: ./artifacts/animals_raw_data:v0/dogs/dogs_00892.jpg"
			# imagePath.split(os.path.sep)[-2] will return "dogs"
      image = cv2.imread(imagePath)
      label = imagePath.split(os.path.sep)[-2]

      # check to see if our preprocessors are not None
      if self.preprocessors is not None:
				# loop over the preprocessors and apply each to
				# the image
        for p in self.preprocessors:
          image = p.preprocess(image)

			# treat our processed image as a "feature vector"
			# by updating the data list followed by the labels
      data.append(image)
      labels.append(label)
   
			# show an update every `verbose` images
      if verbose > 0 and i > 0 and (i + 1) % verbose == 0:
        logger.info("[INFO] processed {}/{}".format(i + 1,len(imagePaths)))

		# return a tuple of the data and labels
    return (np.array(data), np.array(labels))

Grab the list of images that we'll be describing

In [None]:
logger.info("[INFO] preprocessing images...")
imagePaths = list(paths.list_images(data_dir))

13-10-2022 00:56:29 [INFO] preprocessing images...


initialize the ***image preprocessor***, load the dataset from disk, and reshape the data matrix

In [None]:
sp = SimplePreprocessor(32, 32)
sdl = SimpleDatasetLoader(preprocessors=[sp],logger=logger)
(data, labels) = sdl.load(imagePaths, verbose=500)
# 32 x 32 x 3 = 3072
data = data.reshape((data.shape[0], 3072))

13-10-2022 00:56:36 [INFO] processed 500/3000
13-10-2022 00:56:40 [INFO] processed 1000/3000
13-10-2022 00:56:41 [INFO] processed 1500/3000
13-10-2022 00:56:43 [INFO] processed 2000/3000
13-10-2022 00:56:44 [INFO] processed 2500/3000
13-10-2022 00:56:46 [INFO] processed 3000/3000


Show some information on memory consumption of the images

In [None]:
logger.info("[INFO] features matrix: {:.1f}MB".format(data.nbytes / (1024 * 1024)))

13-10-2022 00:56:51 [INFO] features matrix: 8.8MB


Show some information about the images and the data:

In [None]:
logger.info("Data shape: {}".format(data.shape))

13-10-2022 00:56:53 Data shape: (3000, 3072)


In [None]:
logger.info("Label shape: {}".format(labels.shape))

13-10-2022 00:56:55 Label shape: (3000,)


In [None]:
logger.info("Dumping the clean data artifacts to disk")

13-10-2022 00:56:56 Dumping the clean data artifacts to disk


Save the feature artifacts using joblib

In [None]:
joblib.dump(data, args["dataset"])

['clean_data']

Save the target using joblib

In [None]:
joblib.dump(labels, args["label"])

['label']

Clean data artifact

In [None]:
artifact = wandb.Artifact(args["dataset"],
                          type="CLEAN_DATA",
                          description="A json file representing the clean and preprocessed data"
                          )

In [None]:
logger.info("Logging clean data artifact")
artifact.add_file(args["dataset"])
run.log_artifact(artifact)

13-10-2022 00:57:07 Logging clean data artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f392b2bf910>

Clean labal artifact

In [None]:
artifact = wandb.Artifact(args["label"],
                          type="CLEAN_DATA",
                          description="A json file representing the clean label"
                          )

In [None]:
logger.info("Logging clean label artifact")
artifact.add_file(args["label"])
run.log_artifact(artifact)

13-10-2022 00:57:12 Logging clean label artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f392b297210>

In [None]:
run.finish()

VBox(children=(Label(value='8.847 MB of 8.847 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…