<a href="https://colab.research.google.com/github/MiguelEuripedes/embedded_AI/blob/main/Projects/first_image_classifier/mlp_classifier/dataSegregation_MLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Your First Image Classifier: Using MLP to Classify Images

## Data Segregation

The purpose of this dataset is to correctly classify an image as containing a dog, cat, or panda. Containing only 3,000 images, the Animals dataset is meant to be another introductory dataset that we can quickly train a Multilayer Perceptron (MLP) model and obtain results that can be compared with the previously trained KNN model and the future CNN model.

Now let's take the following steps:

1. Data segregation
2. Split clean data into train, validation and test

### Step 01: Setup

Start out by installing the experiment tracking library and setting up your free W&B account:

* **pip install wandb** – Install the W&B library
* **import wandb** – Import the wandb library
* **wandb login** – Login to your W&B account so you can log all your metrics in one place

In [None]:
!pip install wandb -qU

[K     |████████████████████████████████| 1.9 MB 5.3 MB/s 
[K     |████████████████████████████████| 162 kB 44.7 MB/s 
[K     |████████████████████████████████| 182 kB 21.4 MB/s 
[K     |████████████████████████████████| 63 kB 1.5 MB/s 
[K     |████████████████████████████████| 162 kB 47.4 MB/s 
[K     |████████████████████████████████| 158 kB 49.6 MB/s 
[K     |████████████████████████████████| 157 kB 8.6 MB/s 
[K     |████████████████████████████████| 157 kB 46.1 MB/s 
[K     |████████████████████████████████| 157 kB 51.7 MB/s 
[K     |████████████████████████████████| 157 kB 42.9 MB/s 
[K     |████████████████████████████████| 157 kB 48.3 MB/s 
[K     |████████████████████████████████| 157 kB 43.4 MB/s 
[K     |████████████████████████████████| 157 kB 46.2 MB/s 
[K     |████████████████████████████████| 156 kB 45.1 MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [None]:
import wandb
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

**Import the packages**

In [None]:
# import the necessary packages
import logging
import joblib
from sklearn.model_selection import train_test_split
import wandb

Configure logging reference for a logging object

In [None]:
logger = logging.getLogger()

Set level of logging:

In [None]:
logger.setLevel(logging.INFO)

Create handler:

In [None]:
c_handler = logging.StreamHandler()
c_format = logging.Formatter(fmt="%(asctime)s %(message)s",datefmt='%d-%m-%Y %H:%M:%S')
c_handler.setFormatter(c_format)

Add handler to the logger

In [None]:
logger.handlers[0] = c_handler

### Step 02: Data Segregation

We'll be using the same strategy as before and replace our argument parsing code with hard coded arguments and values

In [None]:
args = {
  "project_name": "mlp_classifier",
  "artifact_name_feature": "clean_features:latest",
  "artifact_name_target": "labels:latest",
  "train_feature_artifact": "train_x",
  "train_target_artifact": "train_y",
  "val_feature_artifact": "val_x",
  "val_target_artifact": "val_y",
  "test_feature_artifact": "test_x",
  "test_target_artifact": "test_y",
}

Lets open the W&B project created in the last step

In [None]:
run = wandb.init(entity="euripedes",project=args["project_name"], job_type="data_segregation")

[34m[1mwandb[0m: Currently logged in as: [33meuripedes[0m. Use [1m`wandb login --relogin`[0m to force relogin


Now we can download the clean data from last notebook. Lets get the features and the labels from it.

In [None]:
logger.info("Downloading and reading clean data artifact")
clean_data = run.use_artifact(args["artifact_name_feature"])
clean_data_path = clean_data.file()

logger.info("Downloading and reading label data artifact")
label_data = run.use_artifact(args["artifact_name_target"])
label_data_path = label_data.file()

16-10-2022 03:06:46 Downloading and reading clean data artifact
16-10-2022 03:06:47 Downloading and reading label data artifact


Now unpacking the artifacts

In [None]:
data = joblib.load(clean_data_path)
label = joblib.load(label_data_path)

Now that we have the data, lets start to partition the data into training, validation and test.

First we'll divide the data in 75% for traing and validation purposes and the rest 25% we can use for tests latter on.

In [None]:
(train_x, test_x, train_y, test_y) = train_test_split(data, label,test_size=0.25, random_state=26)

Now from the train_x and train_y we'll obtain the validation data, that will come from 25% of the total train data.

In [None]:
(train_x, val_x, train_y, val_y) = train_test_split(train_x, train_y,test_size=0.25, random_state=26)

Lets see the shape of each partition

In [None]:
logger.info("Train x: {}".format(train_x.shape))
logger.info("Train y: {}".format(train_y.shape))
logger.info("Validation x: {}".format(val_x.shape))
logger.info("Validation y: {}".format(val_y.shape))
logger.info("Test x: {}".format(test_x.shape))
logger.info("Test y: {}".format(test_y.shape))

16-10-2022 03:07:01 Train x: (1687, 3072)
16-10-2022 03:07:01 Train y: (1687,)
16-10-2022 03:07:01 Validation x: (563, 3072)
16-10-2022 03:07:01 Validation y: (563,)
16-10-2022 03:07:01 Test x: (750, 3072)
16-10-2022 03:07:01 Test y: (750,)


So we have train data as 75% of the total data, validation data as 25% and the rest 25% for the test data

### Dumping the train, validation and test data artifacts to disk and upload to W&B

Now that we are done, we can save the artifacts using joblib

In [None]:
joblib.dump(train_x, args["train_feature_artifact"])
joblib.dump(train_y, args["train_target_artifact"])
joblib.dump(val_x, args["val_feature_artifact"])
joblib.dump(val_y, args["val_target_artifact"])
joblib.dump(test_x, args["test_feature_artifact"])
joblib.dump(test_y, args["test_target_artifact"])

logger.info("Dumping the train and validation data artifacts to the disk")

16-10-2022 03:08:30 Dumping the train and validation data artifacts to the disk


**Train X Artifact:**

In [None]:
# train_x artifact
artifact = wandb.Artifact(args["train_feature_artifact"],
                          type="TRAIN_DATA",
                          description="A json file representing the train_x"
                          )

logger.info("Logging train_x artifact")
artifact.add_file(args["train_feature_artifact"])
run.log_artifact(artifact)

16-10-2022 03:08:37 Logging train_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fa9a31b31d0>

**Train Y Artifact:**

In [None]:
# train_y artifact
artifact = wandb.Artifact(args["train_target_artifact"],
                          type="TRAIN_DATA",
                          description="A json file representing the train_y"
                          )

logger.info("Logging train_y artifact")
artifact.add_file(args["train_target_artifact"])
run.log_artifact(artifact)

16-10-2022 03:08:56 Logging train_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fa9a31ac750>

**Validation X Artifact:**

In [None]:
# val_x artifact
artifact = wandb.Artifact(args["val_feature_artifact"],
                          type="VAL_DATA",
                          description="A json file representing the val_x"
                          )

logger.info("Logging val_x artifact")
artifact.add_file(args["val_feature_artifact"])
run.log_artifact(artifact)

16-10-2022 03:08:59 Logging val_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fa9a31ac410>

**Validation Y Artifact:**

In [None]:
# val_y artifact
artifact = wandb.Artifact(args["val_target_artifact"],
                          type="VAL_DATA",
                          description="A json file representing the val_y"
                          )

logger.info("Logging val_y artifact")
artifact.add_file(args["val_target_artifact"])
run.log_artifact(artifact)

16-10-2022 03:09:06 Logging val_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fa9a31acad0>

**Test X Artifact:**

In [None]:
# test_x artifact
artifact = wandb.Artifact(args["test_feature_artifact"],
                          type="TEST_DATA",
                          description="A json file representing the test_x"
                          )

logger.info("Logging test_x artifact")
artifact.add_file(args["test_feature_artifact"])
run.log_artifact(artifact)

16-10-2022 03:09:08 Logging test_x artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fa9a31d41d0>

**Test Y Artifact:**

In [None]:
# test_y artifact
artifact = wandb.Artifact(args["test_target_artifact"],
                          type="TEST_DATA",
                          description="A json file representing the test_y"
                          )

logger.info("Logging test_y artifact")
artifact.add_file(args["test_target_artifact"])
run.log_artifact(artifact)

16-10-2022 03:09:12 Logging test_y artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fa9a31e5450>

End the part of the project

In [None]:
run.finish()

VBox(children=(Label(value='8.848 MB of 8.848 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…