# Very Basic Image Classifier in notebook

This notebook contains a very basic image classifier to classify images of apples and oranges. This model is only for educational purposes and is not intended to be used in production. The model is trained on a very small dataset of 10 images of apples and 10 images of oranges, so the model is not expected to perform well on unseen data.

### Imports

This code cell sets up the environment for working with different python packages. We import the following packages:

- `tensorflow` for TensorFlow.
- `ImageDataGenerator` from `tensorflow.keras.preprocessing.image` for image data augmentation.

We check the TensorFlow version by print the TensorFlow version. This can be useful to verify that you're using the desired version.

In [None]:
# TensorFlow and tf.keras
import tensorflow as tf

print(tf.__version__)

### Constants

This code cell defines a set of configuration constants that are commonly used when working with image datasets for machine learning tasks. These constants help streamline the process of setting up and training models.

In [None]:
DATASET = "dataset"
IMAGE_SIZE = (150, 150)
BATCH_SIZE = 8
EPOCHS = 10

### Creating the dataset

This shell command removes a directory and its contents using the rm command with the `-rf` flags. The `-r` flag stands for "recursive", which means that it will remove not only the directory itself but also all files and subdirectories within it. The `-f` flag stands for "force", which suppresses any confirmation prompts, making the removal process non-interactive.

In [None]:
!rm -rf dataset

This code cell is responsible for downloading image data and organizing it into a structured directory hierarchy. It is a common step in machine learning projects, especially when working with image classification tasks. The code downloads images of apples and oranges from various URLs and categorizes them into training, validation, and test sets.


The dataset provided serves as a useful illustration of machine learning concepts, showcasing how to organize and prepare data for a model. However, it's important to note that this example dataset is exceptionally small in scale, containing just a handful of images of apples and oranges. In practice, real-world datasets can be significantly larger and more complex.

Collecting real datasets for machine learning tasks often presents substantial challenges. Here are a few key considerations:

- **Size:** Real datasets may consist of thousands or even millions of samples, necessitating extensive storage and computational resources for handling and processing.
- **Labeling:** In many cases, each data point must be labeled or categorized correctly. Manual labeling can be a time-consuming and labor-intensive process, especially for large datasets.
- **Diversity:** Real datasets often exhibit a wide range of variations, noise, and complexities, making them more representative of the challenges encountered in real-world applications.
- **Bias and Fairness:** Ensuring that a dataset is unbiased and fairly represents diverse demographics and scenarios is crucial for ethical and accurate machine learning.
- **Privacy and Compliance:** Handling sensitive or personal data requires strict adherence to privacy regulations, adding legal and ethical dimensions to dataset collection.

In [None]:
import os
import requests

for path in [
    "dataset/train/apples",
    "dataset/val/apples",
    "dataset/test/apples",
    "dataset/train/oranges",
    "dataset/val/oranges",
    "dataset/test/oranges",
]:
    os.makedirs(path)


def download_from_list(list, type):
    for i, img_name in enumerate(list):
        response = requests.get(
            f"https://github.com/HOGENT-MLOps/mlops-labs/blob/main/resources/03-ml-workflow/img-lab3/{type}s/{img_name}?raw=true"
        )
        response.raise_for_status()

        ml_split = "train"
        if i == 9:
            ml_split = "test"
        elif i == 8:
            ml_split = "val"

        with open(f"dataset/{ml_split}/{type}s/{type}{i}.jpeg", "wb") as file:
            file.write(response.content)


download_from_list(
    (f"apple-{i}.jpeg" for i in range(1, 11)),
    "apple",
)

download_from_list(
    (f"orange-{i}.jpeg" for i in range(1, 11)),
    "orange",
)

### Preprocessing

In the machine learning workflow, preprocessing is a crucial step that focuses on preparing and enhancing the raw data before it's fed into a model. It plays a pivotal role in shaping the success of a machine learning algorithm. Data preprocessing encompasses various tasks, including cleaning, transformation, and feature engineering.

In the case of image data, as seen in the following code cell, preprocessing often includes resizing images, rescaling pixel values, and organizing data into batches. These steps ensure that the data is in a suitable format and distribution for training and evaluation, ultimately leading to more accurate and efficient machine learning models.

In [None]:
from keras.utils import image_dataset_from_directory

train_dataset = image_dataset_from_directory(
    directory=f"{DATASET}/train",
    image_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    label_mode='binary',
    shuffle=True
)

validation_dataset = image_dataset_from_directory(
    directory=f"{DATASET}/val",
    image_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    label_mode='binary',
)

test_dataset = image_dataset_from_directory(
    directory=f"{DATASET}/test",
    image_size=IMAGE_SIZE,
    batch_size=BATCH_SIZE,
    label_mode='binary',
)

### Build the model

This code cell defines and builds a Convolutional Neural Network (CNN) model using the Keras library. CNNs are a class of deep learning models commonly used for image classification and computer vision tasks. The model architecture is relatively simple, consisting of convolutional layers, activation functions, and dense layers.

> **Challenge**: Can you think of changes we could make that could improve our final result?

In [None]:
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import Activation, Flatten, Dense


model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(150, 150, 3)))
model.add(Activation('relu'))

model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(2))
model.add(Activation('sigmoid'))

This code cell compiles a deep learning model using the Keras library. Compilation is an essential step in preparing the model for training. During compilation, you specify various aspects of the training process, such as the optimizer, loss function, and evaluation metrics.

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

### Training the model

This code cell initiates the training of a deep learning model using the previously compiled model, training data generator, and validation data generator. It leverages the fit method in Keras to start the training process, and it also stores training history for later analysis and visualization.

> **Note**: Training doesn't take long at all since we only have a toy dataset to work with. In real cases this can take hours or days to complete.

> **Challenge**: Can you think of changes we could make that could speed up training?

In [None]:
history = model.fit(
    train_dataset,
    epochs=EPOCHS,
    validation_data=validation_dataset
)

### Evaluation

This code cell is responsible for evaluating the performance of a trained deep learning model on a test dataset. It uses the evaluate method in Keras to compute the model's test loss and accuracy based on the test data generator.

In [None]:
test_loss, test_accuracy = model.evaluate(
    test_dataset,
    steps=len(test_dataset)
)

print('Test accuracy:', test_accuracy)
print('Test Loss:', test_loss)