<a href="https://colab.research.google.com/github/Joana-Mansa/image_classifier_extractor/blob/main/image_classification_feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Image Feature extraction and Classification

- on the weather dataset

This notebook was created as a result of following this tutorial by Computer vision Engineer. Video link can be found [here](https://www.youtube.com/watch?v=oEKg_jiV1Ng).

During the tutorial, you will:

 1. prepare the data
 2. train the model
 3. test performance
 4. save the model

 Dataset can be found here, however, you would need to partition the data into training and validation set. Which i did while following the tutorial.

### Installing Packages and Importing Libraries

In [2]:
pip install img2vec-pytorch

Collecting img2vec-pytorch
  Downloading img2vec_pytorch-1.0.1-py3-none-any.whl (6.9 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->img2vec-pytorch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->img2vec-pytorch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->img2vec-pytorch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->img2vec-pytorch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->img2vec-pytorch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->img2vec-pytorch)
  Using cached nvidia_cufft_cu12-11.0.2.54-p

In [3]:
pip install pillow



In [46]:
import torch
from img2vec_pytorch import Img2Vec


In [47]:

# instantiating the image to vector model
# to extract features from the data
img2vec_model = Img2Vec()



### 1. Preparing the Data

In [48]:

import os
import shutil
from sklearn.model_selection import train_test_split
from PIL import Image

In [49]:
from google.colab import drive
drive.mount('/content/drive')

data_dir = "/content/drive/My Drive/weather_dataset"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [20]:
train_dir = "/content/drive/My Drive/weather_dataset/training_set"
val_dir = "/content/drive/My Drive/weather_dataset/validation_set"


The dataset I downloaded didn't have the classes assembled in separate folders. The script below creates the training and validation folder and splits the data classes into validation and training sets

In [22]:
# create the training and validation directories

class_folders = os.listdir(data_dir)

print(class_folders)

['cloudy', 'sunrise', 'rain', 'shine', 'training_set', 'validation_set']


In [23]:
# creates the training and validation folders
# if they do not already exist
os.makedirs(train_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)

In [25]:
# iterates over each class folder
for class_folder in class_folders:
  class_path = os.path.join(data_dir, class_folder)
  images = os.listdir(class_path)

  # split the images into training and validation sets
  train_images, val_images = train_test_split(images, test_size=0.2, random_state=42)

  # create class folders in the training and validation directories
  os.makedirs(os.path.join(train_dir, class_folder), exist_ok=True)
  os.makedirs(os.path.join(val_dir, class_folder), exist_ok=True)


  # move the images to their respective training and validation folders
  for image in train_images:
    src = os.path.join(class_path, image)
    dest = os.path.join(train_dir, class_folder, image)
    shutil.copy(src, dest)

  for image in val_images:
    src = os.path.join(class_path, image)
    dest = os.path.join(val_dir, class_folder, image)
    shutil.copy(src, dest)


print("Dataset split into training and validation sets")


IsADirectoryError: [Errno 21] Is a directory: '/content/drive/My Drive/weather_dataset/training_set/training_set'

I was flagged with an error after running the feature extraction code and it was due to my working in a single channel instead img2vec expecting a 3 channel image input.
The L means it is a grayscale image

In [33]:
img2 = Image.open("/content/drive/My Drive/weather_dataset/validation_set/cloudy/cloudy18.jpg")
channels = img.mode
print("Number of channels:", channels)

Number of channels: L


#### Feature Extraction

In [36]:
data = {}

# enumerate over the dictionaries containing training and validation data
for j, dir_ in enumerate([train_dir, val_dir]):

  # initialise empty lists to store features and labels
  features = []
  labels = []

  # iterate over each category within the directory
  for category in os.listdir(dir_):

    # iterate over each image file within the category
    for img_path in os.listdir(os.path.join(dir_, category)):

      # gets the full path of the image
      img_path_ = os.path.join(dir_, category, img_path)
      # opens the image using PIL (Python Imaging Library)
      img= Image.open(img_path_)

      # resizes the image to 224x224
      img = img.resize((224, 224))
      # converts the image into RGB
      img = img.convert("RGB")

      # extract the features using img2vec (image to vector model)
      img_features = img2vec_model.get_vec(img)

      # appends the features and labels to the empty lists that were created
      features.append(img_features)
      labels.append(category)
  data[['training_data', 'validation_data'][j]] = features
  data[['training_labels', 'validation_labels'][j]] = labels

print(data.keys())


dict_keys(['training_data', 'training_labels', 'validation_data', 'validation_labels'])


NB: Differences between a label and a feature

the features are the measurable and quantifiable properties or characteristics extracted from the data that are relevant to the task at hand

labels - are the classes we are trying to predict

## Training the Model

In [44]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pickle

In [40]:
model = RandomForestClassifier()
model.fit(data['training_data'], data['training_labels'])

### Testing the Model Performance

In [43]:
y_pred = model.predict(data['validation_data'])
score = accuracy_score(y_pred, data['validation_labels'])
print(score)

0.9515418502202643


## Saving the Model

In [45]:
with open('./model.p', 'wb') as f:
  pickle.dump(model, f)
  f.close()

### Testing with a random image


I tested with images from the dataset and random image from google and they both gave the right prediction

In [52]:
with open('./model.p', 'rb') as f:
  model = pickle.load(f)

In [56]:

image_path1 = '/content/cloud_example.png'

img = Image.open(image_path1)
 # resizes the image to 224x224
img = img.resize((224, 224))
# converts the image into RGB
img = img.convert("RGB")

features = img2vec.get_vec(img)

In [57]:
pred1 = model.predict([features])
print(pred1)

['cloudy']
