<a href="https://colab.research.google.com/github/4101swarna/BharatIntern2024_DS/blob/main/Copy_of_Dog_vs_Cat_Classifier_sdv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'dogs-vs-cats:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-competitions-data%2Fkaggle-v2%2F3362%2F31148%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240414%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240414T132925Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D4a9aec5b80b36bc6e874622bacf32fc61e00d966f10462d9934f8384921f894f67cd399071e08944b9b6cd4b96c61b10716ff4f09b0db4d4867042e2b495d975cb762f6514e4b420441c8e3e0c4eff67a3b41b8466c982fee2076603069a1fa78147cb6fa3176d28b5606e45538f38670ff98798718a87dfdcc4befa9c1882e7866dd5f94ab31e1e65ad9123ad68fafada82354e3a110b466818ede9c039f40bad2d21f09464f6d75e82f1aff58182b5599f2c32da6de65f9b1a8632bbe4b8ca32d822dbd5921c7ec7a18ea51187511649fd6b3f049d9570efabdf49ef18922fa664466684db33bfd2e6747045cc4f4db6e9c5e8c13edeaf00563f1923ac30e3'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading dogs-vs-cats, 851576689 bytes compressed
[=                                                 ] 30679040 bytes downloaded

Import libraries

In [None]:
import cv2 as cv
import numpy as np
import matplotlib.pyplot as plt
import os
import zipfile
from collections import Counter
from scipy.cluster.vq import *
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import seaborn

function get_image_names to get all file names within the dataset, which is later on used to determine splitting among train and test data

In [None]:
def get_image_names(root_path):
    names = os.listdir(root_path)
    return names

function get_images is to get the images listed in img_name_list from the given root path

In [None]:
def get_images(root_path, img_name_list):
    img_list = []
    img_class_list = []

    for img_name in img_name_list:
        img_path = root_path + '/' + img_name
        img = cv.imread(img_path)
        img_list.append(img)

        if ('cat' in img_name):
            img_class_list.append(0)
        elif ('dog' in img_name):
            img_class_list.append(1)

    return img_list, img_class_list

**Bag of Visual Words (BoVW)**

Consists of 3 parts:
1. Feature detection and description (we used SIFT)
2. Dictionary/codewords generation (perform k-means clustering over all vectors. The resulting center of cluters are the codewords/visual words, which represent similar image patches.)
3. Vector quantization (comparing the distances between visual words with the images' features using scipy.cluster.vq. If the current feature descriptor is closer to centroid/visual word i, then it belongs to cluster i. This will produce a histogram (the bag of words) for each image, which represent the frequency of visual words in the image.

Extract image descriptors using SIFT

In [None]:
def get_sift_descriptors(img_list):
    sift = cv.SIFT_create()
    descriptor_list = []

    for img in img_list:
        _, descriptors = sift.detectAndCompute(img, None)
        descriptor_list.append(descriptors)

    return descriptor_list

Elbow method (wcss)

In [None]:
def elbow_method_cluster(descriptor_list):
    stacked_descriptors = descriptor_list[0]
    for descriptor in descriptor_list[1:]:
        stacked_descriptors = np.vstack((stacked_descriptors, descriptor))
    stacked_descriptors = np.float32(stacked_descriptors)

    wcss = []
    k_values = []
    for i in range(1, 10):
        clustering = KMeans(n_clusters=i, init='k-means++', random_state=42)
        clustering.fit(stacked_descriptors)
        wcss.append(clustering.inertia_)
        k_values.append(i)

    plt.plot(wcss, marker='o')
    plt.xticks(np.arange(0, len(wcss)), k_values)
    plt.title('Elbow Method: WCSS vs K (number of clusters)')
    plt.xlabel('K')
    plt.ylabel('Inertia')
    plt.show()

Clustering (creating centroids) using K-means, in which we will experiment with k (number of clusters) of 2 (based on number of class, dog and cat) and custom (optimal number, based on elbow method/wcss)

In [None]:
def clustering(descriptor_list, k):
    stacked_descriptors = descriptor_list[0]
    for descriptor in descriptor_list[1:]:
        stacked_descriptors = np.vstack((stacked_descriptors, descriptor))
    stacked_descriptors = np.float32(stacked_descriptors)

    centroids, _ = kmeans(stacked_descriptors, k, 20)

    return centroids

Vector quantization (Bag of Words)


In [None]:
def vector_quantization(descriptor_list, number_of_images, centroids):
    image_features = np.zeros((number_of_images, len(centroids)), "float32")

    for i in range(number_of_images):
        words, _ = vq(descriptor_list[i], centroids)
        for w in words:
            image_features[i][w] += 1

    return image_features

Histogram normalization using standard scaler, where data is scaled to a standard deviation of 1 and mean of 0, so that the histogram's frequencies are distributed to a wider range.

In [None]:
def normalization(img_feature_list):
    stdscaler = StandardScaler().fit(img_feature_list)
    img_feature_list = stdscaler.transform(img_feature_list)

    return img_feature_list

**Classification using nearest neighbors**

Elbow method to determine the optimal number of k (nearest neighbors) in KNN by plotting error rate. Only odd numbers are considered in order to have a tiebreaker.

In [None]:
def elbow_method_neighbor(train_feature_list, train_class_list, test_feature_list, test_class_list):
    error_rate = []
    min_error = 100
    min_idx = -1
    for i in range(1, 300, 2):
        knn = KNeighborsClassifier(n_neighbors=i)
        knn.fit(train_feature_list, train_class_list)
        pred_i = knn.predict(test_feature_list)
        err = np.mean(pred_i != test_class_list)
        error_rate.append(err)

        if (err < min_error):
            min_error = err
            min_idx = i

    k_values = [1]
    for i in range(10, 301, 10):
        k_values.append(i)

    print("Minimum error rate is at k = ", min_idx, "with error rate of ", min_error)
    plt.figure(figsize=(20, 5))
    plt.plot(error_rate, marker='o')
    plt.xticks(np.arange(0, 151, 5), k_values)
    plt.title('Elbow Method: Error rate vs K (number of neighbors)')
    plt.xlabel('K')
    plt.ylabel('Error Rate')
    plt.show()

    return min_idx, min_error

KNN

K-nearest neighbors is a supervised machine learning algorithm that is used to make classification or prediction. Since this is a classification task, then the algorithm must determine whether an object is a 'Dog' or 'Cat'. To determine this, KNN uses a technique called 'majority voting' or simply checks the K nearest points and  predict the class based on the voting of the most frequent class.

In [None]:
def KNN(train_feature_list, train_class_list, test_feature_list, test_class_list, k):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(train_feature_list, train_class_list)

    results = knn.predict(test_feature_list)
    return results

**Main code**

Our KNN experiment consists of 2 models:
1. KNN with 2 clusters/centroids and elbow method observation-based number of neighbors
2. KNN with elbow method observation-based number of clusters and also elbow method observation-based number of neighbors

In [None]:
with zipfile.ZipFile("../input/dogs-vs-cats/train.zip",'r') as z:
    z.extractall(".")

Read and split train and test images

In [None]:
root_path = "./train"
img_name_list = get_image_names(root_path)
train_name_list, test_name_list = train_test_split(img_name_list, test_size=0.1, random_state=42)

train_name_list = train_name_list[:4200]
test_name_list = test_name_list[:1800]

train_image_list, train_class_list = get_images(root_path, train_name_list)
test_image_list, test_class_list = get_images(root_path, test_name_list)

print("Number of train images = ", len(train_image_list))
print(Counter(train_class_list))
print("Number of test images = ", len(test_image_list))
print(Counter(test_class_list))

Number of train images =  4200
Counter({1: 2112, 0: 2088})
Number of test images =  1800
Counter({1: 919, 0: 881})


Due to memory limit issues, the total of images being used is 6000 out of 25000, with ratio between train and test data being 9:1. The counter above show the distribution of classes in train and test dataset; 0 for cat and 1 for dog. The numbers show that they are balanced.

In [None]:
train_descriptor_list = get_sift_descriptors(train_image_list)
test_descriptor_list = get_sift_descriptors(test_image_list)

(1) KNN with 2 clusters/centroids and elbow method observation-based number of neighbors

Under a simple logic that the number of categories in this dataset is 2 (dog and cat), our first attempt uses k (number of clusters) = 2.

In [None]:
centroids = clustering(train_descriptor_list, 2)
train_feature_list = vector_quantization(train_descriptor_list, len(train_image_list), centroids)
train_feature_list = normalization(train_feature_list)

test_feature_list = vector_quantization(test_descriptor_list, len(test_image_list), centroids)
test_feature_list = normalization(test_feature_list)

In [None]:
min_error_idx, min_error_val = elbow_method_neighbor(train_feature_list, train_class_list, test_feature_list, test_class_list)

Using elbow method, we can observe the error rates as visualized by the plot above, and the global minima is found at k=41, with an error rate of 0.39. At the same time, through visual observation, we can see that the plot only continues to averagely decrease until +- k = 40. Therefore, we can conclude that the elbow is located at the same point as the global minima, which is k = 41.

In [None]:
results = KNN(train_feature_list, train_class_list, test_feature_list, test_class_list, 41)

conf_matrix = confusion_matrix(test_class_list, results)
ax = seaborn.heatmap(conf_matrix, xticklabels='01', yticklabels='01', annot=True, cmap='Blues', fmt='g')
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
plt.show()

In [None]:
print(classification_report(test_class_list, results, target_names=['Cat', 'Dog']))

(2) KNN with elbow method observation-based number of clusters and also elbow method observation-based number of neighbors

However, when observing the dataset, it is apparent that there are diverse images with different types of dogs and cats. It would make sense to have a multicluster KNN, where the number of cluster not only represents cats/dogs, but might also help differentiating subtypes of cats/dogs. In order to find this optimal number of clusters, we use elbow method with WCSS.

In [None]:
elbow_method_cluster(train_descriptor_list)

The WCSS plot shows that the elbow (or bend) is at either k = 3 or k = 4 clusters; yet it is more obvious to observe that the curve after k = 3 has formed an almost straight, flat line. Therefore, the optimal number of clusters we will use is 3.

In [None]:
centroids = clustering(train_descriptor_list, 3)
train_feature_list = vector_quantization(train_descriptor_list, len(train_image_list), centroids)
train_feature_list = normalization(train_feature_list)

test_feature_list = vector_quantization(test_descriptor_list, len(test_image_list), centroids)
test_feature_list = normalization(test_feature_list)

In [None]:
min_error_idx, min_error_val = elbow_method_neighbor(train_feature_list, train_class_list, test_feature_list, test_class_list)

This plot also doesn't show a proper curve, which imply that the clusters in this model most likely had irregular shapes. However, although the global minima is located at k = 73 neighbors, we can observe that the error rate averagely decreases until +- k = 60.
Therefore, we can conclude that the elbow is located k = 61.

In [None]:
results = KNN(train_feature_list, train_class_list, test_feature_list, test_class_list, 61)

conf_matrix = confusion_matrix(test_class_list, results)
ax = seaborn.heatmap(conf_matrix, xticklabels='01', yticklabels='01', annot=True, cmap='Blues', fmt='g')
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
plt.show()

In [None]:
print(classification_report(test_class_list, results, target_names=['Cat', 'Dog']))

In conclusion, with the Dogs vs Cats dataset, KNN with 2 clusters obtained a test accuracy of 61%, while KNN with optimal number of clusters (k = 3) resulted in a test accuracy of 59%. It turns out that there are no improvements after a custom optimal value of k (clusters) are chosen using elbow method (WCSS). This is likely to have happened due to the cluster's irregular shapes, which can be proven from the improper curves of error rate plots.

In [None]:
plt.figure(figsize = (10, 30))
labels = ["Cat", "Dog"]
for i in range(30):
    plt.xticks([])
    plt.yticks([])
    plt.subplot(6, 5, i + 1)
    plt.imshow(cv.cvtColor(test_image_list[i], cv.COLOR_BGR2RGB))
    plt.title(labels[results[i]])
print("Preview of 30 Random KNN Classification Results (with 2nd model)")
plt.show()