# HW 3. 




# Task 3.1: 

# Cat and Dog Image Classification

This project involves building a binary classifier to distinguish between images of cats and dogs using logistic regression. Below are the steps followed in this project:

## 1. Dataset
- The project uses the [Cat and Dog dataset from Kaggle](https://www.kaggle.com/datasets/tongpython/cat-and-dog).
- This dataset consists of images of cats and dogs, which are used to train and test the model for binary classification.

## 2. Prepare Dataset and Train Model
- The dataset is first preprocessed to be suitable for input into the logistic regression model. This includes resizing images, converting them to grayscale or color, and flattening them into a 1D array (if necessary).
- We use the Logistic Regression model from the scikit-learn library. More details can be found on the [LogisticRegression documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
- The dataset is split into training and testing sets. The model is trained on the training set.

## 3. Model Evaluation
- The accuracy of the model is calculated by comparing the model’s predictions on the test set with the actual labels.
- Accuracy is determined using the formula: $$\frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$
- This metric gives us an understanding of how well our model performs in distinguishing between images of cats and dogs.





In [1]:
from pathlib import Path
import pprint
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from PIL import Image

In [2]:
# data paths
data_dir = Path('data')
data_path = data_dir / 'cats_and_dogs'
train_data_path = data_path / 'training_set\\training_set'
test_data_path = data_path / 'test_set\\test_set'
train_cats_path = train_data_path / 'cats'
train_dogs_path = train_data_path / 'dogs'
test_cats_path = test_data_path / 'cats'
test_dogs_path = test_data_path / 'dogs'

In [3]:
# 'train/test'
train_cats = train_cats_path.glob('*.jpg')
train_dogs = train_dogs_path.glob('*.jpg')
test_cats = test_cats_path.glob('*.jpg')
test_dogs = test_dogs_path.glob('*.jpg')

# values of classes
labels = {'cat': 0, 'dog': 1}

In [4]:
# preprocess data
def preprocess(image_path: iter):
    """Prprocess images to a numpy array given it's path."""
    sizes = (40, 40)
    image_data = []
    for i, p in enumerate(image_path):
        if i == 5:
            break
        with Image.open(p) as image:
            image = image.resize(sizes)
            image_data.append(np.array(image).flatten()/255.0)
    return np.array(image_data)

def create_dataset(image_path: iter, label:int=0, test: bool=False):
    """Create a dataset from images in the specified directory."""
    image_processed = preprocess(image_path)
    if test:
        return image_processed
    dummy_array = np.zeros((len(image_processed))).reshape((-1, 1))
    labels = np.full_like(dummy_array, label)
    return np.hstack((image_processed, labels))
        

In [5]:
# Create our datasets
cat_train_set = create_dataset(train_cats, label=0)
print('Cats train data:\n', cat_train_set[:5],'\n')
dog_train_set = create_dataset(train_dogs, label=1)
print('Dogs train data:\n', dog_train_set[:5],'\n')
train_dataset = np.random.permutation(np.concatenate((cat_train_set, dog_train_set)))
print('Shuffled train data:\n', train_dataset[:5],'\n')
cat_test_set = create_dataset(test_cats, label=0)
print('Cats test data:\n', cat_test_set[:5],'\n')
dog_test_set = create_dataset(test_dogs, label=1)
print('Dogs test data:\n', dog_test_set[:5],'\n')
test_dataset = np.random.permutation(np.concatenate((cat_test_set, dog_test_set)))
print('Shuffled test data:\n', test_dataset[:5],'\n')

Cats train data:
 [[0.15294118 0.16862745 0.17647059 ... 0.16862745 0.14901961 0.        ]
 [0.15294118 0.18823529 0.21568627 ... 0.72941176 0.02352941 0.        ]
 [0.87058824 0.8745098  0.85490196 ... 0.83529412 0.81960784 0.        ]
 [0.49803922 0.39215686 0.25098039 ... 0.3254902  0.21960784 0.        ]
 [0.21960784 0.18039216 0.05490196 ... 0.56470588 0.36078431 0.        ]] 

Dogs train data:
 [[0.56470588 0.41568627 0.29019608 ... 0.92941176 0.84705882 1.        ]
 [0.50196078 0.37254902 0.17647059 ... 0.36862745 0.31764706 1.        ]
 [0.75686275 0.71372549 0.69019608 ... 0.25098039 0.21176471 1.        ]
 [0.09411765 0.06666667 0.0627451  ... 0.43921569 0.40392157 1.        ]
 [0.5372549  0.52156863 0.47843137 ... 0.58039216 0.57647059 1.        ]] 

Shuffled train data:
 [[0.75686275 0.71372549 0.69019608 ... 0.25098039 0.21176471 1.        ]
 [0.21960784 0.18039216 0.05490196 ... 0.56470588 0.36078431 0.        ]
 [0.15294118 0.18823529 0.21568627 ... 0.72941176 0.02352941

In [6]:
# Split the data
X_train, y_train = train_dataset[:, :-1], train_dataset[:, -1:]
X_test, y_test = test_dataset[:, :-1], test_dataset[:, -1:]
print('X_train:\n', X_train,'\n')
print('y_train:\n', y_train,'\n')
print('X_test:\n', X_test,'\n')
print('y_test:\n', y_test,'\n')

X_train:
 [[0.75686275 0.71372549 0.69019608 ... 0.32156863 0.25098039 0.21176471]
 [0.21960784 0.18039216 0.05490196 ... 0.09411765 0.56470588 0.36078431]
 [0.15294118 0.18823529 0.21568627 ... 0.7372549  0.72941176 0.02352941]
 ...
 [0.50196078 0.37254902 0.17647059 ... 0.38823529 0.36862745 0.31764706]
 [0.87058824 0.8745098  0.85490196 ... 0.84705882 0.83529412 0.81960784]
 [0.56470588 0.41568627 0.29019608 ... 0.92941176 0.92941176 0.84705882]] 

y_train:
 [[1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]] 

X_test:
 [[0.36078431 0.29019608 0.22745098 ... 0.63529412 0.54117647 0.51764706]
 [0.46666667 0.3372549  0.24705882 ... 0.65098039 0.52156863 0.40392157]
 [0.09803922 0.09411765 0.08627451 ... 0.12156863 0.09803922 0.08627451]
 ...
 [0.50980392 0.47843137 0.49019608 ... 0.30980392 0.33333333 0.34901961]
 [0.14509804 0.14901961 0.14509804 ... 0.45490196 0.38039216 0.2745098 ]
 [0.02745098 0.02745098 0.02745098 ... 0.07843137 0.07058824 0.0627451 ]] 

y_test:
 [[1.]
 

In [7]:
# Train model
logr = LogisticRegression()
logr.fit(X_train, y_train)
y_pred = logr.predict(X_test)

  y = column_or_1d(y, warn=True)


In [11]:
ncp = y_pred
tnp = y_test.reshape((-1))

In [17]:
# Metrics
accuracy = np.sum(ncp==tnp)/len(y_test)
cls_report = classification_report(y_pred=y_pred, y_true=y_test)
print(f'Accuracy: {accuracy}\n')
pprint.pprint(cls_report)

Accuracy: 0.5

('              precision    recall  f1-score   support\n'
 '\n'
 '         0.0       0.50      0.20      0.29         5\n'
 '         1.0       0.50      0.80      0.62         5\n'
 '\n'
 '    accuracy                           0.50        10\n'
 '   macro avg       0.50      0.50      0.45        10\n'
 'weighted avg       0.50      0.50      0.45        10\n')
