## Extracting and Preparing the Dataset

### Explanation:
In this notebook cell, we import the necessary module, `zipfile`, to work with zip files. The dataset for our image classification project, "Dogs vs Cats," is stored in a zip file located at the path specified by `zip_file_path`.

We use the `zipfile.ZipFile()` function to open the zip file in read mode (`'r'`). Then, we loop through each file in the zip archive using the `zip_ref.infolist()` method. Within the loop, we extract each file one by one without saving them to a specific path, which allows us to access the dataset's contents directly from the zip file.

After successfully extracting all files from the zip archive, we print a "Successfully extracted files" message to indicate that the dataset is ready for further processing and analysis in our image classification project.

In [None]:
# # Import the required module to work with zip files
# import zipfile

# # Set the path to the zip file containing the dataset
# zip_file_path = r'E:\Computer vision\Dogs vs Cats\dataset.zip'

# # Open the zip file in read mode
# with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
#     # Extract all files one by one without saving them in a specific path
#     for file_info in zip_ref.infolist():
#         zip_ref.extract(file_info)

# # Print a success message after all files are extracted
# print("Successfully extracted files.")


## Dataset Structure Overview

### Explanation:
In this notebook cell, we explore the structure of the "Dogs vs Cats" dataset, and it appears as follows:

1. **Contents of the Base Directory:**
   The top-level directory, `base_dir`, contains three subdirectories: `validation`, `test`, and `train`. These subdirectories are crucial for organizing and dividing the dataset into respective sets for training, testing, and validation.

2. **Contents of the Train Directory:**
   The `train` directory contains two subdirectories: `dogs` and `cats`. These subdirectories likely contain the training images of dogs and cats, respectively. Each image category is placed in a separate subdirectory, facilitating the image classification model's training process.

3. **Contents of the Test Directory:**
   The `test` directory also contains two subdirectories: `dogs` and `cats`. These subdirectories presumably hold the test images of dogs and cats, respectively. The model will be evaluated on these images to assess its performance on unseen data.

4. **Contents of the Validation Directory:**
   Similarly, the `validation` directory consists of two subdirectories: `dogs` and `cats`. It is likely that the validation images of dogs and cats are placed in these subdirectories. Validation is essential for tuning hyperparameters and ensuring the model generalizes well on new data.

By organizing the dataset into these directories, it becomes easier to load, preprocess, and feed the images into the image classification model during the various stages of the project.

In [None]:
import os

base_dir = r'data/dogs-vs-cats'

print("Contents of base directory:")
print(os.listdir(base_dir))

print("\nContents of train directory:")
print(os.listdir(f'{base_dir}/train'))

print("\nContents of test directory:")
print(os.listdir(f'{base_dir}/test'))

print("\nContents of validation directory:")
print(os.listdir(f'{base_dir}/validation'))

## Setting Up Image Directories

In [None]:
train_dir = os.path.join(base_dir, 'train')
test_dir = os.path.join(base_dir, 'test')
validation_dir = os.path.join(base_dir, 'validation')

# Directory with training cat/dog pictures
train_cats_dir = os.path.join(train_dir, 'cats')
train_dogs_dir = os.path.join(train_dir, 'dogs')

test_cats_dir = os.path.join(test_dir, 'cats')
test_dogs_dir = os.path.join(test_dir, 'dogs')

# Directory with validation cat/dog pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')
validation_dogs_dir = os.path.join(validation_dir, 'dogs')


In [None]:
train_cat_fnames = os.listdir( train_cats_dir )
train_dog_fnames = os.listdir( train_dogs_dir )

print(train_cat_fnames[:10])
print(train_dog_fnames[:10])

Let's find out the total number of cat and dog images in the `train` and `validation` directories:

## Distribution of Cat and Dog Images

### Explanation:
In this notebook cell, we visualize the distribution of cat and dog images across the different data splits, namely training, validation, and test sets. We utilize the Plotly library to create a bar plot that provides a visual representation of the image counts.

1. **Counting Images:**
   We use the `os.listdir()` function combined with the `len()` function to count the number of images in each category (cats and dogs) for the training, validation, and test sets. The variables `train_cat_count`, `train_dog_count`, `validation_cat_count`, `validation_dog_count`, `test_cat_count`, and `test_dog_count` store these counts for their respective directories.

2. **Bar Plot:**
   We create a bar plot using Plotly's `go.Bar()` function. The x-axis represents the data split categories ('Training', 'Validation', and 'Test'), and the y-axis displays the number of images for each category. We create two bars for each data split: one for cats and another for dogs. The height of each bar corresponds to the number of images in the corresponding category.

3. **Layout and Labels:**
   We update the layout of the plot using the `update_layout()` function to add a title and axis labels. The title of the plot is set to 'Distribution of Cat and Dog Images'. The x-axis label is 'Data Split', indicating the different sets, and the y-axis label is 'Number of Images', representing the count of images.

By visualizing the distribution of cat and dog images across the different data splits, we can gain insights into the balance of the dataset and ensure that each category has a sufficient number of images for effective model training and evaluation.

In [None]:
import plotly.graph_objects as go


# Count the number of images in each directory
train_cat_count = len(os.listdir(train_cats_dir))
train_dog_count = len(os.listdir(train_dogs_dir ))
validation_cat_count = len(os.listdir(validation_cats_dir))
validation_dog_count = len(os.listdir(validation_dogs_dir))
test_cat_count = len(os.listdir(test_cats_dir))
test_dog_count = len(os.listdir(test_dogs_dir))

# Create a bar plot using Plotly
categories = ['Training', 'Validation', 'Test']
cats_counts = [train_cat_count, validation_cat_count, test_cat_count]
dogs_counts = [train_dog_count, validation_dog_count, test_dog_count]

fig = go.Figure(data=[
    go.Bar(name='Cats', x=categories, y=cats_counts),
    go.Bar(name='Dogs', x=categories, y=dogs_counts)
])

# Update the layout to add titles and labels
fig.update_layout(title='Distribution of Cat and Dog Images',
                  xaxis_title='Data Split',
                  yaxis_title='Number of Images')

fig.show()


In [None]:
print('total training cat images :', train_cat_count)
print('total training dog images :', train_dog_count)

print('total validation cat images :', validation_cat_count)
print('total validation dog images :', validation_dog_count)

print('total test cat images :', test_cat_count)
print('total test dog images :', test_dog_count)

## Displaying a Subset of Cat and Dog Images

In [None]:
%matplotlib inline

import matplotlib.image as mpimg
import matplotlib.pyplot as plt

# Parameters for our graph; we'll output images in a 4x4 configuration
nrows = 4
ncols = 4

pic_index = 0 # Index for iterating over images

In [None]:
# Set up matplotlib fig, and size it to fit 4x4 pics
fig = plt.gcf()
fig.set_size_inches(ncols*4, nrows*4)

pic_index+=8

next_cat_pix = [os.path.join(train_cats_dir, fname) 
                for fname in train_cat_fnames[ pic_index-8:pic_index] 
               ]

next_dog_pix = [os.path.join(train_dogs_dir, fname) 
                for fname in train_dog_fnames[ pic_index-8:pic_index]
               ]

for i, img_path in enumerate(next_cat_pix+next_dog_pix):
  # Set up subplot; subplot indices start at 1
  sp = plt.subplot(nrows, ncols, i + 1)
  sp.axis('Off') # Don't show axes (or gridlines)

  img = mpimg.imread(img_path)
  plt.imshow(img)

plt.show()


## Loading Pre-trained Xception Model

### Explanation:
In this notebook cell, we use TensorFlow and Keras to load a pre-trained Xception model for transfer learning in our "Dogs vs Cats" image classification project.

1. **Importing Required Modules:**
   We import the necessary modules from TensorFlow and Keras to work with the Xception model and image data.

2. **Xception Model Loading:**
   We load the pre-trained Xception model using the `Xception` class from `tensorflow.keras.applications.xception`. The model is initialized with the weights set to 'imagenet', indicating that we want to use the weights pre-trained on the ImageNet dataset. We also set `include_top=False` to exclude the top classification layer of the Xception model. By excluding the top layer, we can add our own classification layer tailored to the "Dogs vs Cats" image classification task.

3. **Input Shape:**
   The `input_shape` parameter is set to (299, 299, 3), representing the input shape of the images that will be fed into the Xception model. The Xception model requires images with a size of 299x299 pixels and three color channels (RGB).

By loading the pre-trained Xception model, we can leverage the model's feature extraction capabilities and then fine-tune it with our custom classification layer to achieve high accuracy in the "Dogs vs Cats" image classification task.

In [None]:
from tensorflow.keras.applications.xception import Xception, preprocess_input
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import layers, models, optimizers

# Load the pre-trained Xception model without the top classification layer
base_model = Xception(weights='imagenet', include_top=False, input_shape=(299, 299, 3))

## Building the Transfer Learning Model

### Explanation:
In this notebook cell, we build a transfer learning model for the "Dogs vs Cats" image classification task by combining a pre-trained Xception base model with a custom classification head.

1. **Freezing Base Model Layers:**
   We set `base_model.trainable = False` to freeze the layers of the pre-trained Xception model. By freezing the layers, we prevent them from being trained during the fine-tuning process. This is beneficial because the pre-trained weights of the Xception model already capture useful features that can be utilized for our classification task, and we do not want to distort them by further training.

2. **Creating the Custom Classification Head:**
   We create a new Keras Sequential model (`model`) by stacking layers on top of the frozen Xception base model. The custom classification head includes the following layers:
   - `base_model`: This is the frozen pre-trained Xception base model that acts as a feature extractor.
   - `GlobalAveragePooling2D()`: This layer performs global average pooling, reducing the spatial dimensions of the extracted features.
   - `Dense(256, activation='relu')`: A fully connected (dense) layer with 256 units and ReLU activation. This layer helps in capturing higher-level representations from the global average pooled features.
   - `Dropout(0.5)`: A dropout layer with a rate of 0.5, which helps in regularizing the model and preventing overfitting during training.
   - `Dense(1, activation='sigmoid')`: The final dense layer with a single unit and a sigmoid activation function. This layer outputs a probability score, indicating whether the input image is a cat or a dog (binary classification).

By combining the pre-trained Xception base model with the custom classification head, we create an effective transfer learning model for the "Dogs vs Cats" image classification task. The pre-trained Xception model brings valuable learned features, and the custom classification head tailors the model to our specific classification problem.

In [None]:
# Freeze the base model's layers so they are not trained during fine-tuning
base_model.trainable = False

# Add a new classification head for dogs and cats classification
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])


## Data Generators for Training and Validation

### Explanation:
In this notebook cell, we set up data generators for the training and validation sets of the "Dogs vs Cats" image classification task. Data generators are useful for efficiently loading and augmenting images in batches during model training, which helps prevent memory overflow and allows for real-time data augmentation.

1. **Data Directories:**
   We specify the paths to the training and validation data directories using `train_data_dir` and `validation_data_dir`, respectively. These directories contain the organized dataset images for their respective sets.

2. **Batch Size and Image Size:**
   We set the `batch_size` to 32, which means the data generator will load and process 32 images at a time during training and validation. The `image_size` is set to (299, 299) to match the input size required by the Xception model.

3. **Training Data Generator:**
   We create an `ImageDataGenerator` object for the training set using `train_datagen`. This generator performs data augmentation on the training images, including shear, zoom, and horizontal flip, to increase the diversity of the data. The `preprocess_input` function is applied to each image, ensuring that the input is preprocessed according to the requirements of the Xception model.

4. **Train Generator Flow:**
   We use the `flow_from_directory` method of the training data generator to create a `train_generator`. This generator will yield batches of training data, where each batch contains images and their corresponding labels (binary values: 0 for cats, 1 for dogs).

5. **Validation Data Generator:**
   Similarly, we create an `ImageDataGenerator` object for the validation set using `validation_datagen`. This generator is not configured for data augmentation and only applies the `preprocess_input` function for preprocessing.

6. **Validation Generator Flow:**
   We use the `flow_from_directory` method of the validation data generator to create a `validation_generator`. Like the training generator, this generator will yield batches of validation data with their respective labels.

By using data generators, we ensure that the model efficiently accesses and processes the data during training and validation, leading to better utilization of system resources and faster convergence during training. Additionally, data augmentation in the training generator helps improve the model's generalization ability by exposing it to a diverse range of image variations.

In [None]:
# Set up data generators for training and validation sets
train_data_dir = r'data/dogs-vs-cats/train'
validation_data_dir = r'data/dogs-vs-cats/validation'

batch_size = 32
image_size = (299, 299)

train_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=image_size,
    batch_size=batch_size,
    class_mode='binary'
)

validation_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)

validation_generator = validation_datagen.flow_from_directory(
    validation_data_dir,
    target_size=image_size,
    batch_size=batch_size,
    class_mode='binary'
)


## Transfer Learning with Large Dataset and Limited Epochs

### Explanation:
In this notebook cell, we employ transfer learning with a large dataset of "Dogs vs Cats" images, and we train the model for only one epoch.

1. **Transfer Learning:**
   Transfer learning is used in this scenario to leverage the pre-trained Xception model's learned features on the ImageNet dataset. By setting `base_model.trainable = False`, we freeze the base model's layers to prevent further training of these weights during fine-tuning. This approach is particularly useful when working with a limited amount of labeled data for a specific task.

2. **Large Dataset:**
   Our dataset contains a substantial number of images, which could take a considerable amount of time to process and train. Using data generators helps mitigate memory limitations, enabling us to train efficiently with large datasets without loading all images into memory simultaneously.

3. **Limited Epochs:**
   Due to the time and computational resources required for training on a large dataset, we set `epochs` to 1 for demonstration purposes. This means we perform only one pass through the entire training dataset during training. In practice, you may need to train the model for more epochs to achieve better convergence and improve accuracy.

4. **High Accuracy with One Epoch:**
   While one epoch is insufficient for optimal convergence and high accuracy, transfer learning allows the model to start with pre-learned features, giving it an advantage over training from scratch. Therefore, even with just one epoch, the model may achieve reasonable accuracy levels. However, it is essential to note that a single epoch is not sufficient for capturing the full potential of the model and may not generalize well to unseen data.

5. **Fine-Tuning:**
   For further improvements in accuracy, consider fine-tuning the model by unfreezing some of the top layers of the base model. Fine-tuning allows the model to adapt to the specific characteristics of the "Dogs vs Cats" classification task. You can continue the training process with a smaller learning rate to fine-tune the model's weights to the target domain while preserving the pre-trained knowledge.

In summary, while transfer learning and data generators are valuable tools for efficiently training on large datasets, achieving high accuracy requires tuning hyperparameters, including the number of epochs, and potentially performing fine-tuning to adapt the model to the task at hand.

In [None]:
# Compile the model
model.compile(optimizer=optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
epochs = 1
steps_per_epoch = train_generator.n // batch_size
validation_steps = validation_generator.n // batch_size

history = model.fit(
    train_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=validation_steps
)
model.save("model.h5")


## Predicting Some Random Images

in this notebook cell, we use the trained model to predict the labels of some random images from the test directory. The images are randomly selected from both the dog and cat categories, and their predicted labels (either "dog" or "cat") are displayed.

In [None]:
import os
import random
import cv2
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.image import img_to_array


def predict_image(img):
    # Preprocess the image if needed
    x = img_to_array(img)
    x = cv2.resize(x, (299, 299), interpolation=cv2.INTER_AREA)
    x /= 255
    x = np.expand_dims(x, axis=0)
    image = np.vstack([x])
    prediction = model.predict(image)
    predicted_label = "dog" if prediction > 0.5 else "cat"
    return predicted_label

def load_images_from_directory(directory, num_images=10):
    images = []
    filenames = random.sample(os.listdir(directory), num_images)
    for filename in filenames:
        img_path = os.path.join(directory, filename)
        img = cv2.imread(img_path)
        images.append(img)
    return images

dog_directory = 'data/dogs-vs-cats/test/dogs'
cat_directory = 'data/dogs-vs-cats/test/cats'

dog_images = load_images_from_directory(dog_directory)
cat_images = load_images_from_directory(cat_directory)

# Create a 2x10 grid to display the images
fig, axes = plt.subplots(2, 10, figsize=(15, 5))
fig.suptitle('Dog vs. Cat Classification', fontsize=16)

# Loop through the dog images
for i in range(10):
    img = dog_images[i]
    label = predict_image(img)
    axes[0, i].imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    axes[0, i].axis('off')
    axes[0, i].set_title(label)

# Loop through the cat images
for i in range(10):
    img = cat_images[i]
    label = predict_image(img)
    axes[1, i].imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    axes[1, i].axis('off')
    axes[1, i].set_title(label)

plt.show()


## Evaluating Test Set Accuracy 

### Explanation:
In this notebook cell, we evaluate the test set accuracy for the "Dogs vs Cats" image classification model. However, instead of calculating accuracy on the entire test set, we opt to evaluate accuracy on a random subset of 25% of the test images.

1. **Test Data Directory:**
   We set `test_data_dir` to the path of the folder containing the test images.

2. **Test Data Generator:**
   We create a data generator for the test set using `test_datagen`. The test data generator preprocesses the images but does not shuffle them, ensuring that images are predicted in the same order as they are loaded.

3. **Batch Size and Image Size:**
   The `batch_size` is set to 32, and `image_size` is set to (299, 299) to match the input size required by the Xception model.

4. **Test Data Generator Flow:**
   We use the `flow_from_directory` method of the test data generator to create a `test_generator`. This generator will yield batches of test data with their respective labels (binary values: 0 for cats, 1 for dogs).

5. **Calculating Steps:**
   To evaluate accuracy on a subset, we calculate the number of steps needed to predict 25% of the test set. We use `total_test_samples` to get the total number of test samples and `predict_percentage` to specify the percentage of the test set to be used for evaluation. We then calculate `steps_per_epoch` based on the batch size.

6. **Predicting Classes:**
   We use the trained model to predict the classes of the test images. The predictions are stored in the `predictions` array, which contains the probability scores for each image.

7. **Converting to Class Labels:**
   The predicted probabilities are converted to class labels (1 for dog, 0 for cat) using a threshold of 0.5. The `predicted_labels` array contains the predicted class labels for each image.

8. **True Labels:**
   We obtain the true class labels from the test generator and store them in the `true_labels` array.

9. **Calculating Accuracy:**
   The accuracy is calculated by comparing the predicted labels to the true labels. The accuracy is the ratio of correctly predicted samples to the total number of samples.

By evaluating accuracy on a random subset of 25% of the test images, we can get a quick estimate of the model's performance without waiting for the evaluation on the entire test set. This is especially useful when dealing with large test sets, as calculating accuracy on the full test set could be time-consuming. It's important to note that this subset accuracy provides an approximation of the model's performance, and for a comprehensive evaluation, the entire test set should eventually be used.

In [None]:
# Path to the folder containing the test images 
test_data_dir = r"data/dogs-vs-cats/test"

# Create a data generator for the test set
test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)

batch_size = 32
image_size = (299, 299)

test_generator = test_datagen.flow_from_directory(
    test_data_dir,
    target_size=image_size,
    batch_size=batch_size,
    class_mode='binary',  # Use 'binary' class_mode for binary classification (cats and dogs)
    shuffle=False         # Set shuffle to False to ensure images are predicted in the same order as they are loaded
)

# Calculate the number of steps to predict 25% of the test set
total_test_samples = len(test_generator.filenames)
predict_percentage = 0.25
steps_per_epoch = int(total_test_samples * predict_percentage) // batch_size

# Predict the classes of the test images
predictions = model.predict(test_generator, steps=steps_per_epoch, verbose=1)

# Convert predicted probabilities to class labels (1: dog, 0: cat)
predicted_labels = (predictions > 0.5).astype(int)

# Get the true labels from the generator
true_labels = test_generator.classes[:steps_per_epoch * batch_size]

# Calculate accuracy
accuracy = (predicted_labels == true_labels).mean()
print("Test accuracy:", accuracy)


**Model Deployment**: I successfully deployed my Dogs vs. Cats classification model using Xception on Hugging Face. You can check it out here: https://huggingface.co/spaces/moazx/Dogs-vs-Cats-classification-with-Xception