In [None]:
# Downloading and extracting the file from the google drive URL
# URL - https://drive.google.com/file/d/1BmKDGjy23SC07IZxnXiF9lJZob4gIkA9/view?usp=share_link

# Step 1: Download the zip file from the Google Drive URL using gdown
!gdown --id 1BmKDGjy23SC07IZxnXiF9lJZob4gIkA9 --output eAuto_photos.zip

# Step 2: Unzip the downloaded zip file
!unzip -q eAuto_photos.zip

#Step 3: Deleting the zip file
!rm eAuto_photos.zip

 **1. We scraped the web for the same number of images but are all the images useable? If you are wondering, we consider a usable image to be the front of a car with the car’s logo clearly visible. If there are images that are not usable, please remove them from the set.**

  Answer:
  
  There are many images which are not useable i.e. it's not the front of the car and some without logo.

**2. After dataset pruning, is the dataset big enough? Is there any imbalance in our dataset?**

Answer:

To check if the dataset is big enough and balanced after pruning, I would load the images, extract the car brand from the folder name, and count the number of images per brand.

This would show the distribution of images per brand. I'd want at least a few hundred images per brand to train a model. If any brands have significantly fewer images, I may want to try to collect more data for those brands to balance the dataset.

In [None]:
# To check any imbalance in our dataset by counting the number of images we have in each folder

from collections import Counter
import os
brands = []
for folder in ['Volkswagen','Toyota', 'Tata', 'Suzuki', 'Nissan', 'Renault', 'Hyundai', 'Honday', 'Ford']:
  images = os.listdir('/content/photos/'+folder)
  print(folder,":", len(images))

**3. Is the dataset good quality?**
  - Do many images require editing (cropping, zooming, sharpening, etc.)?
  - Are there other issues in our dataset you can see?
  - What are some challenges that you see with the current dataset?

Ans:
  1. I would manually inspect a sample of images and see if many require cropping, zooming, etc. This would give me a sense of how much editing is needed.
  2. I would check for duplicate or near-duplicate images, as those would not add much value. I can use a perceptual hash to find duplicates.
  3. I would see if images are distorted, blurry, or have artifacts that could make the car logo unclear. This may require excluding some images.
  4. Variety in angles, lighting, backgrounds etc. may make the dataset more challenging to work with. Images taken in consistent conditions are easier to handle.

**4. Based on your evaluation, please lay out your plan for correcting any issues you’ve identified.**
  - Do you plan on manually editing some images?
  - Can you do any standardization to the images?
  - Is there any data augmentation you can do?

Ans:
1. Yes, I'll manually review and edit images if needed - cropping, rotating, zooming etc. This is time-consuming but can improve quality.
2. We can apply image pre-processing techniques such as resizing all images to a specific resolution or normalizing pixel values to a consistent scale.
3. We can apply different data augmentation techniques to increase the dataset size and improve model generalization. Augmentation methods can include rotation, flipping, translation, and brightness adjustments.

In [None]:
# As the images are of different sizes, we need to resize all to a standard size.
# So in this cell, I have written code to resize all the images to 224 by 224 pixels.

from PIL import Image
import os

def resize_images(input_dir, output_dir, target_size=(224, 224)):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for filename in os.listdir(input_dir):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, filename)

        # Open the image
        with Image.open(input_path) as img:
            # Resize the image while maintaining the aspect ratio
            img_resized = img.resize(target_size, Image.ANTIALIAS)

            # Save the resized image
            img_resized.save(output_path)

# Example usage:
for folder in ['Volkswagen','Toyota', 'Tata', 'Suzuki', 'Nissan', 'Renault', 'Hyundai', 'Honday', 'Ford']:
  input_directory = '/content/photos/'+folder
  output_directory = '/content/resized_photos/'+folder
  resize_images(input_directory, output_directory)

  print("Images in", folder, "folder has been resized.")

print("All images resized!")
