In [1]:
import cv2
import os
import numpy as np
from glob import glob

## Preprocessing

In [2]:
def preprocess_image(image_path, output_size=(256, 256)):
    # Load the image in grayscale mode
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    if img is None:
        print("Error loading image:", image_path)
        return None

    # Binarize the image using Otsu's thresholding
    # This converts the image to a binary image (0 and 255)
    _, binary = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    
    # Optionally, you can perform additional morphological operations here
    # For example, if you want to remove noise or fill small gaps:
    # kernel = np.ones((3, 3), np.uint8)
    # binary = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
    
    # Normalize the image size by resizing
    binary_resized = cv2.resize(binary, output_size, interpolation=cv2.INTER_AREA)
    
    return binary_resized

In [3]:
def process_dataset(input_folder, output_folder, output_size=(256, 256)):
    # Create the output directory if it does not exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Get list of image files in the input folder (assuming PNG images)
    image_files = glob(os.path.join(input_folder, "**", "*.png"), recursive=True)
    
    print(f"Found {len(image_files)} images in {input_folder}.")
    
    # Process each image
    for image_path in image_files:
        preprocessed = preprocess_image(image_path, output_size)
        if preprocessed is not None:
            filename = os.path.basename(image_path)
            output_path = os.path.join(output_folder, filename)
            cv2.imwrite(output_path, preprocessed)
    
    print("Preprocessing complete. Preprocessed images saved to:", output_folder)

In [4]:

if __name__ == "__main__":
    # Set the paths to your dataset directory and the directory to save processed images.
    input_folder = "./sketches"       
    output_folder = "./processed_sketches"  
    
    output_size = (256, 256)
    
    process_dataset(input_folder, output_folder, output_size)


Found 20000 images in ./sketches.
Preprocessing complete. Preprocessed images saved to: ./processed_sketches


## Normalisation of pixels



In [1]:
import os
import cv2
import numpy as np

# defining path for input and output folders!
FOLDER_PATH = "processed_sketches"
OUTPUT_FOLDER = "normalized_sketches" 

# ensures output folder exists
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

def normalize_images(folder_path, output_folder):
    for filename in os.listdir(folder_path):
        img_path = os.path.join(folder_path, filename)

        # Read image in grayscale
        img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
        if img is None:
            print(f"Skipping {filename} (not a valid image)")
            continue

        # Normalize pixel values to range [0,1]
        img_normalized = img.astype("float32") / 255.0
        # Convert back to 8-bit for saving as PNG/JPG
        img_uint8 = (img_normalized * 255).astype("uint8")
        # Save the normalized image
        output_path = os.path.join(output_folder, filename)
        cv2.imwrite(output_path, img_uint8)

    print(f"All images normalized and saved in '{output_folder}'")

# Run fucntion
normalize_images(FOLDER_PATH, OUTPUT_FOLDER)


All images normalized and saved in 'normalized_sketches'


# **Mean Shift Clustering Algorithm**

Mean Shift is a **centroid-based** clustering algorithm. It iteratively shifts data points towards denser regions (clusters move towards higher density), making it useful for identifying clusters **without predefining the number of clusters**, which suits best for image classification. It is based on **Kernel Density Estimation (KDE)** and moves points towards higher-density areas.

### **Steps of Mean Shift:**
1. **Initialize Centroids:**  
   Each data point is considered a potential cluster center.
2. **Compute Mean Shift Vector:**  
   - For each centroid, find all points within a given bandwidth (a neighborhood around the centroid).  
   - Compute the mean (center of mass) of these points.  
   - Shift the centroid towards this mean.
3. **Repeat Until Convergence:**  
   - Continue shifting centroids until their movement is below a threshold.  
   - Merge centroids that converge to the same location.
4. **Assign Clusters:**  
   - After convergence, each data point is assigned to the nearest centroid.

---

### **2. Working of Mean Shift**
- **Kernel Density Estimation (KDE):** Mean Shift estimates the density distribution of data points to determine cluster centers.
- **Bandwidth Selection:** The **bandwidth parameter** defines how large the search neighborhood is.  
  - A **small bandwidth** results in more clusters.  
  - A **large bandwidth** merges clusters together.  
- **Adaptive Clustering:** Unlike K-Means, Mean Shift does not assume clusters are spherical and can detect arbitrarily shaped clusters.

---

# **Mathematics Behind Mean Shift Clustering**

Mean Shift is a **density-based clustering algorithm** that moves data points towards regions of higher density based on **Kernel Density Estimation (KDE)**.

---

## **1. Kernel Density Estimation (KDE)**  
The Mean Shift algorithm uses **KDE** to estimate the **density of data points**. The density function at a point \( x \) is given by:

$$
f(x) = \frac{1}{n h^d} \sum_{i=1}^{n} K \left( \frac{x - x_i}{h} \right)
$$

where:  
- \( f(x) \) is the estimated density at \( x \).  
- \( n \) is the number of data points.  
- \( h \) is the **bandwidth** (window size).  
- \( d \) is the number of dimensions in the dataset.  
- \( K(\cdot) \) is the **kernel function** that defines the weight of nearby points.
- \( x_i \) are the data points in the dataset.  

### **Common Kernel Functions**
Mean Shift typically uses a **Gaussian kernel**, defined as:

$$
K(x) = e^{-\frac{||x||^2}{2}}
$$

Other kernels include **Epanechnikov, Uniform, and Triangular kernels**.

---

## **2. Mean Shift Vector**
The Mean Shift vector is computed as the **weighted mean of points within the bandwidth**. Mathematically, it is defined as:

$$
m(x) = \frac{\sum_{i=1}^{n} x_i K \left( \frac{x - x_i}{h} \right)}{\sum_{i=1}^{n} K \left( \frac{x - x_i}{h} \right)}
$$

where:  
- \( m(x) \) is the new mean (shifted point).  
- The numerator computes a weighted sum of all points within the bandwidth.  
- The denominator normalizes the weights.

Each point \( x \) is shifted iteratively using:

$$
x_{t+1} = m(x_t)
$$

until convergence (i.e., when the shifts become negligible).

---

## **3. Bandwidth Selection**
The choice of **bandwidth \( h \)** is crucial for Mean Shift.  
- A **small bandwidth** creates **more clusters** (over-segmentation).  
- A **large bandwidth** merges clusters (under-segmentation).  
- A common method to estimate bandwidth is **Scott’s Rule** or **Silverman’s Rule**:

$$
h = \left( \frac{4}{d+2} \right)^{\frac{1}{d+4}} n^{-\frac{1}{d+4}}
$$

where \( d \) is the number of dimensions and \( n \) is the number of samples.

---

## **4. Convergence Criteria**
The algorithm stops when:

$$
||m(x_t) - x_t|| < \epsilon
$$

where **\(epsilon)** is a small threshold.

This ensures that centroids **do not move significantly** between iterations.

---


