<h1>Part 0: Featurs Extraction</h1>
<h2>Why we extract features and dont just read the pixels?</h2>
<p>In image clustering, extracting features from images is a crucial step for several reasons. Reading the pixels directly may not be efficient or effective in capturing the essential information about the images, which is necessary for clustering.here are some important reason for this:</p>
<ul>
    <li><b>Dimensionality Reduction:</b>Images contain a vast amount of data, as they have a high number of pixels. Reading the pixels directly would involve dealing with a large number of dimensions. Feature extraction helps reduce this dimensionality, making the process more manageable and computationally efficient.</li>
    <li><b>Robustness to Noise:</b>Images can contain noise or variations in lighting, which may affect the raw pixel values. Feature extraction helps to filter out noise and focus on the essential characteristics of the images, making the clustering process more robust and accurate.</li>
    <li><b>Generalization:</b>Feature extraction allows the algorithm to generalize and recognize patterns across different images. This generalization helps in clustering images that may not be identical but share similar features or characteristics.</li>
</ul>
<h2>Techniques</h2>
<p>
There are various techniques for feature extraction in image processing. Some of the most common methods include:</p>
<ol>
    <li><b>Principal Component Analysis (PCA):</b> PCA is a linear technique that transforms the original image data into a new coordinate system, where the axes represent the principal components of the image data. It helps reduce dimensionality while retaining most of the image's information.</li>
    <li><b>Singular Value Decomposition (SVD):</b>SVD is another linear transformation technique similar to PCA. It decomposes the image matrix into three matrices (U, Σ, and V). The columns of U and V represent the features, and the diagonal elements of Σ represent the importance of each feature. SVD is often used for dimensionality reduction in image processing.</li>
    <li><b>Convolutional Neural Networks (CNNs):</b>CNNs are a deep learning-based approach to feature extraction. They learn hierarchical feature representations by applying multiple layers of convolutions and pooling operations. CNNs can automatically extract relevant features from raw image data, making them powerful tools for various computer vision tasks, such as image classification, object detection, and segmentation.</li>
</ol>
<h2>what preprocessing need before images entered on model?</h2>
<p>Before entering images into a model, it's essential to preprocess them to ensure they are suitable for the machine learning or deep learning algorithms. Here are some important preprocessing steps for image data:</p>
<ul>
    <li><b>Resizing:</b>Resize the images to a uniform size that matches the input requirements of your model. This helps the model process images more efficiently and reduces computational complexity.</li>
    <li><b>Normalization:</b>Normalize the pixel values of the images to a specific range, typically between 0 and 1 or -1 and 1. This step ensures that all images have similar pixel values, preventing any one image from dominating the learning process due to its high intensity or contrast.</li>
    <li><b>Label Encoding (for supervised learning):</b>If your images have associated labels or categories, encode these labels as numerical values. This is necessary for many supervised learning algorithms that require numerical input.</li>
</ul>

<h1>Part1: Kmeans vs. DBSCAN</h1>
<p>K-Means and DBSCAN are both popular clustering algorithms used in machine learning and data analysis for partitioning data into distinct groups based on their similarities. Here's an overview of each method along with their advantages and disadvantages:
</p>
<h2>K-Means</h2>
<ol>
    <li><b>Algorithm Overview:</b> K-Means is a centroid-based clustering algorithm where the data points are assigned to the nearest centroid. The algorithm iteratively updates the centroids until convergence, minimizing the within-cluster variance.</li>
   <li>
   <b>Advantages:</b>
    <ul>
        <li><b>Simplicity:</b> K-Means is straightforward to understand and implement.</li>
        <li><b>Scalability:</b>It works well with large datasets, as its time complexity is linear with the number of data points.</li>
        <li><b>Efficiency:</b>K-Means converges relatively quickly, especially with well-separated clusters.</li>
    </ul>
   </li>
   <li>
   <b>Disadvantages:</b>
    <ul>
        <li><b>Sensitive to Initial Centroid Selection: </b>Different initializations can lead to different final clusters, impacting the algorithm's performance.</li>
        <li><b>Assumes Spherical Clusters:</b>K-Means assumes that clusters are spherical and of similar size, which may not hold true for all datasets.</li>
        <li><b>Need to Specify the Number of Clusters (K):</b>Determining the optimal number of clusters (K) can be challenging and may require domain knowledge or heuristics.</li>
    </ul>
    </li>
</ol>
<h2>DBSCAN</h2>
<ol>
    <li><b>Algorithm Overview:</b>DBSCAN is a density-based clustering algorithm that groups together data points that are closely packed, marking outliers as noise. It does not require the number of clusters to be specified in advance.</li>
   <li>
   <b>Advantages:</b>
    <ul>
        <li><b>Ability to Detect Arbitrary Shaped Clusters:</b>DBSCAN can identify clusters of various shapes and sizes, unlike K-Means, which assumes spherical clusters.</li>
        <li><b>Robust to Noise:</b>DBSCAN automatically identifies and ignores outliers as noise, making it robust to noise in the dataset.</li>
        <li><b>No Need to Specify the Number of Clusters:</b>DBSCAN determines the number of clusters automatically based on the data density.</li>
    </ul>
   </li>
   <li>
   <b>Disadvantages:</b>
    <ul>
        <li><b>Sensitivity to Distance Metric and Epsilon Parameter:</b>The performance of DBSCAN can be influenced by the choice of distance metric and the epsilon parameter, which defines the neighborhood size.</li>
        <li><b>Difficulty in Handling Varying Density: </b>DBSCAN may struggle with datasets containing clusters of varying densities, as it relies on a single epsilon parameter for density estimation.</li>
        <li><b>Computationally Intensive for Large Datasets: </b>DBSCAN's time complexity is higher compared to K-Means, especially for large datasets, as it needs to calculate distances between each pair of data points.</li>
    </ul>
    </li>
</ol>
<h2>Comparison:</h2>
<ul>
    <li><b>Flexibility:</b>DBSCAN is more flexible in identifying clusters of arbitrary shapes and sizes, while K-Means assumes spherical clusters.</li>
    <li><b>Noise Handling:</b>DBSCAN automatically identifies and ignores outliers, whereas K-Means may assign noise points to the nearest centroid.</li>
    <li><b>Parameter Sensitivity:</b>DBSCAN requires tuning of parameters like epsilon and minPts, while K-Means primarily requires specifying the number of clusters (K).</li>
    <li><b>Scalability:</b>K-Means tends to be more scalable and efficient for large datasets compared to DBSCAN.</li>
</ul>
<p>Choosing between K-Means and DBSCAN depends on the dataset characteristics, the desired cluster shapes, and the presence of noise. For datasets with well-defined, spherical clusters and known or easily determinable K, K-Means may be preferable. On the other hand, for datasets with complex shapes, varying densities, or significant noise, DBSCAN might yield better results.</p>
<h2>PCA</h2>
<p>PCA stands for Principal Component Analysis. It's a statistical technique used for dimensionality reduction in data analysis and machine learning. The primary goal of PCA is to reduce the dimensionality of a dataset while retaining as much of the variation present in the original dataset as possible. This reduction in dimensionality helps in simplifying the dataset, making it easier to visualize, analyze, and process, while still capturing the essential features of the data.</p>
<b>how does it work?</b>
<ol>
    <li><b>Data Standardization:</b>PCA typically starts with standardizing the data to have a mean of 0 and a standard deviation of 1 across each feature</li>
    <li><b>Covariance Matrix Computation:</b> PCA computes the covariance matrix of the standardized data</li>
    <li><b>Eigenvalue Decomposition:</b>The covariance matrix is then decomposed into its eigenvectors and eigenvalues.</li>
    <li><b>Selection of Principal Components:</b>The eigenvectors are ranked in order of their corresponding eigenvalues</li>
    <li><b>Projection:</b> Finally, the original data is projected onto the subspace spanned by the selected principal components.</li>
</ol>

In [18]:
import os
from keras.applications import VGG16
from keras.models import Model
from sklearn.cluster import KMeans , DBSCAN
from sklearn.decomposition import PCA
import cv2
import numpy as np
import pandas as pd

image_folder = "flower_images"
csv_file = "flower_labels.csv"
true_labels = pd.read_csv(csv_file)['label'].values

image_files = os.listdir(image_folder)
image_size = (224,224)

def image_preprocess(path,target_size):
    image = cv2.imread(path)
    image = cv2.resize(image,target_size)
    image = image / 255.0
    return image

images = []

for image_file in image_files :
    image_path = os.path.join(image_folder,image_file)
    image = image_preprocess(image_path,image_size)
    images.append(image)

images = np.array(images)

base_model = VGG16(weights='imagenet',include_top="False",input_shape=(image_size[0],image_size[1],3))
conv_output = base_model.get_layer('block5_conv3').output
features_extractor = Model(inputs=base_model.input,outputs=conv_output)
features = features_extractor.predict(images)
features = features.reshape(features.shape[0],-1)

pca = PCA(n_components=2)
reduced_features = pca.fit_transform(features)


kmeans = KMeans(n_clusters=10)
kmeans_clusters = kmeans.fit_predict(reduced_features)

dbscan = DBSCAN(eps=3.2,min_samples=3)
dbscan_clusters = dbscan.fit_predict(reduced_features)

cluster_index = np.where(dbscan_clusters == dbscan_clusters[0])[0][0]
cluster_index_c = np.where(kmeans_clusters == kmeans_clusters[0])[0][0]


# print(kmeans_clusters[:30])
# print(true_labels[:30])
print(dbscan_clusters[:30])




[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 2s/step
[ 0  0  1  0  0  0  0  0  0  0  0  0  0 -1  2  0  0  0  0  0  3  4  2  4
  0  0  1  0 -1  0]


<h1>the appropriate K is approximatley same as true labels number that we can set it with test and fail and evaluate our final labaling with real labels </h1>

<h1>Validating</h1>
<p>Homogeneity and silhouette are both validation metrics commonly used to assess the quality of clustering algorithms, including those applied to image clustering.</p>
<ol>
    <li><b>Homogeneity: </b>Homogeneity measures the degree to which each cluster contains only members of a single class. In the context of image clustering, homogeneity assesses whether the clusters formed by the algorithm represent distinct and homogeneous groups of similar images. A high homogeneity score indicates that the clusters are composed mostly of images from the same class or category, while a low score suggests that the clusters contain mixed or heterogeneous images.</li>
    <li><b>Silhouette Score: </b> Silhouette analysis measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). For each sample, the silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. A silhouette score close to 1 suggests that the sample is appropriately clustered, while a score near -1 indicates that the sample might be misclassified. In the context of image clustering, silhouette analysis helps to evaluate the compactness and separation of clusters formed by the algorithm.</li>
</ol>

In [19]:
from sklearn.metrics import homogeneity_score , silhouette_score
homogeneity_kmeans = homogeneity_score(true_labels, kmeans_clusters)
print("Homogeneity score for KMeans:", homogeneity_kmeans)

silhouette_kmeans = silhouette_score(reduced_features, kmeans_clusters)
print("Silhouette score for KMeans:", silhouette_kmeans)

homogeneity_dbscan = homogeneity_score(true_labels, dbscan_clusters)
print("Homogeneity score for DBSCAN:", homogeneity_dbscan)

silhouette_dbscan = silhouette_score(reduced_features, dbscan_clusters)
print("Silhouette score for DBSCAN:", silhouette_dbscan)

Homogeneity score for KMeans: 0.3769877861494049
Silhouette score for KMeans: 0.35088867
Homogeneity score for DBSCAN: 0.20004280943208474
Silhouette score for DBSCAN: 0.11665531
