## Import Necessary Libraries
The OS module in Python provides a way of using operating system-dependent functionality, such as reading or writing to the file system, manipulating paths, and working with environment variables. It's used for tasks like file management, process management, and accessing system-related information.

Numpy is a powerful Python library for numerical computing that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. NumPy is commonly used for tasks involving numerical data, such as scientific computing, data analysis, and machine learning.

scikit-learn is a machine learning library in Python that is built on top of NumPy, SciPy, and matplotlib. It provides simple and efficient tools for data mining and data analysis, built on a consistent interface that makes it easy to use various machine learning algorithms in Python. Scikit-learn is widely used for tasks such as classification, regression, clustering, and dimensionality reduction.

In [2]:
import os

In [3]:
from skimage.io import imread
from skimage.transform import resize
import numpy as np


## Step1 : Prepare Data
For data their is two files where the different images are located. In this step images are iterating over a list of categories, reading images from each category directory, resizing them to a fixed size (15x15 pixels), flattening them into a 1D array, and then appending the flattened array to data along with the corresponding category index as a label in labels. This is a common pattern in machine learning tasks where images are used as input data for training models.

In [4]:
input_dir = r'C:\Users\Hrushi\OneDrive\Desktop\picture classification\clf-data'
categories = ['empty','not_empty']

In [5]:
data = []
labels = []

In [None]:
for category_idx, category in enumerate(categories):
    for file in os.listdir(os.path.join(input_dir, category)):
        img_path = os.path.join(input_dir, category, file)
        img = imread(img_path)
        img = resize(img, (15, 15))
        data.append(img.flatten())
        labels.append(category_idx)

In [None]:
data = np.asarray(data)
labels = np.asarray(labels)

## Step2: train / test split
In this step using scikit-learn's train_test_split function to split the data and labels into training and testing sets.

data: The input data, which is a list of flattened images.
labels: The corresponding labels for each image.
test_size: The proportion of the dataset to include in the test split (in this case, 20%).
shuffle: Whether to shuffle the data before splitting.
stratify: Ensures that the train and test sets have the same proportion of class labels as the input dataset.
After this line of code, you would have x_train (training data), x_test (testing data), y_train (training labels), and y_test (testing labels) ready for use in training and evaluating a machine learning model.

In [1]:
from sklearn.model_selection import train_test_split

In [9]:
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, shuffle=True, stratify=labels)

## Step3: Train Classifier
GridSearchCV to perform a grid search for hyperparameter tuning on a Support Vector Classifier (SVC). Here's a breakdown of the code:

SVC(): Creates an instance of the Support Vector Classifier.
parameters: Defines a dictionary of hyperparameters to search over. In this case, it specifies different values for the gamma and C parameters.
GridSearchCV: Initializes a grid search with the classifier and hyperparameter dictionary.
grid_search.fit(x_train, y_train): Fits the grid search to the training data (x_train and y_train), which performs an exhaustive search over the specified parameter values and selects the best parameters based on cross-validated performance.

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

In [11]:
classifier = SVC()

In [12]:
parameters = [{'gamma': [0.01, 0.001, 0.0001], 'C': [1, 10, 100, 1000]}]

grid_search = GridSearchCV(classifier, parameters)

In [13]:
grid_search.fit(x_train, y_train)

## Step 4: Test Performance
this step retrieves the best estimator found by the grid search (best_estimator_), uses it to make predictions on the test data (x_test), calculates the accuracy score by comparing the predicted labels with the actual labels (y_test), and then prints the accuracy score as a percentage.

In [34]:
from sklearn.metrics import accuracy_score

In [35]:
best_estimator = grid_search.best_estimator_

y_prediction = best_estimator.predict(x_test)

In [36]:
score = accuracy_score(y_prediction, y_test)

In [37]:
print('{}% of samples were correctly classified'.format(str(score * 100)))

100.0% of samples were correctly classified


In [38]:
import pickle

In [39]:
pickle.dump(best_estimator, open('./image_classification_model.p', 'wb'))