1. Load the MNIST dataset: You can load the MNIST dataset using the mnist.load_data() function from tensorflow.keras.datasets. This dataset contains 28x28 grayscale images of handwritten digits (0 through 9).

2. Preprocess the data: Normalize the pixel values of the images to be between 0 and 1. Reshape the images to a suitable format for the autoencoder model.

3. Split the dataset: Split the dataset into training and testing sets.

4. Build the autoencoder model:
    1. Define an encoder model that reduces the dimensionality of the input images.
    2. Define a decoder model that reconstructs the original input from the encoded representation.
    3. Combine the encoder and decoder to form the autoencoder model.
5. Compile the model: Compile the autoencoder model with an appropriate loss function and optimizer.

6. Train the model: Train the autoencoder model using the training data.

7. Evaluate the model: Evaluate the performance of the autoencoder using the testing data.

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, adjusted_rand_score
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, losses
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Model

In [None]:
# Step 1: Load the MNIST dataset
(x_train, _), (x_test, _) = mnist.load_data()

# Step 2: Preprocess the data
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

# Reshape the images to 28x28x1 (single channel)
x_train = np.reshape(x_train, (len(x_train), 28, 28, 1))
x_test = np.reshape(x_test, (len(x_test), 28, 28, 1))

In [None]:
# Step 4: Build the autoencoder model
encoder_input = layers.Input(shape=(28, 28, 1))
x = layers.Conv2D(16, (3, 3), activation='relu', padding='same')(encoder_input)
x = layers.MaxPooling2D((2, 2), padding='same')(x)
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = layers.MaxPooling2D((2, 2), padding='same')(x)
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = layers.MaxPooling2D((2, 2), padding='same')(x)
encoder = Model(encoder_input, encoded, name='encoder')

In [None]:
decoder_input = layers.Input(shape=(4, 4, 8))
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(decoder_input)
x = layers.UpSampling2D((2, 2))(x)
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = layers.UpSampling2D((2, 2))(x)
x = layers.Conv2D(16, (3, 3), activation='relu')(x)
x = layers.UpSampling2D((2, 2))(x)
decoded = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)

In [None]:
decoder = Model(decoder_input, decoded, name='decoder')

In [None]:
autoencoder_input = layers.Input(shape=(28, 28, 1))
encoded_img = encoder(autoencoder_input)
decoded_img = decoder(encoded_img)
autoencoder = Model(autoencoder_input, decoded_img, name='autoencoder')

In [None]:
# Step 5: Compile the model
autoencoder.compile(optimizer='adam', loss='mse')

In [None]:
# Step 6: Train the model
autoencoder.fit(x_train, x_train,
                epochs=90,
                batch_size=128,
                shuffle=True)

In [None]:
# Step 7: Evaluate the model
decoded_imgs = autoencoder.predict(x_test)

In [None]:
""" n = 10
plt.figure(figsize=(20, 4))
for i in range(n):
  # display original
  ax = plt.subplot(2, n, i + 1)
  plt.imshow(x_test[i])
  plt.title("original")
  plt.gray()
  ax.get_xaxis().set_visible(False)
  ax.get_yaxis().set_visible(False)

  # display reconstruction
  ax = plt.subplot(2, n, i + 1 + n)
  plt.imshow(decoded_imgs[i])
  plt.title("reconstructed")
  plt.gray()
  ax.get_xaxis().set_visible(False)
  ax.get_yaxis().set_visible(False)
plt.show() """

To encode new data using the trained autoencoder and then perform clustering on the encoded representations, you can follow these steps:

Load the trained autoencoder model: Load the autoencoder model that you trained previously.

Load and preprocess the new data: Load the new data from the CSV file and preprocess it if necessary.

Encode the new data: Use the encoder part of the autoencoder to encode the new data into a lower-dimensional space.

Perform clustering on the encoded representations: Apply a clustering algorithm, such as K-means clustering, to cluster the encoded representations.

In [None]:

# Step 2: Load and preprocess the new data
Dr_Ahmads_Data = pd.read_csv('data.csv')

# Separate features (pixels) and labels
X = Dr_Ahmads_Data.drop("ID", axis=1).values.reshape(-1, 28, 28, 1) / 255.0
print(x.shape)

In [None]:
# Reshape the data if necessary to match the input shape of the autoencoder
# In this case, the shape should match (batch_size, 28, 28, 1)

# Step 3: Encode the new data
encoded_data = encoder.predict(X)
# Flatten the spatial dimensions of encoded_data
encoded_data_flat = encoded_data.reshape(encoded_data.shape[0], -1)
print(encoded_data_flat.shape)

In [None]:
# Grid search to find the best KMeans params
""" from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_clusters': [8, 9, 10, 11, 12],  
    'init': ['k-means++', 'random'],   
    'n_init': [10, 20, 30],             
    'max_iter': [100, 200, 300]         
}

# Create a KMeans estimator
kmeans = KMeans(random_state=52)

# Perform GridSearch
grid_search = GridSearchCV(estimator=kmeans, param_grid=param_grid, cv=5)
grid_search.fit(encoded_data_flat)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Fit the data using the best parameters
best_kmeans = KMeans(**best_params, random_state=52)
best_clusters = best_kmeans.fit_predict(encoded_data_flat) """

In [None]:
# Fit the data using the best parameters
best_kmeans = KMeans(max_iter = 300, n_clusters = 12, random_state=42)
best_clusters = best_kmeans.fit_predict(encoded_data_flat)

In [None]:
# Step 5: Generate Submission File
submission = pd.DataFrame({'ID': Dr_Ahmads_Data['ID'], 'Label': best_clusters})
submission.to_csv('AutoEncoderKmeans_submission.csv', index=False) 
# 0.43 Score, very sadly.
# Also tried VGG16, also ended with a bad score.