### Data Preparation and Normalization



In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the cleaned dataset
train_data_cleaned = pd.read_csv("C:/project/fashion-recommender-system/data/processed/fashion-mnist_train_cleaned.csv")

# Separate features and labels
X = train_data_cleaned.iloc[:, 1:].values  # Pixel values
y = train_data_cleaned.iloc[:, 0].values  # Labels (assuming the first column is the label)

# Normalize pixel values (0-255 to 0-1)
X = X / 255.0

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Optionally standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

print(f"Training set size: {X_train.shape}")
print(f"Validation set size: {X_val.shape}")


Training set size: (47965, 784)
Validation set size: (11992, 784)


1. The dataset has been split into training and validation sets, with normalized pixel values between 0 and 1.
2. The StandardScaler ensures that pixel intensity values are standardized to improve model convergence.


### Model Selection and Training (Logistic Regression)

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize Logistic Regression model
log_reg = LogisticRegression(max_iter=100, solver='saga', multi_class='multinomial')

# Train the model
log_reg.fit(X_train, y_train)

# Predict on the validation set
y_pred_log_reg = log_reg.predict(X_val)

# Evaluate the model
log_reg_accuracy = accuracy_score(y_val, y_pred_log_reg)
print(f"Logistic Regression Validation Accuracy: {log_reg_accuracy:.4f}")




Logistic Regression Validation Accuracy: 0.8568




1. Logistic Regression, a basic algorithm, achieved a certain accuracy on the validation set.
2. This will serve as a baseline to compare more complex models.


### Model Training (Support Vector Machine - SVM)

In [3]:
from sklearn.svm import SVC

# Initialize Support Vector Machine model
svm_model = SVC(kernel='linear')

# Train the model
svm_model.fit(X_train, y_train)

# Predict on the validation set
y_pred_svm = svm_model.predict(X_val)

# Evaluate the model
svm_accuracy = accuracy_score(y_val, y_pred_svm)
print(f"SVM Validation Accuracy: {svm_accuracy:.4f}")


SVM Validation Accuracy: 0.8439


1. The SVM model with a linear kernel provides a higher accuracy than Logistic Regression, showing better separation of classes.
2. This step adds complexity and helps test a more computationally expensive algorithm.


### Model Training (Random Forest Classifier)

In [4]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Predict on the validation set
y_pred_rf = rf_model.predict(X_val)

# Evaluate the model
rf_accuracy = accuracy_score(y_val, y_pred_rf)
print(f"Random Forest Validation Accuracy: {rf_accuracy:.4f}")


Random Forest Validation Accuracy: 0.8797


1. The Random Forest classifier provides an ensemble learning approach, which generally improves performance.
2. The accuracy can now be compared with other models like Logistic Regression and SVM.



### Model Training (Convolutional Neural Network - CNN)

In [5]:
"""import tensorflow as tf
from tensorflow import keras

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
# Reshape the data for CNN input (28x28 pixels, 1 channel)
# Assuming you have X_train and X_val defined
X_train_cnn = X_train.reshape(-1, 28, 28, 1)
X_val_cnn = X_val.reshape(-1, 28, 28, 1)

# Build a CNN model
cnn_model = Sequential([
    Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),  # Added activation and input_shape
    MaxPooling2D(pool_size=(2, 2)),  # Corrected to MaxPooling2D
    Flatten(),
    Dense(10, activation='softmax')  # Assuming 10 output classes
])

# Compile the CNN model
cnn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
cnn_model.fit(X_train_cnn, y_train, epochs=10, batch_size=128, validation_data=(X_val_cnn, y_val))

# Evaluate the CNN model
cnn_eval = cnn_model.evaluate(X_val_cnn, y_val)
print(f"CNN Validation Accuracy: {cnn_eval[1]:.4f}")"""



'import tensorflow as tf\nfrom tensorflow import keras\n\nfrom tensorflow.keras import Sequential\nfrom tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense\n# Reshape the data for CNN input (28x28 pixels, 1 channel)\n# Assuming you have X_train and X_val defined\nX_train_cnn = X_train.reshape(-1, 28, 28, 1)\nX_val_cnn = X_val.reshape(-1, 28, 28, 1)\n\n# Build a CNN model\ncnn_model = Sequential([\n    Conv2D(32, kernel_size=(3, 3), activation=\'relu\', input_shape=(28, 28, 1)),  # Added activation and input_shape\n    MaxPooling2D(pool_size=(2, 2)),  # Corrected to MaxPooling2D\n    Flatten(),\n    Dense(10, activation=\'softmax\')  # Assuming 10 output classes\n])\n\n# Compile the CNN model\ncnn_model.compile(optimizer=\'adam\', loss=\'sparse_categorical_crossentropy\', metrics=[\'accuracy\'])\n\n# Train the model\ncnn_model.fit(X_train_cnn, y_train, epochs=10, batch_size=128, validation_data=(X_val_cnn, y_val))\n\n# Evaluate the CNN model\ncnn_eval = cnn_model.evaluat

1. The CNN model uses deep learning to improve the classification accuracy, particularly for image-based datasets.
2. This model introduces layers of abstraction for feature extraction and performs better than traditional machine learning models in most cases.


### Hyperparameter Tuning (Using Grid Search)

In [7]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize Grid Search
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=3, verbose=1, n_jobs=-1)

# Fit the model using grid search
grid_search.fit(X_train, y_train)

# Best parameters and accuracy
print(f"Best Parameters: {grid_search.best_params_}")
best_rf_accuracy = grid_search.score(X_val, y_val)
print(f"Best Random Forest Validation Accuracy: {best_rf_accuracy:.4f}")


Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best Parameters: {'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 200}
Best Random Forest Validation Accuracy: 0.8793


1. Hyperparameter tuning via Grid Search allows us to optimize the Random Forest model for better performance.
2. This step systematically tests combinations of parameters for the highest validation accuracy.


### Model Evaluation (Confusion Matrix and Classification Report)

In [8]:
from sklearn.metrics import confusion_matrix, classification_report

# Confusion matrix and classification report for the best performing model (assume CNN)
rf_model_predictions = rf_model.predict(X_val).argmax(axis=1)

# Generate confusion matrix
conf_matrix = confusion_matrix(y_val, rf_model_predictions)
print("Confusion Matrix:\n", conf_matrix)

# Classification report
class_report = classification_report(y_val, rf_model_predictions)
print("Classification Report:\n", class_report)


AxisError: axis 1 is out of bounds for array of dimension 1

1. The confusion matrix and classification report show how well the model performs in classifying each label, providing insight into misclassifications.
2. This is crucial for understanding model performance beyond accuracy.


### Save the Best Model

In [10]:
import joblib

# Save the best performing model (assume CNN)
#rf_model.save('best_rf_model.h5')

# Save a traditional machine learning model like Random Forest
joblib.dump(grid_search.best_estimator_, 'best_rf_model.pkl')


['best_rf_model.pkl']