### QUESTION 1

Use Multiple Linear Regression for this task. Display the coefficients of the model and calculate the MAE (Mean Absolute Error) and MSE (Mean Squared Error). Search about RMSE (Root Mean Squared Error) and explain the trade-offs between these metrics. Finally report RMSE score of your model.
Perform this task using both LinearRegression and SGDRegressor.
Additionally, study the MAPE (Mean Absolute Percentage Error) metric using this link, and apply it to evaluate your model.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

# ===============================
# Step 1: Load and Prepare Data
# ===============================

# Load housing dataset
df = pd.read_csv(r"C:\Users\rtape\Downloads\Seneca\CVI620NSB_Summer2025\codes\Assignment2\Q1\house_price.csv")

# Select features (size, bedroom) and target (price)
X = df[['size', 'bedroom']]
y = df['price']

# Split into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ===============================
# Step 2: Linear Regression Model
# ===============================

print("LINEAR REGRESSION")
print("-" * 40)

# Initialize and train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Display learned coefficients
coeff_df = pd.DataFrame(lr_model.coef_, X.columns, columns=['Coefficient'])
print("Coefficients:")
print(coeff_df)
print(f"Intercept: {lr_model.intercept_:.2f}")

# Predict on test data and evaluate performance
y_pred_lr = lr_model.predict(X_test)
lr_mae = mean_absolute_error(y_test, y_pred_lr)
lr_mse = mean_squared_error(y_test, y_pred_lr)
lr_rmse = np.sqrt(lr_mse)
lr_mape = mean_absolute_percentage_error(y_test, y_pred_lr)

print("\nMetrics:")
print(f"MAE:  {lr_mae:.2f}")
print(f"MSE:  {lr_mse:.2f}")
print(f"RMSE: {lr_rmse:.2f}")
print(f"MAPE: {lr_mape:.4f} ({lr_mape*100:.2f}%)")

# ===============================
# Step 3: SGD Regressor Model
# ===============================

print("\n\nSGD REGRESSOR")
print("-" * 40)

# Standardize features (required for gradient-based models like SGD)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train SGD Regressor
sgd_model = SGDRegressor(max_iter=1000, random_state=42)
sgd_model.fit(X_train_scaled, y_train)

# Display learned coefficients (note: scaled features)
sgd_coeff_df = pd.DataFrame(sgd_model.coef_, X.columns, columns=['Coefficient'])
print("Coefficients (scaled features):")
print(sgd_coeff_df)
print(f"Intercept: {sgd_model.intercept_[0]:.2f}")

# Predict and evaluate performance
y_pred_sgd = sgd_model.predict(X_test_scaled)
sgd_mae = mean_absolute_error(y_test, y_pred_sgd)
sgd_mse = mean_squared_error(y_test, y_pred_sgd)
sgd_rmse = np.sqrt(sgd_mse)
sgd_mape = mean_absolute_percentage_error(y_test, y_pred_sgd)

print("\nMetrics:")
print(f"MAE:  {sgd_mae:.2f}")
print(f"MSE:  {sgd_mse:.2f}")
print(f"RMSE: {sgd_rmse:.2f}")
print(f"MAPE: {sgd_mape:.4f} ({sgd_mape*100:.2f}%)")

# ===============================
# Step 4: Model Comparison
# ===============================

print("\n\nMODEL COMPARISON")
print("-" * 40)

# Create a comparison table of evaluation metrics
comparison = pd.DataFrame({
    'Metric': ['MAE', 'MSE', 'RMSE', 'MAPE'],
    'LinearRegression': [lr_mae, lr_mse, lr_rmse, lr_mape],
    'SGDRegressor': [sgd_mae, sgd_mse, sgd_rmse, sgd_mape]
})
print(comparison.round(4))

# ===============================
# Step 5: Metrics Explanation
# ===============================

print("\n\nMETRICS TRADE-OFFS")
print("-" * 40)
print("MAE  - Average absolute error, robust to outliers.")
print("MSE  - Penalizes large errors more due to squaring.")
print("RMSE - Same units as target; balances MAE and MSE.")
print("MAPE - Scale-independent; expresses error in %.")

print(f"\nNote: RMSE is often preferred because it:")
print("- Uses the same unit as the target variable")
print("- Penalizes larger errors more than MAE")
print("- Is more interpretable than MSE in most cases")


LINEAR REGRESSION
----------------------------------------
Coefficients:
          Coefficient
size       143.218532
bedroom -13512.564426
Intercept: 84763.62

Metrics:
MAE:  72334.75
MSE:  8610424544.78
RMSE: 92792.37
MAPE: 0.1746 (17.46%)


SGD REGRESSOR
----------------------------------------
Coefficients (scaled features):
           Coefficient
size     106535.910237
bedroom  -10274.951289
Intercept: 323155.83

Metrics:
MAE:  72124.61
MSE:  8595003325.39
RMSE: 92709.24
MAPE: 0.1740 (17.40%)


MODEL COMPARISON
----------------------------------------
  Metric  LinearRegression  SGDRegressor
0    MAE      7.233475e+04  7.212461e+04
1    MSE      8.610425e+09  8.595003e+09
2   RMSE      9.279237e+04  9.270924e+04
3   MAPE      1.746000e-01  1.740000e-01


METRICS TRADE-OFFS
----------------------------------------
MAE  - Average absolute error, robust to outliers.
MSE  - Penalizes large errors more due to squaring.
RMSE - Same units as target; balances MAE and MSE.
MAPE - Scale-inde

### QUESTION 2

For the Cat and Dog dataset provided in the Q2 folder, perform classification using all the methods you know and try to achieve the best possible result. Compare the algorithms carefully and tune the parameters so that the best result can be obtained.
Save the trained model and test it on several images from the internet. Was the model able to correctly predict the images?


In [6]:
# Import necessary libraries
import cv2
import numpy as np
import os
import glob
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
import joblib

# Path to the dataset directory
base_path = r"C:\Users\rtape\Downloads\Seneca\CVI620NSB_Summer2025\codes\Assignment2\Q2"

# -------------------
# Step 1: Load and preprocess data
# -------------------

data = []
labels = []

# Loop through all image files in the dataset
for address in glob.glob(os.path.join(base_path, '*', '*', '*')):
    img = cv2.imread(address)  # Read image
    if img is None:
        continue  # Skip if image couldn't be loaded
    
    img = cv2.resize(img, (32, 32))  # Resize to 32x32 pixels
    img = img.flatten() / 255.0      # Flatten to 1D array and normalize pixel values
    data.append(img)
    
    # Assign label based on folder name: 0 = Cat, 1 = Dog
    labels.append(0 if 'Cat' in address else 1)

# Convert to NumPy arrays
X = np.array(data)
y = np.array(labels)

# -------------------
# Step 2: Split and scale data
# -------------------

# Split dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features to have 0 mean and unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -------------------
# Step 3: Train models using GridSearchCV
# -------------------

models = {}

# K-Nearest Neighbors
knn_params = {'n_neighbors': [1, 3, 5, 7], 'weights': ['uniform', 'distance']}
knn_grid = GridSearchCV(KNeighborsClassifier(), knn_params, cv=3)
knn_grid.fit(X_train_scaled, y_train)
models['KNN'] = (knn_grid.best_estimator_, accuracy_score(y_test, knn_grid.predict(X_test_scaled)))

# Logistic Regression
lr_params = {'C': [0.1, 1, 10], 'solver': ['lbfgs', 'liblinear']}
lr_grid = GridSearchCV(LogisticRegression(max_iter=1000), lr_params, cv=3)
lr_grid.fit(X_train_scaled, y_train)
models['Logistic Regression'] = (lr_grid.best_estimator_, accuracy_score(y_test, lr_grid.predict(X_test_scaled)))

# SGD Classifier (Stochastic Gradient Descent)
sgd_params = {'alpha': [0.0001, 0.001, 0.01], 'loss': ['hinge', 'log_loss']}
sgd_grid = GridSearchCV(SGDClassifier(max_iter=1000), sgd_params, cv=3)
sgd_grid.fit(X_train_scaled, y_train)
models['SGD'] = (sgd_grid.best_estimator_, accuracy_score(y_test, sgd_grid.predict(X_test_scaled)))

# -------------------
# Step 4: Compare model performances
# -------------------

print("RESULTS:")
for name, (model, acc) in models.items():
    print(f"{name}: {acc:.4f}")  # Print accuracy of each model

# -------------------
# Step 5: Save the best model
# -------------------

# Identify the best performing model
best_name = max(models.keys(), key=lambda x: models[x][1])
best_model, best_acc = models[best_name]

# Save best model and scaler to disk
joblib.dump(best_model, 'best_cat_dog_model.pkl')
joblib.dump(scaler, 'best_cat_dog_scaler.pkl')

print(f"\nBest Model: {best_name} ({best_acc:.4f})")
print("Model saved as best_cat_dog_model.pkl")

# -------------------
# Step 6: Test new image function
# -------------------

# Predict label for new image
def test_image(image_path):
    # Load saved model and scaler
    model = joblib.load('best_cat_dog_model.pkl')
    scaler = joblib.load('best_cat_dog_scaler.pkl')
    
    # Load and preprocess image
    img = cv2.imread(image_path)
    if img is None:
        return "Could not load image"
    
    img = cv2.resize(img, (32, 32))
    img = img.flatten() / 255.0
    img_scaled = scaler.transform(img.reshape(1, -1))
    
    # Predict label (0 = Cat, 1 = Dog)
    prediction = model.predict(img_scaled)[0]
    label = "Cat" if prediction == 0 else "Dog"
    
    # If model supports confidence scoring
    if hasattr(model, "predict_proba"):
        confidence = max(model.predict_proba(img_scaled)[0]) * 100
        return f"{label} ({confidence:.1f}%)"
    else:
        return label

# -------------------
# Step 7: Test on sample dataset images
# -------------------

# List of test images to try
test_paths = [
    os.path.join(base_path, "test", "Cat", "Cat (1).jpg"),
    os.path.join(base_path, "test", "Dog", "Dog (1).jpg")
]

print("\nTest Results:")
for path in test_paths:
    if os.path.exists(path):
        result = test_image(path)
        actual = "Cat" if "Cat" in path else "Dog"
        print(f"{os.path.basename(path)}: {result} (Actual: {actual})")

# -------------------
# Final Note
# -------------------

print("\nTo test internet images: test_image('path_to_image.jpg')")


RESULTS:
KNN: 0.5556
Logistic Regression: 0.5920
SGD: 0.5755

Best Model: Logistic Regression (0.5920)
Model saved as best_cat_dog_model.pkl

Test Results:
Cat (1).jpg: Cat (83.2%) (Actual: Cat)
Dog (1).jpg: Dog (77.8%) (Actual: Dog)

To test internet images: test_image('path_to_image.jpg')


### QUESTION 3

The MNIST dataset is one of the most well-known datasets in the field of image processing. It contains 60,000 images related to handwritten digits from 0 to 9 and is provided as a CSV file in the Q3 folder. In this file, each image is represented as a flattened vector. Classify this dataset using different methods and try to achieve at least 90% accuracy.

In [7]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# ===============================
# Step 1: Load and Prepare the Data
# ===============================

# Load MNIST training and testing data from CSV files
train_data = pd.read_csv(r"C:\Users\rtape\Downloads\Seneca\CVI620NSB_Summer2025\codes\Assignment2\Q3\mnist_train.csv")
test_data = pd.read_csv(r"C:\Users\rtape\Downloads\Seneca\CVI620NSB_Summer2025\codes\Assignment2\Q3\mnist_test.csv")

# Split into features (X) and labels (y)
X_train = train_data.iloc[:, 1:].values / 255.0  # Normalize pixel values to [0, 1]
y_train = train_data.iloc[:, 0].values

X_test = test_data.iloc[:, 1:].values / 255.0
y_test = test_data.iloc[:, 0].values

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

# ===============================
# Step 2: Feature Scaling
# ===============================

# Standardize features: mean = 0, std = 1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ===============================
# Step 3: KNN Classifier
# ===============================

print("\nKNN CLASSIFIER")
print("-" * 30)

# Initialize and train K-Nearest Neighbors (k=3)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)

# Predict on test data and evaluate accuracy
knn_pred = knn.predict(X_test_scaled)
knn_accuracy = accuracy_score(y_test, knn_pred)
print(f"Accuracy: {knn_accuracy:.4f}")

# ===============================
# Step 4: Logistic Regression
# ===============================

print("\nLOGISTIC REGRESSION")
print("-" * 30)

# Initialize and train Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_scaled, y_train)

# Predict on test data and evaluate accuracy
lr_pred = lr.predict(X_test_scaled)
lr_accuracy = accuracy_score(y_test, lr_pred)
print(f"Accuracy: {lr_accuracy:.4f}")

# ===============================
# Step 5: Model Comparison
# ===============================

print("\nMODEL COMPARISON")
print("-" * 30)
print(f"KNN Accuracy:              {knn_accuracy:.4f}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")

# Check if either model achieved ≥ 90% accuracy
if max(knn_accuracy, lr_accuracy) >= 0.90:
    print("✓ Target Achieved (≥ 90%)")
else:
    print("✗ Target Not Met (< 90%)")


Train shape: (59999, 784), Test shape: (9999, 784)

KNN CLASSIFIER
------------------------------
Accuracy: 0.9452

LOGISTIC REGRESSION
------------------------------
Accuracy: 0.9219

MODEL COMPARISON
------------------------------
KNN Accuracy:              0.9452
Logistic Regression Accuracy: 0.9219
✓ Target Achieved (≥ 90%)
