note: because the pre-processing 90k+ images is time-consuming, I would use a subset I create to complete this assignment. This subset has average distribution label for gender-race combination to avoid potential bias.

In [3]:
import numpy as np
import pandas as pd
import os
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [3]:
# Unzip the image dataset

# Just run once

'''
%cd /content
!gdown --id '13shxKy6WSeAa7dPhccnSG9aoFZ76lVPT' --output fairface-img-margin025-trainval.zip
!unzip fairface-img-margin025-trainval.zip
'''

/content
Downloading...
From (original): https://drive.google.com/uc?id=13shxKy6WSeAa7dPhccnSG9aoFZ76lVPT
From (redirected): https://drive.google.com/uc?id=13shxKy6WSeAa7dPhccnSG9aoFZ76lVPT&confirm=t&uuid=5cd3b74e-90fa-4b2e-82b9-803607455ee4
To: /content/fairface-img-margin025-trainval.zip
100% 578M/578M [00:06<00:00, 83.7MB/s]
Archive:  fairface-img-margin025-trainval.zip
replace train/1346.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [4]:
# Download the csv label file for the subset

# Just run once

'''
%cd /content
!gdown --id '1bwKY_aVMRIQ_IcrFnpsTG-E3ZE7iAM8t' --output fairface_subset.csv
'''

/content
Downloading...
From: https://drive.google.com/uc?id=1bwKY_aVMRIQ_IcrFnpsTG-E3ZE7iAM8t
To: /content/fairface_subset.csv
100% 279k/279k [00:00<00:00, 87.2MB/s]


In [4]:
# Load labels
file_path = 'fairface_subset.csv'
labels_df = pd.read_csv(file_path)

labels_df.shape

(6300, 5)

In [5]:
# Preprocess images
def load_and_preprocess_image(image_path):
  # Load image
  image = Image.open(image_path)
  image = image.resize((64, 64))

  # Convert to grayscale
  image = image.convert('L')

  # Convert to numpy array and flatten
  image_array = np.array(image).flatten()
  return image_array


images = [load_and_preprocess_image(f'{fname}') for fname in labels_df['file']]

In [6]:
# Create feature matrix X and labels y
X = np.stack(images, axis=0)
y = labels_df['gender'].values
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Initialize the logistic regression model (with l1 penalty)
model_l1 = LogisticRegression(random_state=0, solver='liblinear', penalty='l1', C=1.0)

# Fit the model to the training data
model_l1.fit(X_train, y_train)

# Get the score
model_l1.score(X_test, y_test)

0.6158730158730159

In [7]:
# Initialize the logistic regression model (with l2 penalty)
model_l2 = LogisticRegression(random_state=0, solver='liblinear', penalty='l2', C=1.0)

# Fit the model to the training data
model_l2.fit(X_train, y_train)

# Get the score
model_l2.score(X_test, y_test)



0.6047619047619047

In [8]:
# Try another solver
model_l2_lbfgs = LogisticRegression(random_state=0, solver='lbfgs', penalty='l2', C=1.0)

# Fit the model to the training data
model_l2_lbfgs.fit(X_train, y_train)

# Get the score
model_l2_lbfgs.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.6849206349206349

In [10]:
# Try another solver
model_l1_saga = LogisticRegression(random_state=0, solver='saga', penalty='l1', C=1.0)

# Fit the model to the training data
model_l1_saga.fit(X_train, y_train)

# Get the score
model_l1_saga.score(X_test, y_test)



0.7277777777777777

Clearly, 'saga' with l1 penalty has decent running time and highest score. But there's a warning about interation. So I will try different max_iter. Than I will try different C values.

In [8]:
# Try different max_iter
max_iter_values = [10, 20, 50, 100, 200, 500]

for max_iter in max_iter_values:
    model = LogisticRegression(random_state=0, solver='saga', penalty='l1',
            C=1.0, max_iter=max_iter)

    model.fit(X_train, y_train)

    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)

    print(f"max_iter: {max_iter}, Training score: {train_score:.4f}, Test score: {test_score:.4f}")



max_iter: 10, Training score: 0.7548, Test score: 0.7278




max_iter: 20, Training score: 0.7673, Test score: 0.7349




max_iter: 50, Training score: 0.7913, Test score: 0.7325




max_iter: 100, Training score: 0.8024, Test score: 0.7278




max_iter: 200, Training score: 0.8177, Test score: 0.7190
max_iter: 500, Training score: 0.8437, Test score: 0.7032




Interesting outcomes - even the model still didn't converge, the test score decreased. So I will give 20 and 50 further try.

In [9]:
# Try different regularization strengths
for C_value in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(C=C_value, penalty='l1', solver='saga', max_iter=20)
    model.fit(X_train, y_train)
    print(f"C={C_value}, Training score: {model.score(X_train, y_train):.4f}, Test score: {model.score(X_test, y_test):.4f}")



C=0.01, Training score: 0.7669, Test score: 0.7341




C=0.1, Training score: 0.7661, Test score: 0.7333




C=1, Training score: 0.7688, Test score: 0.7317




C=10, Training score: 0.7683, Test score: 0.7357
C=100, Training score: 0.7675, Test score: 0.7341




In [10]:
# Try different regularization strengths
for C_value in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(C=C_value, penalty='l1', solver='saga', max_iter=50)
    model.fit(X_train, y_train)
    print(f"C={C_value}, Training score: {model.score(X_train, y_train):.4f}, Test score: {model.score(X_test, y_test):.4f}")



C=0.01, Training score: 0.7865, Test score: 0.7365




C=0.1, Training score: 0.7911, Test score: 0.7341




C=1, Training score: 0.7917, Test score: 0.7310




C=10, Training score: 0.7919, Test score: 0.7325
C=100, Training score: 0.7927, Test score: 0.7325




The best-performing logistic regression model on the dataset incorporates an L1 penalty with a regularization strength of C=0.01, using the 'saga' solver, and allowing for 50 iterations. This setup suggests that the data benefits from a stronger regularization, which helps in reducing overfitting by penalizing less important features.

The 'saga' solver was effective for my needs, likely due to its efficiency with large datasets and its ability to handle L1 penalties well.

# Discuss about pros ann cons of models

## Solvers:

### liblinear:
+ Pros:
 + It's a good choice for small to medium datasets.
 + It supports both L1 and L2 regularization.
 + It's easy to use and interpret.
+ Cons:
 + Not suitable for very large datasets because it can be slow.
 + It does not support multinomial logistic regression; it handles multiclass using a one-vs-rest approach, which can be less efficient.
 + It may have convergence issues with L1 regularization on non-separable data or with insufficient iterations.

### saga:
+ Pros:
 + Designed for large datasets with many samples.
 + Faster for large datasets due to incremental, gradient-based optimization.
 + saga supports both L1 and L2 regularization, making it versatile.
+ Cons:
 + May take longer to converge on smaller datasets.
 + Requires features to be scaled (e.g., using StandardScaler).

## Regularization (Penalty) Techniques:

### L1 Regularization (Lasso):
+ Pros:
 + Can lead to sparse models where some coefficients can become zero.
 + Useful for feature selection because it can eliminate some features entirely.
+ Cons:
 + Can lead to a less stable solution path for coefficients.
 + Not supported by all solvers.

### L2 Regularization (Ridge):
+ Pros:
 + Tends to give better results for features with real predictive power.
 + The model is less likely to fit noise in the data.
 + Supported by most solvers.
+ Cons:
 + Does not reduce coefficients to zero, which means it does not perform feature selection.
 + May lead to smaller coefficients on average, as it shrinks all coefficients equally.
