note: because the pre-processing 90k+ images is time-consuming, I would use a subset I create to complete this assignment. This subset has average distribution label for gender-race combination to avoid potential bias.

In [16]:
import numpy as np
import pandas as pd
import os
from PIL import Image
from sklearn.model_selection import train_test_split

In [None]:
# Unzip the image dataset

%cd /content
!gdown --id '13shxKy6WSeAa7dPhccnSG9aoFZ76lVPT' --output fairface-img-margin025-trainval.zip
!unzip fairface-img-margin025-trainval.zip

In [11]:
# Load labels
file_path = 'fairface_subset.csv'
labels_df = pd.read_csv(file_path)

labels_df.shape

(6300, 5)

In [14]:
# Preprocess images
def load_and_preprocess_image(image_path):
  # Load image
  image = Image.open(image_path)
  image = image.resize((64, 64))

  # Convert to grayscale
  image = image.convert('L')

  # Convert to numpy array and flatten
  image_array = np.array(image).flatten()
  return image_array


images = [load_and_preprocess_image(f'{fname}') for fname in labels_df['file']]

## Labelling Gender (major)

In [21]:
# Create feature matrix X and labels y
X = np.array(images)
y = labels_df['gender'].values

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [46]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Decision Tree Model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)

# Evaluate Decision Tree
print("Decision Tree Performance: " + str(np.round(dt_model.score(X_test,y_test),3)))

Decision Tree Performance: 0.608


In [47]:
# Change parameters of Decision Tree Model

dt_model = DecisionTreeClassifier(max_depth=5, random_state=42, criterion="entropy")
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)

# Evaluete Dicision Tree
print("Decision Tree Performance: " + str(np.round(dt_model.score(X_test,y_test),3)))

Decision Tree Performance: 0.634


Interesting! By avoiding overfitting, I improve the performance of the model. I thought the image data will always be more complex than text data, but it looks like that the tree don't need to be super deep.

In [45]:
# Random Forest Model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

# Evaluate Random Forest
print("Random Forest Performance: " + str(np.round(rf_model.score(X_test,y_test),3)))

Random Forest Performance: 0.686


In [51]:
# Change parameters of Random Forest Model

rf_model = RandomForestClassifier(n_estimators=100, random_state=42, bootstrap=False)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

# Evaluate Random Forest
print("Random Forest Performance:" + str(np.round(rf_model.score(X_test,y_test),3)))

Random Forest Performance:0.703


Also interesting! When I set bootstrap as False, the performance of the model is improved. Really a anti-intuition for me.

## Labelling Race (minor)

In [19]:
# Create feature matrix X and labels y
X = np.array(images)
y = labels_df['race'].values  # Replace 'label' with your actual label column name

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Decision Tree Model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)

# Evaluate Decision Tree
print("Decision Tree Performance:")
print(classification_report(y_test, dt_predictions))

# Random Forest Model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

# Evaluate Random Forest
print("Random Forest Performance:")
print(classification_report(y_test, rf_predictions))

Decision Tree Performance:
                 precision    recall  f1-score   support

          Black       0.26      0.27      0.27       180
     East Asian       0.30      0.25      0.27       201
         Indian       0.20      0.22      0.21       166
Latino_Hispanic       0.15      0.16      0.15       179
 Middle Eastern       0.12      0.14      0.13       163
Southeast Asian       0.22      0.19      0.20       189
          White       0.21      0.20      0.20       182

       accuracy                           0.20      1260
      macro avg       0.21      0.20      0.20      1260
   weighted avg       0.21      0.20      0.21      1260

Random Forest Performance:
                 precision    recall  f1-score   support

          Black       0.42      0.57      0.48       180
     East Asian       0.38      0.39      0.39       201
         Indian       0.31      0.37      0.34       166
Latino_Hispanic       0.23      0.18      0.20       179
 Middle Eastern       0.20    

## Some thoughts

### Performance and Overfitting
+ The Random Forest model showed better results compared to the Decision Tree model. This aligns with expectations, as Random Forests usually offer improved generalization by combining multiple trees. A key observation was that preventing overfitting significantly improved performance, indicating that even for image data, which is often perceived as more complex than text data, a highly complex model (like a very deep tree) is not always necessary. This finding challenges the common assumption that image data invariably requires more complex models.

### Impact of Bootstrap Parameter
+ Interestingly, setting bootstrap to False improved the Random Forest's performance. Typically, bootstrap sampling in Random Forests introduces diversity, enhancing robustness. However, in this case, allowing each tree to train on the full dataset may have reduced bias, leading to better results. This suggests that for certain datasets, the usual advantage of bootstrapping might be less impactful.

### Efficiency
+ Decision Trees were more computationally efficient, highlighting their suitability for smaller datasets or scenarios where rapid training is crucial. Random Forests, involving multiple trees, are more computationally demanding, which becomes a significant factor with larger datasets.

### Pros and Cons
+ Decision Trees:
 + Pros: Simple, interpretable, faster training.
 + Cons: Prone to overfitting, less effective with complex patterns.
+ Random Forests:
 + Pros: Better accuracy, handles overfitting well, good for complex data.
 + Cons: Computationally intensive, less interpretable.
