## Pseudo-labing

The goal of this python file is to get a dataset that is a merge of both the two datasets that we have and use it to train our model.

We have:
* Dataset A → images + age + gender + race
* Dataset B → images + emotion only

We want to:
1. Train a demographics model on Dataset A
2. Use it to predict age/gender/race on Dataset B
3. Save those predictions as pseudo-labels with confidence
4. Merge the datasets safely

This is called **pseudo-labeling**

Important: these are not true labels, so we store them separately and track confidence.


In [1]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from tqdm import tqdm



### STEP 1 -Load Dataset A and Dataset B

In [2]:
# Load the datasets
dfA = pd.read_csv("utk_face_labels.csv")   # age, gender, race
dfB = pd.read_csv("raf_labels.csv")   # emotion only

print(dfA.head())
print(dfB.head())

                                          image_path  age  gender  Race
0  source_data/UTK-Face/part3/27_0_1_201701201338...   27       0     1
1  source_data/UTK-Face/part3/24_0_3_201701191655...   24       0     3
2  source_data/UTK-Face/part3/8_1_0_2017011715460...    8       1     0
3  source_data/UTK-Face/part3/85_1_0_201701202226...   85       1     0
4  source_data/UTK-Face/part3/26_1_0_201701191929...   26       1     0
                                          image_path  emotion
0  source_data/raf/DATASET/train/7/train_11651_al...        7
1  source_data/raf/DATASET/train/7/train_10043_al...        7
2  source_data/raf/DATASET/train/7/train_11301_al...        7
3  source_data/raf/DATASET/train/7/train_10513_al...        7
4  source_data/raf/DATASET/train/7/train_11148_al...        7


### STEP 2 — Convert Age to Bins

We are doing this step because the exact age prediction is noisy and difficult. Age classificiation into bins is more stable.

In [3]:
# Create age bins
age_bins = [0,10,20,30,40,50,60,200]

def age_to_bin(age):
    return np.digitize(age, age_bins) - 1

# Convert age to bins and add as a new column in dfA
dfA["age_bin"] = dfA["age"].apply(age_to_bin)

# printing the unique age bins to verify
print("Unique age bins in dfA:", dfA["age_bin"].unique())

print(dfA[["age", "age_bin"]].head(10))


Unique age bins in dfA: [2 0 6 5 3 4 1]
   age  age_bin
0   27        2
1   24        2
2    8        0
3   85        6
4   26        2
5   57        5
6   33        3
7   78        6
8   45        4
9   34        3


### STEP 3 — Converting Gender to Integers

Neural networks need numeric labels. Therefore we need to convert them accordingly. 

In [4]:
# Drop rows where gender class is 3 since its meaning is unclear and it may be an outlier or error in the dataset
print(f"\nOriginal dataset shape: {dfA.shape}")
dfA = dfA[dfA['gender'] != 3]
print(f"Dataset shape after dropping gender=3: {dfA.shape}")

# Convert gender to categorical if it's not already numeric
if dfA["gender"].dtype == "object":
    dfA["gender"] = dfA["gender"].astype("category").cat.codes

num_age = dfA["age_bin"].nunique()
num_gender = dfA["gender"].nunique()

print(f"Number of age bins: {num_age}")
print(f"Number of gender classes: {num_gender}")


Original dataset shape: (24102, 5)
Dataset shape after dropping gender=3: (24102, 5)
Number of age bins: 7
Number of gender classes: 2


### STEP 4 — Train/Validation Split

Both shuffle=True and stratify=dfA["age_bin"] are important for proper model training and evaluation. 

* Benefits of shuffle=True:
    - Prevents order bias: Without shuffling, if your data is sorted by age, the model might learn patterns based on the order rather than actual features
    - Better generalization: Random mixing ensures the model sees diverse examples in each batch
    - Prevents overfitting to data patterns: Shuffling breaks any inherent ordering that might exist in your dataset

* Benefits of stratify=dfA["age_bin"]:
    - Balanced age distribution: Ensures both training and validation sets have the same proportion of each age group
    - Prevents bias: Without stratification, some age groups might be underrepresented in validation, leading to unreliable performance metrics
    - More accurate evaluation: Your validation set will better represent the real-world age distribution
    - Stable training: Prevents scenarios where certain age groups are only in training or only in validation

In [5]:
# Train/validation split for dataset A with stratification on age bins
trainA, valA = train_test_split(
    dfA,
    test_size=0.2,
    random_state=42,
    shuffle=True,
    stratify=dfA["age_bin"])

### STEP 5 — Create TensorFlow Data Pipeline

TensorFlow works best with tf.data.Dataset.

In [6]:

# Create a Tensforflow dataset for training
IMG_SIZE = 224
BATCH_SIZE = 32

# Function to preprocess images to a standard size and normalize pixel values
def preprocess_image(path):
    img = tf.io.read_file(path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (IMG_SIZE, IMG_SIZE))
    img = img / 255.0
    return img

# Build the dataset - Dataset A Loader:

def create_datasetA(df, training=True):
    image_paths = df["image_path"].values
    age = df["age_bin"].values
    gender = df["gender"].values
    
    # Create a TensorFlow dataset from the image paths and labels
    ds = tf.data.Dataset.from_tensor_slices((image_paths, age, gender))

    # Map the dataset to load and preprocess images and return labels
    def load_data(path,age,gender) :
        img = preprocess_image(path)
        return img, {
            "age": age,
            "gender": gender
        }
    
    # Map the dataset to load and preprocess images and return labels
    ds = ds.map(load_data, num_parallel_calls=tf.data.AUTOTUNE)
    ds = ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
    return ds

# Create datasets for training and validation
train_ds = create_datasetA(trainA, training=True)
val_ds = create_datasetA(valA, training=False)


### STEP 6 — Build Multi-Output Model

This is a very important step

We use:
- Pretrained MobileNetV2
- 2 output heads:
    - age
    - gender
Shared feature extractor → multiple tasks

In [7]:
# Build the model - MobileNetV2 as the base model for feature extraction
base_mode = keras.applications.MobileNetV2(
    input_shape=(IMG_SIZE, IMG_SIZE, 3),
    include_top=False,
    weights="imagenet",
)

base_mode.trainable = False  # Freeze the base model first

# Add custom layers on top of the base model for age and gender
inputs = keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
x = base_mode(inputs, training=False)
x = layers.GlobalAveragePooling2D()(x)

# Output layers for age and gender
age_output = layers.Dense(num_age, activation="softmax", name="age")(x)
gender_output = layers.Dense(num_gender, activation="softmax", name="gender")(x)

# Create the model with two outputs
model = keras.Model(inputs=inputs, outputs=[age_output, gender_output])


### STEP 7 — Compile Model (Updated)

In [8]:
model.compile(
    optimizer=keras.optimizers.Adam(1e-4),
    loss={
        "age": keras.losses.SparseCategoricalCrossentropy(),
        "gender": keras.losses.SparseCategoricalCrossentropy()
    },
    metrics={
        "age": "accuracy",
        "gender": "accuracy"
    }
)

### STEP 8 — Train on Dataset A

In [None]:
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10
)

Epoch 1/10
[1m100/603[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m1:31[0m 182ms/step - age_accuracy: 0.2517 - age_loss: 1.8930 - gender_accuracy: 0.6237 - gender_loss: 0.6615 - loss: 2.5546



[1m120/603[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m1:27[0m 181ms/step - age_accuracy: 0.2621 - age_loss: 1.8761 - gender_accuracy: 0.6320 - gender_loss: 0.6505 - loss: 2.5266

### STEP 9 — Predict Pseudo-Labels for Dataset B

Create loader for B

In [None]:
# Generate pseudo-labels for dataset B using the trained model

IMG_SIZE = 224  # must match model input size

# Function to load and preprocess images for dataset B
def load_image(path):
    img = tf.io.read_file(path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (IMG_SIZE, IMG_SIZE))
    
    img = keras.applications.mobilenet_v2.preprocess_input(img)
    
    return img

# Create a TensorFlow dataset for dataset B
def create_datasetB(df):
    image_paths = df["image_path"].values
    
    # Create a TensorFlow dataset from the image paths
    ds = tf.data.Dataset.from_tensor_slices((image_paths))

    # Map the dataset to load and preprocess images
    def process(path) :
        img = load_image(path)
        return img, path
    
    # Map the dataset to load and preprocess images and return labels
    ds = ds.map(process, num_parallel_calls=tf.data.AUTOTUNE)
    ds = ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE) # Prefetch for performance
    return ds


dsB = create_datasetB(dfB)

In [None]:
# Generate pseudo-labels predictions for dataset B using the trained model

# Store predictions and confidence scores
age_preds = []
gender_preds = []

age_conf = []
gender_conf = []

# Iterate through dataset B and get predictions from the model
for images, paths in dsB:
    age_p, gender_p = model.predict(images, verbose=0)
    
    age_preds.extend(np.argmax(age_p, axis=1))
    gender_preds.extend(np.argmax(gender_p, axis=1))
    
    age_conf.extend(np.max(age_p, axis=1))
    gender_conf.extend(np.max(gender_p, axis=1))


2026-02-17 23:52:46.980177: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### STEP 10 — Create Augmented Dataset B

In [None]:
# Add the pseudo-labels and confidence scores to the original dataframe for dataset B
dfB["age_pseudo"] = age_preds
dfB["gender_pseudo"] = gender_preds

dfB["age_conf"] = age_conf
dfB["gender_conf"] = gender_conf

print(dfB.head())

                                          image_path  emotion  age_pseudo  \
0  source_data/raf/DATASET/train/7/train_11651_al...        7           4   
1  source_data/raf/DATASET/train/7/train_10043_al...        7           1   
2  source_data/raf/DATASET/train/7/train_11301_al...        7           2   
3  source_data/raf/DATASET/train/7/train_10513_al...        7           5   
4  source_data/raf/DATASET/train/7/train_11148_al...        7           2   

   gender_pseudo  age_conf  gender_conf  
0              1  0.263394     0.504591  
1              1  0.512256     0.882156  
2              1  0.687731     0.572668  
3              1  0.410027     0.861015  
4              0  0.552722     0.724074  


### STEP 11 — Confidence Filtering

We don’t trust low-confidence predictions. Therefore it is necessary for us to drop those predictions

* -1 = unknown
* Others = pseudo-label

In [None]:
THRESHOLD = 0.50 # Set a confidence threshold for accepting pseudo-labels

# Set pseudo-labels to -1 for samples where confidence is below the threshold
dfB.loc[dfB["age_conf"] < THRESHOLD, "age_pseudo"] = -1
dfB.loc[dfB["gender_conf"] < THRESHOLD, "gender_pseudo"] = -1

# Save the updated dataframe with pseudo-labels to a new CSV file`
dfB.to_csv("B_with_pseudo_labels.csv", index=False)

### STEP 12 — Merge Datasets

Keep true vs pseudo separate!

In [None]:
print(dfA.columns)

Index(['image_path', 'age', 'gender', 'emotion'], dtype='object')


In [None]:
# Merge datasets A and B using the true labels from dataset A and the pseudo-labels from dataset B.

dfA["emotion"] = -1  # no emotion label in A

dfA["age"] = dfA["age_bin"]        # true age
dfA["gender"] = dfA["gender"]  # true gender

dfA = dfA[["image_path", "age", "gender", "emotion"]]


# Since dataset B does not have true labels, we will use the pseudo-labels as the "true" labels for merging. 
# We will also keep the original columns for clarity, but they will be filled with NaN since we don't have true labels for dataset B.
dfB["age"] = dfB["age_pseudo"]
dfB["gender"] = dfB["gender_pseudo"]
dfB["emotion"] = dfB["emotion"]

dfB = dfB[["image_path", "age", "gender", "emotion"]]

# Merge the two datasets
merged = pd.concat([dfA, dfB], ignore_index=True)
merged.to_csv("merged_dataset.csv", index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfA["emotion"] = -1  # no emotion label in A


KeyError: 'age_bin'