## Creating Custom Dataset

Source : https://academictorrents.com/details/4b9b7e449aa732842aea1a7d4e6413f4507aea99

CSV file used : persons.csv

Images : front and side images of inmates

Customizing the online dataset tp create a custom dataset for the project. 
- `id` coloumn for identifying the inmates.

- `sex` coloumn for groud truth values in Gender Classification. 

- `bmi` coloumn which is calculated using the formula `BMI = weight(kg) / height(m)^2` for groud truth values in BMI Classification.

- `image_front_data` and `image_side_data` coloumns for the images of the inmates.

The custom dataset is saved as a pickle file for further use.

## Loading sex, weight, height and bmi

In [1]:
import pandas as pd
import numpy as np

Loading person dataset to include weight, height and sex.

In [2]:
df = pd.read_csv('illinois_doc_dataset/illinois_doc_dataset/csv/person.csv', sep=';')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61110 entries, 0 to 61109
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              61110 non-null  object 
 1   name                            61110 non-null  object 
 2   date_of_birth                   61096 non-null  object 
 3   weight                          60716 non-null  float64
 4   hair                            61109 non-null  object 
 5   sex                             61109 non-null  object 
 6   height                          60728 non-null  float64
 7   race                            61106 non-null  object 
 8   eyes                            61109 non-null  object 
 9   admission_date                  61109 non-null  object 
 10  projected_parole_date           33932 non-null  object 
 11  last_paroled_date               8474 non-null   object 
 12  projected_discharge_date        

Creating custom dataframe using existing dataset.<br>
Converting weight and height to kilogram and meter respectively.

In [3]:
custom_df = df[['id', 'sex', 'weight', 'height']]
custom_df.eval('weight = weight * 0.453592', inplace=True)
custom_df.eval('height = height * 0.0254', inplace=True)
custom_df.head()

Unnamed: 0,id,sex,weight,height
0,A00147,Male,83.91452,1.7018
1,A00220,Male,70.30676,1.8542
2,A00360,Male,75.749864,1.7526
3,A00367,Male,111.13004,1.8288
4,A01054,Male,75.296272,1.7018


Calculating BMI.

In [4]:
custom_df.eval('bmi = weight / height ** 2', inplace=True)
custom_df = custom_df.round(2)
custom_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  custom_df.eval('bmi = weight / height ** 2', inplace=True)


Unnamed: 0,id,sex,weight,height,bmi
0,A00147,Male,83.91,1.7,28.97
1,A00220,Male,70.31,1.85,20.45
2,A00360,Male,75.75,1.75,24.66
3,A00367,Male,111.13,1.83,33.23
4,A01054,Male,75.3,1.7,26.0


Dropping weight and height

In [5]:
custom_df = custom_df.drop(['weight', 'height'], axis=1)
custom_df.head()

Unnamed: 0,id,sex,bmi
0,A00147,Male,28.97
1,A00220,Male,20.45
2,A00360,Male,24.66
3,A00367,Male,33.23
4,A01054,Male,26.0


## Loading images of face.

In [6]:
import os
import cv2
import pickle

Couting no of females and males in the dataset.

In [7]:
male_count = df[df['sex'] == 'Male'].shape[0]
female_count = df[df['sex'] == 'Female'].shape[0]

print(f"Male count: {male_count}")
print(f"Female count: {female_count}")

Male count: 57341
Female count: 3768


Loading front face.

Adding front face data only if the image exists and the corresponding id is also present.

Taking all female and 4000 males

In [8]:
image_folder = "illinois_doc_dataset/illinois_doc_dataset/front/front"
image_size = (512, 512)  # Resize all for consistency
image_ids = []
image_front_data = []

male_count = 0
male_limit = 4000

for filename in os.listdir(image_folder):    
    if filename.endswith(".jpg") or filename.endswith(".png"):
        file_path = os.path.join(image_folder, filename)
        
        image_id = os.path.splitext(filename)[0]
        
        if image_id in custom_df['id'].values:
            sex = custom_df[custom_df['id'] == image_id]['sex'].values[0]
            if sex == 'Male' and male_count >= male_limit:
                continue
            
            image = cv2.imread(file_path)
            
            if image is None:
                print(f"Warning: Unable to read image file {file_path}")
                continue
            
            try:
                image = cv2.resize(image, image_size)
                image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
                
                image_ids.append(image_id)
                image_front_data.append(image.flatten())
                
                if sex == 'Male':
                    male_count += 1
            except Exception as e:
                print(f"Error processing {file_path}: {e}")
                continue



Creating a dataframe of front images.

Merge it with previous custom dataframe.


In [9]:
image_df = pd.DataFrame({
    "image_id": image_ids,
    "image_front_data": image_front_data
})

custom_df['id'] = custom_df['id'].str.strip() 
image_df['image_id'] = image_df['image_id'].str.strip()

merged_df = pd.merge(custom_df, image_df, how='left', left_on='id', right_on='image_id')
merged_df.drop(columns=['image_id'], inplace=True)
merged_df.dropna(subset=['image_front_data'], inplace=True)

Loading side face.

Adding side face data if the id is present in the custom dataframe

In [10]:
image_folder = "illinois_doc_dataset/illinois_doc_dataset/side/side"
image_size = (512, 512)
image_ids = []
image_side_data = []

male_count = 0
male_limit = 4000

for filename in os.listdir(image_folder):
    
    if filename.endswith(".jpg") or filename.endswith(".png"):
        file_path = os.path.join(image_folder, filename)
        
        image_id = os.path.splitext(filename)[0]
        
        if image_id in custom_df['id'].values:
            sex = custom_df[custom_df['id'] == image_id]['sex'].values[0]
            if sex == 'Male' and male_count >= male_limit:
                continue
            
            image = cv2.imread(file_path)
            
            if image is None:
                print(f"Warning: Unable to read image file {file_path}")
                continue
            
            try:
                image = cv2.resize(image, image_size)
                image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
                
                image_ids.append(image_id)
                image_side_data.append(image.flatten())
                
                if sex == 'Male':
                    male_count += 1
            except Exception as e:
                print(f"Error processing {file_path}: {e}")
                continue



KeyboardInterrupt: 

Creating a dataframe of side images.

Merge it with previous custom dataframe.

In [None]:
image_df = pd.DataFrame({
    "image_id": image_ids,
    "image_side_data": image_side_data
})

custom_df['id'] = custom_df['id'].str.strip()
image_df['image_id'] = image_df['image_id'].str.strip()

merged_df = pd.merge(merged_df, image_df, how='left', left_on='id', right_on='image_id')
merged_df.drop(columns=['image_id'], inplace=True)
merged_df.dropna(subset=['image_side_data'], inplace=True)

Saving the DataFrame to an HDF5 file

In [None]:
with open('custom_dataset.pkl', 'wb') as f:
    pickle.dump(merged_df, f)

print(f"Final DataFrame saved to 'custom_dataset.pkl' with {len(merged_df)} records.")

Final DataFrame saved to 'custom_dataset.pkl' with 7663 records.


: 

Reading HDF5 file

In [None]:
# Read the DataFrame from the pickle file
with open('custom_dataset.pkl', 'rb') as f:
    loaded_df = pickle.load(f)

# Convert the list to a 2D matrix of shape (512, 512)
loaded_df['image_front_data'] = loaded_df['image_front_data'].apply(lambda x: np.array(x).reshape(512, 512))
loaded_df['image_side_data'] = loaded_df['image_side_data'].apply(lambda x: np.array(x).reshape(512, 512))

# # Check the shape of the first image to verify
print(loaded_df['image_front_data'].iloc[120].shape)
print(loaded_df['image_side_data'].iloc[120].shape)
loaded_df.head()