#### **Foundation of Machine Learning**
##### **Final Project**

#### **Step1**

- We found a celebrity dataset in huggingface. 
- Install huggingface cli to use the dataset
```
pip install -U "huggingface_hub[cli]"
huggingface-cli --help
```



Reference: The dataset for celebrity images were found from the following repo in huggingface. 
```
https://huggingface.co/datasets/ares1123/celebrity_dataset
```
Thanks to user **https://huggingface.co/ares1123**


#### **Step 2: Download the dataset**

In [6]:
from datasets import load_dataset

dataset = load_dataset("ares1123/celebrity_dataset")
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 18184
    })
})


* View Dataset loaded as pandas dataframe

In [11]:
import pandas as pd
df = pd.read_parquet("hf://datasets/ares1123/celebrity_dataset/data/train-00000-of-00001.parquet")
display(df)

Unnamed: 0,image,label
0,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,0
1,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,0
2,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,0
3,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,0
4,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,0
...,...,...
18179,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,996
18180,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,996
18181,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,996
18182,{'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x...,996


In [12]:
print(dataset['train'][0])

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=256x256 at 0x20A605A2BD0>, 'label': 0}


* Save images as jpg from dataset

In [21]:
import os
import io
import pandas as pd
from PIL import Image

def save_dataset_images(dataset, label_mapping_csv, output_base_dir='dataset_images'):
    """
    Save images from a Hugging Face dataset to folders named using a CSV label mapping.
    
    Parameters:
    - dataset: Hugging Face dataset
    - label_mapping_csv: Path to CSV file with label mapping
    - output_base_dir: Base directory to save images
    
    Returns:
    - Path to the created output directory
    """
    # Read the label mapping CSV
    try:
        label_mapping_df = pd.read_csv(label_mapping_csv)
    except Exception as e:
        raise ValueError(f"Error reading label mapping CSV: {e}")
    
    # Create a dictionary mapping integer labels to names
    # Assumes the CSV has columns for integer label and corresponding name
    # You might need to adjust column names based on your specific CSV structure
    try:
        label_map = dict(zip(label_mapping_df['Label'], label_mapping_df['Name']))
    except KeyError:
        raise ValueError("CSV must contain 'Label' and 'Name' columns")
    
    # Ensure output directory exists
    os.makedirs(output_base_dir, exist_ok=True)
    
    # Create a list to track saved images
    saved_images = []
    
    # Iterate through the dataset
    for idx, item in enumerate(dataset):
        # Get image and integer label
        img = item['image']
        label = item['label']
        
        # Get the name from the label mapping
        try:
            label_name = label_map[label]
        except KeyError:
            print(f"Warning: No mapping found for label {label}. Skipping.")
            continue
        
        # Create label-specific folder using the mapped name
        label_dir = os.path.join(output_base_dir, label_name)
        os.makedirs(label_dir, exist_ok=True)
        
        # Generate unique filename
        filename = f'image_{idx}_{label_name}.jpg'
        filepath = os.path.join(label_dir, filename)
        
        # Save image as RGB JPG
        try:
            img.convert('RGB').save(filepath, 'JPEG')
        except Exception as e:
            print(f"Error saving image {filename}: {e}")
            continue
        
        # Track saved image details
        saved_images.append({
            'original_label': label,
            'label_name': label_name,
            'filename': filename,
            'full_path': filepath
        })
    
    # Optional: Create a log of saved images
    log_path = os.path.join(output_base_dir, 'saved_images_log.csv')
    pd.DataFrame(saved_images).to_csv(log_path, index=False)
    
    print(f"Images saved to {output_base_dir}")
    print(f"Saved images log: {log_path}")
    
    return output_base_dir

# Example usage
# save_dataset_images(dataset, 'label_mapping.csv')

In [22]:
save_dataset_images(dataset=dataset['train'], label_mapping_csv='label_names.csv')

Images saved to dataset_images
Saved images log: dataset_images\saved_images_log.csv


'dataset_images'

#### We have all the images downloaded into the dataset_images folder

- Some findings
    - There are some celebrities Like Zoe Zaldana and ZoE Zaldana who are the same person but have two folders assigned. 
    - The dataset doesnt have information of Gender. This is a challenge since when a user uploads an image and selects a gender and the app provides similarity to a different gender then it could negatively affect the sentiment of the user and consequently the app experience.