## Deepfake Dataset Generator: 

This notebook is a simple implementation of a deepfake dataset generator. The dataset is generated using the ArtBench-10 dataset. The source images are taken from the ArtBench-10 dataset and the target images are generated using a stable diffusion model. The dataset is generated using the following steps:
- Load the ArtBench-10 dataset
- Obtain metadata for the current datapoint
- Generate a variant image using the stable diffusion model
- Save the source and target images in the dataset
- Save the metadata for the generated datapoint

In [1]:
!pip install numpy
!pip install matplotlib
!pip install pandas

Collecting numpy
  Downloading numpy-2.1.1-cp312-cp312-win_amd64.whl.metadata (59 kB)
Downloading numpy-2.1.1-cp312-cp312-win_amd64.whl (12.6 MB)
   ---------------------------------------- 0.0/12.6 MB ? eta -:--:--
   --- ------------------------------------ 1.0/12.6 MB 6.3 MB/s eta 0:00:02
   -------- ------------------------------- 2.6/12.6 MB 6.6 MB/s eta 0:00:02
   ------------ --------------------------- 3.9/12.6 MB 6.5 MB/s eta 0:00:02
   ----------------- ---------------------- 5.5/12.6 MB 6.6 MB/s eta 0:00:02
   --------------------- ------------------ 6.8/12.6 MB 6.7 MB/s eta 0:00:01
   -------------------------- ------------- 8.4/12.6 MB 6.6 MB/s eta 0:00:01
   ------------------------------ --------- 9.7/12.6 MB 6.6 MB/s eta 0:00:01
   ----------------------------------- ---- 11.0/12.6 MB 6.6 MB/s eta 0:00:01
   ---------------------------------------  12.3/12.6 MB 6.7 MB/s eta 0:00:01
   ---------------------------------------- 12.6/12.6 MB 6.5 MB/s eta 0:00:00
Installing 

### Load dataset:

In [2]:
# load the data from the artbench dataset folder
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
# import torch

In [13]:
# given  a datapoint, return the path to the image
def get_image_path(data, index):
    # from the data, get the folder and filename
    label_folder = os.path.join('./artbench-10-imagefolder-split/artbench-10-imagefolder-split/train/', data.iloc[index]['label'])
    image_path = os.path.join(label_folder, data.iloc[index]['name'])
    try:
        img = Image.open(image_path)
    except:
        #print("Image not found in train folder, checking test folder")
        label_folder = os.path.join('./artbench-10-imagefolder-split/artbench-10-imagefolder-split/test/', data.iloc[index]['label'])
        image_path = os.path.join(label_folder, data.iloc[index]['name'])
        try:
            img = Image.open(image_path)
        except:
            #print("Image not found in test folder")
            return None
    return image_path

In [14]:
def populate_metadata(data, metadata):
    for i in range(len(data)):
        #print("Populating metadata for image ", i)
        image_path = get_image_path(data, i)
        metadata.at[i, 'image'] = image_path
        metadata.at[i, 'original_artwork'] = data.iloc[i]['name'].split("_")[1].split(".jpg")[0]
        metadata.at[i, 'artist'] = data.iloc[i]['artist']
        metadata.at[i, 'original_style'] = data.iloc[i]['label']
    

To save metadata for the generated dataset, we populate a json file with the following fields taken from the metadata_example.json file:

| Field | Description |
| --- | --- |
| original_artwork | The title of the original artwork it's based upon |
| artist | The artist of the original artwork |
| date | The period in which the original artwork was created |
| description | A brief description of the original artwork |
| image | The URL of the original artwork |
| original_style | The style of the original artwork |
| medium | The medium used to create the original artwork |
| AI model | The AI model used to generate the target image |
| deepfake image | The URL of the generated target image |



In [15]:
# load metadata
data = pd.read_csv('./ArtBench-10.csv')
print("Number of samples: ", len(data))
data.head()
# take only the first 1000 samples
data = data[:1000]
data.head()

Number of samples:  60000


Unnamed: 0,name,artist,url,is_public_domain,length,width,label,split,cifar_index
0,frank-omeara_towards-night-and-winter.jpg,frank-omeara,https://uploads5.wikiart.org/00316/images/fran...,True,800,657,impressionism,train,43186
1,goldstein-grigoriy_morning.jpg,goldstein-grigoriy,https://uploads5.wikiart.org/images/grigoriy-g...,True,521,499,impressionism,train,41151
2,georges-lemmen_man-reading.jpg,georges-lemmen,https://uploads6.wikiart.org/images/georges-le...,True,800,612,impressionism,train,9754
3,theodor-aman_port-of-constantza-1882.jpg,theodor-aman,https://uploads6.wikiart.org/images/theodor-am...,True,560,336,impressionism,train,44244
4,niccolo-cannicci_il-passo-della-futa-1914.jpg,niccolo-cannicci,https://uploads3.wikiart.org/images/niccolo-ca...,True,2400,2322,impressionism,train,46885


In [16]:
data.iloc[0]['name'].split("_")[1].split(".jpg")[0]

'towards-night-and-winter'

In [18]:
# create an empty pd dataframe for the metadata with the columns: "original artwork", "artist", "date", "description", "image", "original style", "medium", "AI model", "deepfake image"
metadata = pd.DataFrame(columns=["original_artwork", "artist", "date", "description", "image", "original_style", "medium", "AI model", "deepfake_image"])
populate_metadata(data, metadata)
metadata.head()


Unnamed: 0,original_artwork,artist,date,description,image,original_style,medium,AI model,deepfake_image
0,towards-night-and-winter,frank-omeara,,,./artbench-10-imagefolder-split/artbench-10-im...,impressionism,,,
1,morning,goldstein-grigoriy,,,./artbench-10-imagefolder-split/artbench-10-im...,impressionism,,,
2,man-reading,georges-lemmen,,,./artbench-10-imagefolder-split/artbench-10-im...,impressionism,,,
3,port-of-constantza-1882,theodor-aman,,,./artbench-10-imagefolder-split/artbench-10-im...,impressionism,,,
4,il-passo-della-futa-1914,niccolo-cannicci,,,./artbench-10-imagefolder-split/artbench-10-im...,impressionism,,,


### Generate deepfake variants of ArtBench-10 dataset:

In [None]:
from diffusers import StableDiffusionImageVariationPipeline
from PIL import Image
from torchvision import transforms

In [None]:
for elem in metadata['image']:
    if elem == None:
        print("None found")
        break
    img = Image.open(elem)

    device = "cuda:0"
    sd_pipe = StableDiffusionImageVariationPipeline.from_pretrained(
    "lambdalabs/sd-image-variations-diffusers",
    revision="v2.0",
    )
    sd_pipe = sd_pipe.to(device)
    tform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Resize(
            (224, 224),
            interpolation=transforms.InterpolationMode.BICUBIC,
            antialias=False,
            ),
        transforms.Normalize(
        [0.48145466, 0.4578275, 0.40821073],
        [0.26862954, 0.26130258, 0.27577711]),
    ])
    inp = tform(img).to(device).unsqueeze(0)

    # Find closest number n divisible by 8
    width = img.width - img.width % 8
    height = img.height - img.height % 8
    out = sd_pipe(inp, guidance_scale=3, width=width, height=height)
    #out["images"][0].save("result.jpg")
    for i in range(len(out["images"])):
        out["images"][i].save(f"result_{i}.jpg")

