# CS 4644: Final Project Dataset Setup

This notebook pulls the ["Human Faces" by Ashwin Gupta](https://www.kaggle.com/datasets/ashwingupta3012/human-faces/data), ["Fake-Vs-Real-Faces (Hard)" by Hamza Boulahi](https://www.kaggle.com/datasets/hamzaboulahia/hardfakevsrealfaces), and ["deepfake and real images" by Manjil Karki](https://www.kaggle.com/datasets/manjilkarki/deepfake-and-real-images) datasets from Kaggle and stores them in zip files to be easily loaded for our models.

Copyright (c) 2025 Ethan Nguyen-Tu

## Part 1: Setup

##### STEP 1: Mount Google Drive for Google Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [30]:

drive_path = "drive/MyDrive" # NOTE: Separated so that colab can access the '.kaggle' folder in your Google Drive for Kaggle API authentication
project_folder = drive_path + "/CS4644_FinalProject"

##### STEP 2: Basic Imports

In [19]:
import kagglehub
import os
import zipfile
import pandas as pd

##### STEP 3: Package Installations (If Needed)

In [4]:
# ! pip install -q kaggle # Install kaggle if needed
! pip install imagehash

Collecting imagehash
  Downloading ImageHash-4.3.2-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting PyWavelets (from imagehash)
  Downloading pywavelets-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.0 kB)
Downloading ImageHash-4.3.2-py2.py3-none-any.whl (296 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.7/296.7 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pywavelets-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyWavelets, imagehash
Successfully installed PyWavelets-1.8.0 imagehash-4.3.2


##### STEP 4: Helper Functions

In [48]:
def zip_files(image_files_path, zip_filename, whitelist=None):
  count = 0
  with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(image_files_path):
        for file in files:
          if whitelist and file not in whitelist:
            continue
          file_path = os.path.join(root, file)
          zipf.write(file_path, os.path.relpath(file_path, image_files_path))
          count += 1
  print(f"{count} files from folder {image_files_path} have been zipped as {zip_filename}")

In [6]:
def zip_to_colab(zip_file_path, extract_dir_name):
  extract_dir = '/content/' + extract_dir_name + "/"
  os.makedirs(extract_dir, exist_ok=True)

  with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
      zip_ref.extractall(extract_dir)

  print(f"Files from {zip_file_path} extracted to: {extract_dir}")
  print("Number of files extracted:", len(os.listdir(extract_dir)))

In [7]:
def get_file_extensions(folder_path):
    files = os.listdir(folder_path)

    extensions = set()

    for file in files:
        if os.path.isfile(os.path.join(folder_path, file)):
            extensions.add(os.path.splitext(file)[1])

    return extensions

In [8]:
# Source: https://github.com/cw-somil/Duplicate-Remover

from PIL import Image
import imagehash
import numpy as np

class DuplicateRemover:
    def __init__(self,dirname,hash_size = 8):
        self.dirname = dirname
        self.hash_size = hash_size

    def find_duplicates(self):
        """
        Find and Delete Duplicates
        """

        fnames = os.listdir(self.dirname)
        hashes = {}
        duplicates = []
        print("Finding Duplicates Now!\n")
        for image in fnames:
            with Image.open(os.path.join(self.dirname,image)) as img:
                temp_hash = imagehash.average_hash(img, self.hash_size)
                if temp_hash in hashes:
                    print("Duplicate {} \nfound for Image {}!\n".format(image,hashes[temp_hash]))
                    duplicates.append(image)
                else:
                    hashes[temp_hash] = image

        if len(duplicates) != 0:
            a = input("Do you want to delete these {} Images? Press Y or N:  ".format(len(duplicates)))
            space_saved = 0
            if(a.strip().lower() == "y"):
                for duplicate in duplicates:
                    space_saved += os.path.getsize(os.path.join(self.dirname,duplicate))

                    os.remove(os.path.join(self.dirname,duplicate))
                    print("{} Deleted Succesfully!".format(duplicate))

                print("\n\nYou saved {} mb of Space!".format(round(space_saved/1000000),2))
            else:
                print("Thank you for Using Duplicate Remover")
        else:
            print("No Duplicates Found :(")

    def find_similar(self,location,similarity=95):
        fnames = os.listdir(self.dirname)
        threshold = 1 - similarity/100
        diff_limit = int(threshold*(self.hash_size**2))

        with Image.open(location) as img:
            hash1 = imagehash.average_hash(img, self.hash_size).hash

        print("Finding Similar Images to {} Now!\n".format(location))
        for image in fnames:
            with Image.open(os.path.join(self.dirname,image)) as img:
                hash2 = imagehash.average_hash(img, self.hash_size).hash

                if np.count_nonzero(hash1 != hash2) <= diff_limit:
                    print("{} image found {}% similar to {}".format(image,similarity,location))

## PART 2: "Human Faces" Dataset

Dataset Source: ["Human Faces" by Ashwin Gupta](https://www.kaggle.com/datasets/ashwingupta3012/human-faces/data)

Loads the dataset from Kaggle through the Kaggle API before removing duplicate images and zipping the files to the drive.

##### STEP 1: Import the data from Kaggle using Kaggle API

1. Load the raw dataset files from Kaggle

In [None]:
human_faces_path = kagglehub.dataset_download("ashwingupta3012/human-faces")

print("Path to dataset files:", human_faces_path)

Path to dataset files: /kaggle/input/human-faces


2. Store raw dataset files to drive  (Necessary if Kaggle stored in /input/ folder.)

In [None]:
zip_files(human_faces_path, drive_path + '/HumanFacesDataset.zip')

Folder /kaggle/input/human-faces zipped as drive/MyDrive/HumanFacesDataset.zip


##### STEP 2: Basic File Exploration

1. Check what was downloaded from Kaggle

In [None]:
os.listdir(human_faces_path)

['Humans']

2. Check the Humans Folder

In [None]:
human_faces_image_folder = human_faces_path + '/Humans/'
human_faces_image_folder_filenames = os.listdir(human_faces_image_folder)
HUMANFACES_IMG_EXTENSIONS = get_file_extensions(human_faces_image_folder)


print("Number of image files:", len(human_faces_image_folder_filenames))
print("First 10 Files:", human_faces_image_folder_filenames[:10]) # Check the first 10 files
print("Image file extensions:", HUMANFACES_IMG_EXTENSIONS)

Number of image files: 7219
First 10 Files: ['1 (2916).jpg', '1 (607).jpg', '1 (3767).jpg', '1 (576).jpg', '1 (1856).jpg', '1 (1464).jpg', '1 (1290).jpg', '1 (1341).jpg', '1 (2598).jpg', '1 (789).jpg']
Image file extensions: {'.jpeg', '.jpg', '.JPG', '.png'}


##### STEP 3: Clean the Data

References:
1. ["Deleting duplicated images" by saworz](https://www.kaggle.com/code/saworz/deleting-duplicated-images?scriptVersionId=108951245)

1. Load the raw dataset files (Necessary if Kaggle stored in /input/ folder.)

In [None]:
zip_to_colab(drive_path + "/HumanFacesDataset.zip", "HumanFacesImages")

Files from drive/MyDrive/HumanFacesDataset.zip extracted to: /content/HumanFacesImages/
Number of files extracted: 1


In [None]:
# Reset stored folder, filenames, and extensions

human_faces_image_folder = "/content/HumanFacesImages/Humans/"
human_faces_image_folder_filenames = os.listdir(human_faces_image_folder)
HUMANFACES_IMG_EXTENSIONS = get_file_extensions(human_faces_image_folder)

print("Number of image files:", len(human_faces_image_folder_filenames))
print("First 10 Files:", human_faces_image_folder_filenames[:10]) # Check the first 10 files
print("Image file extensions:", HUMANFACES_IMG_EXTENSIONS)

Number of image files: 7219
First 10 Files: ['1 (3266).jpg', '1 (3452).jpg', '1 (5516).jpg', '1 (4032).jpg', '1 (3295).jpg', '1 (1016).jpg', '1 (2832).jpg', '1 (5160).jpg', '1 (1369).jpg', '1 (3636).jpg']
Image file extensions: {'.jpeg', '.jpg', '.JPG', '.png'}


2. Rename the files

In [None]:
for file_name in human_faces_image_folder_filenames:
    filetype = file_name[file_name.index("."):]
    file_number = file_name[file_name.index("(") + 1: file_name.index(")")]
    new_name = "HF_" + file_number + filetype
    os.rename(os.path.join(human_faces_image_folder, file_name),
              os.path.join(human_faces_image_folder, new_name))
    print(file_name, "was renamed to", new_name)

1 (3266).jpg was renamed to HF_3266.jpg
1 (3452).jpg was renamed to HF_3452.jpg
1 (5516).jpg was renamed to HF_5516.jpg
1 (4032).jpg was renamed to HF_4032.jpg
1 (3295).jpg was renamed to HF_3295.jpg
1 (1016).jpg was renamed to HF_1016.jpg
1 (2832).jpg was renamed to HF_2832.jpg
1 (5160).jpg was renamed to HF_5160.jpg
1 (1369).jpg was renamed to HF_1369.jpg
1 (3636).jpg was renamed to HF_3636.jpg
1 (2147).jpg was renamed to HF_2147.jpg
1 (3184).jpg was renamed to HF_3184.jpg
1 (3999).jpg was renamed to HF_3999.jpg
1 (1065).jpg was renamed to HF_1065.jpg
1 (4141).jpg was renamed to HF_4141.jpg
1 (4725).jpg was renamed to HF_4725.jpg
1 (94).jpg was renamed to HF_94.jpg
1 (3658).jpg was renamed to HF_3658.jpg
1 (1081).jpg was renamed to HF_1081.jpg
1 (5502).jpg was renamed to HF_5502.jpg
1 (1934).jpg was renamed to HF_1934.jpg
1 (1909).jpg was renamed to HF_1909.jpg
1 (1427).jpg was renamed to HF_1427.jpg
1 (4441).jpg was renamed to HF_4441.jpg
1 (844).jpg was renamed to HF_844.jpg
1 (49)

3. Remove all duplicate images

In [None]:
dr = DuplicateRemover(human_faces_image_folder)
dr.find_duplicates()

Finding Duplicates Now!

Duplicate HF_764.jpg 
found for Image HF_918.jpg!

Duplicate HF_4427.jpg 
found for Image HF_5045.jpg!

Duplicate HF_1488.jpg 
found for Image HF_2066.jpg!

Duplicate HF_2259.jpg 
found for Image HF_1394.jpg!

Duplicate HF_3432.jpg 
found for Image HF_3049.jpg!

Duplicate HF_2785.jpg 
found for Image HF_3836.jpg!

Duplicate HF_2107.jpg 
found for Image HF_1454.jpg!

Duplicate HF_1401.jpg 
found for Image HF_286.jpg!

Duplicate HF_1164.jpg 
found for Image HF_1983.jpg!

Duplicate HF_4898.jpg 
found for Image HF_5123.jpg!

Duplicate HF_4109.jpg 
found for Image HF_4331.jpg!

Duplicate HF_4010.jpg 
found for Image HF_3786.jpg!

Duplicate HF_2542.jpg 
found for Image HF_2014.jpg!

Duplicate HF_4620.jpg 
found for Image HF_5097.jpg!

Duplicate HF_3095.jpg 
found for Image HF_5046.jpg!

Duplicate HF_2881.jpg 
found for Image HF_4358.jpg!

Duplicate HF_915.jpg 
found for Image HF_2066.jpg!

Duplicate HF_2484.jpg 
found for Image HF_1422.jpg!

Duplicate HF_2632.jpg 
fo



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Duplicate HF_2479.jpg 
found for Image HF_457.jpg!

Duplicate HF_4060.jpg 
found for Image HF_4176.jpg!

Duplicate HF_3869.jpg 
found for Image HF_4345.jpg!

Duplicate HF_4405.jpg 
found for Image HF_4930.jpg!

Duplicate HF_743.jpg 
found for Image HF_2601.jpg!

Duplicate HF_1723.jpg 
found for Image HF_75.jpg!

Duplicate HF_3018.jpg 
found for Image HF_4175.jpg!

Duplicate HF_48.png 
found for Image HF_21.png!

Duplicate HF_4433.jpg 
found for Image HF_3542.jpg!

Duplicate HF_4220.jpg 
found for Image HF_3728.jpg!

Duplicate HF_3483.jpg 
found for Image HF_4282.jpg!

Duplicate HF_4815.jpg 
found for Image HF_5120.jpg!

Duplicate HF_4826.jpg 
found for Image HF_5142.jpg!

Duplicate HF_6976.jpg 
found for Image HF_5451.jpg!

Duplicate HF_2008.jpg 
found for Image HF_2187.jpg!

Duplicate HF_1572.jpg 
found for Image HF_759.jpg!

Duplicate HF_1098.jpg 
found for Image HF_2440.jpg!

Duplicate HF_141.jpg 
found for Image HF_20

In [None]:
print("Number of image files after duplicates removed:", len(os.listdir(human_faces_image_folder)))

Number of image files after duplicates removed: 3273


##### STEP 4: Save Cleaned Dataset

In [None]:
zip_files(human_faces_image_folder, drive_path + '/HumanFacesCleaned.zip')

Folder /content/HumanFacesImages/Humans/ zipped as drive/MyDrive/HumanFacesCleaned.zip


## PART 3: "Fake-Vs-Real-Faces (Hard)" dataset

Dataset Source: ["Fake-Vs-Real-Faces (Hard)" by Hamza Boulahi](https://www.kaggle.com/datasets/hamzaboulahia/hardfakevsrealfaces)

Loads the dataset from Kaggle through the Kaggle API before removing duplicate images and zipping the files to the drive.

##### STEP 1: Import the data from Kaggle using Kaggle API

1. Load the raw dataset files from Kaggle

In [None]:
fake_v_hard_path = kagglehub.dataset_download("hamzaboulahia/hardfakevsrealfaces")

print("Path to dataset files:", fake_v_hard_path)

Path to dataset files: /kaggle/input/hardfakevsrealfaces


##### STEP 2: Basic File Exploration

1. Check what was downloaded from Kaggle

In [None]:
os.listdir(fake_v_hard_path)

['fake', 'real', 'data.csv']

1. Check Real Images

In [None]:
real_images_path = fake_v_hard_path + '/real/'
real_image_filenames = os.listdir(real_images_path)
REAL_IMG_EXTENSIONS = get_file_extensions(real_images_path)

print("Number of image files:", len(real_image_filenames))
print("Image file extensions:", REAL_IMG_EXTENSIONS)

Number of image files: 589
Image file extensions: {'.jpg'}


2. Check Fake Images

In [None]:
fake_images_path = fake_v_hard_path + '/fake/'
fake_image_filenames = os.listdir(fake_images_path)
FAKE_IMG_EXTENSIONS = get_file_extensions(fake_images_path)

print("Number of image files:", len(fake_image_filenames))
print("Image file extensions:", FAKE_IMG_EXTENSIONS)

Number of image files: 700
Image file extensions: {'.jpg'}


3. Check the data.csv file

In [None]:
pd.read_csv(fake_v_hard_path + "/data.csv")

Unnamed: 0,images_id,label
0,real_1,real
1,real_10,real
2,real_100,real
3,real_101,real
4,real_102,real
...,...,...
1284,fake_95,fake
1285,fake_96,fake
1286,fake_97,fake
1287,fake_98,fake


##### STEP 3: Clean the Data

References:
1. ["Deleting duplicated images" by saworz](https://www.kaggle.com/code/saworz/deleting-duplicated-images?scriptVersionId=108951245)

1. Check the real images for duplicates

In [None]:
dr = DuplicateRemover(real_images_path)
dr.find_duplicates()

Finding Duplicates Now!

No Duplicates Found :(


2. Check the fake images for duplicates

In [None]:
dr = DuplicateRemover(fake_images_path)
dr.find_duplicates()

Finding Duplicates Now!

No Duplicates Found :(


##### STEP 3: Store the Files in Drive

1. Store the fake images and the real images separately as zip files

In [None]:
zip_files(real_images_path, 'RealImages.zip')
zip_files(fake_images_path, 'FakeImages.zip')

Folder /root/.cache/kagglehub/datasets/hamzaboulahia/hardfakevsrealfaces/versions/1/real/ zipped as RealImages.zip
Folder /root/.cache/kagglehub/datasets/hamzaboulahia/hardfakevsrealfaces/versions/1/fake/ zipped as FakeImages.zip


2. Download the .csv label file

In [None]:
from google.colab import files
files.download(fake_v_hard_path + '/data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## PART 4: deepfake and real images dataset

Dataset Source: ["deepfake and real images" by Manjil Karki](https://www.kaggle.com/datasets/manjilkarki/deepfake-and-real-images)

Upstream Source: ["OpenForensics: Multi-Face Forgery Detection And Segmentation In-The-Wild Dataset \[V.1.0.0\]" by Trung-Nghia Le, Huy H Nguyen, Junichi Yamagishi, and Isao Echizen](https://zenodo.org/record/5528418#.YpdlS2hBzDd)

Loads the test dataset from Kaggle through the Kaggle API before removing duplicate images and zipping a subset of the files to the drive.

##### STEP 1: Import the data from Kaggle using Kaggle API

1. Load the raw dataset files from Kaggle

In [9]:
deepfake_and_real_images_path = kagglehub.dataset_download("manjilkarki/deepfake-and-real-images")

print("Path to dataset files:", deepfake_and_real_images_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/manjilkarki/deepfake-and-real-images?dataset_version_number=1...


100%|██████████| 1.68G/1.68G [00:12<00:00, 140MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/manjilkarki/deepfake-and-real-images/versions/1


##### STEP 2: Basic File Exploration

1. Check what was downloaded from Kaggle

In [10]:
os.listdir(deepfake_and_real_images_path)

['Dataset']

In [12]:
os.listdir(deepfake_and_real_images_path + '/Dataset')

['Test', 'Train', 'Validation']

In [13]:
os.listdir(deepfake_and_real_images_path + '/Dataset/Test')

['Real', 'Fake']

In [14]:
deepfake_and_real_images_path += '/Dataset/Test'
deepfake_and_real_images_path

'/root/.cache/kagglehub/datasets/manjilkarki/deepfake-and-real-images/versions/1/Dataset/Test'

1. Check Real Images

In [17]:
real_images_path = deepfake_and_real_images_path + '/Real/'
real_image_filenames = os.listdir(real_images_path)
REAL_IMG_EXTENSIONS = get_file_extensions(real_images_path)

print("Number of image files:", len(real_image_filenames))
print("Image file extensions:", REAL_IMG_EXTENSIONS)

Number of image files: 5413
Image file extensions: {'.jpg'}


2. Check Fake Images

In [18]:
fake_images_path = deepfake_and_real_images_path + '/Fake/'
fake_image_filenames = os.listdir(fake_images_path)
FAKE_IMG_EXTENSIONS = get_file_extensions(fake_images_path)

print("Number of image files:", len(fake_image_filenames))
print("Image file extensions:", FAKE_IMG_EXTENSIONS)

Number of image files: 5492
Image file extensions: {'.jpg'}


##### STEP 3: Clean the Data

References:
1. ["Deleting duplicated images" by saworz](https://www.kaggle.com/code/saworz/deleting-duplicated-images?scriptVersionId=108951245)

1. Check the real images for duplicates

In [34]:
dr = DuplicateRemover(real_images_path)
dr.find_duplicates()

Finding Duplicates Now!

Duplicate real_4155.jpg 
found for Image real_937.jpg!

Duplicate real_2303.jpg 
found for Image real_4588.jpg!

Duplicate real_2756.jpg 
found for Image real_1917.jpg!

Duplicate real_4951.jpg 
found for Image real_5241.jpg!

Duplicate real_3549.jpg 
found for Image real_2638.jpg!

Duplicate real_1627.jpg 
found for Image real_4680.jpg!

Duplicate real_3676.jpg 
found for Image real_347.jpg!

Duplicate real_4747.jpg 
found for Image real_2485.jpg!

Duplicate real_2815.jpg 
found for Image real_4284.jpg!

Duplicate real_2634.jpg 
found for Image real_1813.jpg!

Duplicate real_557.jpg 
found for Image real_4519.jpg!

Duplicate real_2441.jpg 
found for Image real_3368.jpg!

Duplicate real_233.jpg 
found for Image real_3583.jpg!

Duplicate real_827.jpg 
found for Image real_2491.jpg!

Duplicate real_3974.jpg 
found for Image real_2387.jpg!

Duplicate real_1763.jpg 
found for Image real_3500.jpg!

Duplicate real_271.jpg 
found for Image real_2730.jpg!

Duplicate re

2. Check the fake images for duplicates

In [35]:
dr = DuplicateRemover(fake_images_path)
dr.find_duplicates()

Finding Duplicates Now!

Duplicate fake_856.jpg 
found for Image fake_3274.jpg!

Duplicate fake_4100.jpg 
found for Image fake_1626.jpg!

Duplicate fake_1516.jpg 
found for Image fake_4005.jpg!

Duplicate fake_4984.jpg 
found for Image fake_2634.jpg!

Duplicate fake_4217.jpg 
found for Image fake_1764.jpg!

Duplicate fake_1514.jpg 
found for Image fake_4003.jpg!

Duplicate fake_731.jpg 
found for Image fake_5389.jpg!

Duplicate fake_5007.jpg 
found for Image fake_2650.jpg!

Duplicate fake_1763.jpg 
found for Image fake_4216.jpg!

Duplicate fake_3499.jpg 
found for Image fake_1102.jpg!

Duplicate fake_5006.jpg 
found for Image fake_2649.jpg!

Duplicate fake_641.jpg 
found for Image fake_5303.jpg!

Duplicate fake_1931.jpg 
found for Image fake_4376.jpg!

Duplicate fake_2503.jpg 
found for Image fake_4852.jpg!

Duplicate fake_3460.jpg 
found for Image fake_2546.jpg!

Duplicate fake_2949.jpg 
found for Image fake_548.jpg!

Duplicate fake_3655.jpg 
found for Image fake_1213.jpg!

Duplicate 

##### STEP 4: Create a .csv file labeling fake and real images

1. Get the updated files for real and fake images.

In [42]:
real_image_filenames = os.listdir(real_images_path)
fake_image_filenames = os.listdir(fake_images_path)

print("Number of image files:", len(real_image_filenames))
print("Number of image files:", len(fake_image_filenames))

Number of image files: 5309
Number of image files: 5440


2. Take a subset of 258 total images for a test dataset, 129 from the fake folder and 129 from the real folder.

In [50]:
realSubset = real_image_filenames[:129]
fakeSubset = fake_image_filenames[:129]

3. Create the dataframe labeling the fake images as fake and the real images as real.

In [44]:
zero_labels = [0 for _ in range(len(fakeSubset))]
one_labels = [1 for _ in range(len(realSubset))]
filenames = pd.DataFrame({"images_id": fakeSubset + realSubset, "label": zero_labels + one_labels})
filenames

Unnamed: 0,images_id,label
0,fake_5234.jpg,0
1,fake_366.jpg,0
2,fake_4479.jpg,0
3,fake_2599.jpg,0
4,fake_577.jpg,0
...,...,...
253,real_4180.jpg,1
254,real_3319.jpg,1
255,real_4011.jpg,1
256,real_4527.jpg,1


4. Save the label of the test data to the drive.

In [45]:
filenames.to_csv(project_folder + "/test_data.csv", index=False)

4. Confirm the dataset labeling was saved.

In [51]:
pd.read_csv(project_folder + "/test_data.csv")

Unnamed: 0,images_id,label
0,fake_5234.jpg,0
1,fake_366.jpg,0
2,fake_4479.jpg,0
3,fake_2599.jpg,0
4,fake_577.jpg,0
...,...,...
253,real_4180.jpg,1
254,real_3319.jpg,1
255,real_4011.jpg,1
256,real_4527.jpg,1


##### STEP 3: Store the Files in Drive

1. Store the fake images and the real images separately as zip files

In [53]:
zip_files(real_images_path, project_folder + '/Test_RealImages.zip', whitelist=realSubset)
zip_files(fake_images_path, project_folder + '/Test_FakeImages.zip', whitelist=fakeSubset)

129 files from folder /root/.cache/kagglehub/datasets/manjilkarki/deepfake-and-real-images/versions/1/Dataset/Test/Real/ have been zipped as drive/MyDrive/CS4644_FinalProject/Test_RealImages.zip
129 files from folder /root/.cache/kagglehub/datasets/manjilkarki/deepfake-and-real-images/versions/1/Dataset/Test/Fake/ have been zipped as drive/MyDrive/CS4644_FinalProject/Test_FakeImages.zip


## Part 5: Test Load Datasets from Drive .zip Files

##### STEP 1: Load "Human Faces" Dataset

In [56]:
zip_to_colab(project_folder + '/HumanFacesCleaned.zip', "HumanFacesImages")
HUMANFACES_IMAGE_EXTENSIONS = {'.JPG', '.jpeg', '.jpg', '.png'}

Files from drive/MyDrive/CS4644_FinalProject/HumanFacesCleaned.zip extracted to: /content/HumanFacesImages/
Number of files extracted: 3273


##### STEP 2: Load "Fake-Vs-Real-Faces (Hard)" Dataset

In [54]:
zip_to_colab(project_folder + '/RealImages.zip', 'TrainValRealImages')
zip_to_colab(project_folder + '/FakeImages.zip', 'TrainValFakeImages')
FAKEVREAL_IMAGE_EXTENSIONS = {'.jpg'}

Files from drive/MyDrive/CS4644_FinalProject/RealImages.zip extracted to: /content/TrainValRealImages/
Number of files extracted: 589
Files from drive/MyDrive/CS4644_FinalProject/FakeImages.zip extracted to: /content/TrainValFakeImages/
Number of files extracted: 700


##### STEP 3: Load "deepfake and real images" Dataset

In [55]:
zip_to_colab(project_folder + '/Test_RealImages.zip', 'TestRealImages')
zip_to_colab(project_folder + '/Test_FakeImages.zip', 'TestFakeImages')
FAKEVREAL_IMAGE_EXTENSIONS = {'.jpg'}

Files from drive/MyDrive/CS4644_FinalProject/Test_RealImages.zip extracted to: /content/TestRealImages/
Number of files extracted: 129
Files from drive/MyDrive/CS4644_FinalProject/Test_FakeImages.zip extracted to: /content/TestFakeImages/
Number of files extracted: 129


##### STEP 4: Check files

In [57]:
os.listdir()

['.config',
 'TrainValFakeImages',
 'drive',
 'TestRealImages',
 'TestFakeImages',
 'Test_RealImages.zip',
 'TrainValRealImages',
 'HumanFacesImages',
 'Test_FakeImages.zip',
 'sample_data']

In [58]:
print("HumanFacesImages Extensions:", get_file_extensions('/content/HumanFacesImages'))
print("TrainValRealImages Extensions:", get_file_extensions('/content/TrainValRealImages'))
print("TrainValFakeImages Extensions:", get_file_extensions('/content/TrainValFakeImages'))
print("TestFakeImages Extensions:", get_file_extensions('/content/TestFakeImages'))
print("TestRealImages Extensions:", get_file_extensions('/content/TestFakeImages'))

HumanFacesImages Extensions: {'.jpeg', '.jpg', '.png', '.JPG'}
TrainValRealImages Extensions: {'.jpg'}
TrainValFakeImages Extensions: {'.jpg'}
TestFakeImages Extensions: {'.jpg'}
TestRealImages Extensions: {'.jpg'}
