<a href="https://colab.research.google.com/github/KaifAhmad1/code-test/blob/main/Deepfake_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Deepfake and Manipulated Media Analysis Data Download**

In [1]:
!pip install -qU kaggle pandas requests tqdm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.2.3 which is incompatible.[0m[31m
[0m

In [11]:
import os
import requests
from pathlib import Path
from tqdm import tqdm

In [12]:
def download_file(url, dest_path):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }
    try:
        response = requests.get(url, stream=True, headers=headers)
        response.raise_for_status()

        # Handle dynamic filenames for certain sources
        if "thispersondoesnotexist" in url:
            filename = f"generated_face_{hash(url)}.jpg"
            dest_path = dest_path.parent / filename

        total_size = int(response.headers.get('content-length', 0))

        dest_path.parent.mkdir(parents=True, exist_ok=True)

        with open(dest_path, 'wb') as f, tqdm(
            desc=f"Downloading {dest_path.name}",
            total=total_size,
            unit='iB',
            unit_scale=True,
        ) as pbar:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
                    pbar.update(len(chunk))
        return True
    except Exception as e:
        print(f"Error downloading {url}: {str(e)}")
        return False

def download_multimodal_subset():
    base_dir = Path("./multimodal_deepfake_data")

    # Updated with verified working URLs
    datasets = {
        "images": {
            "real": [
                # From Wikimedia Commons (CC0 licensed)
                "https://upload.wikimedia.org/wikipedia/commons/6/6e/Thomas_Edison_1880.jpg",
                "https://upload.wikimedia.org/wikipedia/commons/b/b4/Steve_Jobs_1976_crop.jpg",
            ],
            "fake": [
                # ThisPersonDoesNotExist with proper headers
                "https://thispersondoesnotexist.com" for _ in range(5)
            ]
        },
        "videos": {
            "real": [
                # From Wikimedia Commons sample videos
                "https://upload.wikimedia.org/wikipedia/commons/transcoded/c/c0/Big_Buck_Bunny_4K.webm/Big_Buck_Bunny_4K.webm.360p.vp9.webm",
            ],
            "fake": [
                # DFDC sample videos from official source
                "https://github.com/microsoft/DFD/raw/master/resources/release_samples/dfdc_fake_00.mp4",
                "https://github.com/microsoft/DFD/raw/master/resources/release_samples/dfdc_fake_01.mp4",
            ]
        },
        "audio": {
            "real": [
                # From Common Voice dataset
                "https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-15.0-2024-02-05/en/clips/common_voice_en_38308318.mp3",
            ],
            "fake": [
                # Synthetic audio samples from ESPNet
                "https://github.com/espnet/espnet/raw/master/egs2/ljspeech/tts1/audio.wav",
            ]
        }
    }

    results = {"images": 0, "videos": 0, "audio": 0}

    for modality, categories in datasets.items():
        print(f"\n{'='*40}\nDownloading {modality.upper()} samples\n{'='*40}")
        for category, urls in categories.items():
            print(f"\n{category.capitalize()} samples:")
            modality_dir = base_dir / modality / category

            for url in urls:
                filename = url.split("/")[-1].split("?")[0]  # Clean URL parameters
                dest_path = modality_dir / filename
                if download_file(url, dest_path):
                    results[modality] += 1

    print("\nFinal Report:")
    print(f"Images downloaded: {results['images']}")
    print(f"Videos downloaded: {results['videos']}")
    print(f"Audio downloaded: {results['audio']}")
    print(f"Total dataset size: {sum(results.values())} files")
    print(f"Data location: {base_dir.absolute()}")

In [13]:
if __name__ == "__main__":
    download_multimodal_subset()


Downloading IMAGES samples

Real samples:
Error downloading https://upload.wikimedia.org/wikipedia/commons/6/6e/Thomas_Edison_1880.jpg: 404 Client Error: Not Found for url: https://upload.wikimedia.org/wikipedia/commons/6/6e/Thomas_Edison_1880.jpg
Error downloading https://upload.wikimedia.org/wikipedia/commons/b/b4/Steve_Jobs_1976_crop.jpg: 404 Client Error: Not Found for url: https://upload.wikimedia.org/wikipedia/commons/b/b4/Steve_Jobs_1976_crop.jpg

Fake samples:


Downloading generated_face_-5473839726246319828.jpg: 100%|██████████| 523k/523k [00:00<00:00, 1.02MiB/s]
Downloading generated_face_-5473839726246319828.jpg: 100%|██████████| 518k/518k [00:00<00:00, 978kiB/s] 
Downloading generated_face_-5473839726246319828.jpg: 100%|██████████| 512k/512k [00:00<00:00, 940kiB/s] 
Downloading generated_face_-5473839726246319828.jpg: 100%|██████████| 635k/635k [00:00<00:00, 1.12MiB/s]
Downloading generated_face_-5473839726246319828.jpg: 100%|██████████| 552k/552k [00:00<00:00, 1.04MiB/s]



Downloading VIDEOS samples

Real samples:


Downloading Big_Buck_Bunny_4K.webm.360p.vp9.webm: 100%|██████████| 57.0M/57.0M [00:02<00:00, 27.3MiB/s]



Fake samples:
Error downloading https://github.com/microsoft/DFD/raw/master/resources/release_samples/dfdc_fake_00.mp4: 404 Client Error: Not Found for url: https://github.com/microsoft/DFD/raw/master/resources/release_samples/dfdc_fake_00.mp4
Error downloading https://github.com/microsoft/DFD/raw/master/resources/release_samples/dfdc_fake_01.mp4: 404 Client Error: Not Found for url: https://github.com/microsoft/DFD/raw/master/resources/release_samples/dfdc_fake_01.mp4

Downloading AUDIO samples

Real samples:
Error downloading https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-15.0-2024-02-05/en/clips/common_voice_en_38308318.mp3: 403 Client Error: Forbidden for url: https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-15.0-2024-02-05/en/clips/common_voice_en_38308318.mp3

Fake samples:
Error downloading https://github.com/espnet/espnet/raw/master/egs2/ljspeech/tts1/audio.wav: 404 Client Error: Not Found for url: https