# Deepfake Unmask - Data Collection
## Matthew Reed

The intent of this project is to create a home-grown neural network that can outperform humans ([reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8872790/)) in identifying computer generated images of people.

This notebook handles the collection of similar synthetic and real images, and their subsequent processing for use in a neural-net model.

In [1]:
import requests
import cv2

from bs4 import BeautifulSoup
import re
import os
import time
import datetime
import pandas as pd
import numpy as np

#### Establishing path to be scraped for synthetic images, and generating directory if it does not exist

In [2]:
url = 'https://this-person-does-not-exist.com'

if os.path.isdir('./data/generated/High Res/') is False:
    os.makedirs('./data/generated/High Res/')
    
path = './data/generated/High Res/'

#### Demonstrate retrieval of full resolution images

In [17]:
time_0 = time.time()

retrieve_n = 50
retrieved = 0

for image in range(retrieve_n):
    
    res = requests.get(url)
    good_status = res.status_code == 200
    wait_cycle = 0
    failed_connection = False
    
    while not good_status:
        wait_cycle += 1
        print('Site did not respond. Retrying...')
        if wait_cycle > 5:
            print(f'Could not reach site. Terminating request. \
            Loop collected {retrieved} out of {retrieve_n} image(s) prior to lost connection.')
            failed_connection = True
            break
        time.sleep(3)
        res = requests.get(url)
        good_status = res.status_code == 200

    if failed_connection:
        break

    soup = BeautifulSoup(res.text, 'lxml')

    if not os.path.exists(path):
        os.mkdir(path)
    
    image_json = soup.find('img', {'id': 'avatar'})
    image_link = url + image_json['src']
    image_name = re.findall("avatar\S+.jpg", image_link)[0]
    image_path = path + '/' + image_name
    image_res = requests.get(image_link)

    if image_res.status_code != 200:
        continue

    file = open(image_path, 'wb')
    file.write(image_res.content)
    file.close()
    
    retrieved += 1
    
    time.sleep(1)

    if (image + 1) % (retrieve_n//5) == 0:
        print(f'At {round(((image + 1) / retrieve_n) * 100, 1)}% Completion: Retrieved {retrieved} of targeted {retrieve_n} images.')
        print(f'Current Execution Time: {round((time.time() - time_0), 0)} seconds. \n')

print(f'Final Results: Retrieved {retrieved} of targeted {retrieve_n} images.')
print(f'Current Execution Time: {round((time.time() - time_0), 0)} seconds.')

At 20.0% Completion: Retrieved 10 of targeted 50 images.
Current Execution Time: 40.0 seconds. 

At 40.0% Completion: Retrieved 20 of targeted 50 images.
Current Execution Time: 74.0 seconds. 

At 60.0% Completion: Retrieved 30 of targeted 50 images.
Current Execution Time: 105.0 seconds. 

At 80.0% Completion: Retrieved 40 of targeted 50 images.
Current Execution Time: 144.0 seconds. 

At 100.0% Completion: Retrieved 50 of targeted 50 images.
Current Execution Time: 180.0 seconds. 

Final Results: Retrieved 50 of targeted 50 images.
Current Execution Time: 180.0 seconds.


#### Selected single image from those harvested for inspection and exploration 
(if running for yourself, the path will need to be updated to a file in your directory)

In [3]:
img1 = cv2.imread('./data/generated/High Res/avatar-008390911eb28a25b6dcbbea9cb607ac.jpg', cv2.IMREAD_UNCHANGED)
img1.shape

(1024, 1024, 3)

#### Resizing image

In [5]:
width = 252
height = 252
dim = (width, height)
resized = cv2.resize(img1, dim, interpolation = cv2.INTER_AREA)

#### Image Preview

In [10]:
cv2.imshow('Image 1', resized)
cv2.waitKey(0)
cv2.destroyAllWindows()

#### Pulling in a fresh image from the website and resizing, then viewing the resulting image
#### !!IMPORTANT!! Do not 'x' out of image popup; with the popup window selected, press any key to close.

In [None]:
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')

image_json = soup.find('img', {'id': 'avatar'})
image_link = url + image_json['src']
image_name = re.findall("avatar\S+.jpg", image_link)[0]
image_path = path + '/' + image_name
image_res = requests.get(image_link, stream=True).raw

image = np.asarray(bytearray(image_res.read()), dtype="uint8")
image = cv2.imdecode(image, cv2.IMREAD_COLOR)

width = 256
height = 256
dim = (width, height)
resized = cv2.resize(image, dim, interpolation = cv2.INTER_AREA)

cv2.imshow('Image 1', resized)
cv2.waitKey(0)
cv2.destroyAllWindows()

#### See the difference in downscaling results based on interpolation type

In [11]:
res = requests.get(url)
good_status = res.status_code == 200
wait_cycle = 0
failed_connection = False

while not good_status:
    wait_cycle += 1
    print('Site did not respond. Retrying...')
    if wait_cycle > 5:
        print(f'Could not reach site. Terminating request. \
        Loop collected {retrieved} out of {retrieve_n} image(s) prior to lost connection.')
        failed_connection = True
        break
    time.sleep(3)
    res = requests.get(url)
    good_status = res.status_code == 200

soup = BeautifulSoup(res.text, 'lxml')

image_json = soup.find('img', {'id': 'avatar'})
image_link = url + image_json['src']
image_name = re.findall("avatar\S+.jpg", image_link)[0]
image_path = path + '/' + image_name
image_res = requests.get(image_link, stream=True).raw

#     if image_res.status_code != 200:
#         continue

image = np.asarray(bytearray(image_res.read()), dtype="uint8")
image = cv2.imdecode(image, cv2.IMREAD_COLOR)

width = 128
height = 128
dim = (width, height)
resized = cv2.resize(image, dim, interpolation = cv2.INTER_LANCZOS4)


resized_nearest = cv2.resize(image, dim, interpolation = cv2.INTER_NEAREST)
resized_linear = cv2.resize(image, dim, interpolation = cv2.INTER_LINEAR)
resized_area = cv2.resize(image, dim, interpolation = cv2.INTER_AREA)
resized_cubic = cv2.resize(image, dim, interpolation = cv2.INTER_CUBIC)
resized_lanczos = cv2.resize(image, dim, interpolation = cv2.INTER_LANCZOS4)

horizontal = np.concatenate((resized_nearest, resized_linear, resized_area, resized_cubic, resized_lanczos), axis=1)

cv2.imshow('Image 1', horizontal)
cv2.waitKey(0)
cv2.destroyAllWindows() # !!IMPORTANT!! Do not 'x' out of image popup; press any key to close. Will break jupyter!

#### Function for quickly creating new paths without destroying existing paths

In [5]:
def generate_path(path_string):
    
    if os.path.isdir(path_string) is False:
        os.makedirs(path_string)

    return path_string

#### Low-res (128 x 128px) image collection

In [6]:
def collect_low_res(path, retrieve_n=500, interp='INTER_AREA'):
    
    time_0 = time.time()

    path = path
    retrieve_n = retrieve_n
    retrieved = 0

    for image in range(retrieve_n):

        res = requests.get(url)
        good_status = res.status_code == 200
        wait_cycle = 0
        failed_connection = False

        while not good_status:
            wait_cycle += 1
            print('Site did not respond. Retrying...')
            if wait_cycle > 5:
                print(f'Could not reach site. Terminating request. \
                Loop collected {retrieved} out of {retrieve_n} image(s) prior to lost connection.')
                failed_connection = True
                break
            time.sleep(3)
            res = requests.get(url)
            good_status = res.status_code == 200

        if failed_connection:
            break

        soup = BeautifulSoup(res.text, 'lxml')

        if not os.path.exists(path):
            os.mkdir(path)

        image_json = soup.find('img', {'id': 'avatar'})
        image_link = url + image_json['src']
        image_name = re.findall("avatar\S+.jpg", image_link)[0]
        image_path = path + '/' + image_name
        image_res = requests.get(image_link, stream=True).raw

    #     if image_res.status_code != 200:
    #         continue

        image = np.asarray(bytearray(image_res.read()), dtype="uint8")
        image = cv2.imdecode(image, cv2.IMREAD_COLOR)

        if image.size == 0 or None:
            print('Failed to collect image')
            continue

        width = 128
        height = 128
        dim = (width, height)
        resized = cv2.resize(image, dim, interpolation = getattr(cv2, interp))

        cv2.imwrite(image_path, resized)

        time.sleep(3)
        retrieved += 1

    print(f'Retrieved {retrieved} of targeted {retrieve_n} images.')
    print(f'Execution Time: {round((time.time() - time_0), 0)} seconds')
    
    return None

In [7]:
path_low_res = generate_path('./data/generated/Low Res/')

In [10]:
collect_low_res(path_low_res, retrieve_n=301, interp='INTER_AREA')

Retrieved 301 of targeted 301 images.
Execution Time: 1414.0 seconds


## Experiment: Test whether preprocessing/interpolation of the synthetic images is 'giving away' the synthetics in the model

The 128 x 128px data set from [Kaggle](https://www.kaggle.com/datasets/dullaz/flickrfaces-dataset-nvidia-128x128), while being very convenient and having my deep appreciation, does not specify how the full resolution images from the FFHQ database were downconverted. This poses a problem due to the fact that I apply a downconversion on the raw 1024 x 1024px synthetic images upon retrieval, and in turn must select an interpolation method to execute the downconversion. It is possible that the methods used for downconversion between real and synthetic images differ, and that there may be artifacts to either process that are present in the images purely due to preprocessing, and could be a factor in the neural net's ability to distinguish one from the other. Here, I bring in a small set of full resolution real images ([from here](https://drive.google.com/drive/folders/1tZUcXDBeOibC6jcMCtgRRz67pzrAHeHL)) and apply identical interpolation methods for the downconversion that I have applied to the raw synthetic images, so that a comparison model can be built and relative performance can be evaluated.  

#### Downconvert full resolution real images

In [2]:
def downscale_imgs(full_res_dir, destination_dir, interp, width=128, height=128):
    
    destination_dir = destination_dir + interp + '/'
    
    if not os.path.exists(destination_dir):
        os.mkdir(destination_dir)

    for image_name in os.listdir(full_res_dir):

            image_path = full_res_dir + image_name
            image = cv2.imread(image_path)
                        
            dim = (width, height)
            resized = cv2.resize(image, dim, interpolation = getattr(cv2, interp))
#             print((destination_dir + interp + '_' + image_name))
            cv2.imwrite((destination_dir + interp + '_' + image_name), resized)

In [50]:
full_res_dir = './data/real/FFHQ-high-res/69000/'
destination_dir = './data/real/FFHQ-high-res/'

In [51]:
downscale_imgs(full_res_dir, destination_dir, 'INTER_AREA')

In [52]:
downscale_imgs(full_res_dir, destination_dir, 'INTER_LANCZOS4')

In [3]:
full_res_dir = './data/real/FFHQ-high-res/68000/'
destination_dir = './data/real/FFHQ-high-res/'

In [4]:
downscale_imgs(full_res_dir, destination_dir, 'INTER_AREA')