## Transferring images from the Postgres DB into the Digital Ocean Space

edit: this ends up being throttled somehow (ISP issues?). Deprecating this notebook in favor of getting the images from a database dump in the next notebook.

Here's what we need to do to transfer over the raw image data so it's no longer in the database.

1. Determine that the size of the DO Space we've chosen will be big enough to last a long time. What is the current size of the images in the DB? How fast do we expect the space usage to grow, given reasonable assumptions about how many keywords we're using?
2. Get a list of all images we're going to query for
3. Update the API so that saving all new images will save to the DO Space rather than the database
4. Get the image binary data of an image, save it with the new API image save endpoint, then check that any query that includes that image metadata links to the new DO Space image URL
5. run for all images

### 1. size of Space

I'll check back on this after transfering over all the images from the DB.

### 2. list of all images to query

I have node running the server.js file in the api folder on my test server, which has the same database as prod.

In [1]:
import json
import os
import requests

# base = 'http://159.89.80.47'
base = 'http://api.firewallcafe.com'
folder = 'images'

os.mkdir(folder)

In [2]:
images = requests.get(base + '/images?page_size=200&page=2').json()
len(images)

200

In [3]:
images[-1]

{'image_id': 626520,
 'image_search_engine': 'baidu',
 'image_href': 'https://firewall-cafe-space.nyc3.digitaloceanspaces.com/images/hashed/de2ece85a092791e.jpg',
 'image_rank': '1',
 'image_mime_type': None,
 'wordpress_attachment_post_id': None,
 'wordpress_attachment_file_path': None}

### 3. update API to save images to DO

I have this running on a branch of my dovinmu fork of firewall cafe's codebase.

### ~~4. get binary and save to API~~ setup for putting images on DO

In [4]:
from PIL import Image

In [5]:
from io import StringIO
# Image.open(StringIO(img[0]['image_data']['data']))

Sidenote: I just discovered that Postman can generate code for a given request. My life could have been so much easier.

In [None]:
import random
import imagehash

def hash_image(img):
    return imagehash.phash(fname)

def image_fname(url):
#     print(url)
    def reduce(char):
        return char if char not in set('.`\'"()[]{}\\;&%@,-=+$:/<>~ ?') else '_'
    if '://' in url: url = url.split('://')[1]
    if url[-4:] == '.jpg': url = url[:-4]
    url = ''.join([reduce(char) for char in url])
    if len(url) > 150:
        url = url[:150] + '__' + str(random.randint(100,1000))
    return url
image_fname("https://google.com/test-worked.jpg")

### 5. run for all images

Set up Digital Ocean

In [7]:
with open('digital-ocean-config.json') as f:
    config = json.loads(f.read())

import boto3

bucket_endpoint = f'https://{config["bucket"]}.{config["region"]}.digitaloceanspaces.com'
# Initialize a session using DigitalOcean Spaces.
session = boto3.session.Session()
client = session.client('s3',
    region_name = config['region'],
    endpoint_url = f'https://{config["region"]}.digitaloceanspaces.com',
    aws_access_key_id = config['access_key_id'],
    aws_secret_access_key = config['secret_access_key'])
transfer = boto3.s3.transfer.S3Transfer(client)

def write_image_to_do(fname, new_fname):
    transfer.upload_file(fname, config['bucket'], new_fname)
    # make that file public
    r = client.put_object_acl(ACL='public-read', Bucket=config['bucket'], Key=new_fname)
    if r['ResponseMetadata']['HTTPStatusCode'] == 200:
        return bucket_endpoint + '/' + new_fname
    raise Exception("error from DO:", r)

In [8]:
import imghdr
import os

In [9]:
count = 0
failures = {}

This is insanely inefficient code which gets the images from their original URLs and puts them on DO, but only needs to be run once.

In [10]:
import queue
import threading
import time
import io

class Storage:
    def __init__(self):
        self.start = time.time()
        self._lock = threading.Lock()
        self.failures = {}
        self.errors = {}
        self.count = 0
        
    def add_failure(self, image_id, url, e):
        with self._lock:
            self.failures[image_id] = url
            self.errors[image_id] = e

    def increment_count(self):
        with self._lock:
            self.count += 1
    
    def print_status(self):
        with self._lock:
            print(f"processing {self.count} took {round((time.time()-self.start)/60, 1)} minutes. \
{round(100*(len(self.failures)/self.count), 1)}% failure rate & {len(self.failures)} failures, {round((time.time()-self.start)/self.count, 1)} s per")

storage = Storage()
def worker():
    while True:
        image_id, url = q.get()
        if url is None:
            print('x', end=' ')
            break
        
        storage.increment_count()
#         fname = image_fname(url)
        
        # save image in temp folder
        try:
            r = requests.get(url)
        except Exception as e:
            storage.add_failure(image_id, url, e)
            continue
        try:
            in_memory_file = io.BytesIO(r.content)
            im = Image.open(in_memory_file)
            fname = hash_image(im)
        except Exception as e:
            storage.add_failure(image_id, url, e)
            continue
#         print(os.pwd(), fname, ext)
        try:
            ext = '.jpg'
            im.save(folder + '/' + fname + ext)
        except:
            try:
                ext = '.' + im.format.lower()
                im.save(folder + '/' + fname + ext)
            except Exception as e:
                storage.add_failure(image_id, url, e)
        if random.randint(0,100) == 42:
            storage.print_status()
        q.task_done()

Cache the /image responses so we don't hammer that poor DO Droplet

In [11]:
# image_pages = []
# for page_num in range(0, 201):
#     ts = time.time()
#     j = requests.get(base + '/images?page_size=1000&page=' + str(page_num)).json()
#     image_pages.append(j)
#     print(page_num, round(time.time()-ts), len(j))


In [12]:
# with open('image_pages.json', 'w') as f:
#     f.write(json.dumps(image_pages))

In [13]:
with open('image_pages.json', 'r') as f:
    image_pages = json.loads(f.read())

In [None]:
num_worker_threads = 50

# set up threading and queueing infra
q = queue.Queue()
threads = []
for i in range(num_worker_threads):
    t = threading.Thread(target=worker)
    t.start()
    threads.append(t)

# iterating over all images in DB
for j in image_pages:
#     print("skipping image downloading")
#     break
    for img in j:
        count += 1
        # make filename from url
        if img['image_href'] is not None:
            url = img['image_href']
        elif img['wordpress_attachment_file_path'] is not None:
            url = 'http://firewallcafe.com' + img['wordpress_attachment_file_path']
        else:
            continue
        # put the url into the queue for the workers
        q.put((img['image_id'], url))
    break
        # upload to DO manually
#         url = write_image_to_do('temp/img'+ext, 'images/url_based/'+fname+ext)
print("\ndone with /image queries\ntelling threads to stop")
for i in range(num_worker_threads):
    q.put((None,None))
print('waiting')
for t in threads:
    t.join()


done with /image queries
telling threads to stop
waiting
processing 72 took 0.2 minutes. 1.4% failure rate & 1 failures, 0.2 s per
processing 117 took 0.3 minutes. 6.0% failure rate & 7 failures, 0.1 s per
processing 296 took 0.6 minutes. 10.8% failure rate & 32 failures, 0.1 s per
processing 446 took 1.1 minutes. 11.9% failure rate & 53 failures, 0.1 s per
processing 701 took 1.5 minutes. 11.3% failure rate & 79 failures, 0.1 s per
processing 740 took 1.6 minutes. 11.5% failure rate & 85 failures, 0.1 s per
processing 769 took 1.6 minutes. 11.3% failure rate & 87 failures, 0.1 s per
processing 782 took 1.7 minutes. 11.4% failure rate & 89 failures, 0.1 s per
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 

In [None]:
storage.print_status()

In [None]:
print("ok")

Check all image URLs in the list against their corresponding filenames in the folder.

In [None]:
image_pages[0][0]

In [None]:
fn = image_fname(image_pages[10][5]['image_href'])
fn

In [None]:
images = set(os.listdir(folder))

In [None]:
f'{fn}.jpg' in images

In [None]:
found_count = 0
missing = []
for image_page in image_pages:
    for image in image_page:
        fn = image_fname(image['image_href'])
        if f'{fn}.jpg' in images:
            found_count += 1
        else:
            missing.append(image)

In [None]:
non_jpgs = [fname for fname in images if fname[-4:] != '.jpg']

In [None]:
len(non_jpgs)/len(images)

In [None]:
print(round(100*found_count/(found_count+len(missing)), 2), '% found')

In [None]:
found_count

In [None]:
len(missing)