## Multithreaded prototype

There are a few issues we can solve by rewriting the functions that take a list of image URLs and save them into the Firewall data systems. First of all, we can cut the amount of time it takes a run to complete substantially. We can (try to) fix issues with Google images only being saved as thumbnails. And we can address issues we're running into with some image requests hanging. We can check if an image [already exists](https://developers.digitalocean.com/documentation/spaces/#get-object-info) before uploading it (ideally we'd do this before downloading it too, only we would need to write more code to support that).

Proposal: have an image_handler which black-boxes the multithreading (or subprocesses). Image URLs go in, and the new data lake URLs come out, in a way that we can still be associated with the term, search engine, and ordering.  

### Google images only saving as thumbnails

In [7]:
from bs4 import BeautifulSoup
import ipyplot
import json
import requests
import time

In [8]:
def request_and_write_image(url, spaces_folder):
    print("getting image:", url, end=' ')
    try:
        r = requests.get(url, stream=True)
    except Exception as e:
        print(url, e)
        return
    print(r.status_code)
    if not r.ok:
        return
    # write locally
    with open('temp', 'wb') as f:
        for block in r.iter_content(1024):
            if not block:
                break
            f.write(block)
    spaces_fname = image_fname('temp')
    print("uploading", spaces_fname, "to data lake", end=' ')
    status = _write_public('temp', f'{spaces_folder}/{spaces_fname}')
    print(status)
    if status < 400:
        return spaces_fname
    
def query_google(term):
    google_template = 'https://www.google.com/search?q={}&tbm=isch'
    r = requests.get(google_template.format(term))
    soup = BeautifulSoup(r.text, features="html.parser")
    urls = [tag.get('src') for tag in soup.find_all('img') if tag.get('src')[:4] == 'http']
    return urls

In [9]:
term = "apricot"

In [10]:
urls = query_google(term)

In [13]:
ipyplot.plot_images(urls[:5])

In [19]:
'imgurl' in r.text

False

In [17]:
soup.find_all('a')

[<a class="lXLRf" href="/?output=images&amp;ie=UTF-8&amp;tbm=isch&amp;sa=X&amp;ved=0ahUKEwiZxPewrp3vAhWSsJ4KHcAmAFUQPAgC"><img alt="Google" class="kgJEQe" src="/images/branding/searchlogo/1x/googlelogo_desk_heirloom_color_150x55dp.gif"/></a>,
 <a class="CsQyDc" href="/search?q=apricot&amp;ie=UTF-8&amp;source=lnms&amp;sa=X&amp;ved=0ahUKEwiZxPewrp3vAhWSsJ4KHcAmAFUQ_AUIBCgA">ALL</a>,
 <a class="CsQyDc" href="/search?q=apricot&amp;ie=UTF-8&amp;tbm=nws&amp;source=lnms&amp;sa=X&amp;ved=0ahUKEwiZxPewrp3vAhWSsJ4KHcAmAFUQ_AUIBigC">NEWS</a>,
 <a class="CsQyDc" href="/search?q=apricot&amp;ie=UTF-8&amp;tbm=vid&amp;source=lnms&amp;sa=X&amp;ved=0ahUKEwiZxPewrp3vAhWSsJ4KHcAmAFUQ_AUIBygD">VIDEOS</a>,
 <a class="TwVfHd" href="/search?ie=UTF-8&amp;tbm=isch&amp;q=apricot&amp;chips=q:apricot,online_chips:dried+fruit&amp;sa=X&amp;ved=0ahUKEwiZxPewrp3vAhWSsJ4KHcAmAFUQ4lYICygA">dried fruit</a>,
 <a class="TwVfHd" href="/search?ie=UTF-8&amp;tbm=isch&amp;q=apricot&amp;chips=q:apricot,online_chips:jam&amp;sa=X&

In [14]:
google_template = 'https://www.google.com/search?q={}&tbm=isch'
r = requests.get(google_template.format(term))
soup = BeautifulSoup(r.text, features="html.parser")
urls = [tag.get('src') for tag in soup.find_all('img') if tag.get('src')[:4] == 'http']
ipyplot.plot_images(urls[:5])