
The goal of this exercise is to go through a whole workflow of an example big data DH project: scraping the web for data, downloading the data, putting the data in appropriate storage, analysing it, and presenting the results in human accessible format.  In this example, the data will be a set of old photographs, and the analysis will consist of trying to identify images which contain objects with text in them, for example advertising billboards and street signs.

You can work through the exercise at your own pace.  The instructors are available to help you and will answer your questions.  If any of the steps are beyond your current capabilities, please let the instructors know and we will provide tools which can help you proceed further.

# Dataset

The dataset we will look at will be the Farm Security Administration/Office of War Information (FSA-OWI) photographs archived on the Library of Congress website (http://www.loc.gov/pictures/collection/fsa/). The photographs were taken in the 1930s and 1940s, during the Great Depression and World War II.  This massive collection contains over 170,000 images, documenting all aspects of life in the United States in those years.  One of the most iconic photographs taken during the Depression , ["Migrant Mother"](http://www.loc.gov/pictures/collection/fsa/item/fsa1998021539/PP/) is a part of this collection.

One of the key benefits of using this dataset is that all the photographs are in the Public Domain, since they were made by employees of the US government.

However, the documentation of the collection is incomplete .  69,000 of the photographs are untitled, and for some of these no information at all is available.  Even for photographs containing accopanying information, the title and caption might not fully indicate the interesting aspects of a photograph.  The goal for this exercise is to analyse the untitled photographs in the collection and identify those which contain text.  The text can then be used to deduce more information about the photograph.

# Scraping the web

The Library of Congress website provides a search interface to its collection.  However, this search interface has not been updated for over a decade and, while very useful, does not provide all the functionality that a researcher might need.  For example, while the search can be limited to return only images which are digitized, it cannot be set to return only images which have been digitized in high resolution.  The images available only in low resolution  which are returned are usually not useful for analysis, and we want to modify our search process so that they are discarded.
Also, the search results are only available in a webpage, and cannot be easily downloaded to a convenient dataset for further processing, so we need to write a program which can collect them to a convenient format for future use.

An example program for scraping the web which you can build on is below. You should write a program that scans over successive pages of a search for a term, ("untitled" in this case)

`http://www.loc.gov/pictures/search/?q=untitled&fa=displayed%3Aanywhere&sp=1&co=fsa&st=grid`

`http://www.loc.gov/pictures/search/?q=untitled&fa=displayed%3Aanywhere&sp=2&co=fsa&st=grid`

`...`

`http://www.loc.gov/pictures/search/?q=untitled&fa=displayed%3Aanywhere&sp=696&co=fsa&st=grid`

and extracts the addresses of each of the 100 images in each page, for example:

`http://www.loc.gov/pictures/collection/fsa/item/fsa1997000003/PP/`

For each of these pages, you should analyze whether a high resolution TIFF version of the image is available. If it is, you want to save the image link and title.

The natural way to store this information would be a database.  A Python example with the details of setting up and using a simple database is provided below.  If you need more information on how to work with sqlite database, you can find on [this page](http://www.tutorialspoint.com/sqlite/index.htm).

The HTML code of any web page can be extracted using urllib module of Python.  An example of using it is provided below, with parsing via BeautifulSoup.  You can search the HTML text extracted using the string.find method, or regular expressions.

# Downloading and storing the data

The scraping of the web should be separated from the downloading stage since the downloads may take a long time, and the downloading program may have to be run intermittently over an extended time period.

Take the existing database and add a field for indicating download status.  Then write a program which will download the files which have not yet been downloaded, and update the database as they come in.  You may want to apply a tranformation to .tiff files to reduce their size, converting them to .jpg format.  You may also want to rescale the resolution.  The imagemagick command line program is the most convenient tool for this.

Once the program is done, set it running to download a set of files.  If that is too time consuming, a previously prepared set of images is also available (ask instructor). 

# Detecting text in image
Detecting text in photographic images (as opposed to scans of pages with text) is still a developing field, and the problem is rather challenging.  The difficulty often lies in detecting that text is present somewhere in the image in the first place.

For this exercise we will use a standard OCR tool called Tesseract. This software is conveniently available in all commonly used distributions of Linux.  Applying it to photographs is not its typical use, but it works remarkably well for this exercise.  In any future project one would use better tools as they become available.

First, try to detect text in a single [image](http://loc.gov/pictures/resource/fsa.8a04355/) (first download the [high resolution .tif version of the image](http://cdn.loc.gov/master/pnp/fsa/8a04000/8a04300/8a04355a.tif) ), for example:

`tesseract 8a04355a.tif out.txt`

The text recognized in the image is very clear and the program does a relatively good job in recognising it and producing understandable output, stored in file out.txt.

Then try a more challenging [image](http://www.loc.gov/pictures/collection/fsa/item/fsa1997003919/PP/). Here no text is detected.  Try rotating the image slightly with imagemagick and see if detection improves.

 `convert 8a03931u.tif  -rotate 10 output_rotated10.JPG`

 `tesseract output_rotated10.JPG out.txt`
 
 At the end, write a script which tries to reliably detects text in image. You might want to try a few rotations of the image in your attempts.  Your script should count the characters in the output: once some threshold is exceeded, you can have some confidence that the photograph contains some text.

# Analysing the data
Write a Spark script which will apply the image analysis script to all the files in your data set.  You could add a field to the database indicate whether text has been detected in the image or not.  Run the script on the set of images you gathered to detect those which contain text.

# Presenting the data

Organize the output in some useful way so that humans can easily scan through it.  Jupyter Notebook can be used to do that.  One could also make a webpage with images containing text embedded.  





# Useful code snippets

Get HTML source code of webpage.

In [None]:
import urllib
from bs4 import BeautifulSoup

In [None]:
! pip install requests

In [None]:
import requests
def getpage(urladdress):
    return requests.get(urladdress).text

In [None]:
html_source_code = getpage("http://www.dhsi.org/")
soup = BeautifulSoup(html_source_code, "lxml")

Download image from web


In [None]:
from PIL import Image
from io import BytesIO

In [None]:
img = Image.open(BytesIO(requests.get("https://pbs.twimg.com/media/CkxPtG6UYAEXH8r.jpg").content))
img.save('twitter_deer.jpg')

In [None]:
soup.find_all('a')[4].attrs['href']

sqlite3 database in Python

In [None]:
import sqlite3
import os

data_file='mydatabase.db'

if(os.path.isfile(data_file) ):
    print("found existing data file")
    conn = sqlite3.connect(data_file)
    c = conn.cursor()
else:
    print("creating new database file")
    conn = sqlite3.connect(data_file)
    c = conn.cursor()
    # create table in database
    c.execute("CREATE TABLE mytable (birthyear int, first_name text, last_name text)")
    conn.commit()
    
# enter data into table created
update=(1812,"Charles","Dickens")
print(update)
c.execute("INSERT INTO mytable VALUES (?,?,?)",update)
conn.commit()

# extract and modify data from table
c.execute("SELECT * from mytable")
y=c.fetchall()
for row in y:
    print(y)
    new_name="Charles John Huffam"
    t=[new_name,]
    c.execute("UPDATE mytable SET first_name=? where last_name=\"Dickens\"",t)
    conn.commit()

# Workshop collective solution

In [None]:
from bs4 import BeautifulSoup

In [None]:
import requests

In [None]:
def getpage(urladdress):
    return requests.get(urladdress).text

In [None]:
html_source_code = getpage('http://www.loc.gov/pictures/search/?q=untitled&fa=displayed%3Aanywhere&sp=1&co=fsa&st=grid')

In [None]:
soup = BeautifulSoup(html_source_code, "lxml")

In [None]:
all_links = soup.find_all('a')

In [None]:
def keep_preview_link(link_object):
    return 'preview' in link_object.attrs.get('class', [])

In [None]:
all_links[1]

In [None]:
keep_preview_link(all_links[100])

In [None]:
keep_preview_link(all_links[1])

In [None]:
preview_links = list(filter(keep_preview_link, all_links))

In [None]:
preview_links_text = list(map(lambda x: 'http:' + x.attrs['href'], preview_links))

In [None]:
import pyspark

In [None]:
sc = pyspark.SparkContext()

In [None]:
preview_link_rdd = sc.parallelize(preview_links_text)

In [None]:
html_page_rdd = preview_link_rdd.map(lambda x : (x, getpage(x)))

In [None]:
html_page_rdd.first()

In [167]:
def extract_jpg_image_url(html):
    links = BeautifulSoup(html, "lxml").find_all('link')
    tifs = []    
    for link in links:
        if link.attrs.get('type', None) == 'image/jpg':
            tifs.append('http:' + link.attrs['href'])
    return tifs

In [168]:
photo_link = html_page_rdd.mapValues(extract_jpg_image_url).mapValues(lambda link_list: link_list[0])

In [169]:
photo_link.first()

('http://www.loc.gov/pictures/collection/fsa/item/2017713829/',
 'http://cdn.loc.gov/service/pnp/fsa/8a00000/8a00000/8a00016r.jpg')

In [170]:
from PIL import Image
from io import BytesIO
def getimage(link):
    img_bytes = requests.get(link).content
    return Image.open(BytesIO(img_bytes))

In [171]:
photos_rdd = photo_link.mapValues(getimage)

# Data collection is done

# Let's do some object recognition

In [174]:
! pip install google-cloud-vision

Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
Collecting google-cloud-vision
[?25l  Downloading https://files.pythonhosted.org/packages/85/53/2c98885401a959b63c1a69537f1b5169d73e2df0bd86591dd1e8611b1302/google_cloud_vision-0.32.0-py2.py3-none-any.whl (108kB)
[K    100% |████████████████████████████████| 112kB 828kB/s ta 0:00:01
[?25hCollecting google-api-core[grpc]<2.0.0dev,>=0.1.0 (from google-cloud-vision)
[?25l  Downloading https://files.pythonhosted.org/packages/d2/03/a83c6d0efa63a13d085b81927fdc9e12ffb98aa0f67798a7573fc6b231e2/google_api_core-1.2.1-py2.py3-none-any.whl (50kB)
[K    100% |████████████████████████████████| 51kB 490kB/s eta 0:00:01
Collecting protobuf>=3.0.0 (from google-api-core[grpc]<2.0.0dev,>=0.1.0->google-cloud-vision)
[?25l  Downloading https://files.pythonhosted.org/packages/85/f8/d09e4bf21c4de65405ce053e90542e728c5b7cf296b9df36b0bf0488f534/protobuf-3.6.0-py2.py

Go to https://console.cloud.google.com/apis/credentials/serviceaccountkey
- From the Service account drop-down list, select New service account.
- Enter a name into the Service account name field.
- Don't select a value from the Role drop-down list. No role is required to access this service.
- Click Create. A note appears, warning that this service account has no role.
- Click Create without role. A JSON file that contains your key downloads to your computer.

In [175]:
import os

In [244]:
from google.cloud import vision
from google.cloud.vision import types

In [278]:
bytes_rdd = photo_link.mapValues(lambda url : requests.get(url).content)

In [295]:
def detect_image_content(img_bytes):
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/user10/DHSI-BigData/Day_4_Exercise/gcreds.json'    
    client = vision.ImageAnnotatorClient()
    image = types.Image(content=img_bytes_1)
    response = client.web_detection(image=image)
    return response

In [296]:
web_entities_rdd = bytes_rdd.mapValues(detect_image_content)