# Data integrity audit

We want guarantees that:
 - the scraper is saving the correct images to searches, with correct terms
 - the API returns the same images that were saved for a search
 
Additionally, we need answers to these questions:
 - why do some terms on dataserve appear to be a random mishmash of images?
 - some terms visible on the Search Archive don't seem to match the results being shown by dataserve. Is there a mismatch?

In [25]:
from datetime import datetime
import ipyplot
import json
from pandas import DataFrame, read_excel
import requests
from scrape import run

In [6]:
base_url = "http://api.firewallcafe.com"

### Saving correct images to searches with correct terms

Let's run an example search and check the images it finds. Is it outputting to the API correctly?

The details of each search are saved in a JSON file and uploaded to the Space before they're POSTed to the API. Let's check the JSON files that we're creating first.

In [54]:
terms = ["backpacking", "verner vinge"]
df = DataFrame([{"english":term, "chinese":""} for term in terms])
df

Unnamed: 0,english,chinese
0,backpacking,
1,verner vinge,


In [55]:
df.to_excel('data_integrity_test.xlsx')

Running even a very small, one-query scraper query produces a lot of print output, most of it unnecessary for our purposes.

In [56]:
run(termlist=read_excel('data_integrity_test.xlsx'))

could not make directory [WinError 183] Cannot create a file when that file already exists: 'search_results'
querying 2 terms for 24 seconds
_write_file: https://firewall-cafe-space.nyc3.digitaloceanspaces.com/log.json
wrote to log: querying 2 terms for 24 seconds
request 0, term idx 0: "backpacking", "nan"
	Google got 20 images
	Baidu fail
done querying search engines for term backpacking
writing search results
_write_file: https://firewall-cafe-space.nyc3.digitaloceanspaces.com/search_results/google_searches_2021-02-25.json
200 getting image https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTRAKP_JTGKg1r9mn_3g3MXv496gKnKOXU-IbvvKF7_8uVuYtWFXHjWEBAPEdI&s
_write_file: https://firewall-cafe-space.nyc3.digitaloceanspaces.com/images/hashed/9986a4de875c2d59.jpg
200 getting image https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQxIhlzqCl1_n7zepOg-_GvDH7jsfs2F6Ogf86RSXhrF4GozZhQ34Ml3Xj1vA&s
_write_file: https://firewall-cafe-space.nyc3.digitaloceanspaces.com/images/hashed/b3a3b2ce4c

(10, 0, 2)

As we can see from the output, we just wrote to `https://firewall-cafe-space.nyc3.digitaloceanspaces.com/search_results/google_searches_2021-02-25.json`, which means the local file is `search_results/google_searches_2021-02-25.json`. Let's take a look at the JSON file we wrote as output.

In [57]:
fname = 'search_results/google_searches_2021-02-25.json'
with open(fname) as f:
    j = json.loads(f.read())

In [58]:
result = [item for item in j if item['english_term'] in terms]
urls = []
for item in j:
    if item['english_term'] in terms:
        urls += item['urls']

In [59]:
ipyplot.plot_images(urls, img_width=150)

Okay, let's check that the JSON on DO is the same thing.

In [60]:
j = requests.get(f'https://firewall-cafe-space.nyc3.digitaloceanspaces.com/{fname}').json()
result = [item for item in j if item['english_term'] in terms]
urls = []
for item in j:
    if item['english_term'] == term:
        urls += item['urls']

In [64]:
urls

[]

In [61]:
ipyplot.plot_images(urls, img_width=150)

Now check that that corresponds to what you get when browsing Google.

### Checking that the API returns the same images and other metadata

Let's check the search object it creates in the database, and see if the image objects associated with it are the same.

Again, check the output of the scraper run above. It'll give you the ID of the search object in the database (search for "search IDs"). First, let's make sure that the first search ID we grabbed matches the first term that we searched.

In [65]:
search_id = 8577
r = requests.get("http://api.firewallcafe.com/searches/search_id/"+str(search_id))
r

<Response [200]>

In [66]:
j = r.json()
j

[{'search_id': 8577,
  'search_timestamp': '1614211381057',
  'search_location': 'automated_scraper',
  'search_ip_address': '192.168.0.1',
  'search_client_name': 'automated_scraper',
  'search_engine_initial': None,
  'search_engine_translation': None,
  'search_term_initial': 'verner vinge',
  'search_term_initial_language_code': 'EN',
  'search_term_initial_language_confidence': '1.0',
  'search_term_initial_language_alternate_code': None,
  'search_term_translation': 'nan',
  'search_term_translation_language_code': 'zh-CN',
  'search_term_status_banned': False,
  'search_term_status_sensitive': False,
  'search_schema_initial': None,
  'wordpress_search_term_popularity': None,
  'wordpress_copyright_takedown': None,
  'wordpress_unflattened': None,
  'wordpress_regular_post_id': None,
  'wordpress_search_result_post_id': None,
  'wordpress_search_result_post_slug': None}]

Great. Now we'll request all the images that are associated with the search term.

In [67]:
r = requests.get("http://api.firewallcafe.com/images/search_id/"+str(search_id))
j = r.json()
j[0]

{'image_id': 577709,
 'image_search_engine': 'google',
 'image_href': 'https://firewall-cafe-space.nyc3.digitaloceanspaces.com/images/hashed/8c81b7c65c7b087d.jpg',
 'image_rank': '1',
 'image_mime_type': None,
 'wordpress_attachment_post_id': None,
 'wordpress_attachment_file_path': None}

In [68]:
urls = [item['image_href'] for item in j]

In [69]:
ipyplot.plot_images(urls, img_width=150)

We've shown that this is all being handled properly, and the results are showing up accurately.

## Do API endpoints agree on what images match which search IDs?

Search objects are treated as a unit and given an ID. There are also image objects, which have image IDs, the ID of the search they belong to, and the URL of the actual image file. An image *object* can only belong to one search object, though any number of image objects can refer to the same actual file (that means that that particular image has shown up in lots of searches). 

Let's make sure that a few endpoints that we're using agree on which image objects belong to which searches.

`/searches/term` is a way of getting all search objects for a given term. Since it was written for the dataserve, it also returns a compact list of image file URLs instead of all the image object data.

`/images/search_id` is a way of getting all that image object data, but you can only return the image object data from one search object at a time.

This test starts from a term, and uses the first endpoint to request all search objects for that term. Then it iterates through each search ID, using the second endpoint to request the image object data, comparing the URLs for both.

In [70]:
term = "backpacking"

In [71]:
r = requests.get(base_url + "/searches/terms?term=" + term)
j = r.json()

first_endpoint = {}
second_endpoint = {}
print("id     len1  len2   date")
for item in j:
    first_endpoint[item['search_id']] = item
    r = requests.get(base_url + "/images/search_id/"+str(item['search_id']))
    j = r.json()
    second_endpoint[item['search_id']] = j
    print(f"{item['search_id']:5} {len(item['image_hrefs']):5} {len(j):5}   {datetime.fromtimestamp(int(item['search_timestamp'])/1000)}")

id     len1  len2   date
  120    20    20   2016-02-10 13:56:31
 6480   125   125   2021-02-05 17:19:31.396000
 6505   125   125   2021-02-05 17:19:33.774000
 6757   125   125   2021-02-06 16:40:27.015000
 6782   125   125   2021-02-06 16:40:29.382000
 7384   125   125   2021-02-07 17:03:31.260000
 7409   125   125   2021-02-07 17:03:33.634000
 7640   245   245   2021-02-08 16:59:47.226000
 7739   245   245   2021-02-08 16:59:47.226000
 7913   120   120   2021-02-08 17:57:42.972000
 8039   370   370   2021-02-17 13:04:07.246000
 8113   370   370   2021-02-17 13:04:08.483000
 8291   120   120   2021-02-22 13:37:32.803000
 8315   120   120   2021-02-22 13:37:35.589000
 8518   120   120   2021-02-23 16:37:22.436000
 8542   120   120   2021-02-23 16:37:22.436000
 8566   120   120   2021-02-23 16:37:24.596000
 8576     5     5   2021-02-24 16:02:42.140000


The search we just did above shows up with only 5, as expected. However, the one that happened yesterday has 120 photos. This hints at what is possibly the real cuprit for this results soup: each search only receives 5 image results, and the searches are POSTed to the API in batches of 25 (or possibly 24 in some cases).

Let's take a look at one of these results soups.

In [75]:
search_id = 8566
print(first_endpoint[search_id]['search_location'], len(first_endpoint[search_id]['image_hrefs']),
      datetime.fromtimestamp(int(first_endpoint[search_id]['search_timestamp'])/1000))

automated_scraper 120 2021-02-23 16:37:24.596000


In [76]:
ipyplot.plot_images(first_endpoint[search_id]['image_hrefs'][:10], img_width=150)

In [77]:
urls = [item['image_href'] for item in second_endpoint[search_id]]
ipyplot.plot_images(urls[:10], img_width=150)

For this term, we can see that both endpoints agree that these are the images in the database. Let's look at one more thing before we move on. The ranks should always be 1-5, are they?

In [83]:
', '.join(sorted([item['image_rank'] for item in second_endpoint[search_id]]))

'1, 10, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 11, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 12, 120, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 3, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 4, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 5, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 6, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 7, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 8, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 9, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99'

It appears that the scraper is incorrectly lumping the image results from a number of searches into one search object on the DB side.

In [93]:
# given a search id, print out the images
search_id = 8821
urls = [item['image_href'] for item in requests.get(base_url + "/images/search_id/"+str(search_id)).json()]
ipyplot.plot_images(urls[:10], img_width=150)

## Issue: mismatch of search archive and dataserve?