In [1]:
from bs4 import BeautifulSoup
from datetime import datetime
import dateutil
import ipyplot
import json
import requests
import time
api_base = 'http://api.firewallcafe.com'

In [2]:
r = requests.get('https://firewallcafe.com/wp-json/wp/v2/search-result?per_page=25&page=1')

In [3]:
wp = r.json()

In [4]:
wp[0]['slug']

'coronavirus-1582773676'

In [5]:
ipyplot.plot_images(wp[0]['galleries'][0]['src'], img_width=150)

In [6]:
wp[0]['date']

'2020-02-27T03:21:16'

In [7]:
dateutil.parser.parse(wp[0]['date'])

datetime.datetime(2020, 2, 27, 3, 21, 16)

Since there isn't an API endpoint to get searches by Wordpress ID, let's just scoop up all the searches first. 

In [8]:
searches = []
ts = time.time()
for i in range(10):
    j = requests.get(api_base + f"/searches?page={i}&page_size=1000").json()
    searches += j
    print(i, round(time.time()-ts,1), "seconds")
    ts = time.time()

0 5.2 seconds
1 3.5 seconds
2 3.4 seconds
3 4.0 seconds
4 3.4 seconds
5 3.4 seconds
6 3.4 seconds
7 3.4 seconds
8 2.3 seconds
9 0.4 seconds


In [9]:
with open('all_searches.json', 'w') as f:
    f.write(json.dumps(searches))

In [10]:
searches[0]

{'search_id': 10060,
 'search_timestamp': '1614731797243',
 'search_location': 'automated_scraper',
 'search_ip_address': '192.168.0.1',
 'search_client_name': 'automated_scraper',
 'search_engine_initial': None,
 'search_engine_translation': None,
 'search_term_initial': 'clubbing',
 'search_term_initial_language_code': 'EN',
 'search_term_initial_language_confidence': '1.0',
 'search_term_initial_language_alternate_code': None,
 'search_term_translation': '泡吧',
 'search_term_translation_language_code': 'zh-CN',
 'search_term_status_banned': False,
 'search_term_status_sensitive': False,
 'search_schema_initial': None,
 'wordpress_search_term_popularity': None,
 'wordpress_copyright_takedown': None,
 'wordpress_unflattened': None,
 'wordpress_regular_post_id': None,
 'wordpress_search_result_post_id': None,
 'wordpress_search_result_post_slug': None}

In [11]:
wp_ids = set([search['wordpress_regular_post_id'] for search in searches])
wp_ids2 = set([search['wordpress_search_result_post_id'] for search in searches])

In [12]:
wp[0]['id']

282427

In [13]:
wp[0]['id'] in wp_ids2

True

In [14]:
set([item['id'] for item in wp]) - wp_ids2

set()

Looks like all the IDs of what we got back from the Wordpress API are in the DB; the ID field we care about is called "wordpress_search_result_post_id" there.

Now, how would we check that they have the same images?

In [15]:
def get_search(wp_id):
    results = [search for search in searches if search['wordpress_search_result_post_id'] == wp_id]
    if len(results) != 1:
        raise Exception("hmm, this doesn't seem to be the right length", len(results), wp_id)
    return results[0]
get_search(wp[0]['id'])

{'search_id': 5579,
 'search_timestamp': '1582773676000',
 'search_location': 'poughkeepsie',
 'search_ip_address': None,
 'search_client_name': 'Anonymous',
 'search_engine_initial': 'google',
 'search_engine_translation': 'baidu',
 'search_term_initial': 'coronavirus',
 'search_term_initial_language_code': 'en',
 'search_term_initial_language_confidence': '0.5703125',
 'search_term_initial_language_alternate_code': '',
 'search_term_translation': '冠状病毒',
 'search_term_translation_language_code': 'zh-CN',
 'search_term_status_banned': False,
 'search_term_status_sensitive': False,
 'search_schema_initial': 2,
 'wordpress_search_term_popularity': None,
 'wordpress_copyright_takedown': None,
 'wordpress_unflattened': None,
 'wordpress_regular_post_id': None,
 'wordpress_search_result_post_id': 282427,
 'wordpress_search_result_post_slug': 'coronavirus-1582773676'}

Next step: request the images from the DB using that search ID, and see if they match the images from the Wordpress results. We're going to limit ourselves to Google for simplicity.

In [16]:
for search in wp[:5]:
    search_db = get_search(search['id'])
    if search_db['wordpress_search_result_post_slug'] != search['slug']:
        raise Exception("we seem to have an incorrect correspondance", search_db['wordpress_search_result_post_slug'], search['slug'])
    r = requests.get(api_base + '/images/search_id/' + str(search_db['search_id']))
    j = r.json()
    print("plotting Wordpress images")
    ipyplot.plot_images(search['galleries'][0]['src'], img_width=100)
    print("plotting DB images")
    db_imgs = [item['image_href'] for item in j if item['image_search_engine'] == 'google']
    ipyplot.plot_images(db_imgs, img_width=100)

plotting Wordpress images


plotting DB images


plotting Wordpress images


plotting DB images


plotting Wordpress images


plotting DB images


plotting Wordpress images


plotting DB images


plotting Wordpress images


plotting DB images


## Empty search results

We noticed that there are sometimes empty results being served by both Wordpress and Dataserve.

Wordpress example: search for "censorship" on the archive, and find the search from 2016-02-11 ("press censorship in China").

Dataserve: 2016-2-10 shows up as an empty date frequently, eg for "backpacking" (ID 120), "clubbing" (121), and others.

### Dataserve empty results

```search_id, term, date, empty on archive?
106, court, 2016-2-10, true
118, shaolin, 2016-2-10, true
119, car trip, 2016-2-10, true
120, backpacking, 2016-2-10, true
121, clubbing, 2016-2-10, true
2599, dog, 2016-12-2, true
```

It appears that there's a 1-to-1 correspondence between these problems. Let's check what the API is returning for some of these IDs.

In [17]:
requests.get(api_base + "/searches/search_id/2599").json()

[{'search_id': 2599,
  'search_timestamp': '1480742148000',
  'search_location': 'st_polten',
  'search_ip_address': None,
  'search_client_name': 'Client 462',
  'search_engine_initial': None,
  'search_engine_translation': None,
  'search_term_initial': 'dog',
  'search_term_initial_language_code': None,
  'search_term_initial_language_confidence': None,
  'search_term_initial_language_alternate_code': None,
  'search_term_translation': '狗',
  'search_term_translation_language_code': None,
  'search_term_status_banned': False,
  'search_term_status_sensitive': False,
  'search_schema_initial': 0,
  'wordpress_search_term_popularity': 2,
  'wordpress_copyright_takedown': None,
  'wordpress_unflattened': None,
  'wordpress_regular_post_id': 3401,
  'wordpress_search_result_post_id': 240110,
  'wordpress_search_result_post_slug': 'dog-1480742148'}]

In [18]:
requests.get(api_base + "/images/search_id/2599").json()

[{'image_id': None,
  'image_search_engine': None,
  'image_href': None,
  'image_rank': None,
  'image_mime_type': None,
  'wordpress_attachment_post_id': None,
  'wordpress_attachment_file_path': None}]

In [19]:
j=requests.get("https://firewallcafe.com/wp-json/wp/v2/search-result?per_page=25&page=1&search=dog").json()

In [20]:
for i,search in enumerate(j):
    print(i, search['slug'], '\t', search['date'])

0 places-that-eat-dogs-1582767619 	 2020-02-27T01:40:19
1 dog-meals-1582767601 	 2020-02-27T01:40:01
2 dog-food-1582767590 	 2020-02-27T01:39:50
3 dogs-1582767551 	 2020-02-27T01:39:11
4 recep-tayyip-erdogan-1582756694 	 2020-02-26T22:38:14
5 hot-dog-1581546915 	 2020-02-12T22:35:15
6 dogs-1578684699 	 2020-01-10T19:31:39
7 dog-walker-1569071811 	 2019-09-21T13:16:51
8 dog-1532191736 	 2018-07-21T16:48:56
9 erdogan-1495616624 	 2017-05-24T09:03:44
10 erdogan-1495546901 	 2017-05-23T13:41:41
11 recep-tayyip-erdogan-1495539716 	 2017-05-23T11:41:56
12 dog-1495384485 	 2017-05-21T16:34:45
13 short-dog-food-1495268548 	 2017-05-20T08:22:28
14 dogs-1495123914 	 2017-05-18T16:11:54
15 hund-1481367467 	 2016-12-10T10:57:47
16 hunde-1481367443 	 2016-12-10T10:57:23
17 dog-1480742148 	 2016-12-03T05:15:48
18 american-bully-dog-1457126172 	 2016-03-04T21:16:12
19 xochi-the-dog-1456106340 	 2016-02-22T01:59:00
20 dog-print-1456100190 	 2016-02-22T00:16:30
21 %e7%8b%97-1455924265 	 2016-02-19T23:2

Here's what the Wordpress site is getting from the old API:

In [21]:
j[17]

{'id': 240110,
 'date': '2016-12-03T05:15:48',
 'date_gmt': '2016-12-03T05:15:48',
 'guid': {'rendered': 'https://staging.firewallcafe.com/archive/dog-1480742148/'},
 'modified': '2016-12-03T05:15:48',
 'modified_gmt': '2016-12-03T05:15:48',
 'slug': 'dog-1480742148',
 'status': 'publish',
 'type': 'search-result',
 'link': 'https://firewallcafe.com/archive/dog-1480742148/',
 'title': {'rendered': 'dog'},
 'content': {'rendered': '<p>Google<br />\n<br />\nBaidu<br />\n</p>\n',
  'protected': False},
 'excerpt': {'rendered': '<p>狗</p>\n', 'protected': False},
 'template': '',
 'meta': [],
 'tags': [{'term_id': 149,
   'name': '2016',
   'slug': 'has_search_year_2016',
   'term_group': 0,
   'term_taxonomy_id': 149,
   'taxonomy': 'post_tag',
   'description': '',
   'parent': 0,
   'count': 2691,
   'filter': 'raw'},
  {'term_id': 148,
   'name': 'St. Polten',
   'slug': 'has_search_location_st_polten',
   'term_group': 0,
   'term_taxonomy_id': 148,
   'taxonomy': 'post_tag',
   'descr

We would expect there to be a bunch of images under "galleries", but there aren't any.

Here's another of the searches, abbreviated for readability.

In [22]:
j = requests.get(api_base + "/images/search_id/119").json()
j[:3]

[{'image_id': 2052,
  'image_search_engine': 'baidu',
  'image_href': None,
  'image_rank': '07',
  'image_mime_type': 'image/jpeg',
  'wordpress_attachment_post_id': None,
  'wordpress_attachment_file_path': '/wp-content/uploads/2016/02/baidu-car trip-07.jpg'},
 {'image_id': 2046,
  'image_search_engine': 'baidu',
  'image_href': None,
  'image_rank': '01',
  'image_mime_type': 'image/jpeg',
  'wordpress_attachment_post_id': None,
  'wordpress_attachment_file_path': '/wp-content/uploads/2016/02/baidu-car trip-01.jpg'},
 {'image_id': 2047,
  'image_search_engine': 'baidu',
  'image_href': None,
  'image_rank': '02',
  'image_mime_type': 'image/jpeg',
  'wordpress_attachment_post_id': None,
  'wordpress_attachment_file_path': '/wp-content/uploads/2016/02/baidu-car trip-02.jpg'}]

Every search image is missing its `image_href` field. But it looks like there are images available for this search, let's take a look.

In [23]:
hrefs = [f"https://firewallcafe.com/{img['wordpress_attachment_file_path']}" for img in j]

In [24]:
ipyplot.plot_images(hrefs, img_width=100)

The images are there. Not sure why they aren't showing up on the Wordpress site, but we can at least give access to them on Dataserve.

### What we need to do

Right now, we can modify the code in Dataserve to use the Wordpress image URL as a backup if `image_url` isn't specified.

For the new API, we have some URL management to do. We need a new `original_href` so we can keep track of where we first found the image. For any images with nothing in `image_href`, we need to make sure the image is in the image data lake and add the URL to it. Plus, there are still a lot of images where `image_url` is pointing to the original location, so we'll need to switch those URLs out for the new location that we control.