## Comparing the old Wordpress API and the newer node API

Do they have the same data? How do queries work on each? Let's check.
 - get all searches on both
 - get all images
 - get one search and all images
 - get voting data
 - do all searches on wordpress exist on node API?
 - do all images on wordpress exist on node API?


Here is the endpoint that firewallcafe.com/archive is hitting to get the results.

In [1]:
url = 'https://firewallcafe.com/wp-json/wp/v2/search-result?per_page=25&page=1'

In [2]:
import requests

In [3]:
wp_res = requests.get(url).json()

In [4]:
len(wp_res)

25

In [5]:
wp_res[0].keys()

dict_keys(['id', 'date', 'date_gmt', 'guid', 'modified', 'modified_gmt', 'slug', 'status', 'type', 'link', 'title', 'content', 'excerpt', 'template', 'meta', 'tags', 'galleries', '_links'])

In [6]:
wp_res[0]['id']

282427

The ID is of the top-most entry under Search Results in the admin panel: 282427, "coronavirus".

In [7]:
wp_res[0]['title']

{'rendered': 'coronavirus'}

In [8]:
[item['title']['rendered'] for item in wp_res]

['coronavirus',
 'kim jong un',
 'hitler',
 'why communism is evil',
 'red square',
 'trump',
 'hong kong',
 'tiananmen square',
 'hong kong protest flag',
 'hong kong',
 'chinese internment camps',
 'united states',
 'propaganda',
 'missles',
 'burger king',
 'red bull',
 'doritos',
 'aliens',
 'nascar',
 'fox news',
 'martin luther king jr',
 'hot cheetos',
 'nba',
 'hamburgers',
 'pizza']

In [9]:
new_api_res = requests.get('http://api.firewallcafe.com/searches').json()

In [10]:
len(new_api_res)

5579

The new API's order is reversed, so we have to look at the last item to find the corresponding search.

In [11]:
new_api_res[-1]

{'search_id': 5579,
 'search_timestamp': '1582773676000',
 'search_location': 'poughkeepsie',
 'search_ip_address': None,
 'search_client_name': 'Anonymous',
 'search_engine_initial': 'google',
 'search_engine_translation': 'baidu',
 'search_term_initial': 'coronavirus',
 'search_term_initial_language_code': 'en',
 'search_term_initial_language_confidence': '0.5703125',
 'search_term_initial_language_alternate_code': '',
 'search_term_translation': '冠状病毒',
 'search_term_translation_language_code': 'zh-CN',
 'search_term_status_banned': False,
 'search_term_status_sensitive': False,
 'search_schema_initial': 2,
 'wordpress_search_term_popularity': None,
 'wordpress_copyright_takedown': None,
 'wordpress_unflattened': None,
 'wordpress_regular_post_id': None,
 'wordpress_search_result_post_id': 282427,
 'wordpress_search_result_post_slug': 'coronavirus-1582773676'}

However, there's a lot that's not here which is in the Wordpress result. There's no list of images. Instead we need to call this endpoint:

In [12]:
r = requests.get('http://api.firewallcafe.com/searches/images/search_id/' + str(5579))

Let's look at the result minus the raw image data.

In [13]:
imageset = r.json()
for img in imageset:
    img['image_data']['data'] = []

In [14]:
imageset[0]

{'search_id': 5579,
 'search_timestamp': '1582773676000',
 'search_location': 'poughkeepsie',
 'search_ip_address': None,
 'search_client_name': 'Anonymous',
 'search_engine_initial': 'google',
 'search_engine_translation': 'baidu',
 'search_term_initial': 'coronavirus',
 'search_term_initial_language_code': 'en',
 'search_term_initial_language_confidence': '0.5703125',
 'search_term_initial_language_alternate_code': '',
 'search_term_translation': '冠状病毒',
 'search_term_translation_language_code': 'zh-CN',
 'search_term_status_banned': False,
 'search_term_status_sensitive': False,
 'search_schema_initial': 2,
 'wordpress_search_term_popularity': None,
 'wordpress_copyright_takedown': None,
 'wordpress_unflattened': None,
 'wordpress_regular_post_id': None,
 'wordpress_search_result_post_id': 282427,
 'wordpress_search_result_post_slug': 'coronavirus-1582773676',
 'image_id': 197428,
 'image_search_engine': 'google',
 'image_href': 'https://www.bloomberg.com/graphics/2020-wuhan-novel-c

Looks like we can grab the images with the URLs.

In [15]:
import ipyplot
def get_full_img_url(img):
    return 'http://firewallcafe.com' + img['wordpress_attachment_file_path']
ipyplot.plot_images([get_full_img_url(img) for img in imageset[:10]], img_width=150)

## Do all searches on wordpress API exist on node API?

In [17]:
# get all wordpress searches
# search_by_wp_id = {}
# for page in range(1, 100):
#     print("getting page", page)
#     r = requests.get(f'https://firewallcafe.com/wp-json/wp/v2/search-result?per_page=100&page={page}')
#     j = r.json()
#     if 'data' in j and j['data']['status'] == 400:
#         break
#     print('got', len(j), 'first search:', j[0]['title']['rendered'])
#     for search in j:
#         search_by_wp_id[search['id']] = search

import json
with open('temp/wp_results.json') as f:
    search_by_wp_id = json.loads(f.read())

In [29]:
# get all node searches
# new_api_res = requests.get('http://api.firewallcafe.com/searches').json()

with open('temp/node_results.json') as f:
    new_api_res = json.loads(f.read())

In [30]:
# with open('temp/wp_results.json', 'w') as f:
#     f.write(json.dumps(search_by_wp_id))
# with open('temp/node_results.json', 'w') as f:
#     f.write(json.dumps(new_api_res))

In [31]:
new_api_res[0]

{'search_id': 1,
 'search_timestamp': '1454979377000',
 'search_location': 'new_york_city',
 'search_ip_address': None,
 'search_client_name': 'Dan',
 'search_engine_initial': None,
 'search_engine_translation': None,
 'search_term_initial': 'football',
 'search_term_initial_language_code': None,
 'search_term_initial_language_confidence': None,
 'search_term_initial_language_alternate_code': None,
 'search_term_translation': '足球',
 'search_term_translation_language_code': None,
 'search_term_status_banned': False,
 'search_term_status_sensitive': False,
 'search_schema_initial': 0,
 'wordpress_search_term_popularity': 1,
 'wordpress_copyright_takedown': None,
 'wordpress_unflattened': None,
 'wordpress_regular_post_id': 223,
 'wordpress_search_result_post_id': 241408,
 'wordpress_search_result_post_slug': 'football-1454979377'}

In [36]:
wp_ids = set([str(_id) for _id in search_by_wp_id.keys()])
node_ids = set([str(item['wordpress_search_result_post_id']) for item in new_api_res])

In [37]:
len(wp_ids - node_ids)

0

In [38]:
len(node_ids - wp_ids)

0

In [39]:
len(wp_ids)

5579

Perfect, so the two sets are equal.

## Get a list of all image files on Wordpress

In [48]:
wp_imgset = []
for wp_id, search in search_by_wp_id.items():
    for gallery in search['galleries']:
        wp_imgset += gallery['src']

In [49]:
len(imgset)

187132

In [50]:
wp_imgset[:10]

['https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-01.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-02.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-03.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-04-220x220.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-05.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-06.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-07-194x220.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-08.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-09.jpg',
 'https://firewallcafe.com/wp-content/uploads/2020/02/26/282427/google-1582773676-10.jpg']

## Get a list of all image files on Node API

This is currently infeasible given that the Node API returns raw image data--I'd have to download essentially the entire contents of the multi-GB DB. I'm going to investigate removing that from results since we're not going to be using it soon anyway.

## Do all images on Wordpress exist on Node API?

In [None]:
# for item in new_api_res:
list(search_by_wp_id.items())[0][1]['galleries']
# r = requests.get('http://api.firewallcafe.com/searches/images/search_id/' + str(5579))