## Load sensitive words dataset

The China Digital Times makes their dataset publicly available on their site and on this [Google Doc](https://docs.google.com/spreadsheets/d/1UTP9MU80r_N5WPhQ5-4AjM0ebW1eMxyDlRe_vaYy9IM/edit#gid=0).

In [1]:
import pandas

In [2]:
df=pandas.read_csv('CDT_sensitive_words.csv', skiprows=1, header=0, usecols=[0,1], names=['chinese', 'english'])
df.head()

Unnamed: 0,chinese,english
0,新闻联播,
1,john oliver,
2,last week tonight,
3,上周今夜秀,
4,蓝衣女记者,blue-clothed female reporter


In [3]:
len(df)

3497

In [4]:
english_terms = list(df.dropna().english)

## Testing the rate limits of Google and Baidu

Here are some rates to test:

6 / hour

60 / hour

300 / hour

If we can get to 300 / hour, that's 7,200 / day. That's probably plenty for our purposes. 

In [5]:
import time

In [17]:
import requests
from bs4 import BeautifulSoup
import re
import random
import json

In [7]:
def query_google(term):
    google_template = 'https://www.google.com/search?q={}&tbm=isch'
    r = requests.get(google_template.format(term))
    soup = BeautifulSoup(r.text)
    urls = [tag.get('src') for tag in soup.find_all('img') if tag.get('src')[:4] == 'http']
    return urls

In [8]:
def query_baidu(term):
    baidu_template = 'https://image.baidu.com/search/index?tn=baiduimage&word={}'
    url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+term+'&ct=201326592&v=flip'
    r = requests.get(url, timeout=10,
                proxies={'https':None, 'http':None},
                headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'})
    urls = re.findall('"objURL":"(.*?)",',r.text,re.S)
    return urls

## Test at 300 / hour for two hours

In [19]:
google_fails = []
baidu_fails = []

google_urls = {}
baidu_urls = {}

total_time = 3600 * 2
total_requests = int(300 * 2)
wait_time = total_time / total_requests

start_ts = time.time()
for i in range(0, total_requests):
    start_iter_ts = time.time()
    term = random.choice(english_terms)
    print(f'{i}: "{term}"')
    try:
        urls = query_google(term)
        print(f"\tgoogle got {len(urls)} images")
        google_urls[term] = urls
    except Exception as e:
        google_fails.append(e)
        print("\tgoogle fail", e)
    try:
        urls = query_baidu(term)
        print(f"\tbaidu got {len(urls)} images")
        baidu_urls[term] = urls
    except Exception as e:
        baidu_fails.append(e)
        print("\tbaidu fail")
    
    # account for the time the calls took
    took = time.time() - start_iter_ts
    time.sleep(max(0, wait_time - took))
print("took", round((time.time() - start_ts)/60, 2), "minutes")
with open('google_searches.json', 'w') as f:
    f.write(json.dumps(google_urls))
with open('baidu_searches.json', 'w') as f:
    f.write(json.dumps(baidu_urls))

0: "Lung Ying-tai"
	google got 20 images
	baidu got 60 images
1: "Maoming+aromatic hydrocarbon"
	google got 20 images
	baidu got 60 images
2: "Yihong house (from novel Dream of the Red Chamber)"
	google got 20 images
	baidu got 60 images
3: "Ai Xiaoming"
	google got 20 images
	baidu got 60 images
4: "Internet Management Office"
	google got 20 images
	baidu got 60 images
5: "[Zhou] Yongkang + Jiang Jiemin"
	google got 20 images
	baidu got 10 images
6: "Beijing + chaos"
	google got 20 images
	baidu got 60 images
7: "Fan Chengxiu"
	google got 20 images
	baidu got 60 images
8: "Syria + emergency declaration"
	google got 20 images
	baidu got 60 images
9: "cultural revolution"
	google got 20 images
	baidu got 60 images
10: "cao"
	google got 20 images
	baidu got 60 images
11: "decide behind closed doors"
	google got 20 images
	baidu got 60 images
12: "stone eight big"
	google got 20 images
	baidu got 60 images
13: "Yuyao+assemble"
	google got 20 images
	baidu got 60 images
14: "Mengxue"
	goog