## What's doable and how to do it

Here's the text for the first milestone as I wrote it in the contract (broken up for convenience):

I’ll make some notebooks that show what we need to do in order to do any automated queries to the two search engines. 

Then I’ll take a measurement of how much we can scale that before we run into bottlenecks, and think a bit about how we could get past those bottlenecks if we expect we’ll want to. 

By the time we’re done here, we’ll have 
 - the number of terms we want to start searching, 
 - how often we want to search them, and 
 - an estimate for the number we want to scale to in the near future. 
 
 I’ll also do a pro/con for deploying the production script to Digital Ocean vs AWS.

Action items:
1. code: make a single call to Baidu and Google Images (that's not rejected)
1. code: try several different methods of making calls
2. code: check that you can collect all the data that is currently stored in the Firewall Cafe database
3. research: measure limits on automated searching from blog posts
4. code: try scraping all keywords from China Digital Times banned words list daily and document success/failure
5. research: outline options for using different IP addresses in order to scale further
6. research: find out what happens when you have crossed a line
7. code: test options for getting around #6 (or decide it's unnecessary)

In [1]:
import requests

## #1: make a single call to Google

When I search for "kitten" on Google images, this is the URL that Google takes me to:
https://www.google.com/search?tbm=isch&sxsrf=ALeKk01Rh17RAmLm5VpTJIuFfglrj1ILDQ%3A1600836824011&source=hp&biw=958&bih=923&ei=19RqX7q9O_e90PEPl6q0-AQ&q=kitten&oq=kitten&gs_lcp=CgNpbWcQAzIECCMQJzIFCAAQsQMyBQgAELEDMgUIABCxAzIFCAAQsQMyBQgAELEDMgUIABCxAzICCAAyBQgAELEDMgUIABCxAzoHCCMQ6gIQJzoICAAQsQMQgwFQ5iFY7yZg4idoAXAAeACAAVyIAfQDkgEBNpgBAKABAaoBC2d3cy13aXotaW1nsAEK&sclient=img&ved=0ahUKEwj6hPLxvf7rAhX3HjQIHRcVDU8Q4dUDCAc&uact=5

Breaking it down:

```https://www.google.com/search?
tbm=isch&
sxsrf=ALeKk01Rh17RAmLm5VpTJIuFfglrj1ILDQ%3A1600836824011&
source=hp&
biw=958&
bih=923&
ei=19RqX7q9O_e90PEPl6q0-AQ&
q=kitten&
oq=kitten&gs_lcp=CgNpbWcQAzIECCMQJzIFCAAQsQMyBQgAELEDMgUIABCxAzIFCAAQsQMyBQgAELEDMgUIABCxAzICCAAyBQgAELEDMgUIABCxAzoHCCMQ6gIQJzoICAAQsQMQgwFQ5iFY7yZg4idoAXAAeACAAVyIAfQDkgEBNpgBAKABAaoBC2d3cy13aXotaW1nsAEK&
sclient=img&
ved=0ahUKEwj6hPLxvf7rAhX3HjQIHRcVDU8Q4dUDCAc&
uact=5
```

Building a minimal search URL:

https://www.google.com/search?q=kitten

Works, but doesn't take me to images directly.

But, removing all query parameters that are base64 data does the trick:

https://www.google.com/search?q=kitten&sclient=img&tbm=isch&source=hp&biw=958&bih=923&oq=kitten&&uact=5

Manually removing other parameters that don't break it:

https://www.google.com/search?q=kitten&tbm=isch

The forensic visualization of the URL:

https://dfir.blog/unfurl/?url=https://www.google.com/search?source=hp&ei=BYgfX8rWKfPT9APswJc4&q=hindsight&oq=hindsight&gs_lcp=CgZwc3ktYWIQAzIFCAAQsQMyBQgAELEDMgUIABCxAzICCAAyBQgAELEDMgIIADICCAAyAggAMggILhDHARCvATICCAA6CAgAELEDEIMBOgsILhCxAxDHARCjAjoICC4QxwEQowI6CAguELEDEIMBOgUILhCxAzoHCAAQsQMQCjoICC4QsQMQkwI6AgguOgYIABAWEB46BQghEKABOg4ILhCxAxDHARCjAhCTAjoLCC4QsQMQxwEQrwE6BAgAEANQ5wZYpVhgiWZoAXAAeAKAAX6IAY4VkgEEMjMuN5gBAKABAaoBB2d3cy13aXqwAQA&sclient=psy-ab&ved=0ahUKEwiK7aKK7u7qAhXzKX0KHWzgBQcQ4dUDCAk&uact=5

In [2]:
search_term_english = 'kitten'
search_term_mandarin = '小猫'
google_template = 'https://www.google.com/search?q={}&tbm=isch'
r = requests.get(google_template.format(search_term_english))

In [3]:
r.status_code

200

Searching for the term in Google Images and rendering the result:

In [4]:
def display_html_inline(html):
    from IPython.display import IFrame, display, HTML
    display(HTML(html, metadata=dict(isolated=True)))
display_html_inline(r.text)

0,1,2,3
ALL,IMAGES,NEWS,VIDEOS

0,1,2,3
Letting Your Kitten Outside... vets4pets.com,Bringing Home a New Kitten... vets4pets.com,Report: USDA Forced Kittens... time.com,Avoid heartache with The... icatcare.org
USDA Fed Cats and Dogs to... livescience.com,Helping your new Cat /... icatcare.org,"WHEN IT COMES TO KITTENS,... chuckanddons.com",How to Prepare for a Kitten... catadoptionteam.org
I Got a Kitten...Now What?... lomsnesvet.ca,Kitten Development From 3... thesprucepets.com,Kitten - Wikipedia en.wikipedia.org,Your Kitten's Development... thesprucepets.com
What to Expect When... advantagepetcare.com.au,When Should I Microchip my... heartlandvets.com,I Found Kittens What Do I... fresnohumane.org,Keeping Kittens Healthy —... kittenrescue.org
How to Choose the Right... hillspet.com,Adopt a Pet | Kittens... napavalleyregister.com,"How Hannah Shaw, The... hawaiipublicradio.org",Jeff Merkley Wants to Make... pdxmonthly.com

0
Letting Your Kitten Outside... vets4pets.com

0
Bringing Home a New Kitten... vets4pets.com

0
Report: USDA Forced Kittens... time.com

0
Avoid heartache with The... icatcare.org

0
USDA Fed Cats and Dogs to... livescience.com

0
Helping your new Cat /... icatcare.org

0
"WHEN IT COMES TO KITTENS,... chuckanddons.com"

0
How to Prepare for a Kitten... catadoptionteam.org

0
I Got a Kitten...Now What?... lomsnesvet.ca

0
Kitten Development From 3... thesprucepets.com

0
Kitten - Wikipedia en.wikipedia.org

0
Your Kitten's Development... thesprucepets.com

0
What to Expect When... advantagepetcare.com.au

0
When Should I Microchip my... heartlandvets.com

0
I Found Kittens What Do I... fresnohumane.org

0
Keeping Kittens Healthy —... kittenrescue.org

0
How to Choose the Right... hillspet.com

0
Adopt a Pet | Kittens... napavalleyregister.com

0
"How Hannah Shaw, The... hawaiipublicradio.org"

0
Jeff Merkley Wants to Make... pdxmonthly.com

0,1,2
Settings,Privacy,Terms


And, just for fun, searching for the translated term:

In [5]:
r = requests.get(google_template.format(search_term_mandarin))
display_html_inline(r.text)

0,1,2,3
ALL,IMAGES,VIDEOS,NEWS

0,1,2,3
小猫出售，价格实惠，血统纯正，英短蓝白，全国发货-... maomipuzi.com,小猫和大猫哪个好喂养？__凤凰网 ishare.ifeng.com,刚出生小猫怎么人工喂养|幼猫饲养-波奇网百科大全 boqii.com,新生小猫的7个照顾技巧，如何照顾刚出生的幼猫|... dawangmao.com
猫妈妈生下小猫咪为什么会把小猫吃掉以下禁区你别再犯了！... aboluowang.com,幼猫拉稀不吃饭怎么办，小猫拉肚子不吃饭怎么办 k.sina.cn,养宠知识分享：小猫拉肚子一直睡觉，小猫拉肚子不愿意动_... sohu.com,刚出生的小猫被人摸过后，为什么母猫就会吃掉它？今天算长... k.sina.com.cn
PS调整边缘抠出可爱的小猫- 设计之家 sj33.cn,照顾小猫须知幼猫饲养很简单_好主人 v.163.com,小猫喂养全攻略！养一只健康可爱的小猫咪需要知道的6件事... kknews.cc,猫生小猫为什么不能看母猫为什么吃小猫- 致富热 zhifure.com
专家讲解刚出生的小猫吃什么-新闻频道-手机搜狐 m.sohu.com,小猫不拉粑粑怎么办？ - 知乎 zhuanlan.zhihu.com,幼猫怎么洗澡？帮小猫洗澡的正确方法- 雪花新闻 xuehua.us,小猫的名字_新闻_蛋蛋赞 twoeggz.com
小猫几个月打疫苗，注意别错过最佳时间-热备资讯 hotbak.net,棕色小猫图片-白色背景下的棕色小猫素材-高清图片-摄影... 52112.com,想养猫？教你如何挑选健康的小猫咪- 每日头条 kknews.cc,小猫换牙征兆突然脸肿掉牙还流血?_小可爱宠物网 xiaokeai.com

0
小猫出售，价格实惠，血统纯正，英短蓝白，全国发货-... maomipuzi.com

0
小猫和大猫哪个好喂养？__凤凰网 ishare.ifeng.com

0
刚出生小猫怎么人工喂养|幼猫饲养-波奇网百科大全 boqii.com

0
新生小猫的7个照顾技巧，如何照顾刚出生的幼猫|... dawangmao.com

0
猫妈妈生下小猫咪为什么会把小猫吃掉以下禁区你别再犯了！... aboluowang.com

0
幼猫拉稀不吃饭怎么办，小猫拉肚子不吃饭怎么办 k.sina.cn

0
养宠知识分享：小猫拉肚子一直睡觉，小猫拉肚子不愿意动_... sohu.com

0
刚出生的小猫被人摸过后，为什么母猫就会吃掉它？今天算长... k.sina.com.cn

0
PS调整边缘抠出可爱的小猫- 设计之家 sj33.cn

0
照顾小猫须知幼猫饲养很简单_好主人 v.163.com

0
小猫喂养全攻略！养一只健康可爱的小猫咪需要知道的6件事... kknews.cc

0
猫生小猫为什么不能看母猫为什么吃小猫- 致富热 zhifure.com

0
专家讲解刚出生的小猫吃什么-新闻频道-手机搜狐 m.sohu.com

0
小猫不拉粑粑怎么办？ - 知乎 zhuanlan.zhihu.com

0
幼猫怎么洗澡？帮小猫洗澡的正确方法- 雪花新闻 xuehua.us

0
小猫的名字_新闻_蛋蛋赞 twoeggz.com

0
小猫几个月打疫苗，注意别错过最佳时间-热备资讯 hotbak.net

0
棕色小猫图片-白色背景下的棕色小猫素材-高清图片-摄影... 52112.com

0
想养猫？教你如何挑选健康的小猫咪- 每日头条 kknews.cc

0
小猫换牙征兆突然脸肿掉牙还流血?_小可爱宠物网 xiaokeai.com

0,1,2
Settings,Privacy,Terms


Does this say something about how kittens are perceived differently in China vs the English-speaking world? :shrug:

## #1: make a single call to Baidu

When I translate "kitten" using Google Translate into Chinese (simplified) (not sure why there isn't Mandarin), and search with that word in Baidu then click over to the images tab, I get this URL:

https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E5%B0%8F%E7%8C%AB&ie=utf-8&ie=utf-8

Breaking it down: 
```
https://image.baidu.com/search/index?
tn=baiduimage&
ps=1&
ct=201326592&
lm=-1&
cl=2&
nc=1&
ie=utf-8&
word=%E5%B0%8F%E7%8C%AB&
ie=utf-8&
ie=utf-8
```
Doing a bit of forensics on this URL, looks like there aren't any tracking parameters:

https://dfir.blog/unfurl/?url=https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E5%B0%8F%E7%8C%AB&ie=utf-8&ie=utf-8

Minimal URL that works when pasting into a browser:

https://image.baidu.com/search/index?tn=baiduimage&word=%E5%B0%8F%E7%8C%AB&

Note that without a user agent and/or those proxies, I get a TooManyRedirectsError. Also, using my VPN (Tunnelbear) gets my request ignored.

In [7]:
baidu_template = 'https://image.baidu.com/search/index?tn=baiduimage&word={}'
r = requests.get(baidu_template.format(search_term_mandarin), timeout=10,
                proxies={'https':None, 'http':None},
                headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'})

# display_html_inline(r.text)
display_html_inline('<html><body><h1>hi</h1></body></html>')

There's something about the HTML that automatically loads more and more images, which sucks. Closing the connection does nothing, it's the actual HTML.