# Let's review the ubike scraping code

In [1]:
import requests
import json
url  = 'http://data.taipei/youbike'
response = requests.get(url)
data = json.loads(response.text)
print("response code:", response.status_code)

response code: 200


## The library to get web data: Requests
* http://docs.python-requests.org/en/master/ 
* Quickstart http://docs.python-requests.org/en/master/user/quickstart/

```
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}
```

## The library to convert json to python objects
* https://docs.python.org/3/library/json.html

In [4]:
# Using requests.json() to convert json string to json
import requests
res = requests.get(url).json()
type(res)

dict

# Prac: Get back following data and traverse them

In [7]:
url_AQX = "https://taqm.epa.gov.tw/taqm/aqs.ashx?lang=tw&act=aqi-epa&ts=1538961147679"
url_dcard = "https://www.dcard.tw/_api/forums/girl/posts?popular=true"
url_pchome = "http://ecshweb.pchome.com.tw/search/v3.3/all/results?q=X100F&page=1&sort=rnk/dc"
url_104 = "https://www.104.com.tw/jobs/search/list?ro=0&keyword=%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90&area=6001001000&order=1&asc=0&kwop=7&page=2&mode=s&jobsource=n104bank1"

In [None]:
res = requests.get(url_AQX).json()

# Write a function to load json data
* What is status code? https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

```
2xx Success
200 OK
Standard response for successful HTTP requests. The actual response will depend on the request method used. In a GET request, the response will contain an entity corresponding to the requested resource. In a POST request, the response will contain an entity describing or containing the result of the action.[9]
3xx Redirection
4xx Client errors
400 Bad Request
401 Unauthorized
403 Forbidden
404 Not Found
405 Method Not Allowed
```

In [8]:
def get_web_json(url):
    response = requests.get(url)
    print("response code:", response.status_code)
    if not response.ok:
        return None
    data = response.json()
    return data


In [22]:
data = get_web_json(url_dcard)
if isinstance(data, dict):
    print("A dict with", data.keys())
elif isinstance(data, list):
    print("A list with the first data enetry\n", data[0])

response code: 200
A list with the first data enetry
 {'id': 229819770, 'title': '#抽脂 我不是網美更不是業配', 'excerpt': '謝謝大家那麼關心那麼心疼我，本人我感到十分受寵若驚，最多人問的是怎麼有勇氣？梁靜茹給我的吧，有時這種事也需要一點衝動，想太多就會畏畏縮縮不敢行動，原本想等到月底再來發保養文，但有個好心的朋友讓我很想發這篇文章', 'anonymousSchool': False, 'anonymousDepartment': True, 'pinned': False, 'forumId': 'f72e3b1d-3c9a-4fec-8a61-41c76cc317af', 'replyId': 229814246, 'createdAt': '2018-10-09T17:14:16.989Z', 'updatedAt': '2018-10-09T23:23:51.767Z', 'commentCount': 54, 'likeCount': 545, 'withNickname': False, 'tags': [], 'topics': ['抽脂'], 'forumName': '女孩', 'forumAlias': 'girl', 'gender': 'F', 'school': '健行科技大學', 'replyTitle': '#抽脂 算是送給自己的禮物吧', 'supportedReactions': [], 'reactions': [{'id': '286f599c-f86a-4932-82f0-f5a06f1eca03', 'count': 545}], 'hidden': False, 'customStyle': None, 'withImages': True, 'withVideos': False, 'media': [{'url': 'https://i.imgur.com/q4DXWXe.jpg'}, {'url': 'https://i.imgur.com/gw7Yrmd.jpg'}]}


In [27]:
# Practice: Adding the type-checking code to the function




## Using pandas library to handle a list of dict
* What is pandas? 10 mins tutorial http://pandas.pydata.org/pandas-docs/stable/10min.html

In [23]:
import pandas as pd
pd.DataFrame(data).head()

Unnamed: 0,anonymousDepartment,anonymousSchool,commentCount,createdAt,customStyle,department,excerpt,forumAlias,forumId,forumName,...,replyTitle,school,supportedReactions,tags,title,topics,updatedAt,withImages,withNickname,withVideos
0,True,False,54,2018-10-09T17:14:16.989Z,,,謝謝大家那麼關心那麼心疼我，本人我感到十分受寵若驚，最多人問的是怎麼有勇氣？梁靜茹給我的吧，...,girl,f72e3b1d-3c9a-4fec-8a61-41c76cc317af,女孩,...,#抽脂 算是送給自己的禮物吧,健行科技大學,[],[],#抽脂 我不是網美更不是業配,[抽脂],2018-10-09T23:23:51.767Z,True,False,False
1,True,True,14,2018-10-09T17:42:57.358Z,,,近期很流行簡單不單調的飾品，且我很喜歡小巧精緻的首飾，剛好得知這個日系品牌且也注意很久了，微...,girl,f72e3b1d-3c9a-4fec-8a61-41c76cc317af,女孩,...,,,[],[],#精品 #輕珠寶 #agete 開箱,[精品],2018-10-09T17:42:57.358Z,True,False,False
2,True,True,10,2018-10-10T07:23:30.360Z,,,大愛這種風格的，希望大家喜歡,girl,f72e3b1d-3c9a-4fec-8a61-41c76cc317af,女孩,...,,,[],[],分享我喜歡的桌布！,"[女孩, 桌布, 圖]",2018-10-10T07:23:30.360Z,True,False,False
3,True,False,20,2018-10-10T08:06:28.717Z,,fish_fishhhhh,有瀏海的女孩都有一項強項！，那就是自己修剪瀏海，（應該很少女孩會為了修瀏海常常跑外面花錢剪吧...,girl,f72e3b1d-3c9a-4fec-8a61-41c76cc317af,女孩,...,,生活在陸地上的魚🐡,[],[],#圖 瀏海女孩的困擾,[],2018-10-10T08:06:28.717Z,True,True,False
4,True,False,27,2018-10-09T17:27:03.645Z,,,最不會下標題，跟男友算是遠距，每個週末我都會去他家找他，我以為遠距離的情侶見到面都會更珍惜相...,girl,f72e3b1d-3c9a-4fec-8a61-41c76cc317af,女孩,...,,亞洲大學,[],[],我想單身了,[],2018-10-09T17:27:03.645Z,False,False,False


# Prac: Problmatic url? See requests doc
* Try to get back the url https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1
* You will get an 404 status_code

In [25]:
url_591 = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1"
response = requests.get(url_591)
print(response.ok)
print(response.status_code)

False
404


* Solution: https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python

## Parameters of a request

* The problem could be "User-Agent"
    * User-Agent: 你用什麼瀏覽器或系統
    * Referer: 你從哪個頁面點選、跳轉過來
    * Cookies: 經過與伺服器建立連結後，他給了你什麼資訊好讓你持續可以待在這個頁面。

* Check the Quickstart of requests library http://docs.python-requests.org/en/master/user/quickstart/
```
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,zh-TW;q=0.8,zh;q=0.7,zh-CN;q=0.6
Cache-Control: max-age=0
Connection: keep-alive
Cookie: webp=1; PHPSESSID=74131f35fb441ebacf685a6552f2629b; T591_TOKEN=74131f35fb441ebacf685a6552f2629b; urlJumpIp=1; urlJumpIpByTxt=%E5%8F%B0%E5%8C%97%E5%B8%82; 591_new_session=eyJpdiI6IjBqSkVXb04zTTkzV2FEeXZWUFlmc3c9PSIsInZhbHVlIjoiblZpOCtGcXpBS1gyeWJcL3dWM3N3c1BRbytFSEVIaXNud0VyMjA1ZmFBZVhXSHpLbnlmMW41YzQ0UU5GMXhMYTJoMGNOblJucGo1YXpzbDhPdUpBNEJnPT0iLCJtYWMiOiJlMDdjMjM2Y2Y3NDVjOTFkOWU5ZmMyNjY4YmZmNzRjOWY4MDBkZTY3ZTlhMzg2YzQ4ZGQ1ODI4MzI4NTM2ZWJiIn0%3D
Host: rent.591.com.tw
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
```


In [30]:
headers = {
    'User-Agent': 'Chrome/69.0.3497.100'
} # headers are sent by dict form
response = requests.get(url_591, headers=headers).json()


# Use chrome devtool to find json files
* Slide: https://docs.google.com/presentation/d/e/2PACX-1vRW84XoB5sFRT1Eg-GrK4smX23qoNkFffz_h8oRU4AIvJAgrrxBn8059_0UeHv_pFBks_Z37vNbLGai/pub?start=false&loop=false&delayms=3000
```
https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1
```

In [31]:
headers = {
    'User-Agent': 'Chrome/69.0.3497.100'
} # headers are sent by dict form
url_591 = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1"
res = requests.get(url_591, headers=headers).json()
all_data = res["data"]["data"]
res.keys()
res["records"]

'2,422'

## Get the 2nd, 3rd, 4th, ..., page urls
```
https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1
https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow=30&totalRows=2421
https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow=60&totalRows=2421
https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow=90&totalRows=2421
```

## Convert record to total page
* `str.replace(",", "")` will replace `,` by none. 
* `int()` converts something to integer
* Please check the helps of following type-conversion functions: `float()`, `str()` 

In [34]:
print(res["records"])
total_page = int(res["records"].replace(",", ""))
print(total_page)

2,422
2422


In [35]:
url = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow={}&totalRows={}".format(0*30, total_page)
url = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow={}&totalRows={}".format(1*30, total_page)
url = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow={}&totalRows={}".format(2*30, total_page)
url = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow={}&totalRows={}".format(3*30, total_page)
res = requests.get(url, headers=headers).json()
res.keys()

dict_keys(['status', 'data', 'records', 'is_recom', 'deal_recom', 'online_social_user'])

## Convert total pages to number of crawling pages

In [40]:
print(int(total_page/30))
for i in range(0, int(total_page/30)):
    print(i, end=", ")

80
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 

## Python3 print formating
* https://www.python-course.eu/python3_formatted_output.php
```
print(i, len(all_data))
print("{:3d}\t{:4d}".format(i, len(all_data)))
```

In [44]:
all_data = []
for i in range(0, int(total_page/30)+1):
    url = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow={}&totalRows={}".format(i*30, total_page)
    res = requests.get(url, headers=headers).json()
    all_data.extend(res["data"]["data"])
#     print(i, len(all_data))
    print("{:3d}\t{:4d}".format(i, len(all_data)))

  0	  30
  1	  60
  2	  90
  3	 120
  4	 150
  5	 180
  6	 210
  7	 240
  8	 270
  9	 300
 10	 330
 11	 360
 12	 390
 13	 420
 14	 450
 15	 480
 16	 510
 17	 540
 18	 570
 19	 600
 20	 630
 21	 660
 22	 690
 23	 720
 24	 750
 25	 780
 26	 810
 27	 840
 28	 870
 29	 900
 30	 930
 31	 960
 32	 990
 33	1020
 34	1050
 35	1080
 36	1110
 37	1140
 38	1170
 39	1200
 40	1230
 41	1260
 42	1290
 43	1320
 44	1350
 45	1380
 46	1410
 47	1440
 48	1470
 49	1500
 50	1530
 51	1560
 52	1590
 53	1620
 54	1650
 55	1680
 56	1710
 57	1740
 58	1770
 59	1800
 60	1830
 61	1860
 62	1890
 63	1920
 64	1950
 65	1980
 66	2010
 67	2040
 68	2070
 69	2100
 70	2130
 71	2160
 72	2190
 73	2220
 74	2250
 75	2280
 76	2310
 77	2340
 78	2370
 79	2400
 80	2423


In [45]:
import pandas as pd
rent591_df = pd.DataFrame(all_data)
rent591_df.head()

Unnamed: 0,addInfo,addition2,addition3,addition4,addr_number_name,address,address_img,address_img_title,alley_name,allfloor,...,storeprice,street_name,streetid,type,unit,updatetime,user_id,vipBorder,vipimg,vipstyle
0,,0,0,1,,民權東路三段103巷..,(復北民權)中山國中捷運站全新短租套房,(復北民權)中山國中捷運站全新短租套房,103巷,10,...,0,民權東路三段,26167,1,元/月,1536760172,129673,vipStyle,,isvip
1,,0,0,1,,成功路二段193巷晨陽租屋..,晨陽租屋-溫馨美裝潢套房,晨陽租屋-溫馨美裝潢套房,193巷,7,...,0,成功路二段,25877,1,元/月,1537266440,1777778,vipStyle,,isvip
2,,0,0,1,,同安街80巷古亭捷運..,古亭捷運English頂加陽台新套房,古亭捷運English頂加陽台新套房,80巷,4,...,0,同安街,25778,1,元/月,1538917639,266409,vipStyle,,isvip
3,,0,0,1,,辛亥路二段台灣大學..,台灣大學後門、捷運科技大樓、雙人豪華套房,台灣大學後門、捷運科技大樓、雙人豪華套房,,8,...,0,辛亥路二段,25630,1,元/月,1538302558,1527404,vipStyle,,isvip
4,,0,0,1,23號,成都路27巷新裝潢.優質套..,新裝潢.優質套房(獨立洗衣機)密碼鎖,新裝潢.優質套房(獨立洗衣機)密碼鎖,27巷,4,...,0,成都路,26302,1,元/月,1538530682,2229308,vipStyle,,isvip


# Dump files for backup

## Dump one variable to json by json library
* https://docs.python.org/3/library/json.html

In [47]:
import json
with open('rent_591.json', 'w') as outfile:
    json.dump(all_data, outfile)

## Dump and load json by pandas library

In [50]:
with open('df_to_json.json', 'w') as f:
    f.write(rent591_df.to_json())

In [52]:
with open("df_to_json.json") as fin:
    data2 = pd.read_json(fin)
data2.head()

Unnamed: 0,addInfo,addition2,addition3,addition4,addr_number_name,address,address_img,address_img_title,alley_name,allfloor,...,storeprice,street_name,streetid,type,unit,updatetime,user_id,vipBorder,vipimg,vipstyle
0,,0,0,1,,民權東路三段103巷..,(復北民權)中山國中捷運站全新短租套房,(復北民權)中山國中捷運站全新短租套房,103巷,10,...,0,民權東路三段,26167,1,元/月,1536760172,129673,vipStyle,,isvip
1,,0,0,1,,成功路二段193巷晨陽租屋..,晨陽租屋-溫馨美裝潢套房,晨陽租屋-溫馨美裝潢套房,193巷,7,...,0,成功路二段,25877,1,元/月,1537266440,1777778,vipStyle,,isvip
10,,0,0,1,330號,信義路五段150巷象山旁新..,象山旁新裝潢工業風小套房，雜費全包,象山旁新裝潢工業風小套房，雜費全包,150巷,13,...,0,信義路五段,26234,1,元/月,1537977490,1702300,vipStyle,,isvip
100,"<img src=""./images/index/userCenter/list_vip_v...",0,0,0,,中山北路一段#京站南西商..,#京站南西商圈#3米6浴室對外收納大空間,#京站南西商圈#3米6浴室對外收納大空間,,13,...,0,中山北路一段,25687,1,元/月,1537674505,1131629,vipStyle,"<img src=""./images/index/userCenter/list_vip_v...",isvip
1000,,0,0,0,,建國北路一段建國北路一段南京..,建國北路一段南京金融商圈旁套房,建國北路一段南京金融商圈旁套房,,14,...,0,建國北路一段,25726,1,元/月,1538561958,764665,,,


## Dump multiple variables to pickle

In [95]:
import pickle
with open('rent591.pkl', 'wb') as f:  # Python 3: open(..., 'wb')
    pickle.dump([all_data, rent591_df], f)

## Load multiple variables back to objects

In [97]:
with open('rent591.pkl', "rb") as f:  # Python 3: open(..., 'rb')
    rent591_list, rent591_df2 = pickle.load(f)

In [99]:
len(rent591_list)

2414

In [101]:
rent591_df2.describe()

Unnamed: 0,addition2,addition3,addition4,allfloor,area,browsenum,browsenum_all,checkstatus,closed,comment_ltime,...,refreshtime,regionid,room,sectionid,shape,social_house,storeprice,streetid,updatetime,user_id
count,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,...,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0
mean,0.001243,0.0,0.032312,8.335128,10.194118,60.53604,850.626346,0.762635,0.0,761069000.0,...,1538912000.0,1.0,2.543496,5.552196,1.714167,0.008285,0.0,26796.160315,1538395000.0,1043620.0
std,0.035238,0.0,0.176863,5.474421,4.271879,58.292908,1004.820779,0.9371,0.0,755282700.0,...,747403.7,0.0,15.663882,3.047009,0.70012,0.090663,0.0,5449.111388,950197.4,784746.4
min,0.0,0.0,0.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,...,1511853000.0,1.0,0.0,1.0,1.0,0.0,0.0,25425.0,1511853000.0,1258.0
25%,0.0,0.0,0.0,4.0,7.0,21.0,229.0,0.0,0.0,0.0,...,1538996000.0,1.0,0.0,3.0,1.0,0.0,0.0,25687.0,1538022000.0,307443.2
50%,0.0,0.0,0.0,7.0,9.5,44.0,516.0,1.0,0.0,1291179000.0,...,1539136000.0,1.0,0.0,5.0,2.0,0.0,0.0,25787.0,1538635000.0,925525.5
75%,0.0,0.0,0.0,12.0,13.0,81.0,1099.5,1.0,0.0,1536105000.0,...,1539168000.0,1.0,0.0,8.0,2.0,0.0,0.0,26172.75,1538993000.0,1708152.0
max,1.0,0.0,1.0,121.0,38.0,574.0,13206.0,5.0,0.0,1539188000.0,...,1539189000.0,1.0,99.0,12.0,6.0,1.0,0.0,69947.0,1539189000.0,2482651.0


# All review


In [1]:
import requests
headers = {'User-Agent': 'Chrome/69.0.3497.100'}
url_591 = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1"
res = requests.get(url_591, headers=headers).json()
total_page = int(res["records"].replace(",", ""))

all_data = []
for i in range(0, int(total_page/30)+1):
    url = "https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=2&searchtype=1&region=1&firstRow={}&totalRows={}".format(i*30, total_page)
    res = requests.get(url, headers=headers).json()
    all_data.extend(res["data"]["data"])
    print("{:3d}\t{:4d}".format(i, len(all_data)))

  0	  30
  1	  60
  2	  90
  3	 120
  4	 150
  5	 180
  6	 210
  7	 240
  8	 270
  9	 300
 10	 330
 11	 360
 12	 390
 13	 420
 14	 450
 15	 480
 16	 510
 17	 540
 18	 570
 19	 600
 20	 630
 21	 660
 22	 690
 23	 720
 24	 750
 25	 780
 26	 810
 27	 840
 28	 870
 29	 900
 30	 930
 31	 960
 32	 990
 33	1020
 34	1050
 35	1080
 36	1110
 37	1140
 38	1170
 39	1200
 40	1230
 41	1260
 42	1290
 43	1320
 44	1350
 45	1380
 46	1410
 47	1440
 48	1470
 49	1500
 50	1530
 51	1560
 52	1590
 53	1620
 54	1650
 55	1680
 56	1710
 57	1740
 58	1770
 59	1800
 60	1830
 61	1860
 62	1890
 63	1920
 64	1950
 65	1980
 66	2010
 67	2040
 68	2070
 69	2100
 70	2130
 71	2160
 72	2190
 73	2220
 74	2250
 75	2280
 76	2310
 77	2340
 78	2370
 79	2400
 80	2429


## CS style

In [None]:
udict = {}
print(all_data[1].keys())
for data in all_data:
    if data["user_id"] not in udict:
        udict[data["user_id"]] = 1
    else:
        udict[data["user_id"]] += 1
sorted_by_value = sorted(udict.items(), key=lambda kv: kv[1], reverse = True)

## pandas

In [29]:
import pandas as pd
df = pd.DataFrame(all_data)
df.groupby("user_id")['id'].count().reset_index(name='count').sort_values(['count'], ascending=False).head()

Unnamed: 0,user_id,count
1093,1318377,37
553,440589,29
1428,2210658,25
764,752838,22
1209,1611062,20


In [30]:
df.price = pd.to_numeric(df.price.str.replace(",", ""))
df["avg"] = df.price/df.area

In [31]:
df[(df.user_id == 1318377)].filter(items=['address', 'area', "price", "avg"])

Unnamed: 0,address,area,price,avg
370,和平東路三段捷運六張犁站..,13.3,23500,1766.917293
903,內湖路一段47巷近捷運劍南站【..,12.0,21500,1791.666667
905,復興南路一段捷運大安站3分【..,10.0,19000,1900.0
906,通化街近捷運信義安和站【超值..,15.0,28000,1866.666667
907,長春路捷運南京復興站5分【電..,8.0,15000,1875.0
908,泰順街近捷運台電大樓站【師大..,8.0,15000,1875.0
909,農安街166巷捷運行天宮5分／..,6.0,9500,1583.333333
910,林森北路85巷捷運善導寺站【五..,7.8,12000,1538.461538
911,內湖路一段47巷近捷運劍南站【..,10.0,18500,1850.0
912,松江路16巷捷運忠孝新生【光華..,6.0,12000,2000.0
