### 問題：
Google Colab 無法執行 Selenium。報錯：Service chromedriver unexpectedly exited. Status code was: 1
https://ithelp.ithome.com.tw/questions/10211773

### 解法：

In [None]:
%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
#
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

# Install chromium and chromium-driver
apt-get update

apt-get install chromium chromium-driver

# Install selenium
pip install selenium

Executing: /tmp/apt-key-gpghome.j8pXzISrIG/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
gpg: key DCC9EFBF77E11517: public key "Debian Stable Release Key (10/buster) <debian-release@lists.debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Executing: /tmp/apt-key-gpghome.19g3FtzwV9/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
gpg: key DC30D7C23CBBABEE: public key "Debian Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Executing: /tmp/apt-key-gpghome.bvoXSx5bmo/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
gpg: key 4DFAB270CAA96DFA: public key "Debian Security Archive Automatic Signing Key (10/buster) <ftpmaster@debian.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Get:1 http://deb.debian.org/debian buster InRelease [122 kB]
Get:2 http://deb.debian.org/debian bust



In [None]:
import pandas as pd
import re, time, requests
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from google.colab import drive
import json

### 104下查詢條件後的網址
https://www.104.com.tw/jobs/search/?ro=1&jobcat=2007000000&isnew=3&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&area=6001006000&order=17&asc=0&page=1&mode=l&jobsource=2018indexpoc&langFlag=0&langStatus=0&recommendJob=1&hotJob=1

可以看到:
* ro=1
* jobcat=2007000000
* isnew=3
* area=6001006000
* mode=l


In [None]:
# 加入使用者資訊(如使用什麼瀏覽器、作業系統...等資訊)模擬真實瀏覽網頁的情況
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}

# 查詢的關鍵字
my_params = {'ro':'1', # 限定全職的工作，如果不限定則輸入0
        'jobcat':'2007000000', # 職務類別選擇資訊軟體系統類
        'area':'6001006000', # 限定在新竹縣市的工作
        'isnew':'3', # 只要最近幾日內有更新的過的職缺，例：3就是三日內，7就是一週內，依此類推
        'mode':'l'} # 清單的瀏覽模式(是L不是1)

In [None]:
url = requests.get('https://www.104.com.tw/jobs/search/?' , my_params, headers = headers).url
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=chrome_options)
driver.get(url)

# 網頁的設計方式是滑動到下方時，會自動加載新資料，在這裡透過程式送出Java語法幫我們執行「滑到下方」的動作
for i in range(20):
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    time.sleep(0.6)

# 自動加載只會加載15次，超過之後必須要點選「手動載入」的按鈕才會繼續載入新資料
k = 1
while k != 0:
    try:
        # 手動載入新資料之後會出現新的more page，舊的就無法再使用，所以要使用最後一個物件
        driver.find_elements(By.CLASS_NAME,"js-more-page")[-1].click()
        print('Click 手動載入，' + '載入第' + str(15 + k) + '頁')
        k = k+1
        time.sleep(1) # 時間設定太短的話，來不及載入新資料就會跳錯誤

    #若需要查看error message,就改成except AssertionError as msg:
    #except AssertionError as msg:
        #print(msg)
    except:
        k = 0
        print('No more Job')

# 透過BeautifulSoup解析資料
soup = BeautifulSoup(driver.page_source, 'html.parser')
List = soup.findAll('a',{'class':'js-job-link'})
print('共有 ' + str(len(List)) + ' 筆資料')

No more Job
共有 429 筆資料


In [None]:
List

[<a class="js-job-link" href="//www.104.com.tw/job/7toho?jobsource=hotjob_chr" target="_blank" title="系統設計師 System Designer">系統設計師 System Designer</a>,
 <a class="js-job-link" href="//www.104.com.tw/job/7r7t5?jobsource=hotjob_chr" target="_blank" title="資訊部-MIS工程師">資訊部-MIS工程師</a>,
 <a class="js-job-link" href="//www.104.com.tw/job/7itwv?jobsource=hotjob_chr" target="_blank" title="iOS / Android App工程師(薪優)">iOS / Android App工程師(薪優)</a>,
 <a class="js-job-link" href="//www.104.com.tw/job/7xxht?jobsource=hotjob_chr" target="_blank" title="React Natvie Engineer 工程師">React Natvie Engineer 工程師</a>,
 <a class="js-job-link" href="//www.104.com.tw/job/7r7nk?jobsource=hotjob_chr" target="_blank" title="系統分析師 System Analyst">系統分析師 System Analyst</a>,
 <a class="js-job-link" href="//www.104.com.tw/job/7i8c9?jobsource=hotjob_chr" target="_blank" title="MIS 資訊工程師 (新竹)">MIS 資訊工程師 (新竹)</a>,
 <a class="js-job-link" href="//www.104.com.tw/job/7xkgk?jobsource=jolist_c_date" target="_blank" title="【知名日商半導

In [None]:
import threading

JobList = pd.DataFrame()
lock = threading.Lock() # 建立Lock

def crawl_job_info(i):
    global JobList
    content = List[i]
    try:
        resp = requests.get('https://' + content.attrs['href'].strip('//'))
        soup2 = BeautifulSoup(resp.text,'html.parser')
        df = pd.DataFrame(
            data = [{
                  "model":"mainsite.baseinfo",
                  "pk":str(i),
                  "fields":{
                  'company_name':soup2.find('a', {'class':'btn-link t3 mr-6'}).text.strip(),
                  'job_title':content.attrs['title'].strip(),
                  'job_cate':','.join([item.text.replace('<u data-v-71fba476="">', '').replace('</u>', '') for item in soup2.select('div.trigger u')]),
                  'job_salary':soup2.find('p', {'class':'t3 mb-0 mr-2 text-primary font-weight-bold align-top d-inline-block'}).text.strip(),
                  'job_location':soup2.select('div.job-address span')[0].text.strip(),
                  'job_work_experience':soup2.select('div.job-requirement-table div.t3')[0].text.strip(),
                  'job_edu_require':soup2.select('div.job-requirement-table div.t3')[1].text.strip(),
                  'job_require_major':soup2.select('div.job-requirement-table div.t3')[2].text.strip(),
                  'job_tool_require':soup2.select('div.job-requirement-table div.t3')[4].text.strip(),
                  'job_applicant':soup2.find('a', {'class':'font-weight-bold d-inline-block pl-2 align-middle'}).text.strip().strip('人應徵'),
                  'date':soup2.select('div.job-header__title span')[0].text.strip().strip('更新'),
                  'job_link':'https://' + content.attrs['href'].strip('//')}
                 }]
           )
        with lock: # 使用Lock避免多執行緒同時修改JobList
            JobList = pd.concat([JobList, df], ignore_index=True)
        print("Success and Crawl Next 目前正在爬第" + str(i) + "個職缺資訊")
        time.sleep(0.5) # 執行完休息0.5秒，避免造成對方主機負擔
    except AssertionError as msg:
        print(msg)

ThreadList = []
for i in range(len(List)):
    t = threading.Thread(target=crawl_job_info, args=(i,))
    ThreadList.append(t)

for t in ThreadList:
    t.start()

for t in ThreadList:
    t.join()

print('Finished!')


Success and Crawl Next 目前正在爬第66個職缺資訊
Success and Crawl Next 目前正在爬第75個職缺資訊
Success and Crawl Next 目前正在爬第4個職缺資訊
Success and Crawl Next 目前正在爬第58個職缺資訊
Success and Crawl Next 目前正在爬第29個職缺資訊
Success and Crawl Next 目前正在爬第9個職缺資訊Success and Crawl Next 目前正在爬第17個職缺資訊

Success and Crawl Next 目前正在爬第117個職缺資訊
Success and Crawl Next 目前正在爬第72個職缺資訊Success and Crawl Next 目前正在爬第13個職缺資訊
Success and Crawl Next 目前正在爬第86個職缺資訊
Success and Crawl Next 目前正在爬第36個職缺資訊

Success and Crawl Next 目前正在爬第96個職缺資訊
Success and Crawl Next 目前正在爬第106個職缺資訊
Success and Crawl Next 目前正在爬第61個職缺資訊
Success and Crawl Next 目前正在爬第55個職缺資訊
Success and Crawl Next 目前正在爬第41個職缺資訊
Success and Crawl Next 目前正在爬第110個職缺資訊Success and Crawl Next 目前正在爬第113個職缺資訊
Success and Crawl Next 目前正在爬第76個職缺資訊
Success and Crawl Next 目前正在爬第48個職缺資訊Success and Crawl Next 目前正在爬第14個職缺資訊Success and Crawl Next 目前正在爬第115個職缺資訊
Success and Crawl Next 目前正在爬第34個職缺資訊Success and Crawl Next 目前正在爬第24個職缺資訊



Success and Crawl Next 目前正在爬第27個職缺資訊
Success and Crawl Next 目前正在爬第111個職缺資

In [None]:
JobList

Unnamed: 0,model,pk,fields
0,mainsite.baseinfo,66,"{'company_name': '巨匠電腦股份有限公司(巨匠電腦)(巨匠美語)', 'jo..."
1,mainsite.baseinfo,75,"{'company_name': '愛爾蘭商益華科技股份有限公司台灣分公司', 'job_t..."
2,mainsite.baseinfo,4,"{'company_name': '巨匠電腦股份有限公司(巨匠電腦)(巨匠美語)', 'jo..."
3,mainsite.baseinfo,58,"{'company_name': '東佑達自動化科技股份有限公司', 'job_title'..."
4,mainsite.baseinfo,29,"{'company_name': '偉宸電子股份有限公司', 'job_title': '專..."
...,...,...,...
389,mainsite.baseinfo,389,"{'company_name': 'Cosen Mechatronics Co., Ltd_..."
390,mainsite.baseinfo,390,"{'company_name': '思霈科股份有限公司', 'job_title': '5G..."
391,mainsite.baseinfo,391,"{'company_name': '星通資訊股份有限公司', 'job_title': '防..."
392,mainsite.baseinfo,393,"{'company_name': '星通資訊股份有限公司', 'job_title': '通..."


In [None]:
JobList = pd.DataFrame()

i = 0
while i < len(List):
    # if (i == 78) or (i == 106) or (i == 126) or (i == 274) or (i == 356):
    #   i += 1
    #   continue
    # print('正在處理第' + str(i) + '筆，共 ' + str(len(List)) + ' 筆資料')
    content = List[i]
    # 這裡用Try的原因是，有時候爬太快會遭到系統阻擋導致失敗。因此透過這個方式，當我們遇到錯誤時，會重新再爬一次資料！
    try:
        resp = requests.get('https://' + content.attrs['href'].strip('//'))
        soup2 = BeautifulSoup(resp.text,'html.parser')
        df = pd.DataFrame(
            data = [{
                  "model":"mainsite.baseinfo",
                  "pk":str(i),
                  "fields":{
                  'company_name':soup2.find('a', {'class':'btn-link t3 mr-6'}).text.strip(),
                  'job_title':content.attrs['title'].strip(),
                  'job_cate':','.join([item.text.replace('<u data-v-71fba476="">', '').replace('</u>', '') for item in soup2.select('div.trigger u')]),
                  'job_salary':soup2.find('p', {'class':'t3 mb-0 mr-2 text-primary font-weight-bold align-top d-inline-block'}).text.strip(),
                  'job_location':soup2.select('div.job-address span')[0].text.strip(),
                  'job_work_experience':soup2.select('div.job-requirement-table div.t3')[0].text.strip(),
                  'job_edu_require':soup2.select('div.job-requirement-table div.t3')[1].text.strip(),
                  'job_require_major':soup2.select('div.job-requirement-table div.t3')[2].text.strip(),
                  'job_tool_require':soup2.select('div.job-requirement-table div.t3')[4].text.strip(),
                  'job_applicant':soup2.find('a', {'class':'font-weight-bold d-inline-block pl-2 align-middle'}).text.strip().strip('人應徵'),
                  'date':soup2.select('div.job-header__title span')[0].text.strip().strip('更新'),
                  'job_link':'https://' + content.attrs['href'].strip('//')}
                 }]
           )
        JobList = pd.concat([JobList, df], ignore_index=True)
        i += 1
        print("Success and Crawl Next 目前正在爬第" + str(i) + "個職缺資訊")
        time.sleep(0.5) # 執行完休息0.5秒，避免造成對方主機負擔

    #若需要查看error message,就改成except AssertionError as msg:
    except AssertionError as msg:
        print(msg)
    #except:
        #print("Fail and Try Again!")

Success and Crawl Next 目前正在爬第1個職缺資訊
Success and Crawl Next 目前正在爬第2個職缺資訊
Success and Crawl Next 目前正在爬第3個職缺資訊
Success and Crawl Next 目前正在爬第4個職缺資訊
Success and Crawl Next 目前正在爬第5個職缺資訊
Success and Crawl Next 目前正在爬第6個職缺資訊
Success and Crawl Next 目前正在爬第7個職缺資訊
Success and Crawl Next 目前正在爬第8個職缺資訊
Success and Crawl Next 目前正在爬第9個職缺資訊
Success and Crawl Next 目前正在爬第10個職缺資訊
Success and Crawl Next 目前正在爬第11個職缺資訊
Success and Crawl Next 目前正在爬第12個職缺資訊
Success and Crawl Next 目前正在爬第13個職缺資訊
Success and Crawl Next 目前正在爬第14個職缺資訊
Success and Crawl Next 目前正在爬第15個職缺資訊
Success and Crawl Next 目前正在爬第16個職缺資訊
Success and Crawl Next 目前正在爬第17個職缺資訊
Success and Crawl Next 目前正在爬第18個職缺資訊
Success and Crawl Next 目前正在爬第19個職缺資訊
Success and Crawl Next 目前正在爬第20個職缺資訊
Success and Crawl Next 目前正在爬第21個職缺資訊
Success and Crawl Next 目前正在爬第22個職缺資訊
Success and Crawl Next 目前正在爬第23個職缺資訊
Success and Crawl Next 目前正在爬第24個職缺資訊
Success and Crawl Next 目前正在爬第25個職缺資訊
Success and Crawl Next 目前正在爬第26個職缺資訊
Success and Crawl Next 目前正在爬第27個職缺資訊
Success an

AttributeError: ignored

In [None]:
result = JobList.to_json(orient="records")
parsed = json.loads(result)
json.dumps(parsed, indent=4, ensure_ascii=False)

'[\n    {\n        "model": "mainsite.baseinfo",\n        "pk": "0",\n        "fields": {\n            "company_name": "奔馳科技股份有限公司",\n            "job_title": "Windows 驅動工程師",\n            "job_cate": "Internet程式設計師",\n            "job_salary": "待遇面議",\n            "job_location": "新竹縣竹北市縣政五街32巷8號7樓之3",\n            "job_work_experience": "1年以上",\n            "job_edu_require": "專科、大學、碩士",\n            "job_require_major": "不拘",\n            "job_tool_require": "不拘",\n            "job_applicant": "0~5",\n            "date": "03/10",\n            "job_link": "https://www.104.com.tw/job/76191?jobsource=hotjob_chr"\n        }\n    },\n    {\n        "model": "mainsite.baseinfo",\n        "pk": "1",\n        "fields": {\n            "company_name": "旺宏電子股份有限公司",\n            "job_title": "資訊工程類 - 網路管理工程師(MR160)",\n            "job_cate": "網路管理工程師,系統維護／操作人員,MIS／網管主管",\n            "job_salary": "待遇面議",\n            "job_location": "新竹市力行路16號",\n            "job_work_experience": "4年以上",\n  

In [None]:
with open('/content/data.json', 'w', encoding='cp950') as f:
  json.dump(parsed, f, ensure_ascii=False, indent=4)

UnicodeEncodeError: ignored