# Web Scraping Example 1: nowcoder.com 

In this example, we will use web scraping to obtain the links of shared interview experiences.

In [1]:
from selenium import webdriver
import re
import bs4
import pandas as pd
import time
import datetime

# You need to download chromedriver and pass your own path to this driver.
driver = webdriver.Chrome(executable_path='/Users/leslietang/Documents/development/chromedriver')
url = r'https://www.nowcoder.com/discuss/experience?tagId=894&order=1&companyId=0'
driver.get(url)

# Because of the lazy-loading problem of web page, we cannot obtain
# the full content of the page. Therefore, we should use selenium to 
# keep scrolling down the page till we reach the bottom.
all_window_height = []  # record the height of the page after scrolling down the page
all_window_height.append(driver.execute_script("return document.body.scrollHeight;")) # current page height
while True:
    driver.execute_script("scroll(0,100000)") # scrolling down the page
    time.sleep(1) 
    check_height = driver.execute_script("return document.body.scrollHeight;")
    if check_height == all_window_height[-1]:  
        print("I am at the bottom of the page.")
        break
    else:
        all_window_height.append(check_height)
        print("I am scrolling down the page.")

response = driver.page_source
soup = bs4.BeautifulSoup(response, 'html.parser')

I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am scrolling down the page.
I am at the bottom of the page.


In [2]:
titles = [i.text.strip() for i in soup.findAll(name = 'h4')]

links = re.findall('<a href="/discuss/(.*?)">', response)[3:]
prefix = 'https://www.nowcoder.com/discuss/'
full_links = [prefix + l for l in links]

dates = [i.text.strip() for i in soup.findAll(name = 'span', attrs={'time'})]
for i in range(len(dates)):
    if len(dates[i]) != 10:
        dates[i] = datetime.date.today()

In [3]:
pd.DataFrame({'title':titles,'link':full_links, 'date': dates})

Unnamed: 0,title,link,date
0,实习 虎牙数分、贝壳机器学习/数据挖掘、阿里算法（凉经）面经,https://www.nowcoder.com/discuss/425837,2020-05-12
1,数分实习面经，攒人品许愿offer,https://www.nowcoder.com/discuss/425872,2020-05-12
2,【阿里菜鸟】数据分析实习生 面经,https://www.nowcoder.com/discuss/426035,2020-05-12
3,阿里 数据分析师 暑期实习（非技术） 四轮面试完整面经,https://www.nowcoder.com/discuss/425854,2020-05-12
4,快手数分二面面经+许愿HR面,https://www.nowcoder.com/discuss/425444,2020-05-11
...,...,...,...
465,这最近一个月的感想 烫,https://www.nowcoder.com/discuss/95058,2018-08-17
466,西安 中国银联数据分析岗2019内推面试,https://www.nowcoder.com/discuss/92555,2018-08-10
467,中国银联上海（专业类）数据分析(云闪付)的面试经历？,https://www.nowcoder.com/discuss/91268,2018-08-07
468,顺丰科技，一面技术面,https://www.nowcoder.com/discuss/89124,2018-07-30
