# **Craw Data**

### **Tạo danh sách url dẫn đến danh sách thẻ bài đăng**

Khi tìm kiếm các bài đăng bán BDS ở TP. HCM, ta sẽ nhận được danh sách kết quả chứa theo thẻ. Mỗi thẻ chứa định những thông tin cơ bản về BDS như giá, diện tích, ngày đăng, đường dẫn đến bài đăng chi tiết. \
Kết quả tìm kiếm trả về dưới dạng page. Có tổng cộng 604 page.

In [7]:
url_list = ['https://batdongsan.vn/ban-nha-dat-ho-chi-minh']
for i in range(2, 604):
    url_list.append('https://batdongsan.vn/ban-nha-dat-ho-chi-minh/p' + str(i))

### **Sử dụng thư viện Selenium trích xuất các thông tin cơ bản**

Để cài đặt selenium cho Google Colab, sử dụng địa chị này: https://medium.com/@MinatoNamikaze02/running-selenium-on-google-colab-a118d10ca5f8

In [None]:
# Setup when run on colab
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb

wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/118.0.5993.70/linux64/chromedriver-linux64.zip -P /tmp/
unzip -o /tmp/chromedriver-linux64.zip -d /tmp/
chmod +x /tmp/chromedriver-linux64/chromedriver
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
pip install selenium chromedriver_autoinstaller

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

from selenium import webdriver
import chromedriver_autoinstaller

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # this is must
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chromedriver_autoinstaller.install()


In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

In [29]:
def get_data(driver, url):
    raw_data = pd.DataFrame(columns=['title', 'price', 'area', 'time', 'page_link'])
    driver.get(url)
    list = driver.find_elements(By.XPATH, "//div[@class='uk-grid uk-grid-small uk-grid-width-1-1']/div")

    for item in list:
        title = item.find_element(By.XPATH, ".//div[@class='name']").text
        # Get page link
        page_link = item.find_element(By.XPATH, ".//div[@class='name']/a").get_attribute('href')

        # Get price, if it doesn't exist, set it to Null
        try:
            price = item.find_element(By.XPATH, ".//span[@class='price']").text
        except:
            price = None

        # Get area, if it doesn't exist, set it to Null
        try:
            area = item.find_element(By.XPATH, ".//span[@class='acreage']").text
        except:
            area = None

        # Get time, if it doesn't exist, set it to Null
        try:
            time = item.find_element(By.XPATH, ".//time[@class='timeago']").get_attribute('datetime')
        except:
            time = None
        raw_data = raw_data._append({'title': title, 'price': price, 'area': area, 'time': time, 'page_link': page_link}, ignore_index=True)
    return raw_data

In [30]:

# Create empty dataframe 
raw_data = pd.DataFrame(columns=['title', 'price', 'area', 'time', 'page_link'])

driver = webdriver.Chrome()
# If running on Colab, uncomment the following line
#driver = webdriver.Chrome(options=chrome_options)

for url in url_list:
    raw_data = raw_data._append(get_data(driver, url), ignore_index=True)
driver.quit()

In [31]:
# Save data to csv file
raw_data.to_csv('raw_data.csv', index=False)

# Save file to Google Drive when running on Colab
#from google.colab import drive
#drive.mount('/content/drive')
#raw_data.to_csv('/content/drive/My Drive/raw_data_p201_p604.csv', index=False)

In [32]:
raw_data.head()

Unnamed: 0,title,price,area,time,page_link
0,Chưa tới 30tr/m2 - Hàng ngộp bank BAO ĐẦU TƯ ...,3899000000 tỷ,150m2,2023-12-10 17:11:02,https://batdongsan.vn/chua-toi-30trm2-hang-ngo...
1,"Bán nhà HXH Âu Cơ Phường 9 Tân Bình, 51m2 3 Tầ...",5.5 tỷ,51m2,2023-12-10 18:40:26,https://batdongsan.vn/ban-nha-hxh-au-co-phuong...
2,"SÁT MẶT TIỀN PHAN ĐĂNG LƯU, PHƯỜNG 7, PHÚ NHUẬ...",4.6 tỷ,45m2,2023-12-10 18:56:17,https://batdongsan.vn/sat-mat-tien-phan-dang-l...
3,CHỦ GẤP BÁN TRƯỚC TẾT LÊ HỒNG PHONG QUẬN 5 RA ...,7.35 tỷ,41m2,2023-12-10 20:49:28,https://batdongsan.vn/chu-gap-ban-truoc-tet-le...
4,"LŨY BÁN BÍCH,TÂN PHÚ-DIỆN TÍCH KHỦNG 96M2 ( 4....",Thỏa thuận,96m2,2023-12-07 14:13:40,https://batdongsan.vn/luy-ban-bichtan-phu-dien...


Kết quả được lưu ở đường dẫn sau: https://raw.githubusercontent.com/KhiemDangLe/Final-Project/main/grid_list_raw_data.csv

Tuy nhiên, thư viện Selenium tiêu tốn tài nguyên hơn và thời gian chạy lâu hơn so với sử dụng thư viện request và bs4

#### **Lấy dữ liệu từ các bài đăng chi tiết**

In [3]:
import pandas as pd
raw_data = pd.read_csv('https://raw.githubusercontent.com/KhiemDangLe/Final-Project/main/grid_list_raw_data.csv', header=0)

In [39]:
raw_data.head()

Unnamed: 0,title,price,area,time,page_link
0,Chưa tới 30tr/m2 - Hàng ngộp bank BAO ĐẦU TƯ ...,3899000000 tỷ,150m2,2023-12-10 17:11:02,https://batdongsan.vn/chua-toi-30trm2-hang-ngo...
1,"Bán nhà HXH Âu Cơ Phường 9 Tân Bình, 51m2 3 Tầ...",5.5 tỷ,51m2,2023-12-10 18:40:26,https://batdongsan.vn/ban-nha-hxh-au-co-phuong...
2,"SÁT MẶT TIỀN PHAN ĐĂNG LƯU, PHƯỜNG 7, PHÚ NHUẬ...",4.6 tỷ,45m2,2023-12-10 18:56:17,https://batdongsan.vn/sat-mat-tien-phan-dang-l...
3,CHỦ GẤP BÁN TRƯỚC TẾT LÊ HỒNG PHONG QUẬN 5 RA ...,7.35 tỷ,41m2,2023-12-10 20:49:28,https://batdongsan.vn/chu-gap-ban-truoc-tet-le...
4,"LŨY BÁN BÍCH,TÂN PHÚ-DIỆN TÍCH KHỦNG 96M2 ( 4....",Thỏa thuận,96m2,2023-12-07 14:13:40,https://batdongsan.vn/luy-ban-bichtan-phu-dien...


Một trong những nguyên nhân làm chậm quá trình lấy dữ liêu khi sử dụng BS4 là tốc độ parsing. Để tối ưu, theo hướng dẫn từ documentation của BS4 ta sẽ sử dụng 2 thư viện là lxml và cchardet. Đồng thời, chúng ta sẽ sử dụng dụng đa luồng đẻ tăng tốc độ craw data.

In [5]:
!pip install beautifulsoup4
!pip install lxml
!pip install pyproject-toml
!pip install cython
!pip install cchardet

Collecting pyproject-toml
  Downloading pyproject_toml-0.0.10-py3-none-any.whl.metadata (642 bytes)
Collecting wheel (from pyproject-toml)
  Using cached wheel-0.43.0-py3-none-any.whl.metadata (2.2 kB)
Collecting toml (from pyproject-toml)
  Downloading toml-0.10.2-py2.py3-none-any.whl.metadata (7.1 kB)
Collecting jsonschema (from pyproject-toml)
  Downloading jsonschema-4.22.0-py3-none-any.whl.metadata (8.2 kB)
Collecting jsonschema-specifications>=2023.03.6 (from jsonschema->pyproject-toml)
  Using cached jsonschema_specifications-2023.12.1-py3-none-any.whl.metadata (3.0 kB)
Collecting referencing>=0.28.4 (from jsonschema->pyproject-toml)
  Downloading referencing-0.35.1-py3-none-any.whl.metadata (2.8 kB)
Collecting rpds-py>=0.7.1 (from jsonschema->pyproject-toml)
  Downloading rpds_py-0.18.1-cp311-none-win_amd64.whl.metadata (4.2 kB)
Downloading pyproject_toml-0.0.10-py3-none-any.whl (6.9 kB)
Downloading jsonschema-4.22.0-py3-none-any.whl (88 kB)
   ---------------------------------

  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [22 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-311
      creating build\lib.win-amd64-cpython-311\cchardet
      copying src\cchardet\version.py -> build\lib.win-amd64-cpython-311\cchardet
      copying src\cchardet\__init__.py -> build\lib.win-amd64-cpython-311\cchardet
      running build_ext
      building 'cchardet._cchardet' extension
      creating build\temp.win-amd64-cpython-311
      creating build\temp.win-amd64-cpython-311\Release
      creating build\temp.win-amd64-cpython-311\Release\src
      creating build\temp.win-amd64-cpython-311\Release\src\cchardet
      creating build\temp.win-amd64-cpython-311\Release\src\ext
      creating build\temp.win-amd64-cpython-311\Release\src\ext\uchardet
      creating build\temp.win-amd64-cpython-311\Release\src

In [58]:
import requests
from bs4 import BeautifulSoup
import lxml
import cchardet
import re
import pandas as pd

def get_detail_data(url_list):
    raw_detail_data = pd.DataFrame(columns=['page_link', 'category', 'district', 'article_id', 'bedrom', 'wc', 'direction', 'balcony_direction', 'description'])
    loop = 0
    for url in url_list:
        print(loop)
        loop += 1
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'lxml')

        #header
        try:
            header = soup.find('ul', class_ = 'uk-breadcrumb').find_all('li')
            category = header[1].text[4:]
            district = header[3].text
        except:
            category = None
            district = None

        #panel
        try:
            panel = soup.find('div', class_ = 'landtech-container').find('div', class_ = 'uk-panel').get_text()
        except:
            panel = None
        try:
            bedroom = re.search('(\d+) PN', panel).group(1)
        except:
            bedroom = None
        try:    
            wc = re.search('(\d+) WC', panel).group(1)
        except:
            wc = None
        try:
            direction = re.search('Hướng nhà:\s([^\s]+)', panel).group(1)
        except:
            direction = None
        try:
            balcony_direction = re.search('Hướng ban công:\s([^\s]+)', panel).group(1)
        except:
            balcony_direction = None
        try:
            article_id = re.search('Mã tin:\s([^\s]+)', panel).group(1)
        except:
            article_id = None
        try:
            description = soup.find_all('div', class_= 'landtech-container')[1].find('div', class_ = 'content').get_text()
            description = re.sub('[\n \r \+\-#,.]+', ' ', description)
        except:
            description = None
        raw_detail_data = raw_detail_data._append({'page_link': url, 'category': category, 'district': district, 'article_id': article_id, 'bedrom': bedroom, 'wc': wc, 'direction': direction, 'balcony_direction': balcony_direction, 'description': description}, ignore_index=True)
    return raw_detail_data

In [60]:
# Save data to csv file
raw_detail_data.to_csv('raw_detail_data.csv', index=False)

# Save file to Google Drive when running on Colab
#from google.colab import drive
#drive.mount('/content/drive')
#raw_data.to_csv('/content/drive/My Drive/raw_data_p201_p604.csv', index=False)

In [61]:
df = pd.read_csv('raw_detail_data.csv') 

Trong lúc thực hiện project, vì thời gian chạy thực lấy dữ liệu rất lâu. Do đó, để tránh chạy lại từ đầu khi gặp các vấn đề phát sinh, ta có thể chia thành nhiều lần chạy.\
```python
raw_detail_data = get_detail_data(raw_data['page_link'][:3000])
#Save to file
raw_detail_data = get_detail_data(raw_data['page_link'][3001:6000])
#Save to file
raw_detail_data = get_detail_data(raw_data['page_link'][6001:9000])
#Save to file
raw_detail_data = get_detail_data(raw_data['page_link'][9000:])
#Save to file
```

#### **Update**

In [2]:
import requests
from bs4 import BeautifulSoup
import lxml
import cchardet
import re
import pandas as pd
import multiprocessing as mp

def get_detail_data(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'lxml')

    # header
    try:
        header = soup.find('ul', class_='uk-breadcrumb').find_all('li')
        category = header[1].text[4:]
        district = header[3].text
    except:
        category = None
        district = None

    # panel
    try:
        panel = soup.find('div', class_='landtech-container').find('div', class_='uk-panel')
        title = panel.find('h1', class_='uk-panel-title').get_text()
        price = panel.find('strong', class_='price').get_text()
        price = re.sub('[\n\t]+', '', price)
        # price unit: Nghìn, Triệu, Tỷ, Nghìn/m2, Triệu/m2, Tỷ/m2
        date_posted = panel.find('time', class_='timeago').get('datetime')
        date_posted = re.search('(\d{4}-\d{2}-\d{2})', date_posted).group(1)
        panel = panel.get_text()
    except:
        panel = None

    try:
        area = re.search('(\d+) m2', panel).group(1)
    except:
        area = None
    try:
        bedroom = re.search('(\d+) PN', panel).group(1)
    except:
        bedroom = None
    try:
        wc = re.search('(\d+) WC', panel).group(1)
    except:
        wc = None
    try:
        direction = re.search('Hướng nhà:\s([^\s]+)', panel).group(1)
    except:
        direction = None
    try:
        balcony_direction = re.search('Hướng ban công:\s([^\s]+)', panel).group(1)
    except:
        balcony_direction = None
    try:
        article_id = re.search('Mã tin:\s([^\s]+)', panel).group(1)
    except:
        article_id = None
    try:
      description = soup.find_all('div', class_='landtech-container')[1].find('div', class_='content').get_text()
      description = re.sub('[\n \t \r \+\-#,]+', ' ', description)
    except:
      description = None

    return pd.DataFrame([{
        'page_link': url,
        'title': title,
        'article_id': article_id,
        'category': category,
        'district': district,
        'date_posted': date_posted,
        'price': price,
        'area': area,
        'bedrom': bedroom,
        'wc': wc,
        'direction': direction,
        'balcony_direction': balcony_direction,
        'description': description
    }])

def parallel_get_detail_data(url_list):
    with mp.Pool(processes=20) as pool:
        results = pool.map(get_detail_data, url_list)
    return pd.concat(results, ignore_index=True)

In [4]:
raw_detail_data = pd.DataFrame(columns=['page_link', 'article_id', 'title', 'category', 'district', 'date_posted', 'price', 'area', 'bedrom', 'wc', 'direction', 'balcony_direction', 'description'])
raw_detail_data = parallel_get_detail_data(raw_data['page_link'][:20])

: 

: 

In [None]:
raw_detail_data.to_csv('raw_detail_data.csv', index=False)