# Web Scraping with Python

파이썬2는 urllib2, 파이썬3은 urllib,
이름과 서브모듈의 경로가 변한 것이 큰 차이점

In [1]:
from urllib.request import urlopen

In [2]:
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## BeautifulSoup4 Tutorial

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [4]:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bs = BeautifulSoup(html.read(), "html.parser")
print(bs.h1)

<h1>An Interesting Title</h1>


## Error Handling
자고 일어났더니 스크래퍼가 죽어있다면 예외를 처리하지 않은 자신을 탓하자
1. 페이지를 찾을 수 없는 경우 (URL 오류)
2. 서버를 찾을 수 없는 경우

In [6]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
    # Change to null or break
else:
    print("Keep going")

Keep going


- 최대한 HTTPError를 활용할 것
- 태그에 접근할 때마다 태그가 실제 존재하는지 체크할 것 (None check, AttributeError)

In [8]:
# Testing none exists tag
print(bs.nonExistentTag)

None


  tag_name, tag_name))


In [9]:
# Calling some function in none tag
print(bs.nonExistentTag.someTag)

  tag_name, tag_name))


AttributeError: 'NoneType' object has no attribute 'someTag'

아래는 두가지 상황을 대비하기 위해 명시적으로 에러를 체크하는 방법

In [10]:
try:
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print("Tag was not found")
    else:
        print(badContent)

Tag was not found


  tag_name, tag_name))


- 코드의 전반적 패턴에 대해 생각하고 예외를 처리할 것
- 범용함수를 만들고 예외처리를 철저하게 만들어 재사용할 수 있도록 하자

In [11]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

In [14]:
def getTitle(url):
    # Check HTTP urlopen
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    # Check tag or attribute
    try:
        bs = BeautifulSoup(html.read(), "html.parser")
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

In [13]:
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>
