# 2019.08.09. 수업
## 웹 스크레이퍼
- Web Browser
- 웹 서버에 특정 페이지 요청 - GET 요청

## 연결(urllib)
- Python
    - urllib (URL: Uniform Resource Locator) : 표준 파이썬 라이브러리
        - 웹을 통해서 데이터를 요청하는 함수
        - 쿠키를 처리하는 함수
        - 메타데이터(헤더, 유저에이전트 등)를 바꾸는 함수
    - urlopen : 매우 범용적인 라이브러리
        - HTML 파일, 이미지파일, 기타 파일 스트림

In [1]:
from urllib.request import urlopen

In [9]:
url = "http://pythonscraping.com/pages/page1.html"
#url = "http://www.naver.com"
html = urlopen(url)
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## urllib.request.urlopen
- 메소드
    - read([nbytes]): 데이터를 문자열로 읽음
    - readline(): 한 줄의 텍스트를 바이트 문자열로 읽음
    - info(): URL에 연관된 메타정보를 담은 매핑 객체를 반환
    - getcode(): HTTP응답코드를 정수로 변환
    - close(): 연결을 닫는다.

## BeautifulSoup

In [29]:
from urllib.request import urlopen
from urllib.request import HTTPError
from bs4 import BeautifulSoup

In [11]:
url = "http://pythonscraping.com/pages/page1.html"
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), "html.parser")
print(bsObj)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



In [13]:
print(bsObj.title)

<title>A Useful Page</title>


In [15]:
print(bsObj.h1)

<h1>An Interesting Title</h1>


In [18]:
print(bsObj.div)

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>


In [21]:
print(bsObj.h1.get_text())

An Interesting Title


## 예외 처리
- url 가 있는지 검사

In [33]:
try :
    url = "http://pythonscraping.com/pages/error.html"
    html = urlopen(url)
    bsObj = BeautifulSoup(html.read(), "html.parser")
except HTTPError as e:
    print(e)
else:
    print("Terminated!!")

HTTP Error 404: Not Found


- tag 가 있는지 검사

In [34]:
try :
    url = "http://pythonscraping.com/pages/page1.html"
    html = urlopen(url)
    bsObj = BeautifulSoup(html.read(), "html.parser")
    badContent = bsObj.head.div
except HTTPError as e:
    print(e)
except AttributeError as e:
    print('Attribute Error > tag was not found')
else:
    if badContent == None:
        print('badContent was not found')
    else:
        print(badContent)

badContent was not found


In [37]:
def getLink(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print('HTTPError>'+e)
    try:
        bsObj = BeautifulSoup(html.read(), "html.parser")
        print(bsObj)
        title = bsObj.body.title
    except AttributeError as e:
        print('AttributeError>'+e)
        return None
    return title

url = "http://pythonscraping.com/pages/page1.html"
title = getLink(url)

if title == None:
    print('return None>title could not found')
else: 
    print('정상출력:'+str(title))

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

return None>title could not found


In [39]:
url = "http://pythonscraping.com/pages/warandpeace.html"
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), "html.parser")
tag = bsObj.find('h1')
print('------Tag 출력------')
print(tag)
print('------Tag 제외 출력------')
print(tag.get_text())
print(tag.get_text(strip='True'))

------Tag 출력------
<h1>War and Peace</h1>
------Tag 제외 출력------
War and Peace
War and Peace


In [43]:
from urllib.request import urlopen
from urllib.request import HTTPError
from bs4 import BeautifulSoup

url = "http://pythonscraping.com/pages/warandpeace.html"
try:
    html = urlopen(url)
except HTTPError as e:
    print(e)
bsObj = BeautifulSoup(html.read(), "html.parser")
strings = bsObj.findAll('span')

for string in strings:
    print(string.string)

Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.
Heavens! what a virulent attack!
the prince
Anna Pavlovna
First of all, dear friend, tell me how you are. Set your friend's
mind at rest,
Can one be well while suffering morally? Can one be calm in times
like these if one

In [46]:
url = "http://pythonscraping.com/pages/warandpeace.html"
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), "html.parser")
tag = bsObj.find('h1')
print('------Tag 출력------')
print(tag)
print('------Tag 제외 출력------')
print(tag.get_text())
print(tag.get_text(strip='True'))

nameList = bsObj.findAll("span", {"class":"green"})
print('-------------------------')
print('nameList 개수 : ',len(nameList))
for name in nameList:
    print(name.get_text())

------Tag 출력------
<h1>War and Peace</h1>
------Tag 제외 출력------
War and Peace
War and Peace
-------------------------
nameList 개수 :  41
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [55]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = 'https://en.wikipedia.org/wiki/Kevin_Bacon'
try:
    html = urlopen(url)
except HTTPError as e:
    print(e)
    
bsObj = BeautifulSoup(html, 'html.parser')
for link in bsObj.find('div',{'id':'bodyContent'}).\
findAll('a', href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
/wiki/Philadelphia
/wiki/Edmund_Bacon_(architect)
/wiki/Julia_R._Masterman_High_School
/wiki/Pennsylvania_Governor%27s_School_for_the_Arts
/wiki/Bucknell_University
/wiki/Glory_Van_Scott
/wiki/Circle_in_the_Square
/wiki/Nancy_Mills
/wiki/Cosmopolitan_(magazine)
/wiki/Fraternities_and_sororities
/wiki/Animal_House
/wiki/Search_for_Tomorrow