## BeautifulSoup을 이용해서 페이지 스크래핑 하기

### 1. HTML 파싱 및 Soup 만들기 
- 파싱하는 것은 웹사이트에 포함된 정보를 추출하기 위한 첫 번째 단계이다. 
- BeautifulSoup은 HTML과 XML 파일에서 데이터르 가져오기 위한 파이썬 라이브러리 이다. 
- BeautifulSoup의 장점은 사용하기 쉽고, 빠르게 웹사이트 내에 데이터를 스크래핑 할 수 있는 점이다. 

In [5]:
from bs4 import BeautifulSoup
import lxml

# html 파일 불러오기 
with open('./data/website.html') as file:
    contents = file.read()

# 파싱하기 
soup = BeautifulSoup(contents, "html.parser")
# soup = BeautifulSoup(contents, "lxml")
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<title>Angela's Personal Site</title>
</head>
<body>
<h1 id="name">Angela Yu</h1>
<p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>
<p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p>
<hr/>
<h3 class="heading">Books and Teaching</h3>
<ul>
<li>The Complete iOS App Development Bootcamp</li>
<li>The Complete Web Development Bootcamp</li>
<li>100 Days of Code - The Complete Python Bootcamp</li>
</ul>
<hr/>
<h3 class="heading">Other Pages</h3>
<a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a>
<a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a>
</body>
</html>

In [16]:
# title tag에 있는 내용 불러오기
soup.title.string

"Angela's Personal Site"

In [9]:
# html 코드 들여쓰기 
soup.prettify()

'<!DOCTYPE html>\n<html>\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Angela\'s Personal Site\n  </title>\n </head>\n <body>\n  <h1 id="name">\n   Angela Yu\n  </h1>\n  <p>\n   <em>\n    Founder of\n    <strong>\n     <a href="https://www.appbrewery.co/">\n      The App Brewery\n     </a>\n    </strong>\n    .\n   </em>\n  </p>\n  <p>\n   I am an iOS and Web Developer. I ❤️ coffee and motorcycles.\n  </p>\n  <hr/>\n  <h3 class="heading">\n   Books and Teaching\n  </h3>\n  <ul>\n   <li>\n    The Complete iOS App Development Bootcamp\n   </li>\n   <li>\n    The Complete Web Development Bootcamp\n   </li>\n   <li>\n    100 Days of Code - The Complete Python Bootcamp\n   </li>\n  </ul>\n  <hr/>\n  <h3 class="heading">\n   Other Pages\n  </h3>\n  <a href="https://angelabauer.github.io/cv/hobbies.html">\n   My Hobbies\n  </a>\n  <a href="https://angelabauer.github.io/cv/contact-me.html">\n   Contact Me\n  </a>\n </body>\n</html>'

In [10]:
# 앵커 tag 불러오기
soup.a

<a href="https://www.appbrewery.co/">The App Brewery</a>

In [17]:
# p tag 불러오기  
soup.p

<p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>

### 2. BeautifulSoup로 특정 요소 찾고 선택하기 

In [21]:
# 모든 앵커 tag 찾기
all_anchor_tags = soup.find_all(name="a")
all_anchor_tags

[<a href="https://www.appbrewery.co/">The App Brewery</a>,
 <a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a>,
 <a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a>]

In [22]:
# 모든 p tag 찾기
all_p_tags = soup.find_all(name="p")
all_p_tags

[<p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>,
 <p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p>]

In [23]:
# 앵커 테그에 있는 text 추출하기 
for tag in all_anchor_tags:
    print(tag.getText())

The App Brewery
My Hobbies
Contact Me


In [24]:
# 앵커 테그에 있는 url 추출하기 
for tag in all_anchor_tags:
    print(tag.get("href"))

https://www.appbrewery.co/
https://angelabauer.github.io/cv/hobbies.html
https://angelabauer.github.io/cv/contact-me.html


In [25]:
# h1에 있는 id기준으로 데이터 불러오기 
heading = soup.find(name="h1", id="name")
heading

<h1 id="name">Angela Yu</h1>

In [28]:
# 클래스 속성으로 데이터 추출하기 
heading = soup.find(name="h3", class_="heading")
heading.getText()

'Books and Teaching'

In [29]:
# 중간에 끼여있는 tag를 추출하려면? -> select 메소드 
company_url = soup.select_one(selector="p a")
company_url

<a href="https://www.appbrewery.co/">The App Brewery</a>

In [30]:
# h1의 id 값을 추출하려면? 
company_url = soup.select_one(selector="#name")
company_url

<h1 id="name">Angela Yu</h1>

In [32]:
# select를 이용해서 class 데이터 불러오기 
headings = soup.select(".heading")
headings

[<h3 class="heading">Books and Teaching</h3>,
 <h3 class="heading">Other Pages</h3>]