## BeautifulSoup을 이용해서 페이지 스크래핑 하기

### 1. HTML 파싱 및 Soup 만들기 
- 파싱하는 것은 웹사이트에 포함된 정보를 추출하기 위한 첫 번째 단계이다. 
- BeautifulSoup은 HTML과 XML 파일에서 데이터르 가져오기 위한 파이썬 라이브러리 이다. 
- BeautifulSoup의 장점은 사용하기 쉽고, 빠르게 웹사이트 내에 데이터를 스크래핑 할 수 있는 점이다. 

In [5]:
from bs4 import BeautifulSoup
import lxml

# html 파일 불러오기 
with open('./data/website.html') as file:
    contents = file.read()

# 파싱하기 
soup = BeautifulSoup(contents, "html.parser")
# soup = BeautifulSoup(contents, "lxml")
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<title>Angela's Personal Site</title>
</head>
<body>
<h1 id="name">Angela Yu</h1>
<p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>
<p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p>
<hr/>
<h3 class="heading">Books and Teaching</h3>
<ul>
<li>The Complete iOS App Development Bootcamp</li>
<li>The Complete Web Development Bootcamp</li>
<li>100 Days of Code - The Complete Python Bootcamp</li>
</ul>
<hr/>
<h3 class="heading">Other Pages</h3>
<a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a>
<a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a>
</body>
</html>

In [16]:
# title tag에 있는 내용 불러오기
soup.title.string

"Angela's Personal Site"

In [9]:
# html 코드 들여쓰기 
soup.prettify()

'<!DOCTYPE html>\n<html>\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Angela\'s Personal Site\n  </title>\n </head>\n <body>\n  <h1 id="name">\n   Angela Yu\n  </h1>\n  <p>\n   <em>\n    Founder of\n    <strong>\n     <a href="https://www.appbrewery.co/">\n      The App Brewery\n     </a>\n    </strong>\n    .\n   </em>\n  </p>\n  <p>\n   I am an iOS and Web Developer. I ❤️ coffee and motorcycles.\n  </p>\n  <hr/>\n  <h3 class="heading">\n   Books and Teaching\n  </h3>\n  <ul>\n   <li>\n    The Complete iOS App Development Bootcamp\n   </li>\n   <li>\n    The Complete Web Development Bootcamp\n   </li>\n   <li>\n    100 Days of Code - The Complete Python Bootcamp\n   </li>\n  </ul>\n  <hr/>\n  <h3 class="heading">\n   Other Pages\n  </h3>\n  <a href="https://angelabauer.github.io/cv/hobbies.html">\n   My Hobbies\n  </a>\n  <a href="https://angelabauer.github.io/cv/contact-me.html">\n   Contact Me\n  </a>\n </body>\n</html>'

In [10]:
# 앵커 tag 불러오기
soup.a

<a href="https://www.appbrewery.co/">The App Brewery</a>

In [17]:
# p tag 불러오기  
soup.p

<p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>

### 2. BeautifulSoup로 특정 요소 찾고 선택하기 

In [21]:
# 모든 앵커 tag 찾기
all_anchor_tags = soup.find_all(name="a")
all_anchor_tags

[<a href="https://www.appbrewery.co/">The App Brewery</a>,
 <a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a>,
 <a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a>]

In [22]:
# 모든 p tag 찾기
all_p_tags = soup.find_all(name="p")
all_p_tags

[<p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>,
 <p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p>]

In [23]:
# 앵커 테그에 있는 text 추출하기 
for tag in all_anchor_tags:
    print(tag.getText())

The App Brewery
My Hobbies
Contact Me


In [24]:
# 앵커 테그에 있는 url 추출하기 
for tag in all_anchor_tags:
    print(tag.get("href"))

https://www.appbrewery.co/
https://angelabauer.github.io/cv/hobbies.html
https://angelabauer.github.io/cv/contact-me.html


In [25]:
# h1에 있는 id기준으로 데이터 불러오기 
heading = soup.find(name="h1", id="name")
heading

<h1 id="name">Angela Yu</h1>

In [28]:
# 클래스 속성으로 데이터 추출하기 
heading = soup.find(name="h3", class_="heading")
heading.getText()

'Books and Teaching'

In [29]:
# 중간에 끼여있는 tag를 추출하려면? -> select 메소드 
company_url = soup.select_one(selector="p a")
company_url

<a href="https://www.appbrewery.co/">The App Brewery</a>

In [30]:
# h1의 id 값을 추출하려면? 
company_url = soup.select_one(selector="#name")
company_url

<h1 id="name">Angela Yu</h1>

In [32]:
# select를 이용해서 class 데이터 불러오기 
headings = soup.select(".heading")
headings

[<h3 class="heading">Books and Teaching</h3>,
 <h3 class="heading">Other Pages</h3>]

### 3. 라이브 웹사이트 스크래핑 하기 
- 실제 웹사이트 스크래핑 하기 
    - https://news.ycombinator.com/ -> Hacker News site 

In [1]:
from bs4 import BeautifulSoup 
import requests

url = "https://news.ycombinator.com/"
response = requests.get(url)
print(response.text)

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?R8kUq8d9jHbN7kt0kRJT">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.svg" width="18" height="18" style="border:1px white solid; display:block"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
                            <a href="newest">new</a> | <a href

In [19]:
# 제목 스크래핑 하기 
yc_web_page = response.text

soup = BeautifulSoup(yc_web_page, "html.parser")
article_tag = soup.select_one(".titleline")
article_text = article_tag.getText()
article_text

'LÖVE: a framework to make 2D games in Lua (love2d.org)'

In [32]:
# link와 추천수 스크래핑 하기 
yc_web_page = response.text

soup = BeautifulSoup(yc_web_page, "html.parser")
article_tag = soup.select_one("td > span.titleline > a")
article_text = article_tag.getText()
article_link = article_tag.get("href")
article_upvote = soup.select_one(".score").getText()

print(article_text, article_link, article_upvote)

LÖVE: a framework to make 2D games in Lua https://love2d.org/ 116 points


In [41]:
# 모든 link와 추천수 스크래핑 하기 
yc_web_page = response.text

soup = BeautifulSoup(yc_web_page, "html.parser")
articles = soup.select("td > span.titleline > a")

article_texts = []
article_links = []

for article_tag in articles:
    text = article_tag.getText()
    article_texts.append(text)
    
    link = article_tag.get("href")
    article_links.append(link)
    
    
article_upvote = [int(score.getText().split()[0]) for score in soup.select(".score")]

In [42]:
# 결과 확인하기 
print(article_texts)
print(article_links)
print(article_upvote)

['LÖVE: a framework to make 2D games in Lua', 'Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion', 'Any sufficiently advanced uninstaller is indistinguishable from malware', 'Openmoji', '70B Llama 2 at 35tokens/second on 4090', 'Credit card debt collection', 'iPhone 12 withdrawn from French market for non-compliance with EU regulation', 'A CD Spectrometer', 'What Is Wrong with TOML?', 'Interactive Map of Linux Kernel', "Amazon's Union-Busting Training Video", 'S32 Unix Clock', 'How long it took different companies to find product-market fit', 'Rubber hose animation', 'Typing on Any Surface: A Deep Learning Method for Keystroke Detection in AR', 'The Pirate Preservationists', 'Lessons from YC AI Startups', "Google has been rolling out Chrome's “Enhanced Ad Privacy” via a popup", 'French regulators order Apple to halt sales of the iPhone 12', 'iPhone 15 and iPhone 15 Plus', "Pablo Fanque's Fair (2011)", 'Some notes on local-first development', 'Fine-tune your own Llama 2 to re

In [53]:
# 가장 뷰가 많은 기사 찾기 
largest_number = max(article_upvote)
largest_index = article_upvote.index(max(article_upvote))

print(article_texts[largest_index])
print(article_links[largest_index])

Fine-tune your own Llama 2 to replace GPT-3.5/4
item?id=37484135
