# IR Homwork #1
- 웹 크롤러 만들기
- 1. Beautifulsoup를 활용한 정적 크롤링 해보기
- 2. Selenium을 활용한 동적 크롤링 해보기
- 3. 웹에서 제공하는 API 이용해보기
- 4. LLM을 활용하여 자연어로 크롤링 해보기

## Target Page
- Wikipedia English

## 1. BeautifulSoup 활용한 정적 크롤링
- 해당 페이지의 내용을 크롤링
- 출처 : [위키를 엑셀로, 'beautiful Soup'이용해 웹 크롤링 하는 방법](https://free-eunb.tistory.com/15)

In [39]:
!pip install notebook
!pip install beautifulsoup4

[0m

- URL을 통해 해당 페이지의 HTML 문서 자체를 가져오기
- Target : [Wikipedia En - 'George Washington' Page](https://en.wikipedia.org/wiki/George_Washington)

In [40]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://en.wikipedia.org/wiki/George_Washington")  
bs4Object = BeautifulSoup(html, "html.parser")
bs4Object

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>George Washington - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled

### CSS Selector로 해당 내용 가져오기
- 아래와 같이 태그명, 클래스명, 아이디명에 따라 가져올 수 있음.
```
bsObject.select('태그명')
bsObject.select('.클래스명')
bsObject.select('#아이디명')
bsObject.select('상위태그명 > 하위태그명 > 하위태그명')
bsObject.select('상위태그명.클래스명 > 바로아래하위태그명.클래스명') #바로 아래 하위태그 시 '>'    
bsObject.select('상위태그명.클래스명 하위태그명') #자손 태그 시 띄어쓰기로 구분            
bsObject.select('상위태그명 > 바로아래하위태그명 하위태그명')                     
bsObject.select('태그명.클래스명')
bsObject.select('#아이디명 > 태그명.클래스명)
bsObject.select('태그명[속성=값]')
```

In [41]:
office_time_code = bs4Object.select('.infobox-full-data')
office_time_code

[<td class="infobox-full-data" colspan="2"><link href="mw-data:TemplateStyles:r1066479718" rel="mw-deduplicated-inline-style"/></td>,
 <td class="infobox-full-data" colspan="2" style="border-bottom:none"><span class="nowrap"><b>In office</b></span><br/>April 30, 1789 – March 4, 1797</td>,
 <td class="infobox-full-data" colspan="2" style="border-bottom:none"><span class="nowrap"><b>In office</b></span><br/>July 13, 1798 – December 14, 1799</td>,
 <td class="infobox-full-data" colspan="2" style="border-bottom:none"><span class="nowrap"><b>In office</b></span><br/>June 19, 1775 – December 23, 1783</td>,
 <td class="infobox-full-data" colspan="2" style="border-bottom:none"><span class="nowrap"><b>In office</b></span><br/>April 30, 1788 – December 14, 1799</td>,
 <td class="infobox-full-data" colspan="2" style="border-bottom:none"><span class="nowrap"><b>In office</b></span><br/>September 5, 1774 – June 16, 1775</td>,
 <td class="infobox-full-data" colspan="2" style="border-bottom:none"><sp

- 해당 내용에 대한 텍스트만을 얻고 싶은 경우 -> get_text() 이용

In [42]:
office_time_contents = [code.get_text() for code in office_time_code][1:]
office_time_contents

['In officeApril 30, 1789\xa0– March 4, 1797',
 'In officeJuly 13, 1798\xa0– December 14, 1799',
 'In officeJune 19, 1775\xa0– December 23, 1783',
 'In officeApril 30, 1788\xa0– December 14, 1799',
 'In officeSeptember 5, 1774\xa0– June 16, 1775',
 'In officeJuly 24, 1758\xa0– June 24, 1775']

## 2. Selenium 활용한 동적 크롤링
- 해당 페이지의 내용을 웹 브라우저를 직접 활용하여 크롤링
- Selenium 코드 출처 : [자동화툴 ‘selenium’을 이용한 크롤러 구현 및 3사 데이터 획득 방법 안내](https://blog.bizspring.co.kr/%ED%85%8C%ED%81%AC/selenium-%ED%81%AC%EB%A1%A4%EB%9F%AC-%EA%B5%AC%ED%98%84-3%EC%82%AC-%EB%8D%B0%EC%9D%B4%ED%84%B0/) |
[나만의 웹 크롤러 만들기(3): Selenium으로 무적 크롤러 만들기](https://beomi.github.io/2017/02/27/HowToMakeWebCrawler-With-Selenium/) |
- Webdriver 출처 : [Ubuntu 리눅스 Chrome 설치 방법 (command line)](https://jinseongsoft.tistory.com/430)

In [43]:
!pip install selenium==3.141

[0m

### 웹 드라이버 설치
- Selenium은 웹 브라우저를 이용하는 크롤링 자동화 툴이기 때문에 웹 브라우저와 연관된 프로그램을 설치할 필요가 있음
- Goolge Chrome을 조작하기 위해 chromedriver 설치 후, 아래 코드 실행하기

In [44]:
!curl -LO https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!apt-get install -y ./google-chrome-stable_current_amd64.deb
!pip install webdriver-manager
!google-chrome --version

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 99.2M  100 99.2M    0     0  56.2M      0  0:00:01  0:00:01 --:--:-- 56.1M
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'google-chrome-stable' instead of './google-chrome-stable_current_amd64.deb'
google-chrome-stable is already the newest version (117.0.5938.132-1).
0 upgraded, 0 newly installed, 0 to remove and 86 not upgraded.
[0mGoogle Chrome 117.0.5938.132 


- Google Chrome 117.0.5938.132에 맞는 ChromeDriver 설치하기 -> [Link](https://chromedriver.chromium.org/downloads)
- 설치 후 처음으로 페이지 띄우기
- ChromeDriver 자동으로 찾아주기 출처 : [파이썬 selenium WebDriverException오류 해결](https://playground.naragara.com/674/)
- 크롬 옵션 출처 : [DevToolsActivePort file doesn't exist error 해결법](https://study-grow.tistory.com/entry/DevToolsActivePort-file-doesnt-exist-error-%ED%95%B4%EA%B2%B0%EB%B2%95)

In [45]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--single-process")
chrome_options.add_argument("--disable-dev-shm-usage")

# chromedriver
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
driver.implicitly_wait(3)

# Load Page
driver.get(url='https://en.wikipedia.org/wiki/George_Washington')
print(driver.current_url)

  driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)


https://en.wikipedia.org/wiki/George_Washington


- Selenium은 driver 객체를 통해 다양한 메소드를 제공한다.
- URL에 접근하는 메소드
```
get('http://url.com')
```
- 페이지의 단일 element에 접근하는 메소드
```
find_element_by_name('HTML_name')
find_element_by_id('HTML_id')
find_element_by_xpath('/html/body/some/xpath')
find_element_by_css_selector('#css > div.selector')
find_element_by_class_name('some_class_name')
find_element_by_tag_name('h1')
```
- 페이지의 여러 elements에 접근하는 메소드 등
- (대부분 element 를 elements 로 바꾸기만 하면 된다.)
```
find_elements_by_css_selector('#css > div.selector')
```

- George Washington의 직위별 재임기간 가져오기

In [46]:
# 직위별 재임기간
print("============= 직위별 재임기간 =============")
job_elements = driver.find_elements_by_class_name('infobox-header')
time_elements = driver.find_elements_by_class_name('infobox-full-data')[1:]
for time, job in zip(time_elements, job_elements):
    print(f"직위 : {job.text} \n재임기간: {time.text}")
    print("=========================================")

직위 : 1st President of the United States 
재임기간: In office
April 30, 1789 – March 4, 1797
직위 : 7th Senior Officer of the United States Army 
재임기간: In office
July 13, 1798 – December 14, 1799
직위 : Commander in Chief of the Continental Army 
재임기간: In office
June 19, 1775 – December 23, 1783
직위 : 14th Chancellor of the College of William & Mary 
재임기간: In office
April 30, 1788 – December 14, 1799
직위 : Delegate from Virginia to the Continental Congress 
재임기간: In office
September 5, 1774 – June 16, 1775
직위 : Member of the Virginia House of Burgesses 
재임기간: In office
July 24, 1758 – June 24, 1775


### Selenium으로 검색하여 크롤링 하기
- George Washington 크롤링 후, George III(미국 독립전쟁 당시 영국의 국왕)에 대한 정보를 얻으려하는 상황을 상정
- Wikipedia 해당 페이지로 이동 후, 똑같이 재임기간을 크롤링

In [49]:
driver.get(url='https://en.wikipedia.org/wiki/George_III')

In [57]:
job_elements = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div[1]/table[1]/tbody/tr[3]/th')
time_elements = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div[1]/table[1]/tbody/tr[4]/td')

for j in job_elements:
    print(j.text)

King of Great Britain and Ireland[a]
Elector/King of Hanover[b]


In [58]:
# 직위별 재임기간
print("============= 직위별 재임기간 =============")
job_elements = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div[1]/table[1]/tbody/tr[3]/th')
time_elements = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div[1]/table[1]/tbody/tr[4]/td')
for time, job in zip(time_elements, job_elements):
    print(f"직위 : {job.text} \n재임기간: {time.text}")
    print("=========================================")

직위 : King of Great Britain and Ireland[a]
Elector/King of Hanover[b] 
재임기간: 25 October 1760 – 29 January 1820


## 3. 페이지에서 제공하는 API 이용하기
- 저작권에 저촉되는 문제의 경우 해당 API가 접근하지 못하도록 막을 것이므로 특별한 고려가 필요 없음
- 이번 예제는 Wikipedia에서 제공하는 Wikipedia API를 이용
- 내용 출처 : [python으로 위키피디아에서 텍스트 수집하기](https://brunch.co.kr/@ueber/198) | [파이썬_데이터_크롤링_위키피디아(wikipedia)](https://cromboltz.tistory.com/7) | [Wikipedia-API 0.6.0 Docs in pypi](https://pypi.org/project/Wikipedia-API/)

- wikipedia-api 설치하기

In [59]:
!pip install wikipedia-api

Collecting wikipedia-api
  Downloading Wikipedia_API-0.6.0-py3-none-any.whl (14 kB)
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.6.0
[0m

In [63]:
import wikipediaapi #사용할 api 호출
wiki = wikipediaapi.Wikipedia('MyWikiCrawlingHW (lyhthy6@naver.com)', 'en')

In [67]:
search_word = 'George Washington'
page_py = wiki.page(search_word)
print("Page - Exists: %s" % page_py.exists())
print(f"{search_word}: {page_py.fullurl}")

search_word = 'George III'
page_py = wiki.page(search_word)
print("Page - Exists: %s" % page_py.exists())
print(f"{search_word}: {page_py.fullurl}")

Page - Exists: True
George Washington: https://en.wikipedia.org/wiki/George_Washington
Page - Exists: True
George III: https://en.wikipedia.org/wiki/George_III


- 해당 페이지의 각 섹션별 타이틀을 쉽게 불러오는 method 제공

In [68]:
def print_sections(sections, level=0):
        for s in sections:
                print("%s: %s - %s" % ("*" * (level + 1), s.title, s.text[0:40]))
                print_sections(s.sections, level + 1)


print_sections(page_py.sections)

*: Early life - George was born in Norfolk House in St J
*: Accession and marriage - In 1759, George was smitten with Lady Sa
*: Early reign - 
**: Early regnal years - George, in his accession speech to Parli
**: Legislation and politics - In May 1762, the incumbent Whig governme
**: Family issues and discontent in America - George was deeply devout and spent hours
*: American War of Independence - The American War of Independence was the
*: Mid reign - 
**: Government - With the collapse of Lord North's minist
**: Signs of illness - By this time, George's health was deteri
*: Later reign - 
**: War in Europe - After George's recovery, his popularity,
**: Final years - In late 1810, at the height of his popul
*: Slavery - Over the course of George's reign, a coa
*: Legacy - George was succeeded in turn by two of h
*: Titles, styles, honours and arms - 
**: Titles and styles - 4 June 1738 – 31 March 1751: His Royal H
**: Honours - Great Britain: Royal Knight of the Garte
**: Arms - Bef

- 해당 섹션의 내용을 깔끔하게 크롤링 가능

In [71]:
section_history = page_py.section_by_title('Government')
print("%s - %s" % (section_history.title, section_history.text[0:50]))

Government - With the collapse of Lord North's ministry in 1782


- 해당 페이지의 요약본도 불러오기 가능

In [72]:
page_py.summary

'George III (George William Frederick; 4 June 1738 – 29 January 1820) was King of Great Britain and Ireland from 25 October 1760 until his death in 1820. The Acts of Union 1800 unified Great Britain and Ireland into the United Kingdom of Great Britain and Ireland, with George as its king. He was concurrently Duke and Prince-elector of Brunswick-Lüneburg ("Hanover") in the Holy Roman Empire before becoming King of Hanover on 12 October 1814. He was a monarch of the House of Hanover who, unlike his two predecessors, was born in Great Britain, spoke English as his first language, and never visited Hanover.George was born during the reign of his paternal grandfather, King George II, as the first son of Frederick, Prince of Wales, and Princess Augusta of Saxe-Gotha. Following his father\'s death in 1751, Prince George became heir apparent and Prince of Wales. He succeeded to the throne on George II\'s death in 1760. The following year, he married Princess Charlotte of Mecklenburg-Strelitz, 

## 4. LLM을 이용한 자연어 크롤링
- 여태까지는 Programming으로써 크롤링을 하는 관점
- Large Language Model(이하 LLM)을 이용하여 자연어로 크롤링하는 방법 소개
- 내용 출처 : [Langchain Docs](https://python.langchain.com/docs/get_started/introduction)

In [83]:
!pip install wikipedia
!pip install langchain
!pip install openai

[0mCollecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.1
[0m

- Wikipedia의 내용을 API를 통해 가져올 수 있음.

In [78]:
from langchain.retrievers import WikipediaRetriever

retriever = WikipediaRetriever()
docs = retriever.get_relevant_documents(query=search_word)
docs[0].metadata

{'title': 'George III',
 'summary': 'George III (George William Frederick; 4 June 1738 – 29 January 1820) was King of Great Britain and Ireland from 25 October 1760 until his death in 1820. The Acts of Union 1800 unified Great Britain and Ireland into the United Kingdom of Great Britain and Ireland, with George as its king. He was concurrently Duke and Prince-elector of Brunswick-Lüneburg ("Hanover") in the Holy Roman Empire before becoming King of Hanover on 12 October 1814. He was a monarch of the House of Hanover who, unlike his two predecessors, was born in Great Britain, spoke English as his first language, and never visited Hanover.George was born during the reign of his paternal grandfather, King George II, as the first son of Frederick, Prince of Wales, and Princess Augusta of Saxe-Gotha. Following his father\'s death in 1751, Prince George became heir apparent and Prince of Wales. He succeeded to the throne on George II\'s death in 1760. The following year, he married Princess

In [79]:
docs[0].page_content[:400]

'George III (George William Frederick; 4 June 1738 – 29 January 1820) was King of Great Britain and Ireland from 25 October 1760 until his death in 1820. The Acts of Union 1800 unified Great Britain and Ireland into the United Kingdom of Great Britain and Ireland, with George as its king. He was concurrently Duke and Prince-elector of Brunswick-Lüneburg ("Hanover") in the Holy Roman Empire before b'

- OpenAI사의 GPT-3.5-turbo 이용(유료)

In [84]:
# get a token: https://platform.openai.com/account/api-keys
from getpass import getpass

OPENAI_API_KEY = getpass()

In [85]:
import os

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [86]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name="gpt-3.5-turbo")  # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

In [88]:
questions = [
    "Tell me how George III was involved in the American Revolutionary War.",
    "Tell me the length of George III's reign.",
    "Tell me the duration of the American Revolutionary War.",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: Tell me how George III was involved in the American Revolutionary War. 

**Answer**: George III was the King of Great Britain during the American Revolutionary War. He played a significant role in the conflict as the leader of the British government and commander-in-chief of British forces. King George III strongly opposed the colonists' demands for greater autonomy and resisted their push for independence.

Under George III's rule, British policies such as the Stamp Act, Townshend Acts, and the Tea Act were implemented, which imposed taxes and regulations on the American colonies. These measures sparked widespread protests and resistance among the colonists, leading to increased tensions between Britain and its American colonies.

As the war escalated, George III maintained a firm stance against the colonists' rebellion and authorized military actions to suppress the rebellion. He viewed the American colonists as traitors and sought to quell the uprising by force. The

- 위와 같이 자연어 질의를 통해 페이지의 내용을 정확하게 가져오는 크롤링도 수행할 수 있음.