# 시카고 맛집 데이터 가져오기

### 🔰 개요

- https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/ 

- chicago magazine the 50 best sandwiches

	시카고의 50개 맛집 샌드위치 가게에 대한 이름과 메뉴 정보가 있다.

- 최종목표
	<p>
	총 51개 페이지에서 각 가게의 정보를 가져온다
	</p>

	- 메인페이지
		- 랭킹
		- 가게이름 
		- 대표메뉴
		- url 주소

	- 하위페이지
		- 대표메뉴의 가격 
		- 가게주소

### 🔰 메인페이지

> 메인 페이지에서 데이터를 웹 크롤링

- 랭킹, 데표메뉴, 가게이름, url 정보를 가져온다.

In [9]:
from urllib.request import urlopen, Request
import ssl

In [10]:

url_base = "https://www.chicagomag.com/"
url_sub = "Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/"
url = url_base + url_sub

ca_filepath = "../chicagomag.crt"
ctxt = ssl.create_default_context(cafile=ca_filepath)

response = urlopen(url, context=ctxt)
response

HTTPError: HTTP Error 403: Forbidden

👆 HTTP Error 403 : 서버에서 나에게 문제가 있다고 막은 것이다.

- 해결방법은 `header=` 값을 설정해 주는 것이다.

In [11]:
# ca : Certification Authority
ca_filepath = "../chicagomag.crt"
ctxt = ssl.create_default_context(cafile=ca_filepath)

req = Request(url, headers={"user-agent": "Chrome"})

response = urlopen(req, context=ctxt)
response, response.status

(<http.client.HTTPResponse at 0x1b79bd9eb20>, 200)

- Network > Request Headers > User-Agent 의 원값을 그대로 넣어도 된다.

	- 크롬 개발자 도구 네트워크에서 가져옴.
	- {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}

In [12]:
req = Request(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"})

response = urlopen(req, context=ctxt)
response, response.status

(<http.client.HTTPResponse at 0x1b79bd081f0>, 200)

- `fake-useragent`

	- !pip install fake-useragent

	- Network의 Request Headers 속성 User-Agent의 값을 임으로 생성해 준다.

In [13]:
from fake_useragent import UserAgent

ua = UserAgent()
# ua.ie
# Error occurred during getting browser: ie, but was suppressed with fallback.
ua.fallback

'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'

In [None]:
import ssl
from urllib.request import urlopen, Request
from fake_useragent import UserAgent
from bs4 import BeautifulSoup

ca_filepath = "../chicagomag.crt"
ctxt = ssl.create_default_context(cafile=ca_filepath)

url_base = "https://www.chicagomag.com/"
url_sub = "Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/"
url = url_base + url_sub

ua = UserAgent()
req = Request(url, headers={"user-agent": ua.fallback})

html = urlopen(req, context=ctxt)

soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())

- 수집 데이터 찾기

![시카고맛집50](https://github.com/ElaYJ/Study_EDA/assets/153154981/815e5ee1-ae46-40c6-833f-7b0072c08773)

In [None]:
soup.select(".sammy"), len(soup.select(".sammy"))

In [16]:
# soup.select(".sammy"), len(soup.select(".sammy"))과 동일한 방식
soup.find_all("div", "sammy"), len(soup.find_all("div", "sammy"))

([<div class="sammy" style="position: relative;">
  <div class="sammyRank">1</div>
  <div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
  Old Oak Tap<br/>
  <em>Read more</em> </a></div>
  </div>,
  <div class="sammy" style="position: relative;">
  <div class="sammyRank">2</div>
  <div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Au-Cheval-Fried-Bologna/"><b>Fried Bologna</b><br/>
  Au Cheval<br/>
  <em>Read more</em> </a></div>
  </div>,
  <div class="sammy" style="position: relative;">
  <div class="sammyRank">3</div>
  <div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Xoco-Woodland-Mushroom/"><b>Woodland Mushroom</b><br/>
  Xoco<br/>
  <em>Read more</em> </a></div>
  </div>,
  <div class="sammy" style="position: relative;">
  <div class="sammyRank">4</div>
  <div class="sammyListing"><a href="/Chicago-Magazine/November-2

In [17]:
tmp_one = soup.find_all("div", "sammy")[0]
tmp_one, type(tmp_one)

(<div class="sammy" style="position: relative;">
 <div class="sammyRank">1</div>
 <div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
 Old Oak Tap<br/>
 <em>Read more</em> </a></div>
 </div>,
 bs4.element.Tag)

👆 type이 <bs4.element.Tag>라는 것은 find() 함수를 사용할 수 있다는 뜻이다.

- 랭킹 데이터 얻기

In [18]:
tmp_one.find(class_="sammyRank")

<div class="sammyRank">1</div>

In [19]:
tmp_one.find(class_="sammyRank").get_text()

'1'

In [20]:
tmp_one.find(class_="sammyRank").text

'1'

- 대표 메뉴와 가게 이름 얻기

In [21]:
tmp_one

<div class="sammy" style="position: relative;">
<div class="sammyRank">1</div>
<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
Old Oak Tap<br/>
<em>Read more</em> </a></div>
</div>

In [22]:
tmp_one.find("div", {"class": "sammyListing"})

<div class="sammyListing"><a href="/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/"><b>BLT</b><br/>
Old Oak Tap<br/>
<em>Read more</em> </a></div>

In [23]:
tmp_one.find("div", {"class": "sammyListing"}).text

'BLT\nOld Oak Tap\nRead more '

In [24]:
tmp_one.select_one(".sammyListing").get_text()

'BLT\nOld Oak Tap\nRead more '

In [25]:
tmp_one.select_one(".sammyListing").text.split("\n")

['BLT', 'Old Oak Tap', 'Read more ']

- `re` module : regular expression

	- `|` 는 OR을 의미한다.

In [26]:
import re

tmp_string = tmp_one.find(class_="sammyListing").get_text()
re.split(("\n|\r\n"), tmp_string)

['BLT', 'Old Oak Tap', 'Read more ']

In [27]:
print(re.split(("\n|\r\n"), tmp_string)[0]) # menu
print(re.split(("\n|\r\n"), tmp_string)[1]) # cafe

BLT
Old Oak Tap


- 홈페이지 주소 얻기

	- url 주소가 상대경로인 것도 있고 절대경로인 것도 있다.

In [28]:
tmp_one.find("a")["href"]

'/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/'

In [29]:
tmp_one.select_one("a").get("href")

'/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/'

- `urljoin()`

	- url_base가 없으면 붙여주고, 있으면 그냥 패스

	- 상대 경로와 절대 경로가 혼용되어 있는 경우 상대 경로에 url_base를 붙여 절대 경로로 만들어 준다.

In [30]:
from urllib.parse import urljoin

url_base = "http://www.chicagomag.com"

# 필요한 내용을 담을 빈 리스트 
# 리스트로 하나씩 컬럼을 만들고, DataFrame으로 합칠 예정 
rank = [] 
main_menu = [] 
cafe_name = [] 
url_add = [] 

list_soup = soup.find_all("div", "sammy") # soup.select(".sammy")

for item in list_soup: 
    rank.append(item.find(class_="sammyRank").get_text())
    tmp_string = item.find(class_="sammyListing").get_text()
    main_menu.append(re.split(("\n|\r\n"), tmp_string)[0])
    cafe_name.append(re.split(("\n|\r\n"), tmp_string)[1])
    url_add.append(urljoin(url_base, item.find("a")["href"]))
    
len(rank), len(main_menu), len(cafe_name), len(url_add)

(50, 50, 50, 50)

In [31]:
rank[:5]

['1', '2', '3', '4', '5']

In [32]:
main_menu[:5]

['BLT', 'Fried Bologna', 'Woodland Mushroom', 'Roast Beef', 'PB&L']

In [33]:
cafe_name[:5]

['Old Oak Tap', 'Au Cheval', 'Xoco', 'Al’s Deli', 'Publican Quality Meats']

In [34]:
url_add[:5]

['http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/',
 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Au-Cheval-Fried-Bologna/',
 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Xoco-Woodland-Mushroom/',
 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Als-Deli-Roast-Beef/',
 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Publican-Quality-Meats-PB-L/']

- DataFrame 생성

In [35]:
import pandas as pd

data = {
	"Rank": rank,
	"Menu": main_menu,
	"Cafe": cafe_name,
	"URL": url_add
}

df = pd.DataFrame(data)
df.tail(2)

Unnamed: 0,Rank,Menu,Cafe,URL
48,49,Le Végétarien,Toni Patisserie,https://www.chicagomag.com/Chicago-Magazine/No...
49,50,The Gatsby,Phoebe’s Bakery,https://www.chicagomag.com/Chicago-Magazine/No...


- 컬럼 순서 변경

In [36]:
df = pd.DataFrame(data, columns=["Rank", "Cafe", "Menu", "URL"])
df.tail()

Unnamed: 0,Rank,Cafe,Menu,URL
45,46,Chickpea,Kufta,https://www.chicagomag.com/Chicago-Magazine/No...
46,47,The Goddess and Grocer,Debbie’s Egg Salad,https://www.chicagomag.com/Chicago-Magazine/No...
47,48,Zenwich,Beef Curry,https://www.chicagomag.com/Chicago-Magazine/No...
48,49,Toni Patisserie,Le Végétarien,https://www.chicagomag.com/Chicago-Magazine/No...
49,50,Phoebe’s Bakery,The Gatsby,https://www.chicagomag.com/Chicago-Magazine/No...


- 데이터 저장

In [37]:
df.to_csv("./result_data/01_Chicago_mainpage_data.csv", encoding="utf-8")

### 🔰 하위페이지

> 하위 페이지의 데이터를 웹크롤링

- url 주소를 통해 하위 페이지로 접근해 가격과 주소 데이터를 가져온다.

In [38]:
# requirements
import pandas as pd
import ssl
from urllib.request import urlopen, Request
from fake_useragent import UserAgent
from bs4 import BeautifulSoup

ca_filepath = "../chicagomag.crt"
ctxt = ssl.create_default_context(cafile=ca_filepath)

In [39]:
df = pd.read_csv("./result_data/01_Chicago_mainpage_data.csv", index_col=0)
df.tail()

Unnamed: 0,Rank,Cafe,Menu,URL
45,46,Chickpea,Kufta,https://www.chicagomag.com/Chicago-Magazine/No...
46,47,The Goddess and Grocer,Debbie’s Egg Salad,https://www.chicagomag.com/Chicago-Magazine/No...
47,48,Zenwich,Beef Curry,https://www.chicagomag.com/Chicago-Magazine/No...
48,49,Toni Patisserie,Le Végétarien,https://www.chicagomag.com/Chicago-Magazine/No...
49,50,Phoebe’s Bakery,The Gatsby,https://www.chicagomag.com/Chicago-Magazine/No...


In [40]:
df["URL"][0]

'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/'

- 데이터 수집하기

	- 가격 정보와 주소 추출

![시카고맛집50_하위](https://github.com/ElaYJ/Study_EDA/assets/153154981/e831ecb4-0c13-4c5b-94ae-f6ee0cd69d8a)

In [41]:
ua = UserAgent()
req = Request(df["URL"][0], headers={"user-agent": ua.fallback})
html = urlopen(req, context=ctxt).read()

soup_tmp = BeautifulSoup(html, "html.parser")
soup_tmp.find("p", "addy") # soup_tmp.select_one(".addy")

<p class="addy">
<em>$10. 2109 W. Chicago Ave., 773-772-0406, <a href="http://www.theoldoaktap.com/">theoldoaktap.com</a></em></p>

- regular expression

In [42]:
price_tmp = soup_tmp.select_one(".addy").text
price_tmp

'\n$10. 2109 W. Chicago Ave., 773-772-0406, theoldoaktap.com'

In [43]:
import re

re.split(".,", price_tmp)

['\n$10. 2109 W. Chicago Ave', ' 773-772-040', ' theoldoaktap.com']

In [44]:
price_address = re.split(".,", price_tmp)[0]
price_address

'\n$10. 2109 W. Chicago Ave'

- `"\$\d+\.(\d+)?"`

	- `$`로 시작하고, `d+`뒤에 숫자가 여러 개 올 수 있고, 숫자들 뒤에 `.`이 오고,
	
		 그 뒤에 `d+`숫자들이 `()?`올수도 있고 없을 수도 있다.

In [45]:
price = re.search("\$\d+\.(\d+)?", price_tmp).group()
price

'$10.'

In [46]:
# 가격이 끝나는 지점의 위치를 이용해 그 뒤가 주소라고 생각한다.
address = price_address[len(price)+2:]
address

'2109 W. Chicago Ave'

- 전체 데이터 수집

In [47]:
price = []
address = []

for n in df.index[:3]:
    req = Request(df["URL"][n], headers={"user-agent": ua.fallback})
    html = urlopen(req, context=ctxt).read()
    soup_tmp = BeautifulSoup(html, "html.parser")
    
    gettings = soup_tmp.find("p", "addy").get_text()
    tmp = re.split(".,", gettings)[0]
    price_tmp = re.search("\$\d+\.(\d+)?", tmp).group()
    
    price.append(price_tmp)
    address.append(tmp[len(price_tmp)+2:])
    print(n)

0
1
2


In [48]:
price, address

(['$10.', '$9.', '$9.50'],
 ['2109 W. Chicago Ave', '800 W. Randolph St', ' 445 N. Clark St'])

In [49]:
price = []
address = []

for idx, row in df[:3].iterrows():
    req = Request(row["URL"], headers={"user-agent": ua.fallback})
    html = urlopen(req, context=ctxt).read()
    soup_tmp = BeautifulSoup(html, "html.parser")
    
    gettings = soup_tmp.find("p", "addy").get_text()
    tmp = re.split(".,", gettings)[0]
    price_tmp = re.search("\$\d+\.(\d+)?", tmp).group()
    
    price.append(price_tmp)
    address.append(tmp[len(price_tmp)+2:])
    print(idx)

0
1
2


In [50]:
price, address

(['$10.', '$9.', '$9.50'],
 ['2109 W. Chicago Ave', '800 W. Randolph St', ' 445 N. Clark St'])

- `TQDM` module 설치

	- conda install -c conda-forge tqdm

In [51]:
# VSCode에서 tqdm 사용 ❌
# from tqdm import tqdm 

price = []
address = []

for idx, row in df.iterrows(): # tqdm(df.iterrows()):
    req = Request(row["URL"], headers={"user-agent": ua.fallback})
    html = urlopen(req, context=ctxt).read()
    soup_tmp = BeautifulSoup(html, "html.parser")
    
    gettings = soup_tmp.find("p", "addy").get_text()
    tmp = re.split(".,", gettings)[0]
    price_tmp = re.search("\$\d+\.(\d+)?", tmp).group()
    
    price.append(price_tmp)
    address.append(tmp[len(price_tmp)+2:])
    print(idx)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


In [52]:
len(price), len(address)

(50, 50)

In [53]:
price[:5]

['$10.', '$9.', '$9.50', '$9.40', '$10.']

In [54]:
address[:5]

['2109 W. Chicago Ave',
 '800 W. Randolph St',
 ' 445 N. Clark St',
 ' 914 Noyes St',
 '825 W. Fulton Mkt']

- DataFrame에 추가

	- 하위페이지 데이터 DataFrame에 합치기

In [55]:
df.tail(2)

Unnamed: 0,Rank,Cafe,Menu,URL
48,49,Toni Patisserie,Le Végétarien,https://www.chicagomag.com/Chicago-Magazine/No...
49,50,Phoebe’s Bakery,The Gatsby,https://www.chicagomag.com/Chicago-Magazine/No...


In [56]:
df["Price"] = price
df["Address"] = address
df

Unnamed: 0,Rank,Cafe,Menu,URL,Price,Address
0,1,Old Oak Tap,BLT,http://www.chicagomag.com/Chicago-Magazine/Nov...,$10.,2109 W. Chicago Ave
1,2,Au Cheval,Fried Bologna,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9.,800 W. Randolph St
2,3,Xoco,Woodland Mushroom,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9.50,445 N. Clark St
3,4,Al’s Deli,Roast Beef,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9.40,914 Noyes St
4,5,Publican Quality Meats,PB&L,http://www.chicagomag.com/Chicago-Magazine/Nov...,$10.,825 W. Fulton Mkt
5,6,Hendrickx Belgian Bread Crafter,Belgian Chicken Curry Salad,https://www.chicagomag.com/Chicago-Magazine/No...,$7.25,100 E. Walton St
6,7,Acadia,Lobster Roll,http://www.chicagomag.com/Chicago-Magazine/Nov...,$16.,1639 S. Wabash Ave
7,8,Birchwood Kitchen,Smoked Salmon Salad,http://www.chicagomag.com/Chicago-Magazine/Nov...,$10.,2211 W. North Ave
8,9,Cemitas Puebla,Atomica Cemitas,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9.,3619 W. North Ave
9,10,Nana,Grilled Laughing Bird Shrimp and Fried Po’ Boy,http://www.chicagomag.com/Chicago-Magazine/Nov...,$17.,3267 S. Halsted St


In [57]:
# url 제거
df = df.loc[:, ["Rank", "Cafe", "Menu", "Price", "Address"]]
df.head()

Unnamed: 0,Rank,Cafe,Menu,Price,Address
0,1,Old Oak Tap,BLT,$10.,2109 W. Chicago Ave
1,2,Au Cheval,Fried Bologna,$9.,800 W. Randolph St
2,3,Xoco,Woodland Mushroom,$9.50,445 N. Clark St
3,4,Al’s Deli,Roast Beef,$9.40,914 Noyes St
4,5,Publican Quality Meats,PB&L,$10.,825 W. Fulton Mkt


In [58]:
# index 재설정
df.set_index("Rank", inplace=True)
df.head()

Unnamed: 0_level_0,Cafe,Menu,Price,Address
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Old Oak Tap,BLT,$10.,2109 W. Chicago Ave
2,Au Cheval,Fried Bologna,$9.,800 W. Randolph St
3,Xoco,Woodland Mushroom,$9.50,445 N. Clark St
4,Al’s Deli,Roast Beef,$9.40,914 Noyes St
5,Publican Quality Meats,PB&L,$10.,825 W. Fulton Mkt


In [62]:
# data 저장
df.to_csv("./result_data/01_Chicago_subpage_add_data.csv", encoding="utf-8")

In [63]:
pd.read_csv("./result_data/01_Chicago_subpage_add_data.csv", index_col=0)

Unnamed: 0_level_0,Cafe,Menu,Price,Address
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Old Oak Tap,BLT,$10.,2109 W. Chicago Ave
2,Au Cheval,Fried Bologna,$9.,800 W. Randolph St
3,Xoco,Woodland Mushroom,$9.50,445 N. Clark St
4,Al’s Deli,Roast Beef,$9.40,914 Noyes St
5,Publican Quality Meats,PB&L,$10.,825 W. Fulton Mkt
6,Hendrickx Belgian Bread Crafter,Belgian Chicken Curry Salad,$7.25,100 E. Walton St
7,Acadia,Lobster Roll,$16.,1639 S. Wabash Ave
8,Birchwood Kitchen,Smoked Salmon Salad,$10.,2211 W. North Ave
9,Cemitas Puebla,Atomica Cemitas,$9.,3619 W. North Ave
10,Nana,Grilled Laughing Bird Shrimp and Fried Po’ Boy,$17.,3267 S. Halsted St
