### - 서귀포 숙소 데이터

- 숙소 데이터도 서귀포 쪽의 데이터가 없으므로 추가로 동일한 방법으로 추가한다.
- 키워드 검색
  - 질의어를 통해 장소 검색 결과 반환
  - category_group_name, category_name, place_url
- 키워드 장소 검색 : 일간 100,000건
- 검색할 카테고리 선정 : 숙박
  - AD5(숙박)
- 법정동·리 별로 검색 : <a href = 'https://ko.wikipedia.org/wiki/%EC%84%9C%EA%B7%80%ED%8F%AC%EC%8B%9C%EC%9D%98_%ED%96%89%EC%A0%95_%EA%B5%AC%EC%97%AD' target='_blink'>위키백과</a>

#### 1. 데이터 획득

##### ◽카테고리, 키워드, 지역 변수

In [3]:
keywords = ['호텔', '리조트', '콘도', '게스트하우스', '민박', '펜션']
categorys = ['AD5']
categorys_info = {'AD5' : '숙박'}

In [4]:
jeju_range = ['법환동', '서호동', '호근동', '강정동', '도순동', '영남동', '월평동', '동홍동', '서홍동', '보목동','서귀동','토평동', '상효동', '상예동', '색달동', '하예동', '대포동', '중문동', '하원동', '회수동', '신효동', '하효동']

##### ◽카카오 API 활용 함수 : 함수 재활용

- search_result(keyword, category, jeju_name)
  - (카테고리, 법정동_리) 검색 함수

In [5]:
import json
import requests

def search_result(keyword, category, jeju_name):
    result = []

    # REST 키
    rest_api_key = '63d0926cf9b14de298157081ba8a8d02'
    # 헤더
    headers = {"Authorization" : "KakaoAK {}".format(rest_api_key)}
    # 파라미터
    params = {"query" : f"제주특별자치도 {jeju_name} {keyword}", "page" : 1, "category_group_code" : f"{category}"}
    url = "https://dapi.kakao.com/v2/local/search/keyword.json"

    while True:
        # GET을 이용하여 획득
        res = requests.get(url, headers=headers, params=params)
        if res.status_code == 200:
            # Json을 이용하여 해제
            doc = json.loads(res.text)
            result.extend(doc['documents'])
            if doc['meta']['is_end'] == True:
                break
            else:
                params['page'] += 1
    return result

- search_df()
  - 전체 숙박 데이터 프레임 반환 함수

In [6]:
import pandas as pd
from tqdm.notebook import tqdm

def search_df():
    results = []
    for idx, jeju in tqdm(enumerate(jeju_range)):
        for category in categorys:
            for key in keywords:
                r = pd.DataFrame(search_result(key, category, jeju))
                r['keyword'] = key
                results.append(r.copy())
    return pd.concat(results).reset_index(drop=True)

##### ◽카카오 API 활용 데이터 획득

In [7]:
accommodation_poi_add = search_df()

0it [00:00, ?it/s]

In [9]:
# accommodation_poi_add.to_excel('./data/220121/서귀포_숙박_POI.xlsx',index=False)

In [10]:
accommodation_poi_add.head(2)

Unnamed: 0,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y,keyword
0,제주특별자치도 서귀포시 법환동 1513,AD5,숙박,여행 > 숙박 > 호텔,,17017429,064-800-7200,더그랜드섬오름,http://place.map.kakao.com/17017429,제주특별자치도 서귀포시 막숙포로 114,126.51042079683762,33.23307218540021,호텔
1,제주특별자치도 서귀포시 법환동 745-1,AD5,숙박,여행 > 숙박 > 호텔,,27224641,064-802-7000,비스타케이호텔 월드컵,http://place.map.kakao.com/27224641,제주특별자치도 서귀포시 김정문화로41번길 10-6,126.509787880491,33.251897501317295,호텔


##### ◽데이터 확인(숙박_POI(API))

- 엑셀을 통해 중복(keyword, id) 제거

In [23]:
import pandas as pd

accommodation_poi_add = pd.read_excel('./data/220121/서귀포_숙박_POI.xlsx', index_col=False)

In [24]:
accommodation_poi_add.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1076 entries, 0 to 1075
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   address_name         1076 non-null   object 
 1   category_group_code  1076 non-null   object 
 2   category_group_name  1076 non-null   object 
 3   category_name        1076 non-null   object 
 4   distance             0 non-null      float64
 5   id                   1076 non-null   int64  
 6   phone                754 non-null    object 
 7   place_name           1076 non-null   object 
 8   place_url            1076 non-null   object 
 9   road_address_name    1066 non-null   object 
 10  x                    1076 non-null   float64
 11  y                    1076 non-null   float64
 12  keyword              1076 non-null   object 
dtypes: float64(3), int64(1), object(9)
memory usage: 109.4+ KB


In [25]:
pd.DataFrame(accommodation_poi_add['keyword'].value_counts())

Unnamed: 0,keyword
펜션,446
민박,185
호텔,168
게스트하우스,168
리조트/콘도,109


#### 2. id 중복 처리 : 키워드 정리

##### ◽id 중복 확인

In [27]:
import pandas as pd

accommodation_poi_add = pd.read_excel('./data/220121/서귀포_숙박_POI.xlsx', index_col=False)

- id 값으로 조회
  - keyword와 category를 비교하여 맞지않는 경우 삭제

In [28]:
del_index = []
for idx, row in accommodation_poi_add.iterrows():
    key = row['keyword']
    if '리조트' in key:
        key = '리조트'
    if key not in row['category_name']:
            del_index.append(idx)

In [29]:
# 삭제할 인덱스의 수
len(del_index)

46

In [30]:
# 중복 id의 인덱스 삭제
accommodation_poi_del = accommodation_poi_add.drop(del_index, axis=0)

In [31]:
accommodation_poi_del.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1030 entries, 0 to 1075
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   address_name         1030 non-null   object 
 1   category_group_code  1030 non-null   object 
 2   category_group_name  1030 non-null   object 
 3   category_name        1030 non-null   object 
 4   distance             0 non-null      float64
 5   id                   1030 non-null   int64  
 6   phone                717 non-null    object 
 7   place_name           1030 non-null   object 
 8   place_url            1030 non-null   object 
 9   road_address_name    1021 non-null   object 
 10  x                    1030 non-null   float64
 11  y                    1030 non-null   float64
 12  keyword              1030 non-null   object 
dtypes: float64(3), int64(1), object(9)
memory usage: 112.7+ KB


In [32]:
len(accommodation_poi_del['id'].unique())

1030

In [33]:
# accommodation_poi_del.to_excel('./data/220121/서귀포_숙박_POI_최종RAW.xlsx', index=False)

#### 3. 셀레니움 : 이미지, 평점, 호텔 등급 데이터 확보

##### ◽데이터 확인

In [34]:
import pandas as pd

accommodation = pd.read_excel('./data/220121/서귀포_숙박_POI_최종RAW.xlsx', index_col=False)

In [35]:
accommodation.columns

Index(['address_name', 'category_group_code', 'category_group_name',
       'category_name', 'distance', 'id', 'phone', 'place_name', 'place_url',
       'road_address_name', 'x', 'y', 'keyword'],
      dtype='object')

In [36]:
accommodation.shape

(1030, 13)

In [37]:
accommodation.head(2)

Unnamed: 0,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y,keyword
0,제주특별자치도 서귀포시 법환동 1513,AD5,숙박,여행 > 숙박 > 호텔,,17017429,064-800-7200,더그랜드섬오름,http://place.map.kakao.com/17017429,제주특별자치도 서귀포시 막숙포로 114,126.510421,33.233072,호텔
1,제주특별자치도 서귀포시 법환동 745-1,AD5,숙박,여행 > 숙박 > 호텔,,27224641,064-802-7000,비스타케이호텔 월드컵,http://place.map.kakao.com/27224641,제주특별자치도 서귀포시 김정문화로41번길 10-6,126.509788,33.251898,호텔


##### ◽셀레니움 함수

- 평점 : #mArticle > div.cont_essential > div:nth-child(1) > div.place_details > div > div > a:nth-child(3) > span.color_b
- 호텔등급 : span.txt_location
- 이미지 : #mArticle > div.cont_photo.no_category > div.photo_area > ul > li.size_l > a

In [38]:
import json
import requests
import time
from tqdm.notebook import tqdm
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from fake_useragent import UserAgent

def selenium_result(url):    
    rate = 0
    grade = False
    image = False
    ua = UserAgent()
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument("--incognito")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-setuid-sandbox")
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument(f'user-agent={ua.ie}')
    # options.add_argument('--proxy-server=socks5://127.0.0.1:9150')
    options.add_experimental_option('excludeSwitches', ['enable-logging'])

    driver = webdriver.Chrome('./driver/chromedriver.exe', options=options)
    time.sleep(0.5)
    driver.implicitly_wait(8)
    driver.get(url)
    try:
        rate = driver.find_element_by_css_selector('''#mArticle > div.cont_essential > div:nth-child(1) > div.place_details > div > div > a:nth-child(3) > span.color_b''').text
    except:
        pass
    try:
        grade = driver.find_element_by_css_selector('''span.txt_location''').text
    except:
        pass
    try :
        image = driver.find_element_by_css_selector('''#mArticle > div.cont_photo.no_category > div.photo_area > ul > li.size_l > a''')
    except:
        pass
    else:
        image = 'https:'+image.get_attribute('style')[23:-3]
    driver.quit()

    return rate, grade, image

- 각 행의 place_url을 통해 해당 정보 추출

In [None]:
accommodation['rating'] = 0
accommodation['grade'] = accommodation['keyword']
accommodation['image'] = 'Not Image'

In [45]:
from tqdm.notebook import tqdm

for idx in tqdm(range(757, len(accommodation))):
    url = accommodation.loc[idx, 'place_url']
    rate, grade, image = selenium_result(url)
    accommodation.loc[idx, 'rating'] = rate
    if grade != False:
        accommodation.loc[idx, 'grade'] = grade
    if image != False:
        accommodation.loc[idx, 'image'] = image

  0%|          | 0/273 [00:00<?, ?it/s]

- 오류가 발생하므로 각각의 데이터를 저장 후 합쳐주었다.

In [46]:
# accommodation.to_excel('./data/220121/서귀포_숙박_selenium.xlsx', index=False)

##### ◽셀레니움 데이터 확인

- 최종 합본 데이터 확인

In [53]:
import pandas as pd

accommodation_all = pd.read_excel('./data/220121/_서귀포_숙박_selenium_final.xlsx', index_col=False)

In [54]:
accommodation_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   address_name         1030 non-null   object 
 1   category_group_code  1030 non-null   object 
 2   category_group_name  1030 non-null   object 
 3   category_name        1030 non-null   object 
 4   distance             0 non-null      float64
 5   id                   1030 non-null   int64  
 6   phone                717 non-null    object 
 7   place_name           1030 non-null   object 
 8   place_url            1030 non-null   object 
 9   road_address_name    1021 non-null   object 
 10  x                    1030 non-null   float64
 11  y                    1030 non-null   float64
 12  keyword              1030 non-null   object 
 13  rating               1030 non-null   float64
 14  grade                1030 non-null   object 
 15  image                1030 non-null   o

In [55]:
accommodation_all[['id', 'rating', 'grade', 'image']].head(3)

Unnamed: 0,id,rating,grade,image
0,17017429,4.0,호텔,Not Image
1,27224641,2.8,호텔,https://img1.kakaocdn.net/relay/local/R640x320...
2,1971607879,4.1,호텔,https://img1.kakaocdn.net/relay/local/R640x320...


- 1000개 정도의 숙박 시설은 이미지가 없는 것 확인

In [56]:
(accommodation_all['image'] != 'Not Image').value_counts()

True     661
False    369
Name: image, dtype: int64

In [57]:
accommodation_all['keyword'].value_counts()

펜션        446
민박        178
호텔        166
게스트하우스    162
리조트/콘도     78
Name: keyword, dtype: int64

In [58]:
accommodation_all['grade'].value_counts()

펜션        446
민박        178
게스트하우스    162
호텔        150
콘도,리조트     78
특급호텔       16
Name: grade, dtype: int64

##### ◽이미지 데이터 다운로드

- 이미지 다운로드 함수 작성
  - image_download(place_name, place_id, place_image_url):

In [59]:
import requests

def image_download(place_name, place_id, place_image_url):    
    response = requests.get(place_image_url)
    name = f'{place_name}_{place_id}'
    # 이름 내에 슬래시('/')가 있으면 디렉터리로 인식하므로
    # replace를 통해 변경해준다.
    if '/' in name:
        name = name.replace('/', '-')
    with open("./data/220121/서귀포_숙박_이미지/{}.png".format(name), "wb") as f:
        f.write(response.content)

- 데이터 불러오기

In [60]:
import pandas as pd

accommodation_all = pd.read_excel('./data/220121/_서귀포_숙박_selenium_final.xlsx', index_col=False)

In [61]:
accommodation_all.head(2)

Unnamed: 0,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y,keyword,rating,grade,image
0,제주특별자치도 서귀포시 법환동 1513,AD5,숙박,여행 > 숙박 > 호텔,,17017429,064-800-7200,더그랜드섬오름,http://place.map.kakao.com/17017429,제주특별자치도 서귀포시 막숙포로 114,126.510421,33.233072,호텔,4.0,호텔,Not Image
1,제주특별자치도 서귀포시 법환동 745-1,AD5,숙박,여행 > 숙박 > 호텔,,27224641,064-802-7000,비스타케이호텔 월드컵,http://place.map.kakao.com/27224641,제주특별자치도 서귀포시 김정문화로41번길 10-6,126.509788,33.251898,호텔,2.8,호텔,https://img1.kakaocdn.net/relay/local/R640x320...


- 이미지 저장

In [62]:
import pandas as pd
import time
from tqdm.notebook import tqdm

not_image = pd.DataFrame({'place_name' : [], 'id' : [], 'idx' : []})
cnt = 0
for idx in tqdm(range(len(accommodation_all))):
    p_name, p_id, p_image = accommodation_all.loc[idx, ['place_name', 'id', 'image']]
    if p_image != 'Not Image':
        image_download(p_name, p_id, p_image)
        time.sleep(0.5)
    else:
        not_image.loc[cnt] = {'place_name' : p_image, 'id' : str(p_id), 'idx' : idx}
        cnt += 1

not_image.loc[:, 'idx'] = not_image.loc[:, 'idx'].astype('int')

  0%|          | 0/1030 [00:00<?, ?it/s]

In [63]:
not_image.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 369 entries, 0 to 368
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   place_name  369 non-null    object
 1   id          369 non-null    object
 2   idx         369 non-null    int32 
dtypes: int32(1), object(2)
memory usage: 10.1+ KB


In [64]:
# not_image.to_excel('./data/220121/서귀포_숙박_not_image.xlsx', index=False)

#### 4. 이미지 삭제 : 로고, 음식 사진인 경우 삭제

- 각 이미지를 확인하여 잘못된 경우 삭제한다.

##### ◽이미지 ID 확인

- 데이터 확인

In [97]:
import pandas as pd

seogipo_accom = pd.read_excel('./data/220121/_서귀포_숙박_selenium_del.xlsx', index_col=False)
jeju_accom = pd.read_excel('./data/220117/_숙박_selenium_del.xlsx', index_col=False)

In [98]:
seogipo_accom.head(1)

Unnamed: 0,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y,keyword,rating,grade,image
0,제주특별자치도 서귀포시 법환동 745-1,AD5,숙박,여행 > 숙박 > 호텔,,27224641,064-802-7000,비스타케이호텔 월드컵,http://place.map.kakao.com/27224641,제주특별자치도 서귀포시 김정문화로41번길 10-6,126.509788,33.251898,호텔,2.8,호텔,https://img1.kakaocdn.net/relay/local/R640x320...


In [100]:
jeju_accom.head(1)

Unnamed: 0,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y,keyword,rating,grade,image
0,제주특별자치도 제주시 구좌읍 월정리 699-3,AD5,숙박,여행 > 숙박 > 펜션,,907075,010-6858-2257,월정힐펜션,http://place.map.kakao.com/907075,제주특별자치도 제주시 구좌읍 월정중길 19-9,126.791441,33.557925,펜션,5.0,펜션,https://img1.kakaocdn.net/relay/local/R640x320...


In [101]:
del_index = []
del_names = []
for idx, row in seogipo_accom.iterrows():
    s_id = row['id']
    s_name = row['place_name']
    if int(s_id) in list(map(int, jeju_accom['id'])):
        del_index.append(idx)
        del_names.append((s_id, s_name))

In [102]:
del_index[:5], len(del_index), del_names[:2]

([213, 238, 239, 240, 241],
 167,
 [(24838363, '루체빌리조트'), (26598426, '라마다제주시티호텔')])

##### ◽중복 행 삭제

- jeju_accom에 존재하는 데이터는 삭제한다.

In [105]:
seogipo_accom_del = seogipo_accom.drop(del_index, axis=0)

In [106]:
seogipo_accom_del.head(1)

Unnamed: 0,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y,keyword,rating,grade,image
0,제주특별자치도 서귀포시 법환동 745-1,AD5,숙박,여행 > 숙박 > 호텔,,27224641,064-802-7000,비스타케이호텔 월드컵,http://place.map.kakao.com/27224641,제주특별자치도 서귀포시 김정문화로41번길 10-6,126.509788,33.251898,호텔,2.8,호텔,https://img1.kakaocdn.net/relay/local/R640x320...


In [107]:
# seogipo_accom_del.to_excel('./data/220121/_서귀포_숙박_final.xlsx', index=False)

##### ◽이미지 삭제

- 삭제할 이미지 id 확인

In [None]:
import os

del_id = []
for file in os.listdir('./data/220121/서귀포_숙박_삭제'):
    p_id = file.split('_')[1][:-4]
    del_id.append(p_id)

In [None]:
del_id[:5]

['417849940', '456438925', '1292047314', '20258222', '26549993']

- 해당 데이터 삭제

In [None]:
import pandas as pd

accommodation_all = pd.read_excel('./data/220121/_서귀포_숙박_final.xlsx', index_col=False)

In [None]:
accommodation_all.head(1)

Unnamed: 0,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y,keyword,rating,grade,image
0,제주특별자치도 서귀포시 법환동 745-1,AD5,숙박,여행 > 숙박 > 호텔,,27224641,064-802-7000,비스타케이호텔 월드컵,http://place.map.kakao.com/27224641,제주특별자치도 서귀포시 김정문화로41번길 10-6,126.509788,33.251898,호텔,2.8,호텔,https://img1.kakaocdn.net/relay/local/R640x320...


In [None]:
del_idx = []

for idx, row in accommodation_all.iterrows():
    if str(row['id']) in del_id:
        del_idx.append(idx)

In [None]:
len(del_idx), del_idx[:4]

(59, [0, 5, 7, 11])

In [None]:
accom_del = accommodation_all.drop(del_idx, axis=0)

In [None]:
# accom_del.to_excel('./data/220121/_서귀포_숙박_최종.xlsx', index=False)

##### ◽이미지 데이터 재다운로드

- 이미지 다운로드 함수 작성
  - image_download(place_name, place_id, place_image_url):

In [None]:
import requests

def image_download(place_name, place_id, place_image_url):    
    response = requests.get(place_image_url)
    name = f'{place_name}_{place_id}'
    # 이름 내에 슬래시('/')가 있으면 디렉터리로 인식하므로
    # replace를 통해 변경해준다.
    if '/' in name:
        name = name.replace('/', '-')
    with open("./data/220121/서귀포_숙박_이미지/{}.png".format(name), "wb") as f:
        f.write(response.content)

- 데이터 불러오기

In [None]:
import pandas as pd

accommodation_all = pd.read_excel('./data/220121/_서귀포_숙박_최종.xlsx', index_col=False)

In [None]:
accommodation_all.head(2)

Unnamed: 0,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y,keyword,rating,grade,image
0,제주특별자치도 서귀포시 법환동 745-2,AD5,숙박,여행 > 숙박 > 호텔,,1971607879,064-738-0009,타마라 제주,http://place.map.kakao.com/1971607879,제주특별자치도 서귀포시 김정문화로41번길 10-8,126.510122,33.251846,호텔,4.1,호텔,https://img1.kakaocdn.net/relay/local/R640x320...
1,제주특별자치도 서귀포시 법환동 745-6,AD5,숙박,여행 > 숙박 > 호텔,,137464688,064-738-7077,브릿지레지던스호텔,http://place.map.kakao.com/137464688,제주특별자치도 서귀포시 김정문화로 49,126.51,33.25151,호텔,4.0,호텔,https://img1.kakaocdn.net/relay/local/R640x320...


- 이미지 저장

In [None]:
import pandas as pd
import time
from tqdm.notebook import tqdm

for idx in tqdm(range(len(accommodation_all))):
    p_name, p_id, p_image = accommodation_all.loc[idx, ['place_name', 'id', 'image']]
    
    image_download(p_name, p_id, p_image)
    time.sleep(0.5)

  0%|          | 0/435 [00:00<?, ?it/s]