### - 특성 벡터 추출 테스트

- VGGNet 활용
- PCA를 사용하여 N차원으로 축소
- `tensorflow`로 실행하니 오류 발생 => `keras`로 실행하니 정상 작동
  - 왜 인지는 불명..

#### 1. 모델 생성

##### ◽모델 불러오기 : 사전 학습 모델(VGG16)

In [None]:
import cv2
from matplotlib import pyplot
from keras.models import Model
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing.image import img_to_array

# 모델 불러오기
base_model = VGG16(weights='imagenet')
# 모델 확인하기
base_model.summary()

Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 224, 224, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 112, 112, 64)      0         
                                                                 
 block2_conv1 (Conv2D)       (None, 112, 112, 128)     73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 112, 112, 128)     147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 56, 56, 128)       0     

##### ◽모델 자르기 : 특징 벡터까지로 수정

In [None]:
# Feature Vector 추출 모델 생성
model = Model(inputs = base_model.input, outputs = base_model.get_layer('flatten').output)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 224, 224, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 112, 112, 64)      0         
                                                                 
 block2_conv1 (Conv2D)       (None, 112, 112, 128)     73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 112, 112, 128)     147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 56, 56, 128)       0     

#### 2. 특징 벡터 테스트

In [None]:
image1 = cv2.imread('1.png')
image2 = cv2.imread('2.png')
image3 = cv2.imread('3.png')
image4 = cv2.imread('4.png')

image1 = img_to_array(image1)
image2 = img_to_array(image2)
image3 = img_to_array(image3)
image4 = img_to_array(image4)

image1 = image1.reshape((1, image1.shape[0], image1.shape[1], image1.shape[2]))
image2 = image2.reshape((1, image2.shape[0], image2.shape[1], image2.shape[2]))
image3 = image3.reshape((1, image3.shape[0], image3.shape[1], image3.shape[2]))
image4 = image4.reshape((1, image4.shape[0], image4.shape[1], image4.shape[2]))

In [None]:
image1 = preprocess_input(image1)
feature_vector1 = model.predict(image1)

In [None]:
feature_vector1

array([[0.       , 0.       , 0.       , ..., 0.       , 0.       ,
        3.0114553]], dtype=float32)

In [None]:
image2 = preprocess_input(image2)
feature_vector2 = model.predict(image2)

In [None]:
feature_vector2

array([[ 0.       ,  0.       ,  0.       , ...,  0.       , 12.084741 ,
         0.8450042]], dtype=float32)

In [None]:
image3 = preprocess_input(image3)
feature_vector3 = model.predict(image3)

In [None]:
feature_vector3

array([[0.       , 0.       , 0.       , ..., 0.       , 7.9911184,
        0.       ]], dtype=float32)

In [None]:
image4 = preprocess_input(image4)
feature_vector4 = model.predict(image4)

In [None]:
feature_vector4

array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

#### 3. 코사인 유사도 테스트

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(feature_vector1, [feature_vector1[0], feature_vector2[0], feature_vector3[0], feature_vector4[0]])

array([[1.0000002 , 0.15940237, 0.10766608, 0.4710593 ]], dtype=float32)

##### ◽ㅇ

### - 긴급)서귀포 데이터 획득

- 수집된 데이터를 확인한 결과 '서귀포' 데이터가 없는 것을 확인
- 서귀포에 대한 부분만 새로 획득

#### 1. 데이터 획득

##### ◽카테고리, 키워드, 지역 변수

In [None]:
keywords = ['맛집', '분위기 좋은', '테마파크' ,'오션뷰', '감성', '가족여행', '체험', '휴식', '레포츠', '가볼만한 곳']
categorys = ['CT1', 'AT4', 'FD6', 'CE7']
categorys_info = {'CT1' : '문화시설', 'AT4' : '관광명소', 'FD6' : '음식점', 'CE7' : '카페'}

In [None]:
jeju_range = ['법환동', '서호동', '호근동', '강정동', '도순동', '영남동', '월평동', '동홍동', '서홍동', '보목동','서귀동','토평동', '상효동', '상예동', '색달동', '하예동', '대포동', '중문동', '하원동', '회수동', '신효동', '하효동']

##### ◽카카오 API 활용 함수 : 함수 재활용

- search_result(keyword, category, jeju_name)
  - (카테고리, 법정동_리) 검색 함수

In [None]:
import json
import requests

def search_result(keyword, category, jeju_name):
    result = []

    # REST 키
    rest_api_key = '63d0926cf9b14de298157081ba8a8d02'
    # 헤더
    headers = {"Authorization" : "KakaoAK {}".format(rest_api_key)}
    # 파라미터
    params = {"query" : f"제주특별자치도 {jeju_name} {keyword}", "page" : 1, "category_group_code" : f"{category}"}
    url = "https://dapi.kakao.com/v2/local/search/keyword.json"

    while True:
        # GET을 이용하여 획득
        res = requests.get(url, headers=headers, params=params)
        if res.status_code == 200:
            # Json을 이용하여 해제
            doc = json.loads(res.text)
            result.extend(doc['documents'])
            if doc['meta']['is_end'] == True:
                break
            else:
                params['page'] += 1
    return result

- search_df()
  - 전체 숙박 데이터 프레임 반환 함수

In [None]:
import pandas as pd
from tqdm.notebook import tqdm

def search_df():
    results = []
    for idx, jeju in tqdm(enumerate(jeju_range)):
        for category in categorys:
            for key in keywords:
                r = pd.DataFrame(search_result(key, category, jeju))
                r['keyword'] = key
                results.append(r.copy())
    return pd.concat(results).reset_index(drop=True)

##### ◽카카오 API 활용 데이터 획득

In [None]:
jeju_poi_additional = search_df()

0it [00:00, ?it/s]

In [None]:
# jeju_poi_additional.to_excel('./data/220119/제주_POI_서귀포.xlsx',index=False)

In [None]:
jeju_poi_additional.head(2)

Unnamed: 0,keyword,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y
0,테마파크,제주특별자치도 서귀포시 법환동 877-3,AT4,관광명소,"여행 > 관광,명소 > 테마파크",,10472331,064-739-8254,세리월드,http://place.map.kakao.com/10472331,제주특별자치도 서귀포시 법환상로2번길 97-17,126.511874293757,33.2470161809819
1,테마파크,제주특별자치도 서귀포시 법환동 914,AT4,관광명소,"여행 > 관광,명소 > 테마파크 > 워터테마파크",,17150892,064-739-1930,제주워터월드,http://place.map.kakao.com/17150892,제주특별자치도 서귀포시 월드컵로 33,126.50854558896376,33.24550727132407


##### ◽데이터 확인(제주_POI_서귀포)

- 엑셀을 통해 중복(keyword, id) 제거

In [None]:
import pandas as pd

jeju_poi_additional = pd.read_excel('./data/220119/제주_POI_서귀포.xlsx', index_col=False)

In [None]:
jeju_poi_additional.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4157 entries, 0 to 4156
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   keyword              4157 non-null   object 
 1   address_name         4157 non-null   object 
 2   category_group_code  4157 non-null   object 
 3   category_group_name  4157 non-null   object 
 4   category_name        4157 non-null   object 
 5   distance             0 non-null      float64
 6   id                   4157 non-null   int64  
 7   phone                2510 non-null   object 
 8   place_name           4157 non-null   object 
 9   place_url            4157 non-null   object 
 10  road_address_name    3461 non-null   object 
 11  x                    4157 non-null   float64
 12  y                    4157 non-null   float64
dtypes: float64(3), int64(1), object(9)
memory usage: 422.3+ KB


In [None]:
pd.DataFrame(jeju_poi_additional['keyword'].value_counts())

Unnamed: 0,keyword
맛집,1980
가볼만한 곳,990
분위기 좋은,843
테마파크,175
오션뷰,76
감성,56
가족여행,25
체험,10
레포츠,2


#### 2. id 중복 처리 : 키워드 합치기

##### ◽id 중복 확인

In [None]:
import pandas as pd

jeju_poi_additional = pd.read_excel('./data/220119/제주_POI_서귀포.xlsx', index_col=False)

In [None]:
# 중복되는 id 확인
len(jeju_poi_additional), len(jeju_poi_additional['id'].unique())

(4157, 1708)

In [None]:
jeju_poi_additional.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4157 entries, 0 to 4156
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   keyword              4157 non-null   object 
 1   address_name         4157 non-null   object 
 2   category_group_code  4157 non-null   object 
 3   category_group_name  4157 non-null   object 
 4   category_name        4157 non-null   object 
 5   distance             0 non-null      float64
 6   id                   4157 non-null   int64  
 7   phone                2510 non-null   object 
 8   place_name           4157 non-null   object 
 9   place_url            4157 non-null   object 
 10  road_address_name    3461 non-null   object 
 11  x                    4157 non-null   float64
 12  y                    4157 non-null   float64
dtypes: float64(3), int64(1), object(9)
memory usage: 422.3+ KB


##### ◽id 중복 처리 : 키워드 합치기

In [None]:
import pandas as pd

jeju_poi_additional = pd.read_excel('./data/220119/제주_POI_서귀포.xlsx', index_col=False)

In [None]:
jeju_poi_additional.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4157 entries, 0 to 4156
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   keyword              4157 non-null   object 
 1   address_name         4157 non-null   object 
 2   category_group_code  4157 non-null   object 
 3   category_group_name  4157 non-null   object 
 4   category_name        4157 non-null   object 
 5   distance             0 non-null      float64
 6   id                   4157 non-null   int64  
 7   phone                2510 non-null   object 
 8   place_name           4157 non-null   object 
 9   place_url            4157 non-null   object 
 10  road_address_name    3461 non-null   object 
 11  x                    4157 non-null   float64
 12  y                    4157 non-null   float64
dtypes: float64(3), int64(1), object(9)
memory usage: 422.3+ KB


- id 값으로 조회하여 값이 2개이상인 경우
  - 각 데이터의 키워드를 합치고 1개의 행만 남긴다.
  - included : 중복된 id일 경우를 식별하기 위해 dict형으로 확인한 경우 추가
  - del_index : 중복된 id의 인덱스 중 1개만 사용할 것이기에 삭제할 인덱스 추가

In [None]:
included = {}
del_index = []
for idx, row in jeju_poi_additional.iterrows():
    id = row['id']
    temp = jeju_poi_additional[jeju_poi_additional['id'] == id].copy()
    cnt = len(temp)
    if id not in included and cnt > 1:
        included[id] = True
        for i, r in temp.iterrows():
            if idx == i:
                continue
            else:
                if r['keyword'] not in jeju_poi_additional.loc[idx, 'keyword']:
                    jeju_poi_additional.loc[idx, 'keyword'] = jeju_poi_additional.loc[idx, 'keyword'] + ',' + r['keyword']
                del_index.append(i)

In [None]:
# 중복 id의 수, 삭제할 인덱스의 수
len(included.keys()), len(del_index)

(652, 2449)

In [None]:
# 중복 id의 인덱스 삭제
jeju_poi_additional_del = jeju_poi_additional.drop(del_index, axis=0)

- 행의 수와 id의 수가 일치하므로 중복된 id가 없음을 확인할 수 있다.

In [None]:
len(jeju_poi_additional_del['id'].unique()), len(jeju_poi_additional_del)

(1708, 1708)

In [None]:
# jeju_poi_additional_del.to_excel('./data/220119/제주_POI_서귀포_키워드묶음.xlsx', index=False)

##### ◽id 중복 제거 데이터 확인

In [None]:
import pandas as pd

jeju_poi_add = pd.read_excel('./data/220119/제주_POI_서귀포_키워드묶음.xlsx', index_col=False)

In [None]:
jeju_poi_add.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1708 entries, 0 to 1707
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   keyword              1708 non-null   object 
 1   address_name         1708 non-null   object 
 2   category_group_code  1708 non-null   object 
 3   category_group_name  1708 non-null   object 
 4   category_name        1708 non-null   object 
 5   distance             0 non-null      float64
 6   id                   1708 non-null   int64  
 7   phone                1143 non-null   object 
 8   place_name           1708 non-null   object 
 9   place_url            1708 non-null   object 
 10  road_address_name    1498 non-null   object 
 11  x                    1708 non-null   float64
 12  y                    1708 non-null   float64
dtypes: float64(3), int64(1), object(9)
memory usage: 173.6+ KB


In [None]:
pd.DataFrame(jeju_poi_add['category_group_name'].value_counts())

Unnamed: 0,category_group_name
음식점,866
카페,559
관광명소,279
문화시설,4


#### 3. 수집된 id 처리

- 앞서 API를 사용하여 수집한 데이터에 이번 데이터가 있을 수 있다.
- 해당 데이터들을 확인하고 삭제해준다.

##### ◽데이터 확인

In [None]:
import pandas as pd

jeju_poi_add = pd.read_excel('./data/220119/제주_POI_서귀포_키워드묶음.xlsx', index_col=False)
jeju_poi = pd.read_excel('./data/220119/_제주도_POI_컬럼 정리.xlsx', index_col = 0)

- 서귀포 데이터

In [None]:
jeju_poi_add.head(1)

Unnamed: 0,keyword,address_name,category_group_code,category_group_name,category_name,distance,id,phone,place_name,place_url,road_address_name,x,y
0,"테마파크,가볼만한 곳",제주특별자치도 서귀포시 법환동 877-3,AT4,관광명소,"여행 > 관광,명소 > 테마파크",,10472331,064-739-8254,세리월드,http://place.map.kakao.com/10472331,제주특별자치도 서귀포시 법환상로2번길 97-17,126.511874,33.247016


- 제주도 데이터

In [None]:
jeju_poi.head(1)

Unnamed: 0,keyword,address_name,category_group_name,category_name,id,place_name,x,y,rating
0,테마파크,제주특별자치도 제주시 연동 1320,관광명소,"문화,예술 > 문화시설 > 박물관",26388484,수목원테마파크 아이스뮤지엄,126.488398,33.470777,1.0


##### ◽수집된 데이터 삭제

- id를 기준으로 jeju_poi에 수집된 데이터는 삭제한다.

In [None]:
included_id = list(jeju_poi['id'].unique())
add_id = list(jeju_poi_add['id'].unique())

for ID in add_id:
    if ID in included_id:
        jeju_poi_add = jeju_poi_add[jeju_poi_add['id'] != ID].copy()

- 120개 가량의 데이터가 삭제되었다.

In [None]:
len(jeju_poi_add), len(add_id)

(1580, 1708)

In [None]:
pd.DataFrame(jeju_poi_add['category_group_name'].value_counts())

Unnamed: 0,category_group_name
음식점,832
카페,514
관광명소,231
문화시설,3


In [None]:
# jeju_poi_add.to_excel('./data/220119/_서귀포_POI_최종RAW.xlsx', index=False)

### - 자연어 처리 테스트

- 자연어 처리..
- 진행중~~

#### 1. 데이터 읽기

In [None]:
import pandas as pd

jeju_poi = pd.read_excel('./data/220119/_제주도_POI_컬럼 정리.xlsx', index_col=0)

In [None]:
jeju_poi.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2306 entries, 0 to 5740
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   keyword              2306 non-null   object 
 1   address_name         2306 non-null   object 
 2   category_group_name  2306 non-null   object 
 3   category_name        2306 non-null   object 
 4   id                   2306 non-null   int64  
 5   place_name           2306 non-null   object 
 6   x                    2306 non-null   float64
 7   y                    2306 non-null   float64
 8   rating               2306 non-null   float64
dtypes: float64(3), int64(1), object(5)
memory usage: 180.2+ KB


In [None]:
jeju_poi.head()

Unnamed: 0,keyword,address_name,category_group_name,category_name,id,place_name,x,y,rating
0,테마파크,제주특별자치도 제주시 연동 1320,관광명소,"문화,예술 > 문화시설 > 박물관",26388484,수목원테마파크 아이스뮤지엄,126.488398,33.470777,1.0
1,"테마파크,가볼만한 곳",제주특별자치도 제주시 애월읍 신엄리 2880-12,관광명소,"여행 > 관광,명소 > 테마파크",1129394481,고스트타운,126.356936,33.476195,3.4
2,"테마파크,가볼만한 곳",제주특별자치도 제주시 애월읍 유수암리 1083,관광명소,"여행 > 관광,명소 > 테마파크",891104398,제주불빛정원,126.409179,33.422294,4.1
3,"테마파크,가볼만한 곳",제주특별자치도 제주시 애월읍 어음리 산 131-3,관광명소,"여행 > 관광,명소 > 테마파크",1868828759,9.81파크,126.366664,33.39029,3.6
4,"테마파크,가볼만한 곳",제주특별자치도 제주시 애월읍 광령리 2698,관광명소,"여행 > 관광,명소 > 테마파크",9401924,제주공룡랜드,126.433869,33.441255,2.2


#### 2. 자연어 선택

- keyword와 category_group_name 사용

In [None]:
keywords = set()
for keyword in jeju_poi['keyword'].unique():
    for k in keyword.split(','):
        keywords.add(k)
keywords = list(keywords)

In [None]:
keywords

['가볼만한 곳', '테마파크', '감성', '맛집', '분위기 좋은', '오션뷰', '레포츠', '가족여행', '체험']

In [None]:
groups = []
for category in jeju_poi['category_group_name'].unique():
    groups.append(category)

In [None]:
groups

['관광명소', '음식점', '카페']

In [None]:
jeju_poi[['id', 'place_name', 'keyword', 'category_group_name']].head(2)

Unnamed: 0,id,place_name,keyword,category_group_name
0,26388484,수목원테마파크 아이스뮤지엄,테마파크,관광명소
1,1129394481,고스트타운,"테마파크,가볼만한 곳",관광명소


#### 3. ㅇ

##### ◽ㅇ

#### 1. ㅇ

##### ◽ㅇ