## 3. 관광지 추천

### 3.1 New User의 Information Vector 입력

User Information Vector Description

- libsvm format 파일: USER_ID column1:value, …, columnN:value

| USER_ID | GENDER | AGE_GRP | TRAVEL_STYL_1 | TRAVEL_STYL_5 | TRAVEL_STYL_6 |
| --- | --- | --- | --- | --- | --- |
| 1 | 1:0 | 2:40 | 3:4 | 4:2 | 5:6 |

- Input: libsvm format file

- Output: all_user numpy array —> [user_number, feature_num] ex) (11689 * 5)

### 3.2 기존 유저 중 New User와 가장 유사한 User top K개 추출

- input: all_user numpy array

- output: similar user top K개’s index(USER_ID)

### 3.3 Similar User의 Ratings top K개 추출

- input: similar USER_IDs

- output: top k rating에 대한 VISIT_AREA_IDs

### 3.4 최종 VISIT_AREA_ID에 대한 Information 출력

- input: VISIT_AREA_IDs

- output: 각 VISIT_AREA_ID에 대한 장소명, 장소 좌표(X, Y), 주소(시도, 군구)

In [274]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [286]:
# 메소드 공용 라이브러리
import numpy as np

# 3.1 메소드용 라이브러리
from sklearn.utils.extmath import randomized_svd
from sklearn import preprocessing as prep
from sklearn import datasets
import scipy.sparse as sp
import scipy

# 3.2 메소드용 라이브러리
from sklearn.metrics.pairwise import cosine_similarity

# 3.3 메소드용 라이브러리
import torch

# 3.4 메소드용 라이브러리
import pandas as pd

In [287]:
def tfidf(x):
  """
  compute tfidf of numpy array x
  :param x: input array, document by terms
  :return:
  """
  x_idf = np.log(x.shape[0] - 1) - np.log(1 + np.asarray(np.sum(x > 0, axis=0)).ravel())
  x_idf = np.asarray(x_idf)
  x_idf_diag = scipy.sparse.lil_matrix((len(x_idf), len(x_idf)))
  x_idf_diag.setdiag(x_idf)
  x_tf = x.tocsr()
  x_tf.data = np.log(x_tf.data + 1)
  x_tfidf = x_tf * x_idf_diag
  return x_tfidf

def prep_standardize(x):
  """
  takes sparse input and compute standardized version

  Note:
    cap at 5 std

  :param x: 2D scipy sparse data array to standardize (column-wise), must support row indexing
  :return: the object to perform scale (stores mean/std) for inference, as well as the scaled x
  """
  x_nzrow = x.any(axis=1)
  scaler = prep.StandardScaler().fit(x[x_nzrow, :])
  x_scaled = np.copy(x)
  x_scaled[x_nzrow, :] = scaler.transform(x_scaled[x_nzrow, :])
  x_scaled[x_scaled > 5] = 5
  x_scaled[x_scaled < -5] = -5
  x_scaled[np.absolute(x_scaled) < 1e-5] = 0
  return scaler, x_scaled

In [288]:
# 3.0.1 libsvm data --> vector
def convert_libsvm_to_vector(libsvm_data):
  libsvm_data = tfidf(libsvm_data)

  u, s, _ = randomized_svd(libsvm_data, n_components=300, n_iter=5)
  libsvm_data = u * s
  _, libsvm_data = prep_standardize(libsvm_data)

  if sp.issparse(libsvm_data):
    vectors = libsvm_data.tolil(copy=False)
  else:
    vectors = libsvm_data

  return vectors

In [289]:
# 3.0.2 datafram --> libsvm format file
def convert_new_user_to_libsvm(new_user, save_dir):
  """
  libsvm 포맷의 파일로 저장된 user information 값을
  Numpy Array로 변환하는 메소드

  Input
  - new_user: new_user ndarray --> ex) [UserID, GENDER, AGE_GRP, TRAVEL_STYL_1, TRAVEL_STYL_5, TRAVEL_STYL_6]
  - save_dir new_user_libsvm.txt 파일을 저장하기 위한 경로

  Output: new_user_libsvm.txt 파일로 저장되기 전 값 --> ex) "1	1:0	2:40 3:4 4:2 5:6"
  """

  # labe, features 값 분리, libsvm_format 선언
  label = new_user[0]
  features = new_user[1:]
  libsvm_format = f"{label} "

  # libsvm_format 맞게 index:value 추가
  for i, feature_value in enumerate(features):
    libsvm_format += f"{i+1}:{feature_value} "
  libsvm_format = libsvm_format.strip()

  output_dir = save_dir + f"/new_user_libsvm.txt" # 저장할 경로

  origin_libsvm_dir = "/content/drive/MyDrive/Colab Notebooks/Capstone Design 2023 2nd/DropoutNet/Dataset/all_travel_log/user_features_0based.txt"

  # 결과 출력
  with open(output_dir, "w") as output_file:
    output_file.write("0\n")
    output_file.write(f"{libsvm_format}\n")
    with open(origin_libsvm_dir) as origin_libsvm_file:
      for i, line in enumerate(origin_libsvm_file):
        if i != 0:
          output_file.write(f"{line.strip()}\n")

  return libsvm_format

In [290]:
# 3.1 create_standardized_user_vector method
def create_standardized_user_vectors(libsvm_dir):
  """
  libsvm 포맷의 파일로 저장된 user information 값을
  Numpy Array로 변환하는 메소드

  Input: libsvm format file directory
  Output: old_user의 값을 vector화 한 array
  """
  # 3.1.1 libsvm format file 불러오기
  user_content, _ = datasets.load_svmlight_file(libsvm_dir, zero_based=True, dtype=np.float32)

  # 3.1.2 libsvm format --> standardized vector로 변환
  old_user_vectors = convert_libsvm_to_vector(user_content)

  return old_user_vectors

In [291]:
# 3.2 get_top_K_similar_user method
def get_top_K_similar_user(new_user_raw_vector, old_user_vectors, K):
  """
  new_user와 old_user에 대한 유사도 값을 구한 후
  Top K개

  Input
  - new_user_raw_vector: 새로운 유저 정보에 대한 numpy array 값
  - old_user_vectors: create_user_info_array의 Output
  - k: top K개 추출에 대한 값

  Output
  - similar_users: new_user와 가장 유사한 old_users top K개
  """
  # 3.2.1 new_user를 libsvm으로 변환
  new_user_libsvm = convert_new_user_to_libsvm(new_user_raw_vector, "./")

  # 3.2.2 new_user_libsvm을 standardized vector로 변환
  new_user_vectors = create_standardized_user_vectors("./new_user_libsvm.txt")

  # 3.2.3 new_user 값만 추출
  new_user_vector = new_user_vectors[0].reshape(1, -1)

  # 3.2.4 new_user와 oldusers 간에 cosine_similarity값 구하기
  similarities = cosine_similarity(new_user_vector, old_user_vectors)
  print("3.2.4")
  print(similarities)
  print(similarities.shape)

  # 3.2.5 Top K개의 similar user 추출
  top_K_similar_user_index = similarities.argsort()[0][-(K+1):][::-1][1:]

  # 3.2.6 Method 결과값 출력
  print(f"[Method: get_similar_users()] - Top K개 Similar User Index")
  for idx in top_K_similar_user_index:
    print(f"- Similar User {idx}: {similarities[0][idx]}")

  return top_K_similar_user_index

In [292]:
# 3.3 get_top_K_travel_area method
def get_top_K_travel_area(embedding_dir, top_K_similar_user_index, K):
  """
  Top K개의 추천 관광지 인덱스를 반환하는 메소드

  Input
  - embedding_dir: rating matirx를 구하기 위해 저장된 embedding vector 경로
  - top_K_similar_user_index: 유사한 유저 인덱스 Top K 개 리스트

  Output
  - top_K_recommend_area_index: 추출된 Top K개 추천 관광지 인덱스 리스트
  """
  # 3.3.0
  top_K_recommend_area_value = []
  top_K_recommend_area_index = []

  # 3.3.1 embedding_dir에서 U_embedding, V_embedding 값 가져오기
  U_embedding = np.loadtxt(embedding_dir + "/U_embedding.txt")
  V_embedding = np.loadtxt(embedding_dir + "/V_embedding.txt")

  top_K_U_embedding = U_embedding[top_K_similar_user_index]
  print("3.3.1")
  print(U_embedding)
  print(V_embedding)
  print(U_embedding.shape)
  print(V_embedding.shape)

  # 3.3.2 Rating Matrix 계산
  top_K_U_embedding = torch.tensor(top_K_U_embedding)
  V_embedding = torch.tensor(V_embedding)

  rating_matrix = torch.matmul(top_K_U_embedding, V_embedding.t())
  print("3.3.2")
  print(rating_matrix.shape)

  # 3.3.3 top_K_similar_user_index에 해당하는 rating 값만 가져오기
  print("3.3.3")
  print(top_K_similar_user_index)
  for user_ratings in rating_matrix:
    print(user_ratings)

    top_K_value, top_K_index = torch.topk(user_ratings, K)
    print(top_K_index)
    print(top_K_value)

    # 3.3.4 각 similar_user 별 top K or threshold 이상의 visit area id 가져오기
    for idx, value in enumerate(top_K_value):
      threshold = 0.5
      if value > threshold:
        top_K_recommend_area_value.append(value.item())
        top_K_recommend_area_index.append(top_K_index[idx].item())

  # 3.3.5 visit_area_id 출력
  print(f"[Method: get_top_K_travel_area()] - Top K개 Recommended Area Index")
  for idx, value in enumerate(top_K_recommend_area_value):
    print(f"- Similar User {top_K_recommend_area_index[idx]}: {value}")

  return top_K_recommend_area_index, top_K_recommend_area_value

In [293]:
# 3.4 combine_area_index_and_information method
def combine_area_index_and_information(top_K_recommend_area_index, area_information_dir):
  """
  new_user와 old_user에 대한 유사도 값을 구한 후
    Top K개

    Input
    - top_K_recommend_area_index: 추천하고자 하는 지역 인덱스 리스트
    - area_information_dir: 지역 정보가 담긴 파일의 디렉토리

    Output
    - top_K_area_info_df: top_K_recommend_area_index에 해당하는 지역 정보 데이터프레임
  """
  # 3.4.1 area_information 파일 데이터프레임으로 가져오기
  all_area_info_df = pd.read_csv(area_information_dir)

  # 3.4.2 top_K_recommend_area_index에 해당하는 row만 추출
  top_K_area_info_df = all_area_info_df[all_area_info_df['ITEM_ID'].isin(top_K_recommend_area_index)]

  # 3.4.3 최종 결과값 return
  return top_K_area_info_df

# 메소드 테스트 코드

## 3.1 New User의 Information Vector 입력

In [294]:
libsvm_dir = "/content/drive/MyDrive/Colab Notebooks/Capstone Design 2023 2nd/DropoutNet/Dataset/all_travel_log/user_features_0based.txt"
old_user_vectors = create_standardized_user_vectors(libsvm_dir)

In [295]:
print(old_user_vectors)

[[-1.27688529 -0.77142342 -0.02870358 -0.00257505 -0.01138389  0.        ]
 [-1.27688266  1.44403268 -0.10397885  1.47710307 -0.45107858  0.        ]
 [ 0.78315763 -0.39246277  0.25201573  0.23166496  0.49774332  0.        ]
 ...
 [ 0.78315734 -0.63310286 -0.90533309  1.8745665   0.0489453   0.        ]
 [-1.27688286  1.27387922  1.2606129  -1.70811783 -0.05915442  0.        ]
 [ 0.78315699 -0.93170867 -0.37266036 -0.59898242  0.91814445  0.        ]]


In [296]:
old_user_vectors[0].shape

(6,)

## 3.2 기존 유저 중 New User와 가장 유사한 User top K개 추출

In [297]:
new_user_raw_vector = [100000, 0, 20, 2, 3, 2]
top_K_similar_user_index = get_top_K_similar_user(new_user_raw_vector, old_user_vectors, 10)

3.2.4
[[ 1.          0.14110448 -0.44756665 ... -0.1422205   0.14824793
  -0.11142396]]
(1, 11687)
[Method: get_similar_users()] - Top K개 Similar User Index
- Similar User 1908: 0.3411688915574379
- Similar User 11482: 0.3411688915574379
- Similar User 10678: 0.3411688915574379
- Similar User 10640: 0.3411688915574379
- Similar User 10588: 0.3411688915574379
- Similar User 2728: 0.3411688915574379
- Similar User 338: 0.3411688915574379
- Similar User 5844: 0.3411688915574379
- Similar User 11553: 0.3411688915574379
- Similar User 2830: 0.3411688915574379


In [298]:
print(top_K_similar_user_index)

[ 1908 11482 10678 10640 10588  2728   338  5844 11553  2830]


In [300]:
origin_libsvm_dir = "/content/drive/MyDrive/Colab Notebooks/Capstone Design 2023 2nd/DropoutNet/Dataset/all_travel_log/user_features_0based.txt"

with open(origin_libsvm_dir) as origin_libsvm_file:
  for i, line in enumerate(origin_libsvm_file):
    if i in top_K_similar_user_index:
      print(f"USER_ID-{i} : {line}")

USER_ID-338 : 338 1:0 2:20 3:2 4:2 5:2

USER_ID-1908 : 1908 1:0 2:20 3:2 4:2 5:2

USER_ID-2728 : 2728 1:0 2:20 3:2 4:2 5:2

USER_ID-2830 : 2830 1:0 2:20 3:2 4:2 5:2

USER_ID-5844 : 5844 1:0 2:20 3:2 4:2 5:2

USER_ID-10588 : 10588 1:0 2:20 3:2 4:2 5:2

USER_ID-10640 : 10640 1:0 2:20 3:2 4:2 5:2

USER_ID-10678 : 10678 1:0 2:20 3:2 4:2 5:2

USER_ID-11482 : 11482 1:0 2:20 3:2 4:2 5:2

USER_ID-11553 : 11553 1:0 2:20 3:2 4:2 5:2



## 3.3 유사한 유저의 평점 top K개 추출

-

In [301]:
embedding_dir = "/content/drive/MyDrive/Colab Notebooks/Capstone Design 2023 2nd/DropoutNet/checkpoint/Embedded-latent-factor"

top_K_recommend_area_index, top_K_recommend_area_value = get_top_K_travel_area(embedding_dir, top_K_similar_user_index, 10)

3.3.1
[[-0.00166771 -0.00149262 -0.00738923 ... -0.03102724  0.02404879
   0.01434502]
 [ 0.08449753 -0.1312623  -0.04081517 ... -0.01829614 -0.17351918
   0.11992285]
 [-0.00965555  0.00996899 -0.10817692 ... -0.08915778  0.02496714
   0.08714376]
 ...
 [-0.00932654 -0.04615263  0.01209629 ...  0.04153654  0.04533443
   0.06278741]
 [ 0.04123104  0.00228094  0.02100438 ... -0.01011199 -0.01210243
   0.02754359]
 [-0.06034773 -0.11625741 -0.00487246 ... -0.10381892 -0.04234476
  -0.00914139]]
[[ 0.00619511 -0.02021328 -0.02306933 ... -0.01821989 -0.00984344
   0.005422  ]
 [-0.03601351 -0.01275318 -0.00903387 ...  0.0181077   0.06085781
   0.03763666]
 [-0.02464574 -0.10712692 -0.1197194  ... -0.03343737 -0.05107053
  -0.00976741]
 ...
 [-0.05003117 -0.01565205  0.03319466 ... -0.04806304 -0.08997317
   0.03709236]
 [ 0.12005668  0.03387028 -0.04771087 ...  0.04418553  0.04866258
  -0.04382639]
 [-0.01790223 -0.01784261 -0.02293376 ... -0.05624481 -0.03051585
   0.01734006]]
(11687, 20

## 3.4 최종 VISIT_AREA_ID에 대한 Information 출력

In [302]:
area_information_dir = "/content/drive/MyDrive/Colab Notebooks/Capstone Design 2023 2nd/DropoutNet/Dataset/all_travel_log/all-area-info/all-area-info.csv"

top_K_area_info_df = combine_area_index_and_information(top_K_recommend_area_index, area_information_dir)

In [303]:
top_K_recommend_area_index

[4529,
 4562,
 4529,
 6389,
 4562,
 907,
 4258,
 9209,
 1770,
 9115,
 5620,
 4640,
 4130,
 6877,
 2623,
 6698,
 4872,
 4847,
 9252,
 3220,
 7150,
 7473,
 3211,
 4958,
 5338,
 6626]

In [304]:
top_K_area_info_df

Unnamed: 0,ITEM_ID,VISIT_AREA_NM,X_COORD,Y_COORD
1137,907,광주미디어아트플랫폼 GMAP,126.909293,35.14843757
1138,907,광주미디어아트플랫폼 GMAP,126.909293,35.14843757
2215,1770,노량진컵밥거리,126.9453904,37.51360586
2216,1770,노량진컵밥거리,126.4222454,33.25228709
3270,2623,래미안갤러리,127.1201403,37.47926869
3992,3211,미포철길 입구,129.1708017,35.16002883
4004,3220,민락수변공원,129.1343329,35.15588298
4005,3220,민락수변공원,129.1343329,35.15588298
4006,3220,민락수변공원,129.1343329,35.15588298
5182,4130,상도문 돌담마을,128.5531224,38.16517664
