

다방 결과 해석:

- 가장 중요한 단어는 **"허위"**로, 중요도가 약 0.397으로 매우 높다.
- 이는 리뷰 데이터에서 "허위"라는 단어가 등장하면 별점 예측에 큰 영향을 준다는 것을 의미함.
  - 예: "허위 매물", "허위 광고" 같은 부정적 맥락에서 별점이 낮게 평가되었을 가능성이 높습니다.
- 앱 버전(예: "2.1", "2.4.1")도 일부 중요 피처로 포함되었습니다.
- 특정 앱 버전에서 사용자 경험이 좋지 않아 별점에 영향을 미쳤을 수 있습니다.
- 단어 "네트워크", "업데이트", "오류" 등은 리뷰에서 문제가 언급될 때 부정적 평가로 이어졌을 가능성이 있습니다.
- "삭제"는 리뷰에서 앱을 삭제하려는 사용자 경험과 관련된 부정적 별점으로 이어질 가능성이 있습니다.

--------------------------------------
 밑에는 다른 방법
  -------------------------------------------------

## 주요단계 설명

* (1)  버전 그룹화
major.minor 형태로 버전을 묶음 (2.1.0, 2.1.1 → 2.1).
* (2) 버전별 단어 빈도 분석
각 버전에서 등장한 단어를 집계하여 명사 빈도를 계산.
버전별로 많이 등장한 단어를 비교.
* (3) 별점과 단어 관계 분석
특정 단어가 별점에 미치는 영향을 분석.
예: "허위", "오류" 등의 단어가 부정적 별점에 기여.
* (4) 버전별 데이터로 모델링
버전별 데이터를 입력으로 별점 예측 모델을 학습.
텍스트 데이터를 버전별로 그룹화하여 모델의 입력 변수로 사용.

In [1]:
import pandas as pd
import os
import ast
from collections import Counter
from google.colab import drive

# Google Drive 마운트
drive.mount('/content/drive')

# 데이터 경로 설정
base_path = '/content/drive/MyDrive/2024/TextMining'
data_path = os.path.join(base_path, 'reviews_data')
output_path = os.path.join(base_path, 'reviewmodel')  # 결과 저장 폴더
os.makedirs(output_path, exist_ok=True)  # 폴더가 없으면 생성

app_names = ['다방', '직방', '네이버부동산', '피터팬', '호갱노노']
files = [os.path.join(data_path, f"preprocessed2_reviews_data_{app}_별점.csv") for app in app_names]

# 버전별 단어 빈도 저장용 딕셔너리
version_word_results = {}

Mounted at /content/drive


1. 버전 그룹화
major.minor 형태로 버전을 묶음 (2.1.0, 2.1.1 → 2.1).
2. 버전별 단어 빈도 분석
각 버전에서 등장한 단어를 집계하여 명사 빈도를 계산.

In [2]:


# 파일별 데이터 처리
for app, file in zip(app_names, files):
    if os.path.exists(file):  # 파일 존재 여부 확인
        print(f"Processing {app}...")

        # 데이터 로드
        data = pd.read_csv(file)

        # 결측치 처리
        data_cleaned = data.dropna(subset=['reviewCreatedVersion'])
        data_cleaned['reviewCreatedVersion'] = data_cleaned['reviewCreatedVersion'].astype(str)

        # 버전을 'major.minor' 형태로 그룹화
        data_cleaned['grouped_version'] = data_cleaned['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))

        # 버전별 단어 빈도 계산
        version_word_counts = {}
        for version, group in data_cleaned.groupby('grouped_version'):
            # 명사 추출 및 빈도 계산
            words = group['nouns_without_stopwords'].apply(lambda x: eval(x)).explode()
            word_counts = Counter(words)
            version_word_counts[version] = word_counts.most_common(10)

        print(f"{app}: 버전별 주요 단어:")
        for version, words in version_word_counts.items():
            print(f"Version {version}: {words}")

        # 저장
        version_word_results[app] = version_word_counts

    else:
        print(f"{app}: 파일을 찾을 수 없음")


Processing 다방...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['reviewCreatedVersion'] = data_cleaned['reviewCreatedVersion'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['grouped_version'] = data_cleaned['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))


다방: 버전별 주요 단어:
Version 1.0: [('매물', 9), ('직거래', 5), ('어플', 5), ('부동산', 4), ('정보', 4), ('볼', 4), ('사진', 3), ('사용', 3), ('최고', 2), ('번창', 2)]
Version 1.1: [('방', 5), ('설정', 5), ('어플', 5), ('알림', 4), ('위치', 4), ('최고', 4), ('보기', 4), ('매물', 4), ('디자인', 4), ('앱', 3)]
Version 1.2: [('앱', 2), ('방이', 1), ('안', 1), ('보급', 1), ('사무실', 1), ('정보', 1), (nan, 1), ('매우', 1), ('얼른', 1), ('유저', 1)]
Version 1.3: [('지역', 3), ('설정', 2), ('별로', 2), ('원룸', 2), ('지도', 2), ('살', 1), ('모든', 1), ('구분', 1), ('은', 1), ('초반', 1)]
Version 1.4: [('원룸', 4), ('검색', 3), ('방', 3), ('지도', 2), ('안', 2), ('제주도', 2), ('별', 2), ('식', 2), ('개선', 1), ('노트', 1)]
Version 1.5: [('어플', 62), ('방', 35), ('집', 28), ('앱', 22), ('정보', 17), ('대박', 16), ('볼', 15), ('원룸', 12), ('완전', 11), ('아주', 9)]
Version 1.6: [('피터', 1), ('팬', 1), ('글', 1), ('앱', 1), ('설치', 1), ('스팸', 1), ('뭐', 1), (nan, 1), ('굿', 1)]
Version 1.7: [('방', 9), ('검색', 8), ('어플', 8), ('확인', 5), ('찾기', 4), ('안', 4), ('사진', 4), ('비번', 3), ('추천', 3), ('방도', 3)]
Version 1.8: [

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['reviewCreatedVersion'] = data_cleaned['reviewCreatedVersion'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['grouped_version'] = data_cleaned['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))


직방: 버전별 주요 단어:
Version 1.0: [('사진', 1), ('확인', 1)]
Version 1.1: [('허위', 2), ('방', 2), ('검색', 2), ('물건', 1), ('명', 1), ('등록', 1), ('사기', 1), ('운영', 1), ('어플', 1), ('위치', 1)]
Version 3.0: [('방', 68), ('어플', 65), ('지역', 35), ('관악구', 21), ('앱', 21), ('직방', 20), ('안', 18), ('볼', 14), ('정보', 13), ('보고', 13)]
Version 4.0: [('사진', 9), ('방', 5), ('서울', 4), ('집', 4), ('볼', 4), ('번', 3), ('방이', 3), ('안', 2), ('정보', 2), ('지도', 2)]
Version 4.1: [('안', 25), ('방', 23), ('어플', 21), ('사진', 20), ('집', 19), ('매물', 14), ('원룸', 10), ('볼', 9), ('개', 8), ('정보', 8)]
Version 4.10: [('방', 816), ('매물', 613), ('직방', 443), ('안', 433), ('허위', 411), ('집', 301), ('부동산', 283), ('앱', 260), ('어플', 249), ('사진', 233)]
Version 4.11: [('방', 20), ('매물', 19), ('안', 12), ('직방', 10), ('허위', 9), ('집', 9), ('부동산', 9), ('광고', 8), ('앱', 7), ('검색', 7)]
Version 4.12: [('매물', 165), ('방', 136), ('허위', 114), ('직방', 109), ('안', 81), ('부동산', 61), ('앱', 56), ('집', 49), ('사진', 36), ('직거래', 35)]
Version 4.13: [('매물', 504), ('허위', 377), ('방',

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['reviewCreatedVersion'] = data_cleaned['reviewCreatedVersion'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['grouped_version'] = data_cleaned['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))


네이버부동산: 버전별 주요 단어:
Version  '등록': [(nan, 1)]
Version 1.0: [('앱', 4), ('안드로이드', 2), ('부동산', 2), ('어플', 2), ('찾기', 2), ('일조', 2), ('기분', 2), ('집', 2), ('드뎌', 1), ('아이폰', 1)]
Version 1.1: [('세상', 1), ('자주', 1), ('업', 1), ('부탁', 1)]
Version 1.11: [('다킬', 1), ('업데이트', 1)]
Version 1.12: [('네이버', 2), ('사이트', 1), ('기능', 1), ('메인', 1), ('어플', 1), ('메뉴', 1), ('추가', 1)]
Version 1.13: [('화면', 2), ('터치', 1), ('이동', 1), ('주황색', 1), ('정', 1), ('지금', 1), ('상태', 1), ('론', 1), ('앱', 1), ('전혀', 1)]
Version 1.16: [(nan, 1)]
Version 1.17: [('부동산', 3), ('어플', 1), ('참고자료', 1), ('사용', 1), ('정보', 1), ('가장', 1), ('적', 1), ('확함', 1), ('매물', 1), ('아주', 1)]
Version 1.2: [('최고', 2), ('별루', 1), (nan, 1), ('안드로이드', 1), ('버젼', 1), ('최강', 1), ('부동산', 1), ('앱', 1), (nan, 1), (nan, 1)]
Version 1.22: [('폰', 2), ('매물', 2), ('안', 2), ('머', 1), ('보', 1), ('말', 1), ('먹통', 1), ('업', 1), ('뎃', 1), ('네이버', 1)]
Version 1.24: [('넥서스', 1), ('지원', 1), ('바람', 1), ('매물', 1), ('강종', 1), ('평면도', 1), ('보고', 1), ('뒤', 1), ('튕기네', 1)]
Vers

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['reviewCreatedVersion'] = data_cleaned['reviewCreatedVersion'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['grouped_version'] = data_cleaned['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))


피터팬: 버전별 주요 단어:
Version 1.0: [('앱', 326), ('대박', 300), ('집', 298), ('어플', 262), ('방', 196), ('정보', 169), ('최고', 116), ('보기', 100), ('볼', 95), ('추천', 83)]
Version 1.1: [('매물', 21), ('어플', 12), ('안', 9), ('집', 8), ('등록', 8), ('방', 6), ('전', 6), ('계속', 6), ('연락', 5), ('앱', 5)]
Version 1.2: [('매물', 21), ('안', 18), ('삭제', 9), ('집', 8), ('주소', 6), ('허위', 6), ('방', 6), ('전화', 5), ('처리', 5), ('등록', 5)]
Version 2.0: [('매물', 5), ('앱', 3), ('때문', 2), ('비', 2), ('세상', 2), ('만', 2), ('연락', 2), ('삭제', 2), ('구미', 1), ('정투', 1)]
Version 2.1: [('직거래', 18), ('피터팬', 11), ('안심', 11), ('앱', 11), ('사용', 8), ('방', 8), ('어플', 7), ('정보', 7), ('안', 5), ('매물', 5)]
Version 2.10: [('매물', 20), ('허위', 10), ('방', 7), ('사용', 5), ('직거래', 5), ('부동산', 5), ('앱', 5), ('안', 4), ('안심', 4), ('보고', 4)]
Version 2.11: [('매물', 76), ('허위', 38), ('방', 36), ('앱', 34), ('안', 23), ('부동산', 17), ('어플', 16), ('피터팬', 14), ('연락', 13), ('직거래', 13)]
Version 2.12: [('매물', 82), ('허위', 50), ('안', 49), ('어플', 24), ('방', 24), ('부동산', 22), ('앱', 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['reviewCreatedVersion'] = data_cleaned['reviewCreatedVersion'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned['grouped_version'] = data_cleaned['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))


## 코드 설명

데이터 그룹화:

- 데이터를 grouped_version 기준으로 그룹화하여 각 버전별로 독립적으로 모델링.

버전별 모델 학습:

각 버전에 대해:
- 텍스트 데이터를 TF-IDF로 벡터화.
- 별점(score)을 종속 변수로 사용하여 RandomForestRegressor 학습.
- MSE와 R2 Score로 성능 평가.

중요 피처 분석:

- 각 버전에서 예측에 기여한 상위 중요한 단어를 분석.
데이터가 부족한 버전 스킵:

- 10개 미만의 리뷰를 가진 버전은 분석에서 제외하여 안정적인 결과를 보장.

결과 저장 및 출력:

- model_results에 앱별로 버전별 결과를 저장.
- 모든 버전에 대해 MSE, R2 Score, 상위 중요한 단어를 출력.


네이버 파일에 문제가 있는거같음
- column 확인결과 column에 텍스트토큰화한게 포함된게 2 행 있어서 제거함.

In [3]:
import pandas as pd

# 파일 경로 설정
file_path = os.path.join(data_path, "preprocessed2_reviews_data_네이버부동산_별점.csv")

# 파일 읽기
if os.path.exists(file_path):
    print("Reading the file...")
    data = pd.read_csv(file_path)

    # reviewCreatedVersion 열의 고유 값 확인
    if 'reviewCreatedVersion' in data.columns:
        unique_versions = data['reviewCreatedVersion'].dropna().unique()
        print(f"총 {len(unique_versions)}개의 고유 버전이 있습니다:")
        for version in sorted(unique_versions):
            print(version)
    else:
        print("'reviewCreatedVersion' 열이 데이터에 없습니다.")
else:
    print(f"파일을 찾을 수 없습니다: {file_path}")


Reading the file...
총 80개의 고유 버전이 있습니다:
 '등록'
1.0.0
1.0.1
1.1.0
1.11.0
1.12.0
1.13.0
1.16.0
1.17.0
1.2.0
1.22.0
1.24.0
1.25.0
1.26.0
1.27.0
1.28.0
1.29.0
1.3.0
1.30.0
1.32.0
1.33.0
1.34.0
1.36.0
1.37.0
1.38.0
1.39.0
1.4.0
1.40.0
1.42.0
1.43.0
1.44.0
1.45.0
1.46.0
1.47.0
1.48.0
1.49.0
1.5.0
1.50.0
1.52.0
1.53.0
1.54.0
1.55.0
1.56.0
1.57.0
1.58.0
1.59.0
1.6.0
1.60.0
1.61.0
1.62.0
1.63.0
1.7.0
1.8.0
1.9.0
2.0.1
2.0.2
2.0.3
2.0.4
2.0.5
2.0.6
2.0.7
2.0.8
2.0.9
2.1.0
2.1.1
2.2.0
2.2.1
2.3.1
2.4.0
2.4.1
2.4.11
2.4.12
2.4.13
2.4.2
2.4.3
2.4.4
2.4.7
2.4.8
2.4.9
[('520', 'Number'), ('가구', 'Noun'), ('인데', 'Josa'), ('5200', 'Number'), ('가구', 'Noun'), ('로', 'Josa'), ('뜨고', 'Verb'), ('다', 'Adverb'), ('0', 'Number'), ('이', 'Noun'), ('하나', 'Noun'), ('씩', 'Suffix'), ('붙어서', 'Verb'), ('나와', 'Verb'), ('요', 'Noun'), ('빨리', 'Adverb'), ('오류', 'Noun'), ('수정', 'Noun'), ('해주세요', 'Verb')]


In [4]:
if 'reviewCreatedVersion' in data.columns:
    # reviewCreatedVersion 열에 문자열이 아닌 데이터가 있는지 확인
    print(f"'reviewCreatedVersion' 열의 데이터 타입: {data['reviewCreatedVersion'].dtype}")
    print("열 내용 샘플:")
    print(data['reviewCreatedVersion'].head())

    # 결측치와 고유 값 확인
    unique_versions = data['reviewCreatedVersion'].dropna().unique()
    print(f"총 {len(unique_versions)}개의 고유 버전이 있습니다:")
    for version in sorted(unique_versions):
        print(version)
else:
    print("'reviewCreatedVersion' 열이 데이터에 없습니다.")


'reviewCreatedVersion' 열의 데이터 타입: object
열 내용 샘플:
0    2.4.13
1     2.4.8
2    2.4.11
3     2.4.9
4     2.2.0
Name: reviewCreatedVersion, dtype: object
총 80개의 고유 버전이 있습니다:
 '등록'
1.0.0
1.0.1
1.1.0
1.11.0
1.12.0
1.13.0
1.16.0
1.17.0
1.2.0
1.22.0
1.24.0
1.25.0
1.26.0
1.27.0
1.28.0
1.29.0
1.3.0
1.30.0
1.32.0
1.33.0
1.34.0
1.36.0
1.37.0
1.38.0
1.39.0
1.4.0
1.40.0
1.42.0
1.43.0
1.44.0
1.45.0
1.46.0
1.47.0
1.48.0
1.49.0
1.5.0
1.50.0
1.52.0
1.53.0
1.54.0
1.55.0
1.56.0
1.57.0
1.58.0
1.59.0
1.6.0
1.60.0
1.61.0
1.62.0
1.63.0
1.7.0
1.8.0
1.9.0
2.0.1
2.0.2
2.0.3
2.0.4
2.0.5
2.0.6
2.0.7
2.0.8
2.0.9
2.1.0
2.1.1
2.2.0
2.2.1
2.3.1
2.4.0
2.4.1
2.4.11
2.4.12
2.4.13
2.4.2
2.4.3
2.4.4
2.4.7
2.4.8
2.4.9
[('520', 'Number'), ('가구', 'Noun'), ('인데', 'Josa'), ('5200', 'Number'), ('가구', 'Noun'), ('로', 'Josa'), ('뜨고', 'Verb'), ('다', 'Adverb'), ('0', 'Number'), ('이', 'Noun'), ('하나', 'Noun'), ('씩', 'Suffix'), ('붙어서', 'Verb'), ('나와', 'Verb'), ('요', 'Noun'), ('빨리', 'Adverb'), ('오류', 'Noun'), ('수정', 'Noun'), ('해주세요', '

## processed2 네이버 파일에 이상한 값들이 있어요,,,, 이게 대체 어디서 나온걸지 확인을 해봐야할듯 합니다.

In [5]:
import pandas as pd

# 파일 경로 설정
file_path = os.path.join(data_path, "preprocessed2_reviews_data_네이버부동산_별점.csv")

# 검색할 데이터 (리스트 형태로 입력된 데이터)
target_data = "[('520', 'Number'), ('가구', 'Noun'), ('인데', 'Josa'), ('5200', 'Number'), ('가구', 'Noun'), ('로', 'Josa'), ('뜨고', 'Verb'), ('다', 'Adverb'), ('0', 'Number'), ('이', 'Noun'), ('하나', 'Noun'), ('씩', 'Suffix'), ('붙어서', 'Verb'), ('나와', 'Verb'), ('요', 'Noun'), ('빨리', 'Adverb'), ('오류', 'Noun'), ('수정', 'Noun'), ('해주세요', 'Verb')]"

# 파일 읽기
if os.path.exists(file_path):
    print("Reading the file...")
    data = pd.read_csv(file_path)

    # 열 이름 확인
    print("Columns:", data.columns)

    # 검색 대상 열 설정 (가장 가능성 높은 열 지정)
    search_column = 'reviewCreatedVersion'  # or 다른 열 이름 (예: 'nouns_without_stopwords')

    if search_column in data.columns:
        # 해당 데이터가 포함된 행 필터링
        matching_rows = data[data[search_column].astype(str) == target_data]

        if not matching_rows.empty:
            print(f"총 {len(matching_rows)}개의 행에서 일치하는 데이터가 발견되었습니다.")
            print("행 인덱스:")
            print(matching_rows.index.tolist())
            print("\n일치하는 행 데이터:")
            print(matching_rows)
        else:
            print("일치하는 데이터가 없습니다.")
    else:
        print(f"'{search_column}' 열이 데이터에 없습니다.")
else:
    print(f"파일을 찾을 수 없습니다: {file_path}")


Reading the file...
Columns: Index(['reviewId', 'userName', 'userImage', 'content', 'score',
       'thumbsUpCount', 'reviewCreatedVersion', 'at', 'replyContent',
       'repliedAt', 'appVersion', 'content_preprocessed', 'content_token',
       'content_pos', 'nouns_only', 'nouns_without_stopwords'],
      dtype='object')
총 1개의 행에서 일치하는 데이터가 발견되었습니다.
행 인덱스:
[1895]

일치하는 행 데이터:
       reviewId                  userName            userImage content  score  \
1895  문제점을 수정하여   현재는 정상적으로 이용하실 수 있습니다."  2014-05-08 11:18:10  1.52.0    NaN   

      thumbsUpCount                               reviewCreatedVersion   at  \
1895            NaN  [('520', 'Number'), ('가구', 'Noun'), ('인데', 'Jo...  NaN   

     replyContent repliedAt appVersion content_preprocessed content_token  \
1895          NaN       NaN        NaN                  NaN           NaN   

     content_pos nouns_only nouns_without_stopwords  
1895          []         []                      []  


아,, 실제로 데이터 확인해보니 진짜 이상한게 포함되어있기는 합니다.
전처리하는과정에서 그 부분만 잠시 오류가 있었던것 가타요. 결측치 제거해야겟다.



In [6]:
import pandas as pd
import os

# 파일 경로 설정
file_path = os.path.join(data_path, "preprocessed2_reviews_data_네이버부동산_별점.csv")
output_file_path = os.path.join(output_path, "cleaned_네이버부동산_별점.csv")

# 제거할 값 정의
values_to_remove = [
    "[('520', 'Number'), ('가구', 'Noun'), ('인데', 'Josa'), ('5200', 'Number'), ('가구', 'Noun'), ('로', 'Josa'), ('뜨고', 'Verb'), ('다', 'Adverb'), ('0', 'Number'), ('이', 'Noun'), ('하나', 'Noun'), ('씩', 'Suffix'), ('붙어서', 'Verb'), ('나와', 'Verb'), ('요', 'Noun'), ('빨리', 'Adverb'), ('오류', 'Noun'), ('수정', 'Noun'), ('해주세요', 'Verb')]",
    " '등록'"
]

# 파일 읽기
if os.path.exists(file_path):
    print("Reading the file...")
    data = pd.read_csv(file_path)

    # 열 확인
    if 'reviewCreatedVersion' in data.columns:
        print("Cleaning reviewCreatedVersion...")

        # Null 값과 특정 값을 제거
        cleaned_data = data[~data['reviewCreatedVersion'].isnull()]  # Null 값 제거
        cleaned_data = cleaned_data[~cleaned_data['reviewCreatedVersion'].astype(str).isin(values_to_remove)]  # 특정 값 제거

        # 결과 저장
        cleaned_data.to_csv(output_file_path, index=False)
        print(f"Cleaned data saved to: {output_file_path}")
    else:
        print("'reviewCreatedVersion' 열이 데이터에 없습니다.")
else:
    print(f"파일을 찾을 수 없습니다: {file_path}")


Reading the file...
Cleaning reviewCreatedVersion...
Cleaned data saved to: /content/drive/MyDrive/2024/TextMining/reviewmodel/cleaned_네이버부동산_별점.csv


In [7]:
import pandas as pd
import os

# 데이터 경로 설정
data_path = '/content/drive/MyDrive/2024/TextMining/reviews_data'
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'
os.makedirs(output_path, exist_ok=True)  # 결과 저장 폴더 생성

# 앱 이름 및 파일 경로
selected_apps = ['피터팬', '호갱노노']
files = {app: os.path.join(data_path, f"preprocessed2_reviews_data_{app}_별점.csv") for app in selected_apps}

# 파일 처리
for app, file_path in files.items():
    if os.path.exists(file_path):  # 파일 존재 여부 확인
        print(f"Processing {app}...")

        # 파일 읽기
        data = pd.read_csv(file_path)

        # Null 값 제거
        if 'reviewCreatedVersion' in data.columns:
            cleaned_data = data.dropna(subset=['reviewCreatedVersion'])  # NaN 값 제거

            # 결과 저장
            output_file_path = os.path.join(output_path, f"cleaned_{app}_별점.csv")
            cleaned_data.to_csv(output_file_path, index=False)
            print(f"Cleaned data for {app} saved to: {output_file_path}")
        else:
            print(f"'reviewCreatedVersion' 열이 {app} 데이터에 없습니다.")
    else:
        print(f"File for {app} not found: {file_path}")


Processing 피터팬...
Cleaned data for 피터팬 saved to: /content/drive/MyDrive/2024/TextMining/reviewmodel/cleaned_피터팬_별점.csv
Processing 호갱노노...
Cleaned data for 호갱노노 saved to: /content/drive/MyDrive/2024/TextMining/reviewmodel/cleaned_호갱노노_별점.csv


In [8]:
import pandas as pd
import os

# 데이터 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'

# 앱 이름 및 파일 경로
selected_apps = ['다방', '직방', '네이버부동산', '피터팬', '호갱노노']
files = {app: os.path.join(output_path, f"cleaned_{app}_별점.csv") for app in selected_apps}

# 파일별 고유 값 확인
for app, file_path in files.items():
    if os.path.exists(file_path):  # 파일 존재 여부 확인
        print(f"Checking 'reviewCreatedVersion' for {app}...")

        # 파일 읽기
        data = pd.read_csv(file_path)

        # reviewCreatedVersion 열의 고유 값 확인
        if 'reviewCreatedVersion' in data.columns:
            unique_versions = data['reviewCreatedVersion'].dropna().unique()
            print(f"{app} - reviewCreatedVersion에 포함된 고유 값:")
            for version in sorted(unique_versions):
                print(version)
            print(f"총 {len(unique_versions)}개의 고유 값이 있습니다.\n")
        else:
            print(f"'reviewCreatedVersion' 열이 {app} 데이터에 없습니다.\n")
    else:
        print(f"File for {app} not found: {file_path}")


Checking 'reviewCreatedVersion' for 다방...
다방 - reviewCreatedVersion에 포함된 고유 값:
1.0
1.0.1
1.0.2
1.1.1
1.2.0
1.3.0
1.4.1
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.10.0
2.11.3
2.11.4
2.12.2
2.12.3
2.13.4
2.13.5
2.13.6
2.14.4
2.14.5
2.15.5
2.16.2
2.17.2
2.17.3
2.18.1
2.18.2
2.18.3
2.2
2.2.1
2.2.2
2.3
2.3.1
2.3.2
2.3.3
2.3.4
2.4
2.4.1
2.4.10
2.4.10.1
2.4.11
2.4.2
2.4.3
2.4.4
2.4.5
2.4.6
2.4.7
2.4.7.1
2.4.8
2.4.9
2.5
2.5.1
2.5.2
2.6
2.6.1
2.6.3
2.6.4
2.6.6
2.7.1
2.8.0
2.8.4
2.9.6
3.0.0
3.0.1
3.0.2
3.0.3
3.0.4
3.1.0
3.1.1
3.1.2
3.1.3
3.10.0
3.10.1
3.10.2
3.10.3
3.10.4
3.10.5
3.10.6
3.10.7
3.10.8
3.2.0
3.2.1
3.2.2
3.2.3
3.2.4
3.2.5
3.2.6
3.2.7
3.3.0
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
3.3.6
3.3.7
3.3.8
3.4.0
3.4.1
3.5.0
3.5.2
3.6.0
3.6.1
3.7.0
3.8.0
3.9.0
3.9.1
3.9.2
3.9.3
3.9.4
4.0.0
4.0.1
4.0.2
4.1.0
4.10.0
4.10.1
4.10.2
4.11.0
4.11.1
4.12.0
4.12.1
4.12.2
4.13.2
4.13.4
4.13.5
4.13.6
4.14.2
4.14.3
4.14.4
4.14.5
4.14.6
4.15.0
4.15.1
4.15.2
4.15.3
4.15.5
4.16.0
4.16.1
4.16.2
4.17.1
4.17.2
4.18.0
4.18.1
4.18.2
4.

## 직방** 버전별 모델링 MSE, R2score 계산

In [9]:
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'

# 앱 이름 및 파일 경로
selected_apps = ['직방']
files = {app: os.path.join(output_path, f"cleaned_{app}_별점.csv") for app in selected_apps}

# 결과 저장용 딕셔너리
model_results = {}

# 파일별 데이터 처리
for app, file_path in files.items():
    if os.path.exists(file_path):  # 파일 존재 여부 확인
        print(f"Processing {app}...")

        # 데이터 읽기
        data = pd.read_csv(file_path)

        # 소버전 기준 그룹화 (예: 1.0, 1.1)
        data['grouped_version'] = data['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))

        # 그룹별 결과 저장
        version_results = {}

        # 소버전별 데이터 그룹화
        for version, group in data.groupby('grouped_version'):
            print(f"  Processing version group: {version}")

            # 비어 있는 데이터 제거
            group = group[group['nouns_without_stopwords'].notnull() & (group['nouns_without_stopwords'].str.strip() != "[]")]
            if group.empty:
                print(f"    Skipping version group {version}: No valid text data.")
                continue

            # 텍스트 데이터 벡터화
            texts = group['nouns_without_stopwords'].apply(lambda x: ' '.join(eval(x)))
            tfidf = TfidfVectorizer(max_features=1000)
            try:
                X_text = tfidf.fit_transform(texts)
            except ValueError as e:
                print(f"    Error processing version group {version}: {e}")
                continue

            # 독립 변수와 종속 변수
            X = pd.DataFrame(X_text.toarray())
            y = group['score']

            # 데이터가 충분한지 확인 (최소 10개의 데이터 필요)
            if len(group) < 10:
                print(f"    Skipping version group {version}: Not enough data ({len(group)} rows)")
                continue

            # 데이터 분리
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # 모델 학습
            model = RandomForestRegressor(random_state=42)
            model.fit(X_train, y_train)

            # 평가
            y_pred = model.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            print(f"    Version group {version}: MSE = {mse}, R2 Score = {r2}")

            # 중요 피처 분석
            feature_importances = pd.Series(model.feature_importances_, index=tfidf.get_feature_names_out()).sort_values(ascending=False)
            print(f"    {version} 상위 중요한 피처:")
            print(feature_importances.head(5))

            # 저장
            version_results[version] = {
                'MSE': mse,
                'R2 Score': r2,
                'Important Features': feature_importances.head(10).to_dict()
            }

        # 앱 결과 저장
        model_results[app] = version_results

# 결과 요약 출력
print("\n모델링 결과 요약:")
for app, version_data in model_results.items():
    print(f"App: {app}")
    for version, result in version_data.items():
        print(f"  Version group {version}: MSE = {result['MSE']}, R2 Score = {result['R2 Score']}")
        print("  상위 중요한 피처:")
        for feature, importance in result['Important Features'].items():
            print(f"    - {feature}: {importance}")

Processing 직방...
  Processing version group: 1.0
    Skipping version group 1.0: Not enough data (1 rows)
  Processing version group: 1.1
    Skipping version group 1.1: Not enough data (6 rows)
  Processing version group: 3.0
    Version group 3.0: MSE = 0.9342962326973244, R2 Score = -0.12207145593551205
    3.0 상위 중요한 피처:
관악구    0.111044
별로     0.093192
몰랏네    0.079022
사실     0.071540
가짜     0.055537
dtype: float64
  Processing version group: 4.0
    Version group 4.0: MSE = 3.2302500000000003, R2 Score = -0.43566666666666687
    4.0 상위 중요한 피처:
서울     0.376113
지역     0.294412
사진     0.083603
설명     0.065989
그대로    0.056421
dtype: float64
  Processing version group: 4.1
    Version group 4.1: MSE = 2.9812639021329645, R2 Score = -0.012583669215687765
    4.1 상위 중요한 피처:
방이    0.081174
그냥    0.065750
대해    0.058531
집도    0.053811
사진    0.053651
dtype: float64
  Processing version group: 4.10
    Version group 4.10: MSE = 1.7457319876748174, R2 Score = 0.358197261505833
    4.10 상위 중요한 

### 다방 버전별 모델링 MSE, R2score 계산

In [10]:
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'

# 앱 이름 및 파일 경로
selected_apps = ['다방']
files = {app: os.path.join(output_path, f"cleaned_{app}_별점.csv") for app in selected_apps}

# 결과 저장용 딕셔너리
model_results = {}

# 파일별 데이터 처리
for app, file_path in files.items():
    if os.path.exists(file_path):  # 파일 존재 여부 확인
        print(f"Processing {app}...")

        # 데이터 읽기
        data = pd.read_csv(file_path)

        # 소버전 기준 그룹화 (예: 1.0, 1.1)
        data['grouped_version'] = data['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))

        # 그룹별 결과 저장
        version_results = {}

        # 소버전별 데이터 그룹화
        for version, group in data.groupby('grouped_version'):
            print(f"  Processing version group: {version}")

            # 비어 있는 데이터 제거
            group = group[group['nouns_without_stopwords'].notnull() & (group['nouns_without_stopwords'].str.strip() != "[]")]
            if group.empty:
                print(f"    Skipping version group {version}: No valid text data.")
                continue

            # 텍스트 데이터 벡터화
            texts = group['nouns_without_stopwords'].apply(lambda x: ' '.join(eval(x)))
            tfidf = TfidfVectorizer(max_features=1000)
            try:
                X_text = tfidf.fit_transform(texts)
            except ValueError as e:
                print(f"    Error processing version group {version}: {e}")
                continue

            # 독립 변수와 종속 변수
            X = pd.DataFrame(X_text.toarray())
            y = group['score']

            # 데이터가 충분한지 확인 (최소 10개의 데이터 필요)
            if len(group) < 10:
                print(f"    Skipping version group {version}: Not enough data ({len(group)} rows)")
                continue

            # 데이터 분리
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # 모델 학습
            model = RandomForestRegressor(random_state=42)
            model.fit(X_train, y_train)

            # 평가
            y_pred = model.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            print(f"    Version group {version}: MSE = {mse}, R2 Score = {r2}")

            # 중요 피처 분석
            feature_importances = pd.Series(model.feature_importances_, index=tfidf.get_feature_names_out()).sort_values(ascending=False)
            print(f"    {version} 상위 중요한 피처:")
            print(feature_importances.head(5))

            # 저장
            version_results[version] = {
                'MSE': mse,
                'R2 Score': r2,
                'Important Features': feature_importances.head(10).to_dict()
            }

        # 앱 결과 저장
        model_results[app] = version_results

# 결과 요약 출력
print("\n모델링 결과 요약:")
for app, version_data in model_results.items():
    print(f"App: {app}")
    for version, result in version_data.items():
        print(f"  Version group {version}: MSE = {result['MSE']}, R2 Score = {result['R2 Score']}")
        print("  상위 중요한 피처:")
        for feature, importance in result['Important Features'].items():
            print(f"    - {feature}: {importance}")

Processing 다방...
  Processing version group: 1.0
    Version group 1.0: MSE = 0.2, R2 Score = -0.2500000000000002
    1.0 상위 중요한 피처:
거래     0.0
준비     0.0
안드네    0.0
어플     0.0
완전     0.0
dtype: float64
  Processing version group: 1.1
    Version group 1.1: MSE = 1.05064, R2 Score = -0.6416250000000001
    1.1 상위 중요한 피처:
위치    0.237112
조건    0.153146
검색    0.090757
지금    0.072076
개인    0.068511
dtype: float64
  Processing version group: 1.2
    Skipping version group 1.2: Not enough data (4 rows)
  Processing version group: 1.3
    Skipping version group 1.3: Not enough data (9 rows)
  Processing version group: 1.4
    Skipping version group 1.4: Not enough data (7 rows)
  Processing version group: 1.5
    Version group 1.5: MSE = 0.22439963315087932, R2 Score = -0.4727536671047201
    1.5 상위 중요한 피처:
필요    0.129233
자꾸    0.096957
최고    0.085826
완전    0.084886
다시    0.084597
dtype: float64
  Processing version group: 1.6
    Skipping version group 1.6: Not enough data (2 rows)
  Process

### 네이버부동산 버전별 모델링 MSE, R2score 계산 (네이버 대버전 기준)

In [None]:
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'

# 앱 이름 및 파일 경로
selected_apps = ['네이버부동산']
files = {app: os.path.join(output_path, f"cleaned_{app}_별점.csv") for app in selected_apps}

# 결과 저장용 딕셔너리
model_results = {}

# 파일별 데이터 처리
for app, file_path in files.items():
    if os.path.exists(file_path):  # 파일 존재 여부 확인
        print(f"Processing {app}...")

        # 데이터 읽기
        data = pd.read_csv(file_path)

        # 버전 그룹화 (1.x → 그룹 1, 2.x → 그룹 2)
        data['grouped_version'] = data['reviewCreatedVersion'].apply(lambda x: x.split('.')[0])

        # 그룹별 결과 저장
        version_results = {}

        # 버전별 데이터 그룹화
        for version, group in data.groupby('grouped_version'):
            print(f"  Processing version group: {version}")

            # 비어 있는 데이터 제거
            group = group[group['nouns_without_stopwords'].notnull() & (group['nouns_without_stopwords'].str.strip() != "[]")]
            if group.empty:
                print(f"    Skipping version group {version}: No valid text data.")
                continue

            # 텍스트 데이터 벡터화
            texts = group['nouns_without_stopwords'].apply(lambda x: ' '.join(eval(x)))
            tfidf = TfidfVectorizer(max_features=1000)
            try:
                X_text = tfidf.fit_transform(texts)
            except ValueError as e:
                print(f"    Error processing version group {version}: {e}")
                continue

            # 독립 변수와 종속 변수
            X = pd.DataFrame(X_text.toarray())
            y = group['score']

            # 데이터가 충분한지 확인 (최소 10개의 데이터 필요)
            if len(group) < 10:
                print(f"    Skipping version group {version}: Not enough data ({len(group)} rows)")
                continue

            # 데이터 분리
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # 모델 학습
            model = RandomForestRegressor(random_state=42)
            model.fit(X_train, y_train)

            # 평가
            y_pred = model.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            print(f"    Version group {version}: MSE = {mse}, R2 Score = {r2}")

            # 중요 피처 분석
            feature_importances = pd.Series(model.feature_importances_, index=tfidf.get_feature_names_out()).sort_values(ascending=False)
            important_features = feature_importances[feature_importances > 0.05]  # 중요도 > 0.05만 포함
            print(f"    {version} 상위 중요한 피처:")
            print(important_features.head(5))

            # 저장
            version_results[version] = {
                'MSE': mse,
                'R2 Score': r2,
                'Important Features': important_features.head(10).to_dict()
            }

        # 앱 결과 저장
        model_results[app] = version_results

# 결과 요약 출력
print("\n모델링 결과 요약:")
for app, version_data in model_results.items():
    print(f"App: {app}")
    for version, result in version_data.items():
        print(f"  Version group {version}: MSE = {result['MSE']}, R2 Score = {result['R2 Score']}")
        print("  상위 중요한 피처:")
        for feature, importance in result['Important Features'].items():
            print(f"    - {feature}: {importance}")


Processing 네이버부동산...
  Processing version group: 1
    Version group 1: MSE = 2.642605123467446, R2 Score = 0.10324044673030985
    1 상위 중요한 피처:
부동산    0.067575
버전     0.063554
dtype: float64
  Processing version group: 2


KeyboardInterrupt: 

### 네이버부동산 버전별 모델링 MSE, R2score 계산 (네이버 세부버전 기준)


In [None]:
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'

# 앱 이름 및 파일 경로
selected_apps = ['네이버부동산']
files = {app: os.path.join(output_path, f"cleaned_{app}_별점.csv") for app in selected_apps}

# 결과 저장용 딕셔너리
model_results = {}

# 파일별 데이터 처리
for app, file_path in files.items():
    if os.path.exists(file_path):  # 파일 존재 여부 확인
        print(f"Processing {app}...")

        # 데이터 읽기
        data = pd.read_csv(file_path)

        # 소버전 기준 그룹화 (예: 1.0, 1.1)
        data['grouped_version'] = data['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))

        # 그룹별 결과 저장
        version_results = {}

        # 소버전별 데이터 그룹화
        for version, group in data.groupby('grouped_version'):
            print(f"  Processing version group: {version}")

            # 비어 있는 데이터 제거
            group = group[group['nouns_without_stopwords'].notnull() & (group['nouns_without_stopwords'].str.strip() != "[]")]
            if group.empty:
                print(f"    Skipping version group {version}: No valid text data.")
                continue

            # 텍스트 데이터 벡터화
            texts = group['nouns_without_stopwords'].apply(lambda x: ' '.join(eval(x)))
            tfidf = TfidfVectorizer(max_features=1000)
            try:
                X_text = tfidf.fit_transform(texts)
            except ValueError as e:
                print(f"    Error processing version group {version}: {e}")
                continue

            # 독립 변수와 종속 변수
            X = pd.DataFrame(X_text.toarray())
            y = group['score']

            # 데이터가 충분한지 확인 (최소 10개의 데이터 필요)
            if len(group) < 10:
                print(f"    Skipping version group {version}: Not enough data ({len(group)} rows)")
                continue

            # 데이터 분리
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # 모델 학습
            model = RandomForestRegressor(random_state=42)
            model.fit(X_train, y_train)

            # 평가
            y_pred = model.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            print(f"    Version group {version}: MSE = {mse}, R2 Score = {r2}")

            # 중요 피처 분석
            feature_importances = pd.Series(model.feature_importances_, index=tfidf.get_feature_names_out()).sort_values(ascending=False)
            print(f"    {version} 상위 중요한 피처:")
            print(feature_importances.head(5))

            # 저장
            version_results[version] = {
                'MSE': mse,
                'R2 Score': r2,
                'Important Features': feature_importances.head(10).to_dict()
            }

        # 앱 결과 저장
        model_results[app] = version_results

# 결과 요약 출력
print("\n모델링 결과 요약:")
for app, version_data in model_results.items():
    print(f"App: {app}")
    for version, result in version_data.items():
        print(f"  Version group {version}: MSE = {result['MSE']}, R2 Score = {result['R2 Score']}")
        print("  상위 중요한 피처:")
        for feature, importance in result['Important Features'].items():
            print(f"    - {feature}: {importance}")


### 피터팬 버전별 모델링 MSE, R2score 계산

In [None]:
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'

# 앱 이름 및 파일 경로
selected_apps = ['피터팬']
files = {app: os.path.join(output_path, f"cleaned_{app}_별점.csv") for app in selected_apps}

# 결과 저장용 딕셔너리
model_results = {}

# 파일별 데이터 처리
for app, file_path in files.items():
    if os.path.exists(file_path):  # 파일 존재 여부 확인
        print(f"Processing {app}...")

        # 데이터 읽기
        data = pd.read_csv(file_path)

        # 소버전 기준 그룹화 (예: 1.0, 1.1)
        data['grouped_version'] = data['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))

        # 그룹별 결과 저장
        version_results = {}

        # 소버전별 데이터 그룹화
        for version, group in data.groupby('grouped_version'):
            print(f"  Processing version group: {version}")

            # 비어 있는 데이터 제거
            group = group[group['nouns_without_stopwords'].notnull() & (group['nouns_without_stopwords'].str.strip() != "[]")]
            if group.empty:
                print(f"    Skipping version group {version}: No valid text data.")
                continue

            # 텍스트 데이터 벡터화
            texts = group['nouns_without_stopwords'].apply(lambda x: ' '.join(eval(x)))
            tfidf = TfidfVectorizer(max_features=1000)
            try:
                X_text = tfidf.fit_transform(texts)
            except ValueError as e:
                print(f"    Error processing version group {version}: {e}")
                continue

            # 독립 변수와 종속 변수
            X = pd.DataFrame(X_text.toarray())
            y = group['score']

            # 데이터가 충분한지 확인 (최소 10개의 데이터 필요)
            if len(group) < 10:
                print(f"    Skipping version group {version}: Not enough data ({len(group)} rows)")
                continue

            # 데이터 분리
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # 모델 학습
            model = RandomForestRegressor(random_state=42)
            model.fit(X_train, y_train)

            # 평가
            y_pred = model.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            print(f"    Version group {version}: MSE = {mse}, R2 Score = {r2}")

            # 중요 피처 분석
            feature_importances = pd.Series(model.feature_importances_, index=tfidf.get_feature_names_out()).sort_values(ascending=False)
            print(f"    {version} 상위 중요한 피처:")
            print(feature_importances.head(5))

            # 저장
            version_results[version] = {
                'MSE': mse,
                'R2 Score': r2,
                'Important Features': feature_importances.head(10).to_dict()
            }

        # 앱 결과 저장
        model_results[app] = version_results

# 결과 요약 출력
print("\n모델링 결과 요약:")
for app, version_data in model_results.items():
    print(f"App: {app}")
    for version, result in version_data.items():
        print(f"  Version group {version}: MSE = {result['MSE']}, R2 Score = {result['R2 Score']}")
        print("  상위 중요한 피처:")
        for feature, importance in result['Important Features'].items():
            print(f"    - {feature}: {importance}")


### 호갱노노 버전별 모델링 MSE, R2score 계산

In [None]:
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'

# 앱 이름 및 파일 경로
selected_apps = ['호갱노노']
files = {app: os.path.join(output_path, f"cleaned_{app}_별점.csv") for app in selected_apps}

# 결과 저장용 딕셔너리
model_results = {}

# 파일별 데이터 처리
for app, file_path in files.items():
    if os.path.exists(file_path):  # 파일 존재 여부 확인
        print(f"Processing {app}...")

        # 데이터 읽기
        data = pd.read_csv(file_path)

        # 소버전 기준 그룹화 (예: 1.0, 1.1)
        data['grouped_version'] = data['reviewCreatedVersion'].apply(lambda x: '.'.join(x.split('.')[:2]))

        # 그룹별 결과 저장
        version_results = {}

        # 소버전별 데이터 그룹화
        for version, group in data.groupby('grouped_version'):
            print(f"  Processing version group: {version}")

            # 비어 있는 데이터 제거
            group = group[group['nouns_without_stopwords'].notnull() & (group['nouns_without_stopwords'].str.strip() != "[]")]
            if group.empty:
                print(f"    Skipping version group {version}: No valid text data.")
                continue

            # 텍스트 데이터 벡터화
            texts = group['nouns_without_stopwords'].apply(lambda x: ' '.join(eval(x)))
            tfidf = TfidfVectorizer(max_features=1000)
            try:
                X_text = tfidf.fit_transform(texts)
            except ValueError as e:
                print(f"    Error processing version group {version}: {e}")
                continue

            # 독립 변수와 종속 변수
            X = pd.DataFrame(X_text.toarray())
            y = group['score']

            # 데이터가 충분한지 확인 (최소 10개의 데이터 필요)
            if len(group) < 10:
                print(f"    Skipping version group {version}: Not enough data ({len(group)} rows)")
                continue

            # 데이터 분리
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # 모델 학습
            model = RandomForestRegressor(random_state=42)
            model.fit(X_train, y_train)

            # 평가
            y_pred = model.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            print(f"    Version group {version}: MSE = {mse}, R2 Score = {r2}")

            # 중요 피처 분석
            feature_importances = pd.Series(model.feature_importances_, index=tfidf.get_feature_names_out()).sort_values(ascending=False)
            print(f"    {version} 상위 중요한 피처:")
            print(feature_importances.head(5))

            # 저장
            version_results[version] = {
                'MSE': mse,
                'R2 Score': r2,
                'Important Features': feature_importances.head(10).to_dict()
            }

        # 앱 결과 저장
        model_results[app] = version_results

# 결과 요약 출력
print("\n모델링 결과 요약:")
for app, version_data in model_results.items():
    print(f"App: {app}")
    for version, result in version_data.items():
        print(f"  Version group {version}: MSE = {result['MSE']}, R2 Score = {result['R2 Score']}")
        print("  상위 중요한 피처:")
        for feature, importance in result['Important Features'].items():
            print(f"    - {feature}: {importance}")


## 피터팬에 대해

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'

# 앱 이름 및 파일 경로
selected_apps = ['피터팬']
files = {app: os.path.join(output_path, f"cleaned_{app}_별점.csv") for app in selected_apps}

# 파일 처리 및 분석
for app, file_path in files.items():
    print(f"Analyzing data for app: {app}")

    # 데이터 로드
    data = pd.read_csv(file_path)

    # 필요한 열만 선택
    if 'nouns_without_stopwords' not in data.columns or 'score' not in data.columns:
        print(f"Error: Missing required columns in file {file_path}")
        continue

    data = data[['nouns_without_stopwords', 'score']].dropna()

    # TF-IDF 벡터화
    vectorizer = TfidfVectorizer(max_features=500)  # 상위 500개의 단어만 사용
    X = vectorizer.fit_transform(data['nouns_without_stopwords']).toarray()
    y = data['score']

    # 학습 및 테스트 데이터 분리
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # 모델 학습
    model = RandomForestRegressor(random_state=42)
    model.fit(X_train, y_train)

    # 예측
    y_pred = model.predict(X_test)

    # 평가
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"Results for {app}:")
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}\n")


Analyzing data for app: 피터팬
Results for 피터팬:
Mean Squared Error: 1.120017749173
R-squared: 0.5221009646695718



In [12]:
# 특정 단어가 포함된 경우 예상 평점 계산 함수
def predict_score_for_word(word, vectorizer, model):
    """
    특정 단어의 예상 평점을 계산합니다.
    :param word: 예상 평점을 확인할 단어 (str)
    :param vectorizer: 학습에 사용된 TfidfVectorizer
    :param model: 학습된 모델 (RandomForestRegressor)
    :return: 예상 평점 (float)
    """
    # 입력 단어를 TF-IDF 벡터로 변환
    word_vector = vectorizer.transform([word]).toarray()

    # 모델을 통해 예상 평점 예측
    predicted_score = model.predict(word_vector)[0]

    return predicted_score

# 단어 별 예상 평점 예측
word_to_test = '짜증'  # 확인할 단어
predicted_score = predict_score_for_word(word_to_test, vectorizer, model)
print(f"단어 '{word_to_test}'에 대한 예상 평점: {predicted_score:.2f}")


단어 '짜증'에 대한 예상 평점: 1.96


### 전체 모든 앱에 대해 한번에 실행해버림
def process_app(app_name, file_path, test_word):

'test_word' 칸에 이런 단어 들어가있을때 대충 이 점수일것이다 를 알고싶은 단어 집어넣기.

In [13]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 로드 함수
def load_data(file_path):
    data = pd.read_csv(file_path)
    if 'nouns_without_stopwords' not in data.columns or 'score' not in data.columns:
        raise ValueError(f"Required columns missing in {file_path}")
    return data[['nouns_without_stopwords', 'score']].dropna()

# TF-IDF 벡터화 및 모델 학습 함수
def train_model(data):
    vectorizer = TfidfVectorizer(max_features=500)
    X = vectorizer.fit_transform(data['nouns_without_stopwords']).toarray()
    y = data['score']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestRegressor(random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}")
    return vectorizer, model

# 특정 단어 예상 평점 계산 함수
def predict_score_for_word(word, vectorizer, model):
    word_vector = vectorizer.transform([word]).toarray()
    predicted_score = model.predict(word_vector)[0]
    return predicted_score

# 앱 데이터 처리
def process_app(app_name, file_path, test_word):
    print(f"Processing app: {app_name}")
    data = load_data(file_path)
    vectorizer, model = train_model(data)
    predicted_score = predict_score_for_word(test_word, vectorizer, model)
    print(f"Predicted score for '{test_word}' in {app_name}: {predicted_score:.2f}\n")

# 앱별 파일 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'
apps = {
    '피터팬': f"{output_path}/cleaned_피터팬_별점.csv",
    '다방': f"{output_path}/cleaned_다방_별점.csv",
    '직방': f"{output_path}/cleaned_직방_별점.csv",
    '호갱노노': f"{output_path}/cleaned_호갱노노_별점.csv",
    '네이버부동산': f"{output_path}/cleaned_네이버부동산_별점.csv"
}

# 앱별 독립적 분석
process_app('피터팬', apps['피터팬'], '사기')
process_app('다방', apps['다방'], '사기')
process_app('직방', apps['직방'], '사기')
process_app('호갱노노', apps['호갱노노'], '사기')
process_app('네이버부동산', apps['네이버부동산'], '사기')

Processing app: 피터팬
Mean Squared Error: 1.120017749173
R-squared: 0.5221009646695718
Predicted score for '사기' in 피터팬: 1.24

Processing app: 다방
Mean Squared Error: 1.069345658571159
R-squared: 0.57218462385512
Predicted score for '사기' in 다방: 1.00

Processing app: 직방
Mean Squared Error: 1.549488268641228
R-squared: 0.40939370520237206
Predicted score for '사기' in 직방: 1.94

Processing app: 호갱노노
Mean Squared Error: 1.8844002539901519
R-squared: 0.3646737711082856
Predicted score for '사기' in 호갱노노: 4.60

Processing app: 네이버부동산
Mean Squared Error: 1.656851364777668
R-squared: 0.21276960386111832
Predicted score for '사기' in 네이버부동산: 3.40



In [14]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import pickle

# 데이터 로드 함수
def load_data(file_path):
    data = pd.read_csv(file_path)
    if 'nouns_without_stopwords' not in data.columns or 'score' not in data.columns:
        raise ValueError(f"Required columns missing in {file_path}")
    return data[['nouns_without_stopwords', 'score']].dropna()

# 모델 학습 및 저장 함수
def train_and_save_model(app_name, file_path):
    data = load_data(file_path)
    vectorizer = TfidfVectorizer(max_features=500)
    X = vectorizer.fit_transform(data['nouns_without_stopwords']).toarray()
    y = data['score']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestRegressor(random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"{app_name} - Mean Squared Error: {mse}")
    print(f"{app_name} - R-squared: {r2}")

    # 모델과 벡터라이저 저장
    with open(f"{app_name}_vectorizer.pkl", 'wb') as vec_file:
        pickle.dump(vectorizer, vec_file)
    with open(f"{app_name}_model.pkl", 'wb') as model_file:
        pickle.dump(model, model_file)
    print(f"Model and vectorizer saved for {app_name}")

# 앱별 파일 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'
apps = {
    '피터팬': f"{output_path}/cleaned_피터팬_별점.csv",
    '다방': f"{output_path}/cleaned_다방_별점.csv",
    '직방': f"{output_path}/cleaned_직방_별점.csv",
    '호갱노노': f"{output_path}/cleaned_호갱노노_별점.csv",
    '네이버부동산': f"{output_path}/cleaned_네이버부동산_별점.csv"
}

# 모든 앱에 대해 모델 학습 및 저장
for app_name, file_path in apps.items():
    try:
        train_and_save_model(app_name, file_path)
    except Exception as e:
        print(f"Error processing {app_name}: {e}")


피터팬 - Mean Squared Error: 1.120017749173
피터팬 - R-squared: 0.5221009646695718
Model and vectorizer saved for 피터팬
다방 - Mean Squared Error: 1.069345658571159
다방 - R-squared: 0.57218462385512
Model and vectorizer saved for 다방
직방 - Mean Squared Error: 1.549488268641228
직방 - R-squared: 0.40939370520237206
Model and vectorizer saved for 직방
호갱노노 - Mean Squared Error: 1.8844002539901519
호갱노노 - R-squared: 0.3646737711082856
Model and vectorizer saved for 호갱노노
네이버부동산 - Mean Squared Error: 1.656851364777668
네이버부동산 - R-squared: 0.21276960386111832
Model and vectorizer saved for 네이버부동산


In [15]:
# 모델 로드 함수
def load_model_and_vectorizer(app_name):
    with open(f"{app_name}_vectorizer.pkl", 'rb') as vec_file:
        vectorizer = pickle.load(vec_file)
    with open(f"{app_name}_model.pkl", 'rb') as model_file:
        model = pickle.load(model_file)
    print(f"Loaded vectorizer and model for {app_name}")
    return vectorizer, model

# 특정 단어 예상 점수 계산 함수
def predict_scores_for_words(app_name, test_words):
    vectorizer, model = load_model_and_vectorizer(app_name)
    for word in test_words:
        word_vector = vectorizer.transform([word]).toarray()
        predicted_score = model.predict(word_vector)[0]
        print(f"Predicted score for '{word}' in {app_name}: {predicted_score:.2f}")
    print("\n")

# 테스트할 단어
test_words = ['사기', '좋아요', '최악', '친절']

# 모든 앱에 대해 단어 예측 실행
for app_name in apps.keys():
    try:
        predict_scores_for_words(app_name, test_words)
    except Exception as e:
        print(f"Error processing {app_name}: {e}")


Loaded vectorizer and model for 피터팬
Predicted score for '사기' in 피터팬: 1.24
Predicted score for '좋아요' in 피터팬: 4.83
Predicted score for '최악' in 피터팬: 4.41
Predicted score for '친절' in 피터팬: 4.83


Loaded vectorizer and model for 다방
Predicted score for '사기' in 다방: 1.00
Predicted score for '좋아요' in 다방: 4.68
Predicted score for '최악' in 다방: 2.29
Predicted score for '친절' in 다방: 4.68


Loaded vectorizer and model for 직방
Predicted score for '사기' in 직방: 1.94
Predicted score for '좋아요' in 직방: 4.40
Predicted score for '최악' in 직방: 1.00
Predicted score for '친절' in 직방: 4.40


Loaded vectorizer and model for 호갱노노
Predicted score for '사기' in 호갱노노: 4.60
Predicted score for '좋아요' in 호갱노노: 4.60
Predicted score for '최악' in 호갱노노: 2.69
Predicted score for '친절' in 호갱노노: 4.60


Loaded vectorizer and model for 네이버부동산
Predicted score for '사기' in 네이버부동산: 3.40
Predicted score for '좋아요' in 네이버부동산: 2.83
Predicted score for '최악' in 네이버부동산: 1.00
Predicted score for '친절' in 네이버부동산: 2.83




In [16]:
!pip uninstall -y tensorflow
!pip install tensorflow-cpu

Found existing installation: tensorflow 2.17.1
Uninstalling tensorflow-2.17.1:
  Successfully uninstalled tensorflow-2.17.1
Collecting tensorflow-cpu
  Downloading tensorflow_cpu-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow-cpu)
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tensorflow_cpu-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (230.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m230.0/230.0 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tensorboard-2.18.0-py3-none-any.whl (5.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m104.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tensorboard, tensorflow-cpu
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.17.1
    Uninstalling tensorboard-2.17.1:
      Successfull

In [17]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel
import torch.nn as nn

# 데이터 로드 함수
def load_data(file_path):
    data = pd.read_csv(file_path)
    if 'nouns_without_stopwords' not in data.columns or 'score' not in data.columns:
        raise ValueError(f"Required columns missing in {file_path}")
    return data[['nouns_without_stopwords', 'score']].dropna()

# 랜덤포레스트 모델 학습 및 평가
def random_forest_model(data):
    vectorizer = TfidfVectorizer(max_features=500)
    X = vectorizer.fit_transform(data['nouns_without_stopwords']).toarray()
    y = data['score']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestRegressor(random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)  # RMSE
    r2 = r2_score(y_test, y_pred)

    return vectorizer, model, mse, rmse, r2

# BERT 데이터셋 클래스
class ReviewDataset(Dataset):
    def __init__(self, texts, targets, tokenizer, max_length):
        self.texts = texts
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        target = self.targets[idx]
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'target': torch.tensor(target, dtype=torch.float),
        }

# BERT 모델 정의
class BERTRegressor(nn.Module):
    def __init__(self, bert_model_name):
        super(BERTRegressor, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        output = self.drop(pooled_output)
        return self.out(output)

# BERT 모델 학습 및 평가
def bert_model(data, bert_model_name='bert-base-uncased', max_length=128, batch_size=16, epochs=3):
    tokenizer = BertTokenizer.from_pretrained(bert_model_name)
    dataset = ReviewDataset(
        texts=data['nouns_without_stopwords'].tolist(),
        targets=data['score'].tolist(),
        tokenizer=tokenizer,
        max_length=max_length,
    )
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = BERTRegressor(bert_model_name).to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    criterion = nn.MSELoss()

    # 모델 학습
    model.train()
    for epoch in range(epochs):
        epoch_loss = 0
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            targets = batch['target'].to(device)

            outputs = model(input_ids, attention_mask).squeeze(-1)
            loss = criterion(outputs, targets)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()

        print(f"Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / len(dataloader):.4f}")

    # 모델 평가
    model.eval()
    y_true = []
    y_pred = []
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            targets = batch['target'].to(device)

            outputs = model(input_ids, attention_mask).squeeze(-1)
            y_true.extend(targets.tolist())
            y_pred.extend(outputs.tolist())

    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)

    return mse, rmse, r2

# 앱별 독립적 분석
def process_app(app_name, file_path):
    print(f"\nProcessing app: {app_name}")
    data = load_data(file_path)

    # 랜덤포레스트 모델 학습 및 결과 출력
    print("Training Random Forest Model...")
    vectorizer, rf_model, rf_mse, rf_rmse, rf_r2 = random_forest_model(data)
    print(f"Random Forest Results for {app_name}:")
    print(f"{'MSE':<10}: {rf_mse:.2f}")
    print(f"{'RMSE':<10}: {rf_rmse:.2f}")
    print(f"{'R-squared':<10}: {rf_r2:.2f}\n")

    # BERT 모델 학습 및 결과 출력
    print("Training BERT Model...")
    bert_mse, bert_rmse, bert_r2 = bert_model(data)
    print(f"BERT Results for {app_name}:")
    print(f"{'MSE':<10}: {bert_mse:.2f}")
    print(f"{'RMSE':<10}: {bert_rmse:.2f}")
    print(f"{'R-squared':<10}: {bert_r2:.2f}\n")

# 앱별 파일 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'
apps = {
    '피터팬': f"{output_path}/cleaned_피터팬_별점.csv",
    '다방': f"{output_path}/cleaned_다방_별점.csv",
    '직방': f"{output_path}/cleaned_직방_별점.csv",
    '호갱노노': f"{output_path}/cleaned_호갱노노_별점.csv",
    '네이버부동산': f"{output_path}/cleaned_네이버부동산_별점.csv"
}

# 앱별 독립적 분석 실행
for app_name, file_path in apps.items():
    process_app(app_name, file_path)



Processing app: 피터팬
Training Random Forest Model...




Random Forest Results for 피터팬:
MSE       : 1.12
RMSE      : 1.06
R-squared : 0.52

Training BERT Model...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Epoch 1/3, Loss: 2.8208
Epoch 2/3, Loss: 1.1889
Epoch 3/3, Loss: 1.0785
BERT Results for 피터팬:
MSE       : 0.96
RMSE      : 0.98
R-squared : 0.56


Processing app: 다방
Training Random Forest Model...




Random Forest Results for 다방:
MSE       : 1.07
RMSE      : 1.03
R-squared : 0.57

Training BERT Model...
Epoch 1/3, Loss: 2.0171
Epoch 2/3, Loss: 1.1876
Epoch 3/3, Loss: 1.0308
BERT Results for 다방:
MSE       : 0.87
RMSE      : 0.93
R-squared : 0.65


Processing app: 직방
Training Random Forest Model...




Random Forest Results for 직방:
MSE       : 1.55
RMSE      : 1.24
R-squared : 0.41

Training BERT Model...
Epoch 1/3, Loss: 1.9626
Epoch 2/3, Loss: 1.5530
Epoch 3/3, Loss: 1.4258
BERT Results for 직방:
MSE       : 1.23
RMSE      : 1.11
R-squared : 0.53


Processing app: 호갱노노
Training Random Forest Model...




Random Forest Results for 호갱노노:
MSE       : 1.88
RMSE      : 1.37
R-squared : 0.36

Training BERT Model...
Epoch 1/3, Loss: 4.3334
Epoch 2/3, Loss: 2.2638
Epoch 3/3, Loss: 2.0583
BERT Results for 호갱노노:
MSE       : 1.82
RMSE      : 1.35
R-squared : 0.37


Processing app: 네이버부동산
Training Random Forest Model...




Random Forest Results for 네이버부동산:
MSE       : 1.66
RMSE      : 1.29
R-squared : 0.21

Training BERT Model...
Epoch 1/3, Loss: 2.2488
Epoch 2/3, Loss: 1.9019
Epoch 3/3, Loss: 1.7964
BERT Results for 네이버부동산:
MSE       : 1.62
RMSE      : 1.27
R-squared : 0.21



In [18]:
from transformers import AutoModel, AutoTokenizer
AutoModel.from_pretrained("bert-base-uncased")
AutoTokenizer.from_pretrained("bert-base-uncased")


BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [19]:
!pip install transformers tqdm




In [20]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel
import torch
import torch.nn as nn
from tqdm import tqdm

# 데이터 로드 함수
def load_data(file_path):
    data = pd.read_csv(file_path)
    if 'nouns_without_stopwords' not in data.columns or 'score' not in data.columns:
        raise ValueError(f"Required columns missing in {file_path}")
    data = data[['nouns_without_stopwords', 'score']].dropna()

    # 데이터 크기 줄이기 (10%만 샘플링)
    data = data.sample(frac=0.1, random_state=42)
    return data

# 랜덤포레스트 모델 학습 및 평가
def random_forest_model(data):
    vectorizer = TfidfVectorizer(max_features=500)
    X = vectorizer.fit_transform(data['nouns_without_stopwords']).toarray()
    y = data['score']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestRegressor(random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    return vectorizer, model, mse, rmse, r2

# BERT 데이터셋 클래스
class ReviewDataset(Dataset):
    def __init__(self, texts, targets, tokenizer, max_length):
        self.texts = texts
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        target = self.targets[idx]
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'target': torch.tensor(target, dtype=torch.float),
        }

# BERT 모델 정의
class BERTRegressor(nn.Module):
    def __init__(self, bert_model_name):
        super(BERTRegressor, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        output = self.drop(pooled_output)
        return self.out(output)

# BERT 모델 학습 및 평가
def bert_model(data, bert_model_name='bert-base-uncased', max_length=64, batch_size=8, epochs=3):
    tokenizer = BertTokenizer.from_pretrained(bert_model_name)

    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(
        data['nouns_without_stopwords'], data['score'], test_size=0.2, random_state=42
    )

    train_dataset = ReviewDataset(
        texts=X_train.tolist(),
        targets=y_train.tolist(),
        tokenizer=tokenizer,
        max_length=max_length,
    )
    test_dataset = ReviewDataset(
        texts=X_test.tolist(),
        targets=y_test.tolist(),
        tokenizer=tokenizer,
        max_length=max_length,
    )

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = BERTRegressor(bert_model_name).to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    criterion = nn.MSELoss()

    # 모델 학습
    model.train()
    for epoch in range(epochs):
        epoch_loss = 0
        loop = tqdm(train_loader, leave=True, desc=f'Epoch {epoch + 1}/{epochs}')
        for batch in loop:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            targets = batch['target'].to(device)

            outputs = model(input_ids, attention_mask).squeeze(-1)
            loss = criterion(outputs, targets)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            loop.set_postfix(loss=loss.item())

        print(f"Epoch {epoch + 1} Loss: {epoch_loss / len(train_loader):.4f}")

    # 모델 평가
    model.eval()
    y_true = []
    y_pred = []
    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            targets = batch['target'].to(device)

            outputs = model(input_ids, attention_mask).squeeze(-1)
            y_true.extend(targets.tolist())
            y_pred.extend(outputs.tolist())

    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)

    return mse, rmse, r2

# 앱별 독립적 분석
def process_app(app_name, file_path):
    print(f"\nProcessing app: {app_name}")

    # GPU 확인
    if torch.cuda.is_available():
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")
    else:
        print("GPU is not available. Using CPU.")

    data = load_data(file_path)

    # 랜덤포레스트 모델 학습 및 결과 출력
    print("Training Random Forest Model...")
    vectorizer, rf_model, rf_mse, rf_rmse, rf_r2 = random_forest_model(data)
    print(f"Random Forest Results for {app_name}:")
    print(f"{'MSE':<10}: {rf_mse:.2f}")
    print(f"{'RMSE':<10}: {rf_rmse:.2f}")
    print(f"{'R-squared':<10}: {rf_r2:.2f}\n")

    # BERT 모델 학습 및 결과 출력
    print("Training BERT Model...")
    bert_mse, bert_rmse, bert_r2 = bert_model(data)
    print(f"BERT Results for {app_name}:")
    print(f"{'MSE':<10}: {bert_mse:.2f}")
    print(f"{'RMSE':<10}: {bert_rmse:.2f}")
    print(f"{'R-squared':<10}: {bert_r2:.2f}\n")

# 앱별 파일 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'
apps = {
    '피터팬':f"{output_path}/cleaned_피터팬_별점.csv",
    '다방':f"{output_path}/cleaned_피터팬_별점.csv",
    '직방': f"{output_path}/cleaned_직방_별점.csv",
    '호갱노노': f"{output_path}/cleaned_호갱노노_별점.csv",
    '네이버부동산': f"{output_path}/cleaned_네이버부동산_별점.csv"
}

# 앱별 독립적 분석 실행
for app_name, file_path in apps.items():
    process_app(app_name, file_path)



Processing app: 피터팬
Using GPU: Tesla T4
Training Random Forest Model...
Random Forest Results for 피터팬:
MSE       : 1.52
RMSE      : 1.23
R-squared : 0.41

Training BERT Model...


Epoch 1/3: 100%|██████████| 45/45 [00:05<00:00,  8.13it/s, loss=4.94]


Epoch 1 Loss: 9.8275


Epoch 2/3: 100%|██████████| 45/45 [00:05<00:00,  8.25it/s, loss=1.61]


Epoch 2 Loss: 3.0149


Epoch 3/3: 100%|██████████| 45/45 [00:05<00:00,  8.18it/s, loss=0.911]


Epoch 3 Loss: 1.6876
BERT Results for 피터팬:
MSE       : 1.52
RMSE      : 1.23
R-squared : 0.40


Processing app: 다방
Using GPU: Tesla T4
Training Random Forest Model...
Random Forest Results for 다방:
MSE       : 1.52
RMSE      : 1.23
R-squared : 0.41

Training BERT Model...


Epoch 1/3: 100%|██████████| 45/45 [00:05<00:00,  8.12it/s, loss=3.7]


Epoch 1 Loss: 8.0063


Epoch 2/3: 100%|██████████| 45/45 [00:05<00:00,  8.12it/s, loss=2.66]


Epoch 2 Loss: 2.4808


Epoch 3/3: 100%|██████████| 45/45 [00:05<00:00,  8.16it/s, loss=0.239]


Epoch 3 Loss: 1.6621
BERT Results for 다방:
MSE       : 1.59
RMSE      : 1.26
R-squared : 0.38


Processing app: 직방
Using GPU: Tesla T4
Training Random Forest Model...
Random Forest Results for 직방:
MSE       : 1.87
RMSE      : 1.37
R-squared : 0.33

Training BERT Model...


Epoch 1/3: 100%|██████████| 221/221 [00:27<00:00,  8.12it/s, loss=2.36]


Epoch 1 Loss: 3.9071


Epoch 2/3: 100%|██████████| 221/221 [00:27<00:00,  8.11it/s, loss=2.65]


Epoch 2 Loss: 2.1359


Epoch 3/3: 100%|██████████| 221/221 [00:26<00:00,  8.21it/s, loss=2.32]


Epoch 3 Loss: 1.8979
BERT Results for 직방:
MSE       : 1.81
RMSE      : 1.34
R-squared : 0.35


Processing app: 호갱노노
Using GPU: Tesla T4
Training Random Forest Model...
Random Forest Results for 호갱노노:
MSE       : 2.06
RMSE      : 1.44
R-squared : 0.24

Training BERT Model...


Epoch 1/3: 100%|██████████| 21/21 [00:02<00:00,  8.39it/s, loss=11.7]


Epoch 1 Loss: 9.8174


Epoch 2/3: 100%|██████████| 21/21 [00:02<00:00,  8.27it/s, loss=4.64]


Epoch 2 Loss: 4.4287


Epoch 3/3: 100%|██████████| 21/21 [00:02<00:00,  8.26it/s, loss=2.76]


Epoch 3 Loss: 2.9462
BERT Results for 호갱노노:
MSE       : 2.40
RMSE      : 1.55
R-squared : 0.12


Processing app: 네이버부동산
Using GPU: Tesla T4
Training Random Forest Model...
Random Forest Results for 네이버부동산:
MSE       : 1.97
RMSE      : 1.40
R-squared : -0.17

Training BERT Model...


Epoch 1/3: 100%|██████████| 21/21 [00:02<00:00,  8.19it/s, loss=5.47]


Epoch 1 Loss: 4.0685


Epoch 2/3: 100%|██████████| 21/21 [00:02<00:00,  8.20it/s, loss=2.6]


Epoch 2 Loss: 2.9803


Epoch 3/3: 100%|██████████| 21/21 [00:02<00:00,  8.17it/s, loss=0.984]


Epoch 3 Loss: 2.6608
BERT Results for 네이버부동산:
MSE       : 1.79
RMSE      : 1.34
R-squared : -0.07



In [21]:
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import torch
from torch.utils.data import DataLoader, Dataset
import pickle

# 데이터 로드 함수
def load_data(file_path):
    data = pd.read_csv(file_path)
    if 'nouns_without_stopwords' not in data.columns or 'score' not in data.columns:
        raise ValueError(f"Required columns missing in {file_path}")
    return data[['nouns_without_stopwords', 'score']].dropna()

# 커스텀 데이터셋
class ReviewDataset(Dataset):
    def __init__(self, reviews, scores, tokenizer, max_len=128):
        self.reviews = reviews
        self.scores = scores
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, idx):
        review = self.reviews.iloc[idx]
        score = self.scores.iloc[idx]

        encoding = self.tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True
        )

        return {
            'input_ids': torch.tensor(encoding['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(encoding['attention_mask'], dtype=torch.long),
            'score': torch.tensor(score, dtype=torch.float)
        }

# 모델 학습 및 저장 함수
def train_and_save_model(app_name, file_path):
    data = load_data(file_path)

    # Load tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        data['nouns_without_stopwords'], data['score'], test_size=0.2, random_state=42
    )

    # Create datasets and dataloaders
    train_dataset = ReviewDataset(X_train, y_train, tokenizer)
    test_dataset = ReviewDataset(X_test, y_test, tokenizer)

    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=16)

    # Load BERT model
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)
    model = model.to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

    optimizer = AdamW(model.parameters(), lr=2e-5)

    # Training loop
    epochs = 3
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in train_loader:
            input_ids = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)
            scores = batch['score'].to(model.device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=scores.unsqueeze(1))
            loss = outputs.loss
            total_loss += loss.item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print(f"{app_name} - Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(train_loader)}")

    # Evaluation
    model.eval()
    y_pred, y_true = [], []
    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)
            scores = batch['score'].to(model.device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits.squeeze().cpu().numpy()
            y_pred.extend(logits)
            y_true.extend(scores.cpu().numpy())

    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    print(f"{app_name} - Mean Squared Error: {mse}")
    print(f"{app_name} - R-squared: {r2}")

    # Save model and tokenizer
    model.save_pretrained(f"{app_name}_model")
    tokenizer.save_pretrained(f"{app_name}_tokenizer")
    print(f"Model and tokenizer saved for {app_name}")

# 모델 로드 함수
def load_model_and_tokenizer(app_name):
    model = BertForSequenceClassification.from_pretrained(f"{app_name}_model")
    tokenizer = BertTokenizer.from_pretrained(f"{app_name}_tokenizer")
    print(f"Loaded model and tokenizer for {app_name}")
    return model, tokenizer

# 특정 단어 예상 점수 계산 함수
def predict_scores_for_words(app_name, test_words):
    model, tokenizer = load_model_and_tokenizer(app_name)
    model.eval()

    predictions = []
    for word in test_words:
        encoding = tokenizer(
            word,
            add_special_tokens=True,
            max_length=128,
            return_tensors="pt",
            padding='max_length',
            truncation=True
        )
        input_ids = encoding['input_ids'].to(model.device)
        attention_mask = encoding['attention_mask'].to(model.device)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            predicted_score = outputs.logits.squeeze().item()
            predictions.append((word, predicted_score))
            print(f"Predicted score for '{word}' in {app_name}: {predicted_score:.2f}")

    print("\n")
    return predictions

# 앱별 파일 경로 설정
output_path = '/content/drive/MyDrive/2024/TextMining/reviewmodel'
apps = {
    '피터팬': f"{output_path}/cleaned_피터팬_별점.csv",
    '다방': f"{output_path}/cleaned_다방_별점.csv",
    '직방': f"{output_path}/cleaned_직방_별점.csv",
    '호갱노노': f"{output_path}/cleaned_호갱노노_별점.csv",
    '네이버부동산': f"{output_path}/cleaned_네이버부동산_별점.csv"
}

# 모든 앱에 대해 모델 학습 및 저장
for app_name, file_path in apps.items():
    try:
        train_and_save_model(app_name, file_path)
    except Exception as e:
        print(f"Error processing {app_name}: {e}")

# 테스트할 단어
test_words = ['사기', '좋아요', '최악', '친절']

# 모든 앱에 대해 단어 예측 실행
for app_name in apps.keys():
    try:
        predict_scores_for_words(app_name, test_words)
    except Exception as e:
        print(f"Error processing {app_name}: {e}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


피터팬 - Epoch 1/3, Loss: 2.5641950728495915
피터팬 - Epoch 2/3, Loss: 1.268910451663865
피터팬 - Epoch 3/3, Loss: 1.1265691538320648
피터팬 - Mean Squared Error: 1.1251959800720215
피터팬 - R-squared: 0.5198914663228766
Model and tokenizer saved for 피터팬


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


다방 - Epoch 1/3, Loss: 1.7968297883509152
다방 - Epoch 2/3, Loss: 1.1397271380426894
다방 - Epoch 3/3, Loss: 0.9859451511987003
다방 - Mean Squared Error: 1.1104443073272705
다방 - R-squared: 0.5557422279915839
Model and tokenizer saved for 다방


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


직방 - Epoch 1/3, Loss: 1.9749312913720158
직방 - Epoch 2/3, Loss: 1.5539025291395576
직방 - Epoch 3/3, Loss: 1.4192694938350199
직방 - Mean Squared Error: 1.4907722473144531
직방 - R-squared: 0.4317740388417002
Model and tokenizer saved for 직방


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


호갱노노 - Epoch 1/3, Loss: 3.5301436799414017
호갱노노 - Epoch 2/3, Loss: 2.1774538267476884
호갱노노 - Epoch 3/3, Loss: 2.0679040294067534
호갱노노 - Mean Squared Error: 1.7941641807556152
호갱노노 - R-squared: 0.39509691801723357
Model and tokenizer saved for 호갱노노


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


네이버부동산 - Epoch 1/3, Loss: 2.17677721906276
네이버부동산 - Epoch 2/3, Loss: 1.8801715044748215
네이버부동산 - Epoch 3/3, Loss: 1.756026738598233
네이버부동산 - Mean Squared Error: 1.7803014516830444
네이버부동산 - R-squared: 0.15411402245606276
Model and tokenizer saved for 네이버부동산
Loaded model and tokenizer for 피터팬
Predicted score for '사기' in 피터팬: 3.65
Predicted score for '좋아요' in 피터팬: 3.89
Predicted score for '최악' in 피터팬: 3.89
Predicted score for '친절' in 피터팬: 3.96


Loaded model and tokenizer for 다방
Predicted score for '사기' in 다방: 3.70
Predicted score for '좋아요' in 다방: 4.52
Predicted score for '최악' in 다방: 4.52
Predicted score for '친절' in 다방: 4.88


Loaded model and tokenizer for 직방
Predicted score for '사기' in 직방: 2.04
Predicted score for '좋아요' in 직방: 4.40
Predicted score for '최악' in 직방: 4.40
Predicted score for '친절' in 직방: 4.78


Loaded model and tokenizer for 호갱노노
Predicted score for '사기' in 호갱노노: 3.11
Predicted score for '좋아요' in 호갱노노: 4.47
Predicted score for '최악' in 호갱노노: 4.47
Predicted score for '친절' in 호