### web crawling
- 웹 서비스의 구조 : 서버 - 클라이언트
- 크롤링 방법 세가지
    - requests : JSON : 동적 페이지
        - 네이버 주식 사이트에서 주가 데이터 크롤링
            - 시각화(정규화), 상관계수
    - requests : html : 정적 페이지
    - selenium : web driver
- 크롤링 절차
    - 웹서비스 분석 : URL
    - 요청, 응답 : str
    - 전처리 : str 데이터 파싱(dict, bs obj(css-selector) 등) 후 데이터 프레임으로 생성

#### 1. 웹 서비스 분석 : URL 찾기

In [1]:
import pandas as pd
import requests
from fake_useragent import UserAgent
import time

In [3]:
result_df = pd.DataFrame()
for page in range(1, 106):    
    url = "https://smartstore.naver.com/lgycompany/products/4817186601/reviews/page.json?page={}&size=20&sortType=REVIEW_SCORE_DESC&contentType=ALL&topicCode".format(page)
    headers = {
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9,ko;q=0.8',
        'cache-control': 'no-cache',
        'charset': 'utf-8',
        'content-type': 'application/x-www-form-urlencoded; charset=utf-8',
        'dnt': '1',
        'pragma': 'no-cache',
        'referer': 'https://smartstore.naver.com/lgycompany/products/4817186601',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'user-agent': str(UserAgent().random),
        'x-requested-with': 'XMLHttpRequest'
    }
    response = requests.get(url, headers=headers)
    time.sleep(3)
    data = response.json()['htReturnValue']['pagedResult']['content']
    review_df = pd.DataFrame(data)
    result_df = result_df.append(review_df)
result_df = result_df[['reviewScore', 'contents']]
result_df.reset_index(drop=True, inplace=True)
result_df

KeyError: 'htReturnValue'

In [116]:
result_df = pd.DataFrame()
total_review = 345
url_input = input("네이버 스마트스토어 링크를 입력하세요:\n 예) https://smartstore.naver.com/happyjiggu/products/4920780040\n")
for page in range(1, int(total_review/20)):    
    url = url_input + ("/reviews/page.json?page={}&size=20&sortType=REVIEW_SCORE_DESC&contentType=ALL&topicCode").format(page)
    headers = {
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9,ko;q=0.8',
        'cache-control': 'no-cache',
        'charset': 'utf-8',
        'content-type': 'application/x-www-form-urlencoded; charset=utf-8',
        'dnt': '1',
        'pragma': 'no-cache',
        'referer': url_input,
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'user-agent': str(UserAgent().random),
        'x-requested-with': 'XMLHttpRequest',
    }
    response = requests.get(url, headers=headers)
    time.sleep(3)
    data = response.json()['htReturnValue']['pagedResult']['content']
    review_df = pd.DataFrame(data)
    result_df = result_df.append(review_df)
result_df = result_df[['reviewScore', 'contents']]
result_df.reset_index(drop=True, inplace=True)
result_df

네이버 스마트스토어 링크를 입력하세요:
 예) https://smartstore.naver.com/happyjiggu/products/4920780040
https://smartstore.naver.com/cheongnyeon2/products/4658712762


KeyError: 'htReturnValue'

In [115]:
data= response.json()
data

{'bSuccess': False,
 'sErrorCode': 'InvalidRequest',
 'sErrorMessage': '잘못된 요청입니다.'}

In [77]:
# result_df 엑셀로 저장하기
result_df.to_excel("review data.xlsx", sheet_name=data[0]['purchasedProductName'], encoding="utf-8")

In [90]:
url_input = input("네이버 스마트스토어 링크를 입력하세요:\n 예) https://smartstore.naver.com/happyjiggu/products/4920780040\n")
page = 1
url = url_input + ("/reviews/page.json?page={}&size=20&sortType=REVIEW_SCORE_DESC&contentType=ALL&topicCode").format(page)
url

네이버 스마트스토어 링크를 입력하세요:
 예) https://smartstore.naver.com/happyjiggu/products/4920780040
https://smartstore.naver.com/happyjiggu/products/4920780040


'https://smartstore.naver.com/happyjiggu/products/4920780040/reviews/page.json?page=1&size=20&sortType=REVIEW_SCORE_DESC&contentType=ALL&topicCode'

In [127]:
# 함수로 만들기
def crawl_review(total_review=5000):
    result_df = pd.DataFrame()
    for page in range(1, int(total_review/20)):    
        url = ("https://smartstore.naver.com/kiroman/products/4839221779/reviews/page.json?page={}&size=20&sortType=REVIEW_SCORE_DESC&contentType=ALL&topicCode").format(page)
        headers = {
            'accept': '*/*',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'en-US,en;q=0.9,ko;q=0.8',
            'cache-control': 'no-cache',
            'charset': 'utf-8',
            'content-type': 'application/x-www-form-urlencoded; charset=utf-8',
            'dnt': '1',
            'pragma': 'no-cache',
            'referer': 'https://smartstore.naver.com/kiroman/products/4839221779',
            'sec-fetch-dest': 'empty',
            'sec-fetch-mode': 'cors',
            'sec-fetch-site': 'same-origin',
            'user-agent': str(UserAgent().random),
            'x-requested-with': 'XMLHttpRequest'
        }
        response = requests.get(url, headers=headers)
        time.sleep(3)
        data = response.json()['htReturnValue']['pagedResult']['content']
        review_df = pd.DataFrame(data)
        result_df = result_df.append(review_df)
    result_df = result_df[['reviewScore', 'contents']]
    result_df.reset_index(drop=True, inplace=True)
    result_df.to_excel("review data.xlsx", sheet_name=data[0]['purchasedProductName'], encoding="utf-8")
    return "크롤링 완료! 해당 제품에 대한 리뷰는 엑셀 파일을 확인하세요."

In [128]:
crawl_review(1563)

KeyError: 'htReturnValue'