### scrapy
- 파이썬 언어를 이용한 웹 데이터 수집 프레임워크
    - 프레임워크와 라이브러리 또는 패키지의 차이
    - 프레임워크는 특정 목적을 가진 기능의 코드가 미리 설정되어서 빈칸채우기 식으로 코드를 작성
    - 패키지는 다른사람이 작성해 놓은 코드를 가져다가 사용하는 방법
- scrapy
    - pip install scrapy
- tree
    - sudo apt install tree

#### 2. Scrapy Project
- scrapy project create
     - `!scrapy startproject crawler` : crawler 폴더 생성
- scrapy struct
- gmarket 베스트 상품 링크 수집, 링크 안에 있는 상세 정보 수집

In [8]:
import scrapy
from scrapy.http import TextResponse

In [1]:
!scrapy startproject crawler

Error: scrapy.cfg already exists in C:\Code_dss15\220_영상강의_실습\06_scrapy\crawler


#### scrapy의 구조
- 기본적으로 startproject로 만들어진 폴더안에 같은 이름이로 폴더 하나 더생김
    - startproject crawler
        - crawler/crawler 안의 이야기
    - tutorial/    
        - scrapy.cfg # deploy configuration file(구성 파일 배포)
        - tutorial/ # project's Python module, you'll import your code from here
            - `__init__.py`
            - items.py # project items definition file
            - middlewares.py # project middlewares file
            - pipelines.py # project pipelines file
            - settings.py # project settings file
            - spiders/ # a directory where you'll later put your spiders
                - `__init__.py`
- crawler/crawler/spiders (폴더)
    - 어떤 웹서비스를 어떻게 크롤링 할 것인지에 대한 코드 작성
- crawler/crawler/items.py
    - 모델에 해당하는 코드, 저장하는 데이터의 자료구조를 설정
- crawler/crawler/piplines.py
    - 스크래핑한 결과물을 item 형태로 구성하고 처리하는 방법에 대한 코드
- crawler/crawler/settings.py
    - 스크래핑 할때의 환경 설정값을 지정
    - robots.txt: 따를지, 안따를지


#### gmarket 베스트 셀러 상품 수집
- 상품명, 상세페이지 URL, 원가, 판매가, 할인율
- xpath 확인
- items.py
- spider.py
- 크롤러 실행

In [9]:
url = 'http://corners.gmarket.co.kr/Bestsellers'
req = requests.get(url)
resp = TextResponse(req.url, body=req.text, encoding='utf-8')

In [11]:
items = resp.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li')
len(items)

200

In [14]:
links = resp.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div/a/@href').extract()
len(links)

200

In [2]:
req = requests.get(links[1])
resp = TextResponse(req.url, body=req.text, encoding='utf-8')
title = resp.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract()
s_price = resp.xpath('//*[@id="itemcase_basic"]/p/span/strong[@class="price_real"]/text()')[0].extract()
o_price = resp.xpath('//*[@id="itemcase_basic"]/p/span/span[@class="price_original"]/text()')[0].extract()
discount_rate = str(round((1 - int(s_price.replace(',', '')) / int(o_price.replace(',', ''))) * 100, 2)) + '%'
title, s_price, o_price, discount_rate

NameError: name 'links' is not defined

In [28]:
!cat crawler/crawler/items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


In [3]:
%%writefile crawler/crawler/items.py
import scrapy


class CrawlerItem(scrapy.Item):
    title = scrapy.Field()
    s_price = scrapy.Field()
    o_price = scrapy.Field()
    discount_rate = scrapy.Field()
    link = scrapy.Field()

Overwriting crawler/crawler/items.py


#### 3. spider.py 작성

In [22]:
%%writefile crawler/crawler/spiders/spider.py
import scrapy
from crawler.items import CrawlerItem

class Spider(scrapy.Spider):
    
    # 어떻게 접속하겠다.
    name = 'GmarketBestsellers'
    allow_domain = ['gmarket.co.kr'] # 이 URL로 구성된 도메인만 크롤링 하겠다.
    start_urls = ['http://corners.gmarket.co.kr/Bestsellers']
    
    # 받아서 어떻게 처리하겠다.
    def parse(self, resp):
        links = resp.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div/a/@href').extract()
        for link in links[:50]:
            yield scrapy.Request(link, callback=self.page_content)
    def page_content(self, resp):
        item = CrawlerItem()
        item['title'] = resp.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract()
        item['s_price'] = resp.xpath('//*[@id="itemcase_basic"]/p/span/strong[@class="price_real"]/text()')[0].extract()
        try:
            item['o_price'] = resp.xpath('//*[@id="itemcase_basic"]/p/span/span[@class="price_original"]/text()')[0].extract()
        except:
            item['o_price'] = item['s_price']
        item['discount_rate'] = str(round((1 - int(item['s_price'].replace(',', '')) 
                                           / int(item['o_price'].replace(',', ''))) * 100, 2)) + '%'
        item['link'] = resp.url
        yield item

Overwriting crawler/crawler/spiders/spider.py


#### 4. scrapy 실행

In [40]:
pd.read_csv('crawler/items.csv')

Unnamed: 0,discount_rate,link,o_price,s_price,title
0,33.67%,http://item.gmarket.co.kr/Item?goodscode=15033...,30000,19900,[오뚜기밥] 맛있는 오뚜기밥 210g 24개
1,40.0%,http://item.gmarket.co.kr/Item?goodscode=19345...,199330,119600,[빈폴스포츠] 롱패딩/경량다운 外 시즌맞춤 인기아우터 이너 모음
2,7.15%,http://item.gmarket.co.kr/Item?goodscode=15773...,95000,88210,[CASEY KIM] 이퓨쳐 Smart Phonics 1~5권(전5권) / 워크북 ...
3,31.96%,http://item.gmarket.co.kr/Item?goodscode=18240...,21900,14900,당일바리 생물 갑오징어 1kg(7~9미) 태안 산지직송
4,56.86%,http://item.gmarket.co.kr/Item?goodscode=19353...,29900,12900,[에잇세컨즈] 에잇세컨즈 니트/가디건/티셔츠/후드/스커트+20%
...,...,...,...,...,...
195,45.23%,http://item.gmarket.co.kr/Item?goodscode=16899...,19900,10900,우리네농산물 달콤한 왕만쥬빵 60gx30개입
196,0.0%,http://item.gmarket.co.kr/Item?goodscode=11714...,32900,32900,[크리넥스] 데코 앤 소프트 화장지 27M30롤X2팩/휴지 +증정
197,50.0%,http://item.gmarket.co.kr/Item?goodscode=17260...,31000,15500,[슈퍼대디] 슈퍼대디x미피 제로 물티슈 캡형 80매10팩(55g)
198,50.0%,http://item.gmarket.co.kr/Item?goodscode=19176...,27800,13900,[스파클] 스파클생수 2L 30병(무료배송) 쿠폰가11820


#### 5. pipelines 설정
- item을 출력하기 전에 실행되는 코드를 정의

In [11]:
import requests
import json

def send_slack(msg):
    WEBHOOK_URL = 'https://hooks.slack.com/services/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'
    payload = {
        'channel': '#project',
        'username': 'GIGI',
        'text': msg,
    }
    requests.post(WEBHOOK_URL, json.dumps(payload))

In [20]:
send_slack('ㅗ')

In [25]:
%%writefile crawler/crawler/pipelines.py
# %load crawler/crawler/pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import requests
import json

class CrawlerPipeline:
    def __send_slack(self, msg):
        WEBHOOK_URL = 'https://hooks.slack.com/services/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'
        payload = {
            'channel': '#project',
            'username': 'GIGI',
            'text': msg,
        }
        requests.post(WEBHOOK_URL, json.dumps(payload))
        
    def process_item(self, item, spider):
        keyword = '마스크'
        if keyword in item['title']:            
            print('='*100)
            print(item['title'])
            print('='*100)
            self.__send_slack('{}, {}, {}'.format(item['title'], item['s_price'], item['link']))
            
        return item

Overwriting crawler/crawler/pipelines.py


- pipeline 설정 : settings.py

```
ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,
}
```

In [24]:
%%writefile crawler/crawler/settings.py
# %load crawler/crawler/settings.py
# Scrapy settings for crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html


BOT_NAME = 'crawler'

SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'crawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'crawler.middlewares.CrawlerSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'crawler.middlewares.CrawlerDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'crawler.pipelines.CrawlerPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,
}

Overwriting crawler/crawler/settings.py
