### Scrapy
- 파이썬 언어를 이용한 웹 데이터 수집 프레임워크
    - 프레임워크와 라이브러리 또는 패키지의 차이
    - 프레임워크는 특정 목적을 가진 기능의 코드가 미리 설정 되어서 빈칸채우기 식으로 코드를 작성
    - 패키지는 다른 사람이 작성해 놓은 코드를 가져다가 사용하는 방법
- scrapy
    - pip install scrapy
- tree
    - sudo apt intall tree

#### Index
- xpath : css-selector 역할을 해주는 문법
- 스크래피의 구조
- gmarket 베스트 상품 데이터 크롤링

In [1]:
import scrapy
import requests
from scrapy.http import TextResponse # xpath 연습

#### 1. xpath 사용법
- 네이버, 다음 실시간 검색어 데이터
- 네이버 검색어 xpath

```
//*[@id="PM_ID_ct"]/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span[2]
```

- `//` : 가장 상위 엘리먼트
- `*` : 조건에 맞는 하위 엘리먼트를 모두 살펴봄, "div .txt"
- `[@id="PM_ID_ct"]` : 조건 : id가 PM_ID_ct인 엘리먼트
- `/` : 바로 아래 엘리먼트를 살펴봄, "div > .txt"
- `div[1]` : div 태그에서 1 번째 엘리먼트를 선택
- `.`:  현재 엘리먼트를 선택
- `not` : not(조건)

In [2]:
# 웹페이지에 연결
req = requests.get("https://www.naver.com/")

# response 객체 생성
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [3]:
# 네이버 키워드 순위 데이터 가져오기
# xpath : xpath selector
# data : xpath selector로 선택된 엘리먼트
response.xpath('//*[@id="PM_ID_ct"]\
/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span')

[<Selector xpath='//*[@id="PM_ID_ct"]/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span' data='<span class="ah_r">20</span>'>,
 <Selector xpath='//*[@id="PM_ID_ct"]/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span' data='<span class="ah_k">강지환</span>'>]

In [4]:
# text를 data로 설정
response.xpath('//*[@id="PM_ID_ct"]\
/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span/text()')

[<Selector xpath='//*[@id="PM_ID_ct"]/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span/text()' data='20'>,
 <Selector xpath='//*[@id="PM_ID_ct"]/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span/text()' data='강지환'>]

In [5]:
# response 객체에서 data 변수만 가져옴
response.xpath('//*[@id="PM_ID_ct"]\
/div[1]/div[2]/div[2]/div[1]/div/ul/li[20]/a/span/text()').extract()

['20', '강지환']

In [6]:
response.xpath('//*[@id="PM_ID_ct"]\
/div[1]/div[2]/div[2]/div[1]/div/ul/li/a/span[2]/text()').extract()[:3]

['에이스톱 완판세트', '용이매니저 다이어트', '테라픽 헤어토닉']

#### 2. Scrapy Project
- scrapy 프로젝트 생성
- scrapy 구조
- gmarket 베스트 상품 링크 수집, 링크 안에 있는 상세 정보 수집

In [7]:
# 프로젝트 생성

In [8]:
!rm -rf crawler

In [9]:
!scrapy startproject crawler

New Scrapy project 'crawler', using template directory '/home/ubuntu/.pyenv/versions/3.6.9/envs/python3/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /home/ubuntu/python3/notebook/scrapy/crawler

You can start your first spider with:
    cd crawler
    scrapy genspider example example.com


In [10]:
!tree crawler

[01;34mcrawler[00m
├── [01;34mcrawler[00m
│   ├── __init__.py
│   ├── [01;34m__pycache__[00m
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── [01;34mspiders[00m
│       ├── __init__.py
│       └── [01;34m__pycache__[00m
└── scrapy.cfg

4 directories, 7 files


#### scrapy의 구조
- spiders 
    - 어떤 웹서비스를 어떻게 크롤링할것인지에 대한 코드를 작성(.py 파일로 작성)
- items.py
    - 모델에 해당하는 코드, 저장하는 데이터의 자료구조를 설정
- pipelines.py
    - 스크래핑한 결과물을 item 형태로 구성하고 처리하는 방법에 대한 코드
- settings.py
    - 스크래핑 할때의 환경 설정값을 지정
    - robots.txt : 따를지, 안따를지

#### gmarket 베스트 셀러 상품 수집
- 상품명, 상세페이지 URL, 원가, 판매가, 할인율
- xpath 확인
- items.py
- spider.py
- 크롤러 실행

##### 1. xpath 확인

In [11]:
req = requests.get("http://corners.gmarket.co.kr/Bestsellers")
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [12]:
links = response.xpath(
    '//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href').extract()
len(links)

200

In [13]:
links[0]

'http://item.gmarket.co.kr/Item?goodscode=1561295696&ver=637100331507061177'

In [14]:
req = requests.get(links[1])
response = TextResponse(req.url, body=req.text, encoding="utf-8")
title = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract()
s_price = response.xpath(
    '//*[@id="itemcase_basic"]/p/span/strong/text()')[0]\
.extract().replace(",", "")
o_price = response.xpath(
    '//*[@id="itemcase_basic"]/p/span/span/text()')[0]\
.extract().replace(",", "")
discount_rate = str(round((1 - int(s_price) / int(o_price))*100, 2)) + "%"
title, s_price, o_price, discount_rate

('프롬유 전상품균일가/기모롱원피스/후드원피스 70종 ', '8900', '29600', '69.93%')

#### 2. items.py 작성

In [15]:
!cat crawler/crawler/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


In [16]:
%%writefile crawler/crawler/items.py
import scrapy

class CrawlerItem(scrapy.Item):
    title = scrapy.Field()
    s_price = scrapy.Field()
    o_price = scrapy.Field()
    discount_rate = scrapy.Field()
    link = scrapy.Field()

Overwriting crawler/crawler/items.py


#### 3. spider.py 작성

In [18]:
%%writefile crawler/crawler/spiders/spider.py
import scrapy
from crawler.items import CrawlerItem

class Spider(scrapy.Spider):
    name = "GmarketBestsellers"
    allow_domain = ["gmarket.co.kr"]
    start_urls = ["http://corners.gmarket.co.kr/Bestsellers"]
    
    def parse(self, response):
        links = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href').extract()
        for link in links[:10]:
            yield scrapy.Request(link, callback=self.page_content)
            
    def page_content(self, response):
        item = CrawlerItem()
        item["title"] = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract()
        item["s_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()')[0].extract().replace(",", "")
        try:
            item["o_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract().replace(",", "")
        except:
            item["o_price"] = item["s_price"]
        item["discount_rate"] = str(round((1 - int(item["s_price"]) / int(item["o_price"]))*100, 2)) + "%"
        item["link"] = response.url
        yield item

Overwriting crawler/crawler/spiders/spider.py


#### 4. Scrapy 실행

In [19]:
%%writefile run.sh
cd crawler
scrapy crawl GmarketBestsellers

Overwriting run.sh


In [20]:
!chmod +x run.sh

In [21]:
!./run.sh

2019-11-22 06:26:37 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: crawler)
2019-11-22 06:26:37 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Oct 24 2019, 05:23:48) - [GCC 7.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-1054-aws-x86_64-with-debian-buster-sid
2019-11-22 06:26:37 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'crawler', 'NEWSPIDER_MODULE': 'crawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['crawler.spiders']}
2019-11-22 06:26:37 [scrapy.extensions.telnet] INFO: Telnet Password: 30c77c23812956cb
2019-11-22 06:26:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-11-22 06:26:37 [scrapy.middleware] INFO: Enabled downloader middlewar

2019-11-22 06:26:39 [scrapy.core.scraper] DEBUG: Scraped from <200 http://item.gmarket.co.kr/Item?goodscode=1538061341&ver=637100331980374279>
{'discount_rate': '14.09%',
 'link': 'http://item.gmarket.co.kr/Item?goodscode=1538061341&ver=637100331980374279',
 'o_price': '36900',
 's_price': '31700',
 'title': '[큐씨와이] 11/25(월) 출고/QCY T5 블루투스 이어폰/파우치 증정 '}
2019-11-22 06:26:39 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-22 06:26:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4253,
 'downloader/request_count': 13,
 'downloader/request_method_count/GET': 13,
 'downloader/response_bytes': 337447,
 'downloader/response_count': 13,
 'downloader/response_status_count/200': 13,
 'elapsed_time_seconds': 1.785422,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 11, 22, 6, 26, 39, 709608),
 'item_scraped_count': 10,
 'log_count/DEBUG': 26,
 'log_count/INFO': 10,
 'memusage/max': 53202944,
 'memusage/sta

- 결과를 csv로 저장

In [23]:
%%writefile run.sh
cd crawler
scrapy crawl GmarketBestsellers -o GmarketBestsellers.csv

Overwriting run.sh


In [24]:
!./run.sh

2019-11-22 06:39:26 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: crawler)
2019-11-22 06:39:26 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Oct 24 2019, 05:23:48) - [GCC 7.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-1054-aws-x86_64-with-debian-buster-sid
2019-11-22 06:39:26 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'crawler', 'FEED_FORMAT': 'csv', 'FEED_URI': 'GmarketBestsellers.csv', 'NEWSPIDER_MODULE': 'crawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['crawler.spiders']}
2019-11-22 06:39:26 [scrapy.extensions.telnet] INFO: Telnet Password: ca21e31a0a386e9d
2019-11-22 06:39:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy

2019-11-22 06:39:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://item.gmarket.co.kr/Item?goodscode=1221737792&ver=637100339668604194>
{'discount_rate': '69.93%',
 'link': 'http://item.gmarket.co.kr/Item?goodscode=1221737792&ver=637100339668604194',
 'o_price': '29600',
 's_price': '8900',
 'title': '프롬유 전상품균일가/기모롱원피스/후드원피스 70종 '}
2019-11-22 06:39:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-22 06:39:28 [scrapy.extensions.feedexport] INFO: Stored csv feed (10 items) in: GmarketBestsellers.csv
2019-11-22 06:39:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4253,
 'downloader/request_count': 13,
 'downloader/request_method_count/GET': 13,
 'downloader/response_bytes': 337440,
 'downloader/response_count': 13,
 'downloader/response_status_count/200': 13,
 'elapsed_time_seconds': 1.904346,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 11, 22, 6, 39, 28, 654251),
 'item_scraped_count': 10,
 'log_count/D

In [25]:
!ls crawler/

GmarketBestsellers.csv	crawler  scrapy.cfg


In [26]:
import pandas as pd

In [32]:
files = !ls crawler/
files

['GmarketBestsellers.csv', 'crawler', 'scrapy.cfg']

In [35]:
"crawler/{}".format(files[0])

'crawler/GmarketBestsellers.csv'

In [34]:
df = pd.read_csv("crawler/{}".format(files[0]))
df.tail(2)

Unnamed: 0,discount_rate,link,o_price,s_price,title
8,14.09%,http://item.gmarket.co.kr/Item?goodscode=15380...,36900,31700,[큐씨와이] 11/25(월) 출고/QCY T5 블루투스 이어폰/파우치 증정
9,69.93%,http://item.gmarket.co.kr/Item?goodscode=12217...,29600,8900,프롬유 전상품균일가/기모롱원피스/후드원피스 70종


#### 5. Pipelines 설정
- item 을 출력하기 전에 실행되는 코드를 정의

In [36]:
import requests
import json

def send_slack(msg):
    WEBHOOK_URL = "https://hooks.slack.com/services/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"
    payload = {
        "channel": "#rada",
        "username": "PDJ",
        "text": msg,
    }
    requests.post(WEBHOOK_URL, json.dumps(payload))

In [37]:
send_slack("테스트")

In [38]:
!cat crawler/crawler/pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class CrawlerPipeline(object):
    def process_item(self, item, spider):
        return item


In [39]:
%%writefile crawler/crawler/pipelines.py
import requests
import json

class CrawlerPipeline(object):
    
    def __send_slack(self, msg):
        WEBHOOK_URL = "https://hooks.slack.com/services/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"
        payload = {
            "channel": "#rada",
            "username": "PDJ",
            "text": msg,
        }
        requests.post(WEBHOOK_URL, json.dumps(payload))
        
    def process_item(self, item, spider):
        keyword = "세트"
        print("="*100)
        print(item["title"], keyword)
        print("="*100)
        if keyword in item["title"]:
            self.__send_slack("{},{},{}".format(
                item["title"], item["s_price"], item["link"]))
        return item

Overwriting crawler/crawler/pipelines.py


- pipeline 설정 : settings.py
```
ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,
}
```

In [40]:
!echo "ITEM_PIPELINES = {" >> crawler/crawler/settings.py
!echo "    'crawler.pipelines.CrawlerPipeline': 300,"  >> crawler/crawler/settings.py
!echo "}"  >> crawler/crawler/settings.py

In [41]:
!tail -n 3 crawler/crawler/settings.py

ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,
}


In [42]:
!./run.sh

2019-11-22 07:04:52 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: crawler)
2019-11-22 07:04:52 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.6.9 (default, Oct 24 2019, 05:23:48) - [GCC 7.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-1054-aws-x86_64-with-debian-buster-sid
2019-11-22 07:04:52 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'crawler', 'FEED_FORMAT': 'csv', 'FEED_URI': 'GmarketBestsellers.csv', 'NEWSPIDER_MODULE': 'crawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['crawler.spiders']}
2019-11-22 07:04:52 [scrapy.extensions.telnet] INFO: Telnet Password: f5785e6f42be8610
2019-11-22 07:04:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy

2019-11-22 07:04:54 [urllib3.connectionpool] DEBUG: https://hooks.slack.com:443 "POST /services/TNKEL1KJR/BQHDMJ9TM/NkAc2UDpQemyH2oCkSbYie10 HTTP/1.1" 200 22
2019-11-22 07:04:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://item.gmarket.co.kr/Item?goodscode=610206086&ver=637100354925718753>
{'discount_rate': '69.78%',
 'link': 'http://item.gmarket.co.kr/Item?goodscode=610206086&ver=637100354925718753',
 'o_price': '46000',
 's_price': '13900',
 'title': '[키친아트] 찜통기1단 세트 '}
2019-11-22 07:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://item.gmarket.co.kr/Item?goodscode=1538061341&ver=637100354925718753> (referer: http://corners.gmarket.co.kr/Bestsellers)
2019-11-22 07:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://item.gmarket.co.kr/Item?goodscode=1221737792&ver=637100354925718753> (referer: http://corners.gmarket.co.kr/Bestsellers)
[큐씨와이] 11/25(월) 출고/QCY T5 블루투스 이어폰/파우치 증정  세트
2019-11-22 07:04:54 [scrapy.core.scraper] DEBUG: Scraped from <200 http://item