# Raspagem de dados com Scrapy

Framework para extrair os dados que você quer de algum site.

## Instalação

$ pip install scrapy

## Criando um novo projeto

$ scrapy startproject crowdfunding

$ cd crowdfunding

## Criando um Spider padrão

<img src="https://media.giphy.com/media/13FPh7EXyAIAJG/giphy.gif" />

$ scrapy genspider my_spider https://fundrazr.com/find?category=Travel

## Botão NEXT

find?category=Travel&page=2

In [None]:
npages = 46

for i in range(2, npages + 2):
    start_urls.append("https://fundrazr.com/find?category=Travel&page="+str(i)+"")

## Pegando os links da home da campanha

scrapy shell 'https://fundrazr.com/find?category=Travel'

In [None]:
response.xpath("//h2[contains(@class, 'title headline-font')]/a[contains(@class, 'campaign-link')]//@href").extract()

In [None]:
def parse(self, response):
    for href in response.xpath("//h2[contains(@class, 'title headline-font')]/a[contains(@class, 'campaign-link')]//@href"):
        url = "https:" + href.extract()
        yield scrapy.Request(url, callback=self.parse_dir_contents)

In [None]:
def parse_dir_contents(self, response):
    pass

Com esse esquema de requisições e callbacks que podem gerar novas requisições (com novos callbacks), você pode programar a navegação por um site gerando requisições para os links a serem seguidos, até chegar nas páginas com os itens que nos interessam. Por exemplo, para um spider que precise extrair produtos de um site de compras navegando em páginas por categorias, você poderia usar uma estrutura como a seguinte:

In [None]:
import scrapy

class SkeletonSpider(scrapy.Spider):
    name = 'spider-mummy'
    start_urls = ['http://www.someonlinewebstore.com']

    def parse(self, response):
        for c in [...]:
            url_category = ...
            yield scrapy.Request(url_category, self.parse_category_page)

    def parse_category_page(self, response):
        for p in [...]:
            url_product = ...
            yield scrapy.Request(url_product, self.parse_product)

    def parse_product(self, response):

## Pegando as informações interna de cada campanha

### Title

In [None]:
item['campaignTitle'] = response.xpath("//div[contains(@id, 'campaign-title')]/descendant::text()").extract()[0].strip()

### Raised

In [None]:
item['amountRaised'] = response.xpath("//span[contains(@class, 'stat')]/span[contains(@class, 'amount-raised')]/descendant::text()").extract()

### Goal

In [None]:
item['goal'] = " ".join(response.xpath("//div[contains(@class, 'stats-primary with-goal')]//span[contains(@class, 'stats-label hidden-phone')]/text()").extract()).strip()

### Currency type

In [None]:
item['currencyType'] = response.xpath("//div[contains(@class, 'stats-primary with-goal')]/@title").extract()

### Contributors

In [None]:
item['numberContributors'] = response.xpath("//div[contains(@class, 'stats-secondary with-goal')]//span[contains(@class, 'donation-count stat')]/text()").extract()

### Story

In [None]:
story_list = response.xpath("//div[contains(@id, 'full-story')]/descendant::text()").extract()
story_list = [x.strip() for x in story_list if len(x.strip()) > 0]
item['story']  = " ".join(story_list)

### URL

In [None]:
item['url'] = response.xpath("//meta[@property='og:url']/@content").extract()

## Items

In [None]:
class CrowdfundingItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    campaignTitle = scrapy.Field()
    amountRaised = scrapy.Field()
    goal = scrapy.Field()
    currencyType = scrapy.Field()
    numberContributors = scrapy.Field()
    story = scrapy.Field()
    url = scrapy.Field()

### My Spider

In [None]:
from crowfunding.items import CrowdfundingItem
import re

In [None]:
def parse_dir_contents(self, response):
    item = CrowdfundingItem()

    item['campaignTitle'] = response.xpath("//div[contains(@id, 'campaign-title')]/descendant::text()").extract()[0].strip()

    item['amountRaised'] = response.xpath("//span[contains(@class, 'stat')]/span[contains(@class, 'amount-raised')]/descendant::text()").extract()

    item['goal'] = " ".join(response.xpath("//div[contains(@class, 'stats-primary with-goal')]//span[contains(@class, 'stats-label hidden-phone')]/text()").extract()).strip()

    item['currencyType'] = response.xpath("//div[contains(@class, 'stats-primary with-goal')]/@title").extract()

    item['numberContributors'] = response.xpath("//div[contains(@class, 'stats-secondary with-goal')]//span[contains(@class, 'donation-count stat')]/text()").extract()

    story_list = response.xpath("//div[contains(@id, 'full-story')]/descendant::text()").extract()
    story_list = [x.strip() for x in story_list if len(x.strip()) > 0]
    item['story']  = " ".join(story_list)

    item['url'] = response.xpath("//meta[@property='og:url']/@content").extract()

    yield item

## Rodando

scrapy crawl my_scraper -o information.csv

scrapy crawl my_scraper -o information.json

scrapy crawl my_scraper -o information.xml

## Pandas

In [2]:
%matplotlib inline
import pandas as pd #Para criação de DataFrames
import matplotlib.pyplot as plt #Para plotagem
import random

In [6]:
# Lê o dataset
df = pd.read_csv('information.csv')

In [7]:
# Mostra as 5 primeiras linhas
df.head()

Unnamed: 0,story,amountRaised,goal,url,campaignTitle,numberContributors,currencyType
0,"We are Alice, Brianna, Megan, Paula, and Tabit...",570,of $10k goal,https://fundrazr.com/team-we-live-on-the-road,Team #WeLive Mongol Rally 2016 - On The Road,15.0,U.S. Dollar
1,The Victoria/Vancouver area is looking to brin...,315,of $500 goal,https://fundrazr.com/e18mV4,Bring Kage to Victoria/Vancouver!,4.0,Canadian Dollar
2,"Hello My name is Slobodan, from Belgrade,Serbi...",170,of €1.5k goal,https://fundrazr.com/d14k7a,Help me visit my grandfathers grave in UK,3.0,Euro
3,"I am Wallace Peterson,i m 37 years old with a ...",1150,of $1.5k goal,https://fundrazr.com/711NYd,Please help us get Wally rolling!,18.0,U.S. Dollar
4,The Mongol Charity Rally July 19th 2015 TEAM: ...,283,of £1k goal,https://fundrazr.com/8s6ca,THE MONGOL CHARITY RALLY 2015 FUNDRAISER,10.0,Pound Sterling


In [8]:
# Quantidade de linhas e colunas
df.shape

(534, 7)