## S16 T02: Tasca de web scraping - Nivell 3 - Eduardo Baffi
#### Descripció
Aprèn a realitzar web scraping.

### Nivell 3
#### - Exercici 3
Tria una página web que tu vulguis i realitza web scraping mitjançant la llibreria Scrapy. 

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()

In [2]:
#Import Scrapy
import scrapy
from scrapy.crawler import CrawlerProcess

In [3]:
# Set up a pipeline that includes the items that are found in a JSON file.

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

The objective is to retrieve quotes from a website called goodreads.com. 
The link is: https://www.goodreads.com/quotes/tag/inspirational?page=0

Title: "Inspirational Quotes"

In [4]:
# Define the spider
# QuotesSpider class defines from which URLs to start crawling and which values to retrieve
import logging

class SpiderQuotes(scrapy.Spider):
    name = "quotes"
    start_urls = [                               # Defines from which URLs to start crawling
        'https://www.goodreads.com/quotes/tag/inspirational?page=1',
        'https://www.goodreads.com/quotes/tag/inspirational?page=2',
        'https://www.goodreads.com/quotes/tag/inspirational?page=3',
        'https://www.goodreads.com/quotes/tag/inspirational?page=4',
        'https://www.goodreads.com/quotes/tag/inspirational?page=5',
        'https://www.goodreads.com/quotes/tag/inspirational?page=6'        
        'https://www.goodreads.com/quotes/tag/inspirational?page=7',
        'https://www.goodreads.com/quotes/tag/inspirational?page=8',
        'https://www.goodreads.com/quotes/tag/inspirational?page=9'
        'https://www.goodreads.com/quotes/tag/inspirational?page=10'
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,              # Sets logging level of the crawler to warning to avoid to many DEBUG messages.
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1},  
        'FEED_FORMAT':'json',                                  
        'FEED_URI': 'quoteresult.json'             # Saves file of the result                
    }
    
    def parse(self, response):                     # Deines which values to retrieve
        for quote in response.css('div.quoteDetails'):
            yield {
                'text': quote.css('div.quoteText::text').extract_first(),
                'author': quote.css('span.authorOrTitle::text').extract_first(),
            }

In [5]:
# Start the crawler
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(SpiderQuotes)
process.start()

2021-07-27 17:01:47 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-07-27 17:01:47 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2021-07-27 17:01:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-07-27 17:01:47 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
  exporter = cls(crawler)



<Deferred at 0x2c045ff02e0>

In [6]:
# Create dataframes
import pandas as pd
result = pd.read_json("quoteresult.json")

In [7]:
pd.options.display.max_rows=1000
result

Unnamed: 0,text,author
0,\n “Be yourself; everyone else is already...,\n Oscar Wilde\n
1,\n “You've gotta dance like there's nobod...,\n William W. Purkey\n
2,\n “Be the change that you wish to see in...,\n Mahatma Gandhi\n
3,\n “Live as if you were to die tomorrow. ...,\n Mahatma Gandhi\n
4,\n “Darkness cannot drive out darkness: o...,"\n Martin Luther King Jr.,\n"
5,"\n “Without music, life would be a mistak...","\n Friedrich Nietzsche,\n"
6,\n “We accept the love we think we deserv...,"\n Stephen Chbosky,\n"
7,"\n “Imperfection is beauty, madness is ge...",\n Marilyn Monroe\n
8,\n “There are only two ways to live your ...,\n Albert Einstein\n
9,"\n “We are all in the gutter, but some of...","\n Oscar Wilde,\n"


In [8]:
# Clean rows
result = result.replace('\n','', regex=True)
result

Unnamed: 0,text,author
0,“Be yourself; everyone else is already t...,Oscar Wilde
1,“You've gotta dance like there's nobody ...,William W. Purkey
2,“Be the change that you wish to see in t...,Mahatma Gandhi
3,“Live as if you were to die tomorrow. Le...,Mahatma Gandhi
4,“Darkness cannot drive out darkness: onl...,"Martin Luther King Jr.,"
5,"“Without music, life would be a mistake.”","Friedrich Nietzsche,"
6,“We accept the love we think we deserve.”,"Stephen Chbosky,"
7,"“Imperfection is beauty, madness is geni...",Marilyn Monroe
8,“There are only two ways to live your li...,Albert Einstein
9,"“We are all in the gutter, but some of u...","Oscar Wilde,"


In [9]:
pd.set_option('display.max_colwidth', 200)
result

Unnamed: 0,text,author
0,“Be yourself; everyone else is already taken.”,Oscar Wilde
1,"“You've gotta dance like there's nobody watching,",William W. Purkey
2,“Be the change that you wish to see in the world.”,Mahatma Gandhi
3,“Live as if you were to die tomorrow. Learn as if you were to live forever.”,Mahatma Gandhi
4,“Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that.”,"Martin Luther King Jr.,"
5,"“Without music, life would be a mistake.”","Friedrich Nietzsche,"
6,“We accept the love we think we deserve.”,"Stephen Chbosky,"
7,"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",Marilyn Monroe
8,“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”,Albert Einstein
9,"“We are all in the gutter, but some of us are looking at the stars.”","Oscar Wilde,"


In [10]:
# Export table to csv
result.to_csv('quoteScrapy.csv')