# Master Data Science for Business - Data Science Consulting - Session 2 

# Notebook 1: 

# Introduction to Web Scraping with Scrapy

This notebook aims at explaining the basics of scraping a website using the Python package Scrapy. <br> 
 <br>
Useful ressources: <br>
-The officiel tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html <br>
-Tutorial to use Scrapy within a Jupyter Notebook: https://www.jitsejan.com/using-scrapy-in-jupyter-notebook.html <br>
 <br>

**To Do**: Run all the cells, and prepare a short explanation of what the Scrapy code is doing in the 3rd part, "Our first spider". Don't spend time looking at the other parts today, since these are mainly scripts to use Scrapy inside a Jupyter Notebook. 

## 1. Importing packages

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
import json
import logging
import pandas as pd

## 2. Creating the JSon pipeline 

In [2]:
#JSon pipeline, you can rename the "trustpilot.jl" to the name of your choice
class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('trustpilot.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

## 3. Our first spider 

Scrapy uses spiders that crawl through the web to get what we want as data. A spider is a piece of code allowing to define the websites we want to scrape, as well as defining the elements you want such as titles, author, content, images...

In [3]:
#We create a class with the name of our Spider
class TrustSpider(scrapy.Spider):
    name = "trust"
    #Adding URLs to scrap 
    start_urls = [
        'https://fr.trustpilot.com/review/www.centerparcs.fr/fr-fr',
    ]
    #Custom settings to modify settings usually found in the settings.py file 
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'trustpilot.json'                        # Used for pipeline 2
    }
    #The following function parses the web pages to get the elements 
    #This is the main function we want you to understand
    def parse(self, response):
        stars = 'article.review section.review__content div.review-content div.review-content__header div.review-content-header'
        date = 'article.review section.review__content div.review-content div.review-content__header div.review-content-header__dates script::text'
        for review in response.css('article.review'):
            nb_stars = review.css(stars).extract_first()[77] if review.css(stars).extract_first() is not None else -1
            pub_date = review.css(date).extract_first()[19:38] if review.css(stars).extract_first() is not None else -1
            yield {
                'title': review.css('a.link.link--large.link--dark::text').extract_first(),
                'content': review.css('p.review-content__text::text').extract_first(),
                'author': review.css('div.consumer-information__name::text').extract_first(),
                'date': pub_date,
                'stars': nb_stars,
            }

## 4. Crawling

The following code is used to launch the crawling of the spider. <br> 
**Warning**: You can execute the process only once and for one spider per notebook. If you want to relaunch the process, you have to "Restart and run all", otherwise you will get an error. 

In [4]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(TrustSpider)
process.start()

2019-01-14 18:50:51 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-01-14 18:50:51 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.2 (default, Jan  2 2019, 17:07:39) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.16299-SP0
2019-01-14 18:50:51 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'trustpilot.json', 'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


## 5. Loading data into Pandas' dataframe

In [5]:
dfjson = pd.read_json('trustpilot.json')
#Preview of the dataframe
dfjson.head()

Unnamed: 0,author,content,date,stars,title
0,\n Cor Boonen\n,\n Nous avons passé un excellent we...,2018-11-22 17:23:07,4,Week-end
1,\n Alain \n,"\n Parc très agréable, difficile de...",2018-08-08 07:52:09,3,Les TROIS FORETS
2,\n Manuele Civico\n,"\n Pas grand chose ne marche, ni l’...",2018-05-12 10:02:58,1,Pas grand chose ne marche !
3,\n Sophie Duhamel\n,\n Moi je vais parler aujourd'hui ...,2017-09-15 11:17:50,1,Non professionnel
4,\n jerome\n,"\n calme, reposant, confortable, dé...",2016-10-11 07:54:28,5,bon séjour
