# Scraping Yelp

The aim of this exercise is to allow a user to make an automatic search on [Yelp](https://www.yelp.fr/) and store the results in a .json file. You will be guided through the different steps : making a form request with search keywords, parsing the search results, crawling all the result pages and storing the results into a file.

**As scrapy is not made to launch several crawler processes in the same script, you will have to restart your notebook's kernel before completing each question !**

1. Create a class `YelpSpider(scrapy.Spider)` with `start_urls = ['https://www.yelp.fr/']`. In this class, define a `parse(self, response)` method that automatically fills Yelp's homepage form with : "restaurant japonais" as search keywords and "Paris" as search location. Then, define another method `after_search(self, response)` that parses the first page of results, and yields the name and url of each search result. Finally, declare a `CrawlerProcess` that will store the results in a file named `"restaurant_japonais-paris.json"`.

In [None]:
import os
import logging

import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
class YelpSpider(scrapy.Spider):
    # Name of your spider
    name = "yelp"

    # Starting URL
    start_urls = ['https://www.yelp.fr/']

    # Parse function for form request
    def parse(self, response):
        # FormRequest used to make a search in Paris
        return scrapy.FormRequest.from_response(
            response,
            formdata={'find_desc': 'restaurant japonais', 'find_loc': 'paris'},
            callback=self.after_search
        )

    # Callback used after login
    def after_search(self, response):
        
        results = response.css('h4 a')
        
        for r in results:
            yield {
                'name': r.css('::text').get(),
                'url': "https://www.yelp.fr" + r.attrib["href"]
            }

In [None]:
# Name of the file where the results will be saved
filename = "restaurant_japonais-paris.json"

# If file already exists, delete it before crawling (because Scrapy will concatenate the last and new results otherwise)
if filename in os.listdir('results/'):
        os.remove('results/' + filename)

# Declare a new CrawlerProcess with some settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'results/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(YelpSpider)
process.start()


2020-09-01 14:19:36 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-09-01 14:19:36 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 | packaged by conda-forge | (default, Jul 31 2020, 02:39:48) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Linux-4.19.112+-x86_64-with-glibc2.10
2020-09-01 14:19:36 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-09-01 14:19:36 [scrapy.extensions.telnet] INFO: Telnet Password: 07c0818057376a18
2020-09-01 14:19:36 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-09-01 14:19:37 [scrapy.middleware] INFO: Enabl

2. Once you've managed to get the first page's results in "restaurant_japonais-paris.json", complete the `after_search(self,response)` method to crawl the different result pages, such that all the search results will be stored in the file `"restaurant_japonais-paris.json"`. Restart your notebook's kernel, execute the new `CrawlerProcess` and check that all the search results (and not only the first page) are now stored in the file.

In [None]:
import os
import logging

import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
class YelpSpider(scrapy.Spider):
    # Name of your spider
    name = "yelp"

    # Starting URL
    start_urls = ['https://www.yelp.fr/']

    # Parse function for form request
    def parse(self, response):
        # FormRequest used to make a search in Paris
        return scrapy.FormRequest.from_response(
            response,
            formdata={'find_desc': 'restaurant japonais', 'find_loc': 'paris'},
            callback=self.after_search
        )

    # Callback used after login
    def after_search(self, response):
        
        results = response.css('h4 a')
        
        for r in results:
            yield {
                'name': r.css('::text').get(),
                'url': "https://www.yelp.fr" + r.attrib["href"]
            }
            
        # Select the NEXT button and store it in next_page
        try:
            next_page = response.css('a.next-link').attrib["href"]
        except KeyError:
            logging.info('No next page. Terminating crawling process.')
        else:
            yield response.follow(next_page, callback=self.after_search)

In [None]:
# Name of the file where the results will be saved
filename = "restaurant_japonais-paris.json"

# If file already exists, delete it before crawling (because Scrapy will concatenate the last and new results otherwise)
if filename in os.listdir('results/'):
        os.remove('results/' + filename)

# Declare a new CrawlerProcess with some settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'results/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(YelpSpider)
process.start()


2020-09-01 14:24:22 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-09-01 14:24:22 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 | packaged by conda-forge | (default, Jul 31 2020, 02:39:48) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Linux-4.19.112+-x86_64-with-glibc2.10
2020-09-01 14:24:22 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-09-01 14:24:22 [scrapy.extensions.telnet] INFO: Telnet Password: d744b107c4cf10b2
2020-09-01 14:24:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-09-01 14:24:23 [scrapy.middleware] INFO: Enabl

Congrats, you've just made the proof of concept of making an automated search on Yelp with Scrapy ! Now, let's improve the script such that it will allow the user to make any search at any location 😎

> Indented block

> Indented block

> Indented block

> Indented block

> Indented block











3. Use python's `input()` function to ask the user which keywords and location he would like to use, and save them into two variables : `search_keywords` and `search_location`. Then, change the `parse(self, response)` method such that it fills Yelp's form with user-defined keywords and location. Finally, change the `CrawlerProcess` such that it stores the results in a file named with the following format : "search_keywords-location.json". 

Try your search engine with different keywords and locations ✌️

In [None]:
import os
import logging

import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
print("Welcome in the automated Yelp search engine !")

search_keywords = input("Please enter your search keywords : ")
search_location = input("Please enter the name of the city : ")

Welcome in the automated Yelp search engine !


Please enter your search keywords :  spa
Please enter the name of the city :  Grenoble


In [None]:
class YelpSpider(scrapy.Spider):
    # Name of your spider
    name = "yelp"

    # Starting URL
    start_urls = ['https://www.yelp.fr/']

    # Parse function for form request
    def parse(self, response):
        # FormRequest used to make a search
        return scrapy.FormRequest.from_response(
            response,
            formdata={'find_desc': search_keywords, 'find_loc': search_location},
            callback=self.after_search
        )

    # Callback used after login
    def after_search(self, response):
        
        results = response.css('h4 a')
        
        for r in results:
            yield {
                'name': r.css('::text').get(),
                'url': "https://www.yelp.fr" + r.attrib["href"]
            }
            
        # Select the NEXT button and store it in next_page
        try:
            next_page = response.css('a.next-link').attrib["href"]
        except KeyError:
            logging.info('No next page. Terminating crawling process.')
        else:
            yield response.follow(next_page, callback=self.after_search)

In [None]:
# Name of the file where the results will be saved
filename = search_keywords.replace(" ", "_") + "-" + search_location + ".json"

# If file already exists, delete it before crawling (because Scrapy will concatenate the last and new results otherwise)
if filename in os.listdir('results/'):
        os.remove('results/' + filename)

# Declare a new CrawlerProcess with some settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'results/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(YelpSpider)
process.start()


2020-09-01 14:31:45 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-09-01 14:31:45 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 | packaged by conda-forge | (default, Jul 31 2020, 02:39:48) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Linux-4.19.112+-x86_64-with-glibc2.10
2020-09-01 14:31:45 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-09-01 14:31:45 [scrapy.extensions.telnet] INFO: Telnet Password: f7dddfab2d16ca6e
2020-09-01 14:31:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-09-01 14:31:46 [scrapy.middleware] INFO: Enabl