# Scraping Yelp

The aim of this exercise is to allow a user to make an automatic search on <a href="https://www.yelp.fr/" target="_blank">Yelp</a> and store the results in a `.json` file. You will be guided through the different steps: making a form request with search keywords, parsing the search results, crawling all the result pages and storing the results into a file.

⚠ **As scrapy is not made to launch several crawler processes in the same script, you will have to restart your notebook's kernel before completing each question!**

1. Create a class `YelpSpider(scrapy.Spider)` with `start_urls = ['https://www.yelp.fr/']`. In this class, define a `parse(self, response)` method that automatically fills Yelp's homepage form with: "restaurant japonais" as search keywords and "Paris" as search location. Then, define another method `after_search(self, response)` that parses the first page of results, and yields the name and url of each search result. Finally, declare a `CrawlerProcess` that will store the results in a file named `"restaurant_japonais-paris.json"`.

In [None]:
# Import your libraries here
import os 
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
# Define your class YelpSpider(scrapy.Spider) with all methods needed
class YelpSpider(scrapy.Spider):
    name = "yelp"
    allowed_domains = ['www.yelp.fr']
    start_urls = ['https://www.yelp.fr/']

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of the first <div> with class="quote"
    def parse(self, response):
        yield scrapy.FormRequest.from_response(response,
                                               formdata = {'find_desc':'restaurant japonais', 'find_loc':'Paris'},
                                               callback = self.after_search)
            
    def after_search(self, response):
        text_block = response.css('div.businessName__09f24__3Wql2.display--inline-block__09f24__FsgS4.border-color--default__09f24__R1nRO')
        for text in text_block:
            yield {
                'name': text.css('a.link__09f24__1kwXV.link-color--inherit__09f24__3PYlA.link-size--inherit__09f24__2Uj95::text').get(),
                'url': text.css('a.link__09f24__1kwXV.link-color--inherit__09f24__3PYlA.link-size--inherit__09f24__2Uj95::attr(href)').get()
            }

In [None]:
# CrawlerProcess and settings go here

filename = "japanese-restaurant-paris.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if not os.path.exists('./saving'):
    os.mkdir('./saving')
if filename in os.listdir('saving/'):
    os.remove('saving/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'saving/' + filename : {"format": "json"},
    },
    "AUTOTHROTTLE_ENABLED": True,
    "COOKIE_ENABLE": True
})

# Start the crawling using the spider you defined above
process.crawl(YelpSpider)
process.start()

2020-12-13 15:20:58 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2020-12-13 15:20:58 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Linux-4.19.112+-x86_64-with-glibc2.10
2020-12-13 15:20:58 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 '
               'Firefox/48.0'}
2020-12-13 15:20:58 [scrapy.extensions.telnet] INFO: Telnet Password: 4273b84827f6dcdc
2020-12-13 15:20:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.lo

2. Once you've managed to get the first page's results in `restaurant_japonais-paris.json`, complete the `after_search(self,response)` method to crawl the different result pages, such that all the search results will be stored in the file `"restaurant_japonais-paris.json"`. Restart your notebook's kernel, execute the new `CrawlerProcess` and check that all the search results (and not only the first page) are now stored in the file.

In [None]:
# Import your libraries here
import os 
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
# Define a new class YelpSpider based on the previous one 
# but complete after_search method
class YelpSpider(scrapy.Spider):
    name = "yelp"
    allowed_domains = ['www.yelp.fr']
    start_urls = ['https://www.yelp.fr/']

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of the first <div> with class="quote"
    def parse(self, response):
        yield scrapy.FormRequest.from_response(response,
                                               formdata = {'find_desc':'restaurant japonais', 'find_loc':'Paris'},
                                               callback = self.after_search)
            
    def after_search(self, response):
        text_block = response.css('div.businessName__09f24__3Wql2.display--inline-block__09f24__FsgS4.border-color--default__09f24__R1nRO')
        for text in text_block:
            yield {
                'name': text.css('a.link__09f24__1kwXV.link-color--inherit__09f24__3PYlA.link-size--inherit__09f24__2Uj95::text').get(),
                'url': text.css('a.link__09f24__1kwXV.link-color--inherit__09f24__3PYlA.link-size--inherit__09f24__2Uj95::attr(href)').get()
            }
        # Select the NEXT button and store it in next_page
        next_page = response.css('a.link__09f24__1kwXV.next-link.navigation-button__09f24__3F7Pt.link-color--inherit__09f24__3PYlA.link-size--inherit__09f24__2Uj95::attr(href)').get()
        # If a next page is found, execute the parse method once again
        try:
            yield response.follow(next_page, callback=self.after_search)
        except:
            logging.info('No next page. Terminating crawling process.')

In [None]:
# CrawlerProcess and settings go here
filename = "japanese-restaurant-paris.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if not os.path.exists('./saving'):
    os.mkdir('./saving')
if filename in os.listdir('saving/'):
    os.remove('saving/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'saving/' + filename : {"format": "json"},
    },
    "AUTOTHROTTLE_ENABLED": True,
    "COOKIE_ENABLE": True
})

# Start the crawling using the spider you defined above
process.crawl(YelpSpider)
process.start()

2020-12-13 15:18:10 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2020-12-13 15:18:10 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Linux-4.19.112+-x86_64-with-glibc2.10
2020-12-13 15:18:10 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 '
               'Firefox/48.0'}
2020-12-13 15:18:10 [scrapy.extensions.telnet] INFO: Telnet Password: c8fe3194fe365ec9
2020-12-13 15:18:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.lo

Congrats, you've just made the proof of concept of making an automated search on Yelp with Scrapy! Now, let's improve the script such that it will allow the user to make any search at any location 😎

3. Use python's `input()` function to ask the user which keywords and location he would like to use, and save them into two variables: `search_keywords` and `search_location`. Then, change the `parse(self, response)` method such that it fills Yelp's form with user-defined keywords and location. Finally, change the `CrawlerProcess` such that it stores the results in a file named with the following format : `search_keywords-location.json`. 

Try your search engine with different keywords and locations ✌️

In [None]:
# Import your libraries here
import os 
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
# Use input() function here to define keywords
search_keywords = input("Enter keywords search: ")
search_location = input("Enter the location: ")

Enter keywords search:  restaurants cubains
Enter the location:  Paris


In [None]:
# Declare YelpSpider class here
class YelpSpider(scrapy.Spider):
    name = "yelp"
    allowed_domains = ['www.yelp.fr']
    start_urls = ['https://www.yelp.fr/']

    # Callback function that will be called when starting your spider
    # It will get text, author and tags of the first <div> with class="quote"
    def parse(self, response):
        yield scrapy.FormRequest.from_response(response,
                                formdata = {'find_desc':search_keywords, 'find_loc':search_location},
                                callback = self.after_search)
            
    def after_search(self, response):
        text_block = response.css('div.businessName__09f24__3Wql2.display--inline-block__09f24__FsgS4.border-color--default__09f24__R1nRO')
        for text in text_block:
            yield {
                'name': text.css('a.link__09f24__1kwXV.link-color--inherit__09f24__3PYlA.link-size--inherit__09f24__2Uj95::text').get(),
                'url': text.css('a.link__09f24__1kwXV.link-color--inherit__09f24__3PYlA.link-size--inherit__09f24__2Uj95::attr(href)').get()
            }
        # Select the NEXT button and store it in next_page
        next_page = response.css('a.link__09f24__1kwXV.next-link.navigation-button__09f24__3F7Pt.link-color--inherit__09f24__3PYlA.link-size--inherit__09f24__2Uj95::attr(href)').get()
        # If a next page is found, execute the parse method once again
        try:
            yield response.follow(next_page, callback=self.after_search)
        except:
            logging.info('No next page. Terminating crawling process.')

In [None]:
# CrawlerProcess and settings go here
filename = (search_keywords + "-"+ search_location + ".json").lower()
filename = filename.replace(" ", "-")

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if not os.path.exists('./saving'):
    os.mkdir('./saving')
if filename in os.listdir('saving/'):
    os.remove('saving/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'saving/' + filename : {"format": "json"},
    },
    "AUTOTHROTTLE_ENABLED": True,
    "COOKIE_ENABLE": True
})

# Start the crawling using the spider you defined above
process.crawl(YelpSpider)
process.start()

2020-12-13 15:11:43 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2020-12-13 15:11:43 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Linux-4.19.112+-x86_64-with-glibc2.10
2020-12-13 15:11:43 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 '
               'Firefox/48.0'}
2020-12-13 15:11:43 [scrapy.extensions.telnet] INFO: Telnet Password: 75f51826b1d82716
2020-12-13 15:11:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.lo