# Exercise: Web Scraping with Scrapy 🌐

<img src="./assets/scrapy.png" alt="scrapy" width="800"/>

[Scrapy](https://pandas.pydata.org/docs/index.html) is "an open source and collaborative framework for extracting the data you need from websites". It is written in Python and runs on Linux, Windows, Mac and BSD. Scrapy is one of the main tools for web scraping on Python, in addition to other tools like Beautiful Soup and Selenium (which automates and extracts data). It is preffered 

This exercise covers the basics of Scrapy, inlcuding how to load the library, read a `DataFrame`, extract a `DataFrame`, and save a `DataFrame` as a .csv file. While this notebook is designed to get you started on Scrapy, the [docs](https://docs.scrapy.org/en/latest/) should the first place you look for documentation and additional Scrapy functionality.

**Credit:** This notebook follows the example presented [here](https://github.com/ifrankandrade/data-collection) by Frank Andrade.

<hr style="border-top: 0.2px solid gray; margin-top: 12pt; margin-bottom: 0pt"></hr>

## Create an environment on Anaconda
[Anaconda](https://www.anaconda.com/) simplifies package management and deployment. Here, we will load the Scrapy library.
<ol>
<li>Navigate to the <b>Environments</b> section on the left panel</li>
<li>Clone your <b>eds-217</b> environment and assign it a new name. For example, <b>eds-217-scrapy</b></li>
<li>Open the terminal via your new environment. Click on the green arrow next to <b>eds-217-scrapy</b> on the left panel</li>
</ol>

On the terminal, run the line:

In [1]:
conda install -c conda-forge scrapy

Collecting package metadata (current_repodata.json): done
Solving environment: | ^C
failed with initial frozen solve. Retrying with flexible solve.

CondaError: KeyboardInterrupt


Note: you may need to restart the kernel to use updated packages.


## Set up Scrapy on VS Code
We work on <a href = "https://code.visualstudio.com/">Visual Studio (VS) Code</a> but you have other integrated development environment options like Pycharm. Here, we set up our workspace with a GitHub repository and specify our environment. 
<ol>
<li>Create a new repository on GitHub called eds-217-scrapy</li>
<li>Clone to make a version controlled workspace on VS Code</li>
<li>Create a Jupyter notebook by <i>New text file</i> > <i>Save as</i> > <i>Assign a name and add file extentsion ".ipynb"</i>
<li>Navigate to the environment on the upper right corner of the Jupyter notebook and change Kernel to <b>eds-217-scrapy</b></li>
<li>Next, if you are on macOS, update your terminal to your new environment. Run the line:</li>
</ol>


In [None]:
conda activate eds-217-scrapy

## Get to know Scrapy
Let's learn more about Scrapy commands. To view library doucmentation and available commands, on your terminal run the line:

In [None]:
scrapy

In [7]:
{
    "tags": [
        "remove-input"
    ]
}

from tabulate import tabulate

data = [["startproject", "creates a new Scrapy project", "global command"], 
        ["genspider", "creates a new spider on the current folder", "global command"], 
        ["settings", "gets the value of a Scrapy setting", "global command"], 
        ["runspider", "run a spider self-contained in a Python file, without having to create a project", "global command"],
        ["shell", "starts the Scrapy shell for the given URL", "global command"],
        ["fetch", "downloads the given URL and writes the contents to standard output", "global command"],
        ["view", "Opens the given URL in a browser, as your Scrapy spider would see it", "global command"],
        ["version", "prints the Scrapy version", "global command"],
        ["crawl", "starts crawling with spider", "local command"],
        ["check", "runs contract checks", "local command"],
        ["list", "returns all available spiders in the current project", "local command"],
        ["edit", "edits the given spider", "local command"],
        ["parse", "fetches the given URL and parses it ", "local command"],
        ["bench", "runs a quick benchmark test", "local command"]]
  
#define header names
col_names = ["Command", "Use", "Type"]
  
#display table
print(tabulate(data, headers = col_names))

Command       Use                                                                               Type
------------  --------------------------------------------------------------------------------  --------------
startproject  creates a new Scrapy project                                                      global command
genspider     creates a new spider on the current folder                                        global command
settings      gets the value of a Scrapy setting                                                global command
runspider     run a spider self-contained in a Python file, without having to create a project  global command
shell         starts the Scrapy shell for the given URL                                         global command
fetch         downloads the given URL and writes the contents to standard output                global command
view          Opens the given URL in a browser, as your Scrapy spider would see it              global command
version    

## Create our project and spider
Now, we can start on our project and spider. We scrape data from [worldometers](https://www.worldometers.info/world-population/population-by-country/) to extract and save populations by country.

First, open the terminal and run the line to crreate a project, <b>spider_worldometer</b>:

In [8]:
scrapy startproject spider_worldometer

SyntaxError: invalid syntax (4162788128.py, line 1)

Look around the folder you just created! And remember to update your working directory. Run the line:

In [None]:
cd spider_worldometer/

In this project folder, create a spider, <b>worldometer</b>, and assign a URL. 

scrapy genspider worldometer https://www.worldometers.info/world-population/population-by-country/

Your spider <b>worldometer</b> will create a file, <i>worldometer.py</i>, and it will something like this:

In [None]:
import scrapy

class WorldometersSpider(scrapy.Spider):
    name = 'worldometer'
    allowed_domains = ['www.worldoueters.info/world-population/population-by-country']
    start_urls = ['https://www.worildometers.info/vorld-population/population-by=country/']

def parse(self, response):
    pass

## Check out www.worldometer.info
In order to scrape the data from this website, we need to learn a little bit more about how it is storing information in HTML. In web design, if a webiste is a human body, HTML is our bones, CSS is our skin, and JavaScript is our movement. Here, we want to focus on the bones – the text!

First, visit [worldometers](https://www.worldometers.info/world-population/population-by-country/) and right click to <i>inspect</i>.

<p style="text-align:center;"><img src="./assets/inspect.png" alt="scrapy" width="800"/><p>

On Google Chrome, click on the cursor icon on the upper left corner of the right panel to inspect by element. 
<p style="text-align:center;"><img src="./assets/inspect-by-element.png" alt="scrapy" width="800"/><p>
<p style="text-align:center;"><img src="./assets/inspect-td-a.png" alt="scrapy" width="800"/><p>


## Start the shell for URL
Let's explore how elements in the URL look like and how you can access them. On the terminal, run the lines: 

In [None]:
scrapy shell
>>> r = scrapy.Request(url = 'https://www.worildometers.info/vorld-population/population-by=country/')
>>> fetch(r)
>>> response.body
>>> response.xpath('//h1/text()').get() # returns titles
>>> response.xpath('//td/a/text()').getall() # returns countries 

## Build the spider
Update your spider with what you looked at when you started the shell. Your file, <i>worldometer.py</i>, will now look like:

In [None]:
import scrapy

class WorldometersSpider(scrapy.Spider):
    name = 'worldometer'
    allowed_domains = ['www.worldoueters.info/world-population/population-by-country']
    start_urls = ['https://www.worildometers.info/vorld-population/population-by=country/']

def parse(self, response):
    # Extracting title and country names
    title = response.xpath('//h1/text()').get()
    countries = response.xpath('//td/a/text()').getall() 
    
    # Return data extracted
    yield {
        'titles': title,
        'countries': countries,
    }

And now you can start crawling with your spider, <b>worldometer</b>. On the terminal, run the line to get a list of all the countries:

In [None]:
scrapy crawl worldometer

## Export the data extracted as a .csv file
Finally, let's update the spider to save the extracted data. Notice where the data for each row is contained. 

<p style="text-align:center;"><img src="./assets/inspect-by-element-row.png" alt="scrapy" width="800"/><p>

Your file, <i>worldometer.py</i>, will now look like:

In [None]:
import scrapy

class WorldometersSpider(scrapy.Spider):
    name = 'worldometer'
    allowed_domains = ['www.worldoueters.info/world-population/population-by-country']
    start_urls = ['https://www.worildometers.info/vorld-population/population-by=country/']

    def parse(self, response):
        rows = response.xpath('//tr')

        for row in rows:
            countries = row.xpath('./td/a/text()').get()
            population = row.xpath('./td[3]/text()').get()
        
            yield {
                'countries': countries,
                'population': population,
            }

Lastly, on the terminal, run the line:

In [None]:
scrapy crawl worldometer -o population.csv

## …and now we make it pretty!
Learn more about data visualization libraries and visit these notebooks: 
<ul>
<li> <a href = https://github.com/adelaiderobinson/Group_4_Plotly>Plotly</a>, an open source interactive graphing library</li>
<li><a href = https://github.com/gabriellensmith/eds-217-scipy-tutorial>SciPy</a>, a library for fundamental algorithms for scientific computing</li>
<ul>