# STA 141B Lecture 10

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

### Topics

* Web Scraping

### Datasets

* [CUESA's Vegetable Seasons Chart](https://cuesa.org/eat-seasonally/charts/vegetables)
* [Craigslist Apartments](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa)

### References

* [__requests__ documentation](http://docs.python-requests.org/en/master/)
* [__requests-html__ documentation](https://html.python-requests.org/)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
* Python for Data Analysis, Ch. 6
* Python for Data Analysis, Ch. 7.3 (to review string processing)

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Web Scraping

In [7]:
# Our usual data science tools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

# Web scraping tools
import lxml.html as lx
import requests
import requests_cache

requests_cache.install_cache("../mycache")

## CSS Selectors

To use CSS selectors in __lxml__, you also need to install the __cssselect__ package. To install for Anaconda, run

```shell
conda install -c anaconda cssselect
```

in an Anaconda Prompt (Windows) or Terminal (MacOS).

If you don't use Anaconda, the package can be installed with `pip`.

### Example: CUESA Vegetable Seasons

CUESA (Center for Urban Education about Sustainable Agriculture) provides [a chart](https://cuesa.org/eat-seasonally/charts/vegetables) that shows when vegetables are in season. Let's scrape the chart.

In [None]:
# Download the page
response = requests.get("https://cuesa.org/eat-seasonally/charts/vegetables")
response.raise_for_status()

# Parse the HTML
html = lx.fromstring(response.text)
html

# Find the table with the vegetable seasons
tab = html.xpath("//table")[0]

rows = tab.xpath(".//tr")
rows = rows[1:]

# Extract the header
header = rows[0]

header = [x.text for x in header.xpath(".//li")]

# To be continued in the next lecture...

Can we generalize our scraper to the [chart](https://cuesa.org/eat-seasonally/charts/fruit) for fruit and nuts?

### Example: Craigslist Apartments

[Craigslist](https://www.craigslist.org/) is a popular website where people can post advertisements for free. We can use data from Craigslist to analyze the local rental market for apartments.

Craigslist doesn't provide an API, so we have to scrape the data ourselves. Scraping Craigslist is the biggest challenge we've faced yet, since each ad is on a separate page.

We can start by scraping the front page of the [apartments section](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa) for links to individual ads.