# STA 141B Lecture 10

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

* Participation grade estimates will be posted this weekend
* Assignment 2 grades as well
* Feedback on project proposals as well

### Topics

* Web Scraping

### Datasets

* [CUESA's Vegetable Seasons Chart](https://cuesa.org/eat-seasonally/charts/vegetables)
* [Craigslist Apartments](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa)

### References

* [__requests__ documentation](http://docs.python-requests.org/en/master/)
* [__requests-html__ documentation](https://html.python-requests.org/)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
* Python for Data Analysis, Ch. 6
* Python for Data Analysis, Ch. 7.3 (to review string processing)

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Web Scraping

In [59]:
# Our usual data science tools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp # other science tools
# statsmodels -- "traditional" statistical models
# scikit-learn -- machine learning models
import seaborn as sns
#from plotnine import *

%matplotlib inline

# Web scraping tools
import lxml.html as lx
import requests
import requests_cache

requests_cache.install_cache("../craigslist")

## CSS Selectors

To use CSS selectors in __lxml__, you also need to install the __cssselect__ package. To install for Anaconda, run

```shell
conda install -c anaconda cssselect
```

in an Anaconda Prompt (Windows) or Terminal (MacOS).

If you don't use Anaconda, the package can be installed with `pip`.

### Example: CUESA Vegetable Seasons

CUESA (Center for Urban Education about Sustainable Agriculture) provides [a chart](https://cuesa.org/eat-seasonally/charts/vegetables) that shows when vegetables are in season. Let's scrape the chart.

In [57]:
def extract_row(row):
    # We started looking at one row here:
    #row = rows[0]

    link = row.xpath(".//a")[0]
    name = link.text_content()
    # One way to get href:
    url = link.xpath("@href")[0]
    url
    # Use .get() to get a default value if the attribute doesn't exist.
    #link.attrib.get("style", "")

    # Get li tags for this row
    seasons = row.xpath(".//li/@class")
    return [name, url] + ["season" in x or "market" in x for x in seasons] # season
    #pd.Series(seasons).str.contains("season")

def scrape_cuesa_chart(base_url, food = None):
    # Download the page
    #base_url = "https://cuesa.org/eat-seasonally/charts/vegetables"
    if food is None:
        food = base_url.rsplit("/", 1)[-1]
    
    response = requests.get(base_url)
    response.raise_for_status()

    # Parse the HTML
    html = lx.fromstring(response.text)
    html.make_links_absolute(base_url)
    html

    # Find the table with the vegetable seasons
    tab = html.xpath("//table")
    print(tab)
    tab = tab[0]

    h, b = tab.getchildren()

    # Extract the header
    header = [food, "url"] + [x.text for x in h.xpath(".//li")]

    rows = b.xpath(".//tr")

    return pd.DataFrame([extract_row(row) for row in rows], columns = header)

scrape_cuesa_chart("https://cuesa.org/eat-seasonally/charts/vegetables")

[<Element table at 0x7f0ab1c04e58>]


Unnamed: 0,vegetables,url,Jan,Feb,Mar,Apr,May,June,July,Aug,Sept,Oct,Nov,Dec
0,Artichokes,https://cuesa.org/food/artichokes,False,False,True,True,True,True,False,False,True,True,True,True
1,Arugula,https://cuesa.org/food/arugula,True,True,True,True,True,True,True,True,True,True,True,True
2,Asparagus,https://cuesa.org/food/asparagus,False,True,True,True,True,True,False,False,False,False,False,False
3,Beets,https://cuesa.org/food/beets,True,True,True,True,True,True,True,True,True,True,True,True
4,Bok choy,https://cuesa.org/food/bok-choy,True,True,True,True,True,True,True,True,True,True,True,True
5,Broccoli,https://cuesa.org/food/broccoli,True,True,True,True,True,True,True,True,True,True,True,True
6,Broccoli rabe,https://cuesa.org/food/broccoli-rabe,True,True,True,True,True,True,False,False,True,True,True,True
7,Brussels sprouts,https://cuesa.org/food/brussels-sprouts,True,True,True,True,True,False,False,False,True,True,True,True
8,Burdock,https://cuesa.org/food/burdock,False,False,False,False,False,False,True,True,True,True,True,True
9,Cabbage,https://cuesa.org/food/cabbage,True,True,True,True,True,True,True,True,True,True,True,True


In [43]:
# Check if "season" or "market" is in "month market"
"season" in "month market" or "market" in "month market"

True

Can we generalize our scraper to the [chart](https://cuesa.org/eat-seasonally/charts/fruit) for fruit and nuts?

In [50]:
scrape_cuesa_chart("https://cuesa.org/eat-seasonally/charts/fruit", "fruit")

[<Element table at 0x7f0ab1bebb38>]


Unnamed: 0,fruit,url,Jan,Feb,Mar,Apr,May,June,July,Aug,Sept,Oct,Nov,Dec
0,Almonds,https://cuesa.org/food/almonds,True,True,True,True,True,True,True,True,True,True,True,True
1,Apples,https://cuesa.org/food/apples,True,True,True,True,True,True,True,True,True,True,True,True
2,Apricots,https://cuesa.org/food/apricots,False,False,False,False,True,True,True,False,False,False,False,False
3,Apriums,https://cuesa.org/food/apriums,False,False,False,False,True,True,False,False,False,False,False,False
4,Asian pears,https://cuesa.org/food/asian-pears,True,True,True,True,False,False,False,False,True,True,True,True
5,Avocados,https://cuesa.org/food/avocados,False,True,True,True,True,True,True,True,True,True,True,True
6,Blackberries,https://cuesa.org/food/blackberries,False,False,False,False,True,True,True,True,True,True,False,False
7,Blueberries,https://cuesa.org/food/blueberries,False,False,False,False,True,True,True,True,False,False,False,False
8,Boysenberries,https://cuesa.org/food/boysenberries,False,False,False,False,False,True,True,False,False,False,False,False
9,Cactus pears,https://cuesa.org/food/cactus-pears,False,False,False,False,True,True,True,True,True,True,True,False


In [54]:
"https://cuesa.org/eat-seasonally/charts/fruit".rsplit("/", 1)[-1]

'fruit'

### Example: Craigslist Apartments

[Craigslist](https://www.craigslist.org/) is a popular website where people can post advertisements for free. We can use data from Craigslist to analyze the local rental market for apartments.

Craigslist doesn't provide an API, so we have to scrape the data ourselves. Scraping Craigslist is the biggest challenge we've faced yet, since each ad is on a separate page.

We can start by scraping the front page of the [apartments section](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa) for links to individual ads.

In [71]:
start_url = "https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa"

def scrape_front_page(url):
    response = requests.get(url)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    html.make_links_absolute(url)

    html

    # Get all <a> tags with class "result-title"
    links = html.xpath("//a[contains(@class, 'result-title')]/@href")
    
    next_page = html.xpath("//a[contains(@class, 'next')]/@href")[0]
    
    return next_page, links

next_page, links = scrape_front_page(start_url)
scrape_front_page(next_page)

('https://sacramento.craigslist.org/search/apa?s=240',
 ['https://sacramento.craigslist.org/apa/d/sacramento-spacious-floorplans-great/6813774585.html',
  'https://sacramento.craigslist.org/apa/d/pool-service-pool-repairs-green-cleans/6813750027.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-half-off-rent-deposit/6813774301.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-spacious-pet-friendly/6813773972.html',
  'https://sacramento.craigslist.org/apa/d/davis-luxurious-2-bdrm-2-bath-available/6813773951.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-screening-room-stainless/6813773785.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-renovated-granite-stainless/6813773548.html',
  'https://sacramento.craigslist.org/apa/d/move-in-immediately-rosemont-park/6813773522.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-upgraded-apartments-granite/6813772697.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-this-2-bedro

In [87]:
link = links[0]
link

response = requests.get(link)
try:
    response.raise_for_status()
except:
    print("The url couldn't be downloaded!")

html = lx.fromstring(response.text)

price = html.xpath("//*[contains(@class, 'price')]")[0]
price

# Alternative using CSS selectors:
# html.cssselect(".price") 

title = html.cssselect("#titletextonly")[0].text_content()

#html.cssselect("p.attrgroup span")
[x.text_content() for x in html.xpath("//p[contains(@class, 'attrgroup')]/span")]

['1BR / 1Ba',
 '625ft2',
 'available feb 7',
 'cats are OK - purrr',
 'dogs are OK - wooof',
 'apartment',
 'laundry on site',
 'no smoking',
 'off-street parking']