# Lecture 9

Feburary 03, 2022

### Announcements

* HW 3 due next week
* HW 1 grade available this Friday

### Topics

* Web Scraping

### Datasets

* [CUESA's Vegetable Seasons Chart](https://cuesa.org/eat-seasonally/charts/vegetables)
* [Craigslist Apartments](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa)

### References

* [__requests__ documentation](http://docs.python-requests.org/en/master/)
* [__requests-html__ documentation](https://html.python-requests.org/)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
* Python for Data Analysis, Ch. 6
* Python for Data Analysis, Ch. 7.3 (to review string processing)

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## A remark on `<span>`


The `<span>` tag is an inline container used to mark up a part of a text, or a part of a document.
    
For example, you can write the code
```
<p>My hat is <span style="color:blue">blue</span>.</p>    
```  
    
<p>My hat is <span style="color:blue">blue</span>.</p>     

## A remark on XML

In the last lecture, we discussed HTML and XML. 

__XML__ (Extensible Markup Language) is a markup language similar to _HTML_, but without predefined tags to use. Instead, you define your own tags designed specifically for your needs. 

What is markup language? 

Markup is information added to a document that enhances its meaning in certain ways, in that it identifies the parts and how they relate to each other. More specifically, a markup language is a set of symbols that can be placed in the text of a document to demarcate and label the parts of that document. [Reference](https://www.tutorialspoint.com/xml/xml_overview.htm)


In many HTML applications, XML is used to store or transport data, while HTML is used to format and display the same data.

XML online editor can be found [here](https://jsonformatter.org/xml-editor)

For example, 
```
<message>
   <text>Hello, world!</text>
</message>
```

This snippet includes the markup symbols, or the tags such as `<message>...</message>` and `<text>... </text>`. The tags `<message>` and `</message>` mark the start and the end of the XML code fragment. The tags `<text>` and `</text>` surround the text Hello, world!.



Another example:

```
<?xml version = "1.0" encoding = "UTF-8"?>
<contact-info>
   <name>Tanmay Patil</name>
   <company>TutorialsPoint</company>
   <phone>(011) 123-4567</phone>
</contact-info>
```

* XML Declaration: `<?xml version = "1.0" encoding = "UTF-8"?>`

This part is optional, if document contains XML declaration, then it strictly needs to be the first statement of the XML document.

* Element Syntax: Each XML-element needs to be closed either with start or with end elements OR simply `<contact-info/>`

* Nesting of Elements: XML can have multiple nested elements, e.g., name, company, phone

* The names of XML-elements are case-sensitive!

* Each XML document has exactly one single root element. It encloses all the other elements and is therefore the sole parent element to all the other elements. In this example, i.e., contact-info

Sample XML examples: [here](https://docs.oracle.com/cd/E19857-01/819-6518/abxii/index.html)

## Writing Scrapers

In the last lecture, we did web scraping from wikipedia.



The `read_html` function searches for `<table>` elements and only for `<tr>` and `<th>` rows and `<td>` elements within each `<tr>` or `<th>` element in the table. `<td>` stands for “table data”. 

Let's inspect the Wikipedia list of [US cities by area][wiki].

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

When the data we want isn't in a `table` element or request was forbidden, we have to write our own scraper.

The workflow for writing a scraper is the same regardless of the language you use:

1. Download pages with an HTTP request (usually `GET`)
2. Parse pages to extract text
3. Clean up extracted text with string methods or regex
4. Save cleaned results

### Example: CUESA Vegetable Seasons



CUESA (Center for Urban Education about Sustainable Agriculture) provides [a chart](https://cuesa.org/eat-seasonally/charts/vegetables) that shows when vegetables are in season. Let's scrape the chart.

In [12]:
import pandas as pd
result = pd.read_html("https://cuesa.org/eat-seasonally/charts/vegetables")

# does not work 

HTTPError: HTTP Error 403: Forbidden

In [13]:
import requests
import lxml.html as lx

# there is no import lxml.xml
# lxml: processing XML and HTML with Python

In [14]:
# Download the page
response = requests.get("https://cuesa.org/eat-seasonally/charts/vegetables")

In [15]:
response.raise_for_status

<bound method Response.raise_for_status of <Response [200]>>

In [16]:
# Parse the HTML
html = lx.fromstring(response.text)
html

<Element html at 0x7f81ea3c3130>

In [18]:
# Find the table with the vegetable seasons
tab = html.xpath("//table")[0]
tab

<Element table at 0x7f81ea3cbae0>

In [20]:
rows = tab.xpath(".//tr")
rows = rows[1:]
rows

[<Element tr at 0x7f81ea3cb950>,
 <Element tr at 0x7f81ea3cbc70>,
 <Element tr at 0x7f81ea3cb770>,
 <Element tr at 0x7f81ea3cbd10>,
 <Element tr at 0x7f81ea3cb860>,
 <Element tr at 0x7f81ea3cb8b0>,
 <Element tr at 0x7f81ea3cb9a0>,
 <Element tr at 0x7f81ea3cbdb0>,
 <Element tr at 0x7f81ea3cbe00>,
 <Element tr at 0x7f81ea3cbe50>,
 <Element tr at 0x7f81ea3cbea0>,
 <Element tr at 0x7f81ea3cbef0>,
 <Element tr at 0x7f81ea3cbf40>,
 <Element tr at 0x7f81ea3cbf90>,
 <Element tr at 0x7f81ea3d2040>,
 <Element tr at 0x7f81ea3d2090>,
 <Element tr at 0x7f81ea3d20e0>,
 <Element tr at 0x7f81ea3d2130>,
 <Element tr at 0x7f81ea3d2180>,
 <Element tr at 0x7f81ea3d21d0>,
 <Element tr at 0x7f81ea3d2220>,
 <Element tr at 0x7f81ea3d2270>,
 <Element tr at 0x7f81ea3d22c0>,
 <Element tr at 0x7f81ea3d2310>,
 <Element tr at 0x7f81ea3d2360>,
 <Element tr at 0x7f81ea3d23b0>,
 <Element tr at 0x7f81ea3d2400>,
 <Element tr at 0x7f81ea3d2450>,
 <Element tr at 0x7f81ea3d24a0>,
 <Element tr at 0x7f81ea3d24f0>,
 <Element 

In [21]:
# Extract the header
header = rows[0]
header = [x.text for x in header.xpath(".//li")]
header

['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'June',
 'July',
 'Aug',
 'Sept',
 'Oct',
 'Nov',
 'Dec']

Now, we want to print out the vegetable name and its url.
Eventually, we want to create a function to get the table for vegetables.

In [51]:
base_url = "https://cuesa.org/eat-seasonally/charts/vegetables"
response = requests.get(base_url)
response.raise_for_status()

In [36]:
# Parse the HTML
html = lx.fromstring(response.text)
html.make_links_absolute(base_url)
html

<Element html at 0x7f81ec77ed10>

In [39]:
# Find the table with the vegetable seasons
tab = html.xpath("//table")
print(tab)
tab = tab[0]
tab

[<Element table at 0x7f81ec77eb30>]


<Element table at 0x7f81ec77eb30>

In [42]:
h, b = tab.getchildren()
print(h)
print(b)

<Element thead at 0x7f81ec754900>
<Element tbody at 0x7f81ec7549a0>


In [45]:
# Extract the header
food = base_url.rsplit("/", 1)[-1]
food

'vegetables'

In [47]:
# Extract the header
header = [food, "url"] + [x.text for x in h.xpath(".//li")] # adding food and url together
header # now the header looks good

['vegetables',
 'url',
 'Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'June',
 'July',
 'Aug',
 'Sept',
 'Oct',
 'Nov',
 'Dec']

In [56]:
rows = b.xpath(".//tr") # we going to get the rows
rows

[<Element tr at 0x7f81ec713f40>,
 <Element tr at 0x7f81ec713cc0>,
 <Element tr at 0x7f81ec7180e0>,
 <Element tr at 0x7f81ea3c30e0>,
 <Element tr at 0x7f81ec726c70>,
 <Element tr at 0x7f81ec726bd0>,
 <Element tr at 0x7f81ec726b80>,
 <Element tr at 0x7f81ec71c6d0>,
 <Element tr at 0x7f81ea3d1e50>,
 <Element tr at 0x7f81ec77eb80>,
 <Element tr at 0x7f81ec77e4f0>,
 <Element tr at 0x7f81ec77e950>,
 <Element tr at 0x7f81ec77e0e0>,
 <Element tr at 0x7f81ec77e400>,
 <Element tr at 0x7f81ec77e450>,
 <Element tr at 0x7f81ec77e630>,
 <Element tr at 0x7f81ec77e810>,
 <Element tr at 0x7f81ec77e860>,
 <Element tr at 0x7f81ec77e090>,
 <Element tr at 0x7f81ec6fb1d0>,
 <Element tr at 0x7f81ec6fbae0>,
 <Element tr at 0x7f81ec6fb950>,
 <Element tr at 0x7f81ec6fb680>,
 <Element tr at 0x7f81ec7720e0>,
 <Element tr at 0x7f81ec772130>,
 <Element tr at 0x7f81ec772c20>,
 <Element tr at 0x7f81ec772f40>,
 <Element tr at 0x7f81ec7723b0>,
 <Element tr at 0x7f81ec772220>,
 <Element tr at 0x7f81ec772b80>,
 <Element 

In [None]:
row = rows[0]
link = row.xpath(".//a")[0]
name = link.text_content()
# One way to get href:
url = link.xpath("@href")[0]
url
# Use .get() to get a default value if the attribute doesn't exist.
#link.attrib.get("style", "")

# Get li tags for this row
seasons = row.xpath(".//li/@class")

In [58]:
row = rows[0]
link = row.xpath(".//a")[0]
link

<Element a at 0x7f81ec77da90>

In [60]:
name = link.text_content() # name of the link
name

'Artichokes'

In [61]:
# One way to get href:
url = link.xpath("@href")[0]
url

'https://cuesa.org/food/artichokes'

In [63]:
# Get li tags for this row
seasons = row.xpath(".//li/@class")
seasons

['month',
 'month',
 'month season',
 'month season',
 'month season',
 'month season',
 'month',
 'month',
 'month season',
 'month season',
 'month season',
 'month season']

In [66]:
tab_line = [name, url] + ["season" in x or "market" in x for x in seasons]
tab_line

['Artichokes',
 'https://cuesa.org/food/artichokes',
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True]

Now we want to do this for all rows:

In [23]:
def extract_row(row):
    # We started looking at one row here:
    #row = rows[0]

    link = row.xpath(".//a")[0]
    name = link.text_content()
    # One way to get href:
    url = link.xpath("@href")[0]
    url
    # Use .get() to get a default value if the attribute doesn't exist.
    #link.attrib.get("style", "")

    # Get li tags for this row
    seasons = row.xpath(".//li/@class")
    return [name, url] + ["season" in x or "market" in x for x in seasons] # season
    #pd.Series(seasons).str.contains("season")

def scrape_cuesa_chart(base_url, food = None):
    # Download the page
    #base_url = "https://cuesa.org/eat-seasonally/charts/vegetables"
    if food is None:
        food = base_url.rsplit("/", 1)[-1]
    
    response = requests.get(base_url)
    response.raise_for_status()

    # Parse the HTML
    html = lx.fromstring(response.text)
    html.make_links_absolute(base_url)
    html

    # Find the table with the vegetable seasons
    tab = html.xpath("//table")
    print(tab)
    tab = tab[0]

    h, b = tab.getchildren()

    # Extract the header
    header = [food, "url"] + [x.text for x in h.xpath(".//li")]

    rows = b.xpath(".//tr")

    return pd.DataFrame([extract_row(row) for row in rows], columns = header)

scrape_cuesa_chart("https://cuesa.org/eat-seasonally/charts/vegetables")

[<Element table at 0x7f81ea3f6400>]


Unnamed: 0,vegetables,url,Jan,Feb,Mar,Apr,May,June,July,Aug,Sept,Oct,Nov,Dec
0,Artichokes,https://cuesa.org/food/artichokes,False,False,True,True,True,True,False,False,True,True,True,True
1,Arugula,https://cuesa.org/food/arugula,True,True,True,True,True,True,True,True,True,True,True,True
2,Asparagus,https://cuesa.org/food/asparagus,False,True,True,True,True,True,False,False,False,False,False,False
3,Beets,https://cuesa.org/food/beets,True,True,True,True,True,True,True,True,True,True,True,True
4,Bitter melon,https://cuesa.org/food/bitter-melon,False,False,False,False,False,True,True,True,True,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,Tatsoi,https://cuesa.org/food/tatsoi,True,True,True,True,False,False,False,False,False,True,True,True
66,Tomatillos,https://cuesa.org/food/tomatillos,False,False,False,False,False,True,True,True,True,True,True,False
67,Tomatoes,https://cuesa.org/food/tomatoes,False,False,False,True,True,True,True,True,True,True,True,False
68,Turnips,https://cuesa.org/food/turnips,True,True,True,True,True,True,True,True,True,True,True,True


In [24]:
# Check if "season" or "market" is in "month market"
"season" in "month market" or "market" in "month market"

True

In [68]:
"season" in "month market"

False

Can we generalize our scraper to the [chart](https://cuesa.org/eat-seasonally/charts/fruit) for fruit and nuts?

In [25]:
scrape_cuesa_chart("https://cuesa.org/eat-seasonally/charts/fruit", "fruit")

[<Element table at 0x7f81ea3d5e50>]


Unnamed: 0,fruit,url,Jan,Feb,Mar,Apr,May,June,July,Aug,Sept,Oct,Nov,Dec
0,Almonds,https://cuesa.org/food/almonds,True,True,True,True,True,True,True,True,True,True,True,True
1,Apples,https://cuesa.org/food/apples,True,True,True,True,True,True,True,True,True,True,True,True
2,Apricots,https://cuesa.org/food/apricots,False,False,False,False,True,True,True,False,False,False,False,False
3,Apriums,https://cuesa.org/food/apriums,False,False,False,False,True,True,False,False,False,False,False,False
4,Asian pears,https://cuesa.org/food/asian-pears,True,True,True,True,False,False,False,False,True,True,True,True
5,Avocados,https://cuesa.org/food/avocados,False,True,True,True,True,True,True,True,True,True,True,True
6,Blackberries,https://cuesa.org/food/blackberries,False,False,False,False,True,True,True,True,True,True,False,False
7,Blueberries,https://cuesa.org/food/blueberries,False,False,False,False,True,True,True,True,False,False,False,False
8,Boysenberries,https://cuesa.org/food/boysenberries,False,False,False,False,False,True,True,False,False,False,False,False
9,Cactus pears,https://cuesa.org/food/cactus-pears,False,False,False,False,True,True,True,True,True,True,True,False


In [26]:
"https://cuesa.org/eat-seasonally/charts/fruit".rsplit("/", 1)[-1]

'fruit'

### Example: Craigslist Apartments

[Craigslist](https://www.craigslist.org/) is a popular website where people can post advertisements for free. We can use data from Craigslist to analyze the local rental market for apartments.

Craigslist doesn't provide an API, so we have to scrape the data ourselves. Scraping Craigslist is the biggest challenge we've faced yet, since each ad is on a separate page.

We can start by scraping the front page of the [apartments section](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa) for links to individual ads.

In [27]:
# Our usual data science tools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp # other science tools
# statsmodels -- "traditional" statistical models
# scikit-learn -- machine learning models
import seaborn as sns
#from plotnine import *

%matplotlib inline

# Web scraping tools
import lxml.html as lx
import requests
import requests_cache

requests_cache.install_cache("../craigslist")

In [28]:
start_url = "https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa"

def scrape_front_page(url):
    response = requests.get(url)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    html.make_links_absolute(url)

    html

    # Get all <a> tags with class "result-title"
    links = html.xpath("//a[contains(@class, 'result-title')]/@href")
    
    next_page = html.xpath("//a[contains(@class, 'next')]/@href")[0]
    
    return next_page, links

next_page, links = scrape_front_page(start_url)
scrape_front_page(next_page)

('https://sacramento.craigslist.org/d/apartments-housing-for-rent/search/apa?s=240',
 ['https://sacramento.craigslist.org/apa/d/sacramento-move-in-special-the-core/7407010198.html',
  'https://sacramento.craigslist.org/apa/d/great-location-limited-time-only-apply/7407018440.html',
  'https://sacramento.craigslist.org/apa/d/roseville-bedroom-bathroom-duplex/7405747529.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-plush-carpet-woodgrain/7407017329.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-luxurious-bedroom-units/7393782386.html',
  'https://sacramento.craigslist.org/apa/d/woodland-welcome-to-your-new-home/7407023660.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-1bd-1ba-home-for-rent/7407021134.html',
  'https://sacramento.craigslist.org/apa/d/sacramento-1st-floor-studio-with/7406958058.html',
  'https://sacramento.craigslist.org/apa/d/west-sacramento-microwave-ceiling-fan/7407020161.html',
  'https://sacramento.craigslist.org/apa/d/antelop

In [29]:
link = links[0]
link

response = requests.get(link)
try:
    response.raise_for_status()
except:
    print("The url couldn't be downloaded!")

html = lx.fromstring(response.text)

price = html.xpath("//*[contains(@class, 'price')]")[0]
price

# Alternative using CSS selectors:
# html.cssselect(".price") 

title = html.cssselect("#titletextonly")[0].text_content()

#html.cssselect("p.attrgroup span")
[x.text_content() for x in html.xpath("//p[contains(@class, 'attrgroup')]/span")]

['3BR / 2.5Ba',
 '1366ft2',
 'available nov 15',
 'cats are OK - purrr',
 'dogs are OK - wooof',
 'flooring: tile',
 'house',
 'w/d hookups',
 'no smoking',
 'attached garage',
 'rent period: monthly']