# STA 141B Lecture 10

February 08, 2022

### Announcements

* HW3 due next week

### Topics

* Web Scraping
* debugging
* (If time permits) Text Mining / Natural Language Processing

### Datasets

* [Craigslist Apartments](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa)

### References

+ Web Scraping
    * [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
    * [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
    * [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
+ Natural Language Processing
    * [Natural Language Processing with Python][nlpp], chapters 1-3. Beware: the print version is for Python 2.
    * [Applied Text Analysis with Python][atap], chapters 1, 3.

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US

## Homework 1 graded

#### Summary

* Average 8.85
* 4 students got 10
* 2 students recevied the extra credit (posted on Piazza)

#### Remarks

* 1.4: Common issues:
    - didn't check if the set is empty first;
    - didn’t exclude types other than `str`;
    - didn’t include float type;
    
    ```
    def better_mean(x):
    if len(x) == 0:
        return None
    
    # Check for non-numeric elements in x.
    for elt in x:
        if isinstance(elt, int) or isinstance(elt, float):
            pass
        else:
            return None
    
    return mean(x)
    ```

* 1.6: Both `sort()` and `sorted()` can be used to sort a list:
    - `sort()` will modify the list it is called on
    - `sorted()` will create a new list containing a sorted version of the list it is given
    
* 1.7: Can you find initial __guesses__ to get both roots? Start a positive value to get the positive root and a negative value to get the negative root

* 1.9: Sample code:
```
def fib(n):
    if n == 0:
        return "0"
    
    prev = "0"
    curr = "01"
    
    while n > 1:
        prev, curr = curr, curr + prev
        n -= 1
    
    return curr, n
```


## Web Scraping

Recall what we did last week, we web scraping [CUESA's Vegetable Seasons Chart](https://cuesa.org/eat-seasonally/charts/vegetables).

In [1]:
# Our usual data science tools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Web scraping tools
import lxml.html as lx
import requests
import requests_cache

# requests_cache.install_cache("../craigslist1")  #use cache

### Example: Craigslist Apartments

[Craigslist](https://www.craigslist.org/) is a popular website where people can post advertisements for free. We can use data from Craigslist to analyze the local rental market for apartments.

Craigslist doesn't provide an API, so we have to scrape the data ourselves. Scraping Craigslist is the biggest challenge we've faced yet, since each ad is on a separate page.

We can start by scraping the front page of the [apartments section](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa) for links to individual ads.

In [None]:
craigslist_url = "https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa"

response = requests.get(craigslist_url)
response.raise_for_status()
html = lx.fromstring(response.text)
html.make_links_absolute(craigslist_url)

# html.text_content()

`make_links_absolute(base_href)`: makes all links in the document absolute, assuming that base_href is the URL of the document. So if you pass base_href="http://localhost/foo/bar.html" and there is a link to baz.html that will be rewritten as http://localhost/foo/baz.html.

More explanation: [here](https://linuxtut.com/en/e03431c718b94d6304ff/)

In [None]:
# Get all <a> tags with class "result-title"
links = html.xpath("//a[contains(@class, 'result-title')]/@href")
links

In [None]:
next_page = html.xpath("//a[contains(@class, 'next')]/@href")[0]
next_page

In [None]:
# make it a function

def scrape_front_page(url):
    response = requests.get(url)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    html.make_links_absolute(url)

    # html

    # Get all <a> tags with class "result-title"
    links = html.xpath("//a[contains(@class, 'result-title')]/@href")
    
    next_page = html.xpath("//a[contains(@class, 'next')]/@href")[0]
    
    return next_page, links

next_page, links = scrape_front_page(craigslist_url)
#scrape_front_page(next_page)

In [None]:
len(links)

In [None]:
response = requests.get(links[0])
try:
    response.raise_for_status()
except:
    print("The url couldn't be downloaded!")

In [None]:
html = lx.fromstring(response.text) # Parses an XML document or fragment from a string.
html.text_content()

In [None]:
# get the price
price = html.xpath("//*[contains(@class, 'price')]")[0]
price.text_content()

html.cssselect(".price")[0].text_content()

In [None]:
# using cssslect https://www.w3schools.com/cssref/css_selectors.asp
html.cssselect("#titletextonly")[0].text_content()
html.cssselect("#postingbody")[0].text_content()

In [None]:
html.xpath("//p[contains(@class, 'attrgroup')]/span")[1].text_content()

In [None]:
coords = html.cssselect("#map")[0]
lon = coords.attrib.get("data-longitude")
lat = coords.attrib.get("data-latitude")
(lat, lon)

In [None]:
time = html.cssselect("time.timeago")[0]
time = time.attrib.get("datetime")
time

In [None]:
def scrape_one_post(link):
    response = requests.get(link)
    try:
        response.raise_for_status()
    except:
        print("The url couldn't be downloaded!")

    html = lx.fromstring(response.text) # Parses an XML document or fragment from a string.

    #if len(html.cssselect(".removed")):
        # Deleted post
    #    return {"price": None}
    
    #try:
    price = html.xpath("//*[contains(@class, 'price')]")[0]
    price = price.text_content()
    # except IndexError:
    #    price = None

    # Alternative using CSS selectors:
    # html.cssselect(".price") 

    title = html.cssselect("#titletextonly")[0].text_content()

    #html.cssselect("p.attrgroup span")
    attribs = [x.text_content() for x in html.xpath("//p[contains(@class, 'attrgroup')]/span")]

    text = html.cssselect("#postingbody")[0].text_content()

    #img = html.cssselect("div .first img")[0]
    #img_url = img.attrib.get("src")
    #img_resp = requests.get(img_url)
    #img_resp.raise_for_status()
    #img_resp.content

    #img_url

    # Next step: save image to file with open() and .write()
    # Or we could use the wget package

    coords = html.cssselect("#map")[0]
    lon = coords.attrib.get("data-longitude")
    lat = coords.attrib.get("data-latitude")
    (lat, lon)

    time = html.cssselect("time.timeago")[0]
    time = time.attrib.get("datetime")
    time

    return {"text": text, "attribs": attribs, "lat": lat, "lon": lon, "time": time, "title": title, "price": price}

scrape_one_post(links[0])
# scrape_one_post(links[1])
# scrape_one_post(links[20])

In [None]:
posts = [scrape_one_post(u) for u in links]

In [None]:
response = requests.get(links[54])
response.raise_for_status()
html = lx.fromstring(response.text)
price = html.xpath("//*[contains(@class, 'price')]")
len(html.xpath("//*[contains(@class, 'price')]")) # index out of range

In [None]:
i = 0
for link in links:
    posts = scrape_one_post(link)
    i = i + 1
    print(i)
    

In [None]:
# How to improve our function?
def scrape_one_post(link):
    response = requests.get(link)
    try:
        response.raise_for_status()
    except:
        print("The url couldn't be downloaded!")

    html = lx.fromstring(response.text) # Parses an XML document or fragment from a string.

    #if len(html.cssselect(".removed")):
        # Deleted post
    #    return {"price": None}
    
    #try:
    if (len(html.xpath("//*[contains(@class, 'price')]")) == 0):
        return{"price": None}
    else:
        price = html.xpath("//*[contains(@class, 'price')]")[0]
        price = price.text_content()
    # except IndexError:
    #    price = None

    # Alternative using CSS selectors:
    # html.cssselect(".price") 

    title = html.cssselect("#titletextonly")[0].text_content()

    #html.cssselect("p.attrgroup span")
    attribs = [x.text_content() for x in html.xpath("//p[contains(@class, 'attrgroup')]/span")]

    text = html.cssselect("#postingbody")[0].text_content()

    #img = html.cssselect("div .first img")[0]
    #img_url = img.attrib.get("src")
    #img_resp = requests.get(img_url)
    #img_resp.raise_for_status()
    #img_resp.content

    #img_url

    # Next step: save image to file with open() and .write()
    # Or we could use the wget package

    coords = html.cssselect("#map")[0]
    lon = coords.attrib.get("data-longitude")
    lat = coords.attrib.get("data-latitude")
    (lat, lon)

    time = html.cssselect("time.timeago")[0]
    time = time.attrib.get("datetime")
    time

    return {"text": text, "attribs": attribs, "lat": lat, "lon": lon, "time": time, "title": title, "price": price}

scrape_one_post(links[54])
# scrape_one_post(links[1])
# scrape_one_post(links[20])

In [None]:
posts = [scrape_one_post(u) for u in links]

In [None]:
pd.DataFrame(posts)

## Natural Language Processing

A _natural language_ is a language people use to communicate, like English, Spanish, or Mandarin. These languages evolved over thousands of years and do not have simple, explicit rules.

_Natural language processing_ (NLP) means using a computer to analyze, manipulate, or synthesize natural language. Some examples of NLP tasks are:
* Translating from one language to another
* Recognizing speech or handwriting
* Tagging sentences with metadata, such as parts of speech (verbs, nouns, etc) or sentiment
* Extracting information or computing statistics from text

Compared to artificial languages like Python and XML, it's much more difficult to extract information from natural languages. NLP is a wide field; we only have time to learn the absolute basics. If you want to learn more, consider reading the entire [Natural Language Processing with Python][nlpp] book or taking a class in computational linguistics.

[nlpp]: https://www.nltk.org/book/


### The Python NLP Ecosystem

There are lots of Python packages for NLP (try searching online)! A few popular ones are:

* [Natural Language Tool Kit][nltk] (__nltk__) is the most popular. It's designed for learning and research, so it's well-documented and has lots of features.
* [TextBlob][textblob] is a "simplified" package. It has a nicer interface than NLTK, but less features.
* [SpaCy][spacy] is a "production-ready" package, and the fastest of all the packages listed here. Useful for working with large natural language datasets.
* [gensim][gensim] is a package for creating topic models, which are a kind of statistical model that predict the topics of a text.

We're going to learn __nltk__, but you might want to try some of the others if your project involves NLP.

[Stanford's Core NLP][CoreNLP] library is at the cutting edge of NLP research. It's developed in Java, but several Python packages provide an interface (such as [pynlp][] and [stanford-corenlp][]).

[nltk]: https://www.nltk.org/
[spacy]: https://spacy.io/
[textblob]: https://textblob.readthedocs.io/en/dev/
[gensim]: https://radimrehurek.com/gensim/
[CoreNLP]: https://stanfordnlp.github.io/CoreNLP/
[pynlp]: https://github.com/sina-al/pynlp
[stanford-corenlp]: https://github.com/Lynten/stanford-corenlp

### Installing NLTK

In an Anaconda Prompt (Win) or Terminal (MacOS & Linux), run:

```shell
conda install -c anaconda nltk
```

Then try:

In [None]:
import nltk

### Corpora and Documents

A _document_ is a single body text. When working with natural language data, documents are the unit of observation.

What you choose as a document depends on the purpose of your analysis. If you're studying how people react to news on Twitter, it makes sense to use individual tweets as documents. If you're studying how animals are portrayed in 19th-century literature, you could use individual novels as documents.

A _corpus_ is a collection of documents. In other words, a corpus is a dataset.

__nltk__ provides some example corpora in the `nltk.corpus` submodule. The documentation gives a [complete list](http://www.nltk.org/nltk_data/). Most have to be downloaded with `nltk.download()` before use.

In [None]:
import nltk.corpus

# Download books from Project Gutenberg
nltk.download("gutenberg")

The `.fileids()` method lists the documents in a corpus.

In [None]:
nltk.corpus.gutenberg.fileids()

The `.raw()` method returns the raw text for a single document. Specify the document by its file ID.

In [None]:
alice = nltk.corpus.gutenberg.raw("carroll-alice.txt")

In [None]:
alice

### Tokenization

A _token_ is a sequence of characters to be treated as a group. Tokens are the unit of analysis for an indvidual document.

Tokens can represent paragraphs, sentences, words, or something else. Most of the time, tokens will be words.

When you analyze a document, the first step will usually be to split the document into tokens. Functions that do this are called _tokenizers_, and this process is called _tokenization_.

The `nltk.sent_tokenize()` function splits a document into sentences, and the `nltk.word_tokenize()` function splits a document into words.

In [None]:
nltk.sent_tokenize(alice)[0]

In [None]:
nltk.word_tokenize(alice)[:10]

Corpora also have `.sents()` and `.word()` methods for tokenization. These methods are specialized to the corpus, so they sometimes use the different strategies than `sent_tokenize()` and `word_tokenize()`.

In [None]:
nltk.corpus.gutenberg.sents("carroll-alice.txt")[0]

In [None]:
nltk.corpus.gutenberg.words("carroll-alice.txt")[:10]

### Strings, String Methods, and Regular Expressions

How does word tokenization actually work?

The simplest strategy is to split at whitespace. You can do this with Python's built-in string methods:

In [None]:
alice.split()

Splitting on whitespace doesn't handle punctuation. You can use regular expressions to split on more complex patterns. Python's built-in __re__ module provides regular expression functions.

In [None]:
import re

re.split("[ ,.:;!']", alice)

What if we also want to split at newlines?