# STA 141B Lecture 11

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

* Project feedback starts arriving today
* Assignment 4 will be posted this afternoon

### Topics

* Web Scraping
* Text Mining / Natural Language Processing

### Datasets

* [Craigslist Apartments](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa)

### References

+ Web Scraping
    * [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
    * [XPath Diner](http://www.topswagcode.com/xpath/) -- an interactive XPath tutorial
    * [CSS Diner](https://flukeout.github.io/) -- an interactive CSS Selector tutorial
+ Natural Language Processing
    * [Natural Language Processing with Python][nlpp], chapters 1-3. Beware: the print version is for Python 2.
    * [Applied Text Analysis with Python][atap], chapters 1, 3.

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US

## Being Efficient

In [8]:
sum([1, 2, 3])

6

In [9]:
# NOT LIKE THIS:
sum([1, 2, 3]) * sum([1, 2, 3])

36

In [10]:
# LIKE THIS:
x = sum([1, 2, 3])
x * x

36

In [14]:
response = requests.get("https://swapi.co/api/planets/1/")
response.raise_for_status()

# Use a variable for the parsed JSON, to avoid parsing twice.
result = response.json()
if "my_key" in result:
    # do something
    x = 3
else:
    result = result

{'name': 'Tatooine',
 'rotation_period': '23',
 'orbital_period': '304',
 'diameter': '10465',
 'climate': 'arid',
 'gravity': '1 standard',
 'terrain': 'desert',
 'surface_water': '1',
 'population': '200000',
 'residents': ['https://swapi.co/api/people/1/',
  'https://swapi.co/api/people/2/',
  'https://swapi.co/api/people/4/',
  'https://swapi.co/api/people/6/',
  'https://swapi.co/api/people/7/',
  'https://swapi.co/api/people/8/',
  'https://swapi.co/api/people/9/',
  'https://swapi.co/api/people/11/',
  'https://swapi.co/api/people/43/',
  'https://swapi.co/api/people/62/'],
 'films': ['https://swapi.co/api/films/5/',
  'https://swapi.co/api/films/4/',
  'https://swapi.co/api/films/6/',
  'https://swapi.co/api/films/3/',
  'https://swapi.co/api/films/1/'],
 'created': '2014-12-09T13:50:49.641000Z',
 'edited': '2014-12-21T20:48:04.175778Z',
 'url': 'https://swapi.co/api/planets/1/'}

In [15]:
# NOT LIKE THIS:
#if ("mykey" in result) == True:
# LIKE THIS:
#if "mykey" in result:

## Web Scraping

In [16]:
# Our usual data science tools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp # other science tools
# statsmodels -- "traditional" statistical models
# scikit-learn -- machine learning models
import seaborn as sns
#from plotnine import *

%matplotlib inline

# Web scraping tools
import lxml.html as lx
import requests
import requests_cache

requests_cache.install_cache("../craigslist1")

### Example: Craigslist Apartments

[Craigslist](https://www.craigslist.org/) is a popular website where people can post advertisements for free. We can use data from Craigslist to analyze the local rental market for apartments.

Craigslist doesn't provide an API, so we have to scrape the data ourselves. Scraping Craigslist is the biggest challenge we've faced yet, since each ad is on a separate page.

We can start by scraping the front page of the [apartments section](https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa) for links to individual ads.

In [17]:
start_url = "https://sacramento.craigslist.org/d/apts-housing-for-rent/search/apa"

def scrape_front_page(url):
    response = requests.get(url)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    html.make_links_absolute(url)

    html

    # Get all <a> tags with class "result-title"
    links = html.xpath("//a[contains(@class, 'result-title')]/@href")
    
    next_page = html.xpath("//a[contains(@class, 'next')]/@href")[0]
    
    return next_page, links

next_page, links = scrape_front_page(start_url)
#scrape_front_page(next_page)

In [None]:
%%timeit



In [56]:
def scrape_one_post(link):
    response = requests.get(link)
    try:
        response.raise_for_status()
    except:
        print("The url couldn't be downloaded!")

    html = lx.fromstring(response.text)

    if len(html.cssselect(".removed")):
        # Deleted post
        return {"price": None}
    
    try:
        price = html.xpath("//*[contains(@class, 'price')]")[0]
        price = price.text_content()
    except IndexError:
        price = None

    # Alternative using CSS selectors:
    # html.cssselect(".price") 

    title = html.cssselect("#titletextonly")[0].text_content()

    #html.cssselect("p.attrgroup span")
    attribs = [x.text_content() for x in html.xpath("//p[contains(@class, 'attrgroup')]/span")]

    text = html.cssselect("#postingbody")[0].text_content()

    #img = html.cssselect("div .first img")[0]
    #img_url = img.attrib.get("src")
    #img_resp = requests.get(img_url)
    #img_resp.raise_for_status()
    #img_resp.content

    #img_url

    # Next step: save image to file with open() and .write()
    # Or we could use the wget package

    coords = html.cssselect("#map")[0]
    lon = coords.attrib.get("data-longitude")
    lat = coords.attrib.get("data-latitude")
    (lat, lon)

    time = html.cssselect("time.timeago")[0]
    time = time.attrib.get("datetime")
    time

    return {"text": text, "attribs": attribs, "lat": lat, "lon": lon, "time": time, "title": title, "price": price.text_content()}

scrape_one_post(links[0])
scrape_one_post(links[1])
scrape_one_post(links[20])

{'price': None}

In [57]:
posts = [scrape_one_post(u) for u in links]

In [58]:
pd.DataFrame(posts)

Unnamed: 0,attribs,lat,lon,price,text,time,title
0,"[1BR / 1Ba, 550ft2, available sep 18, 2018, ca...",38.567457,-121.491328,$1300,\n \n QR Code Link to This P...,2019-02-12T08:54:19-0800,1Bed/1Bath in Downtown Sacramento --Free Febru...
1,"[2BR / 1Ba, 824ft2, available mar 20, cats are...",38.574860,-121.303779,$1100,\n \n QR Code Link to This P...,2019-02-12T08:54:00-0800,Upstairs 2x1 Coming Available In March! Apply ...
2,"[1BR / 1Ba, 550ft2, available dec 14, 2018, ca...",38.567457,-121.491328,$1300,\n \n QR Code Link to This P...,2019-02-12T08:53:58-0800,Upgraded 1Bed/1Bath in Downtown Sacramento --F...
3,"[1BR / 1Ba, 975ft2, cats are OK - purrr, dogs ...",38.656002,-121.512051,$1666,\n \n QR Code Link to This P...,2019-02-01T15:01:10-0800,"Lush Landscaping, Garden Style Oval Soaking Tu..."
4,"[2BR / 2Ba, available feb 12, cats are OK - pu...",38.610017,-121.344677,$1350,\n \n QR Code Link to This P...,2019-02-12T08:53:19-0800,Awesome 2 bed 2 bath at Woodbridge Place Apart...
5,"[2BR / 1Ba, 815ft2, available feb 13, \n ...",38.665255,-121.345901,$1295,\n \n QR Code Link to This P...,2019-02-12T08:53:10-0800,Move in specials!!! Interior upgrades!!! Washe...
6,"[2BR / 2Ba, 912ft2, available feb 12, cats are...",38.673261,-121.262212,$1480,\n \n QR Code Link to This P...,2019-02-12T08:52:57-0800,***Move in READY! Come see your NEW HOME!***
7,"[2BR / 2Ba, 952ft2, available feb 7, cats are ...",38.678231,-121.287165,$1499,\n \n QR Code Link to This P...,2019-02-07T09:38:23-0800,Move NOW*FULL SIZE W/D*Remodeled*Garage Availa...
8,"[2BR / 1Ba, 850ft2, available nov 30, 2018, ca...",38.743599,-121.280619,$1500,\n \n QR Code Link to This P...,2019-02-12T08:52:46-0800,"Two Bedroom, One Bathroom Apartment in Rosevil..."
9,"[2BR / 1Ba, 814ft2, available feb 13, cats are...",38.591785,-121.401072,$1375,\n \n QR Code Link to This P...,2019-02-12T08:52:30-0800,"*HURRY IN TODAY* Last One Left, Downstairs w/ ..."


## Natural Language Processing

A _natural language_ is a language people use to communicate, like English, Spanish, or Mandarin. These languages evolved over thousands of years and do not have simple, explicit rules.

_Natural language processing_ (NLP) means using a computer to analyze, manipulate, or synthesize natural language. Some examples of NLP tasks are:
* Translating from one language to another
* Recognizing speech or handwriting
* Tagging sentences with metadata, such as parts of speech (verbs, nouns, etc) or sentiment
* Extracting information or computing statistics from text

Compared to artificial languages like Python and XML, it's much more difficult to extract information from natural languages. NLP is a wide field; we only have time to learn the absolute basics. If you want to learn more, consider reading the entire [Natural Language Processing with Python][nlpp] book or taking a class in computational linguistics.

[nlpp]: https://www.nltk.org/book/


### The Python NLP Ecosystem

There are lots of Python packages for NLP (try searching online)! A few popular ones are:

* [Natural Language Tool Kit][nltk] (__nltk__) is the most popular. It's designed for learning and research, so it's well-documented and has lots of features.
* [TextBlob][textblob] is a "simplified" package. It has a nicer interface than NLTK, but less features.
* [SpaCy][spacy] is a "production-ready" package, and the fastest of all the packages listed here. Useful for working with large natural language datasets.
* [gensim][gensim] is a package for creating topic models, which are a kind of statistical model that predict the topics of a text.

We're going to learn __nltk__, but you might want to try some of the others if your project involves NLP.

[Stanford's Core NLP][CoreNLP] library is at the cutting edge of NLP research. It's developed in Java, but several Python packages provide an interface (such as [pynlp][] and [stanford-corenlp][]).

[nltk]: https://www.nltk.org/
[spacy]: https://spacy.io/
[textblob]: https://textblob.readthedocs.io/en/dev/
[gensim]: https://radimrehurek.com/gensim/
[CoreNLP]: https://stanfordnlp.github.io/CoreNLP/
[pynlp]: https://github.com/sina-al/pynlp
[stanford-corenlp]: https://github.com/Lynten/stanford-corenlp

### Installing NLTK

In an Anaconda Prompt (Win) or Terminal (MacOS & Linux), run:

```shell
conda install -c anaconda nltk
```

Then try:

In [59]:
import nltk

### Corpora and Documents

A _document_ is a single body text. When working with natural language data, documents are the unit of observation.

What you choose as a document depends on the purpose of your analysis. If you're studying how people react to news on Twitter, it makes sense to use individual tweets as documents. If you're studying how animals are portrayed in 19th-century literature, you could use individual novels as documents.

A _corpus_ is a collection of documents. In other words, a corpus is a dataset.

__nltk__ provides some example corpora in the `nltk.corpus` submodule. The documentation gives a [complete list](http://www.nltk.org/nltk_data/). Most have to be downloaded with `nltk.download()` before use.

In [60]:
import nltk.corpus

# Download books from Project Gutenberg
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to /home/nick/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

The `.fileids()` method lists the documents in a corpus.

In [62]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

The `.raw()` method returns the raw text for a single document. Specify the document by its file ID.

In [64]:
alice = nltk.corpus.gutenberg.raw("carroll-alice.txt")

### Tokenization

A _token_ is a sequence of characters to be treated as a group. Tokens are the unit of analysis for an indvidual document.

Tokens can represent paragraphs, sentences, words, or something else. Most of the time, tokens will be words.

When you analyze a document, the first step will usually be to split the document into tokens. Functions that do this are called _tokenizers_, and this process is called _tokenization_.

The `nltk.sent_tokenize()` function splits a document into sentences, and the `nltk.word_tokenize()` function splits a document into words.

In [68]:
nltk.sent_tokenize(alice)[0]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I."

In [69]:
nltk.word_tokenize(alice)[:10]

['[',
 'Alice',
 "'s",
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll',
 '1865']

Corpora also have `.sents()` and `.word()` methods for tokenization. These methods are specialized to the corpus, so they sometimes use the different strategies than `sent_tokenize()` and `word_tokenize()`.

In [71]:
nltk.corpus.gutenberg.sents("carroll-alice.txt")[0]

['[',
 'Alice',
 "'",
 's',
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll',
 '1865',
 ']']

In [73]:
nltk.corpus.gutenberg.words("carroll-alice.txt")[:10]

['[',
 'Alice',
 "'",
 's',
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll']

### Strings, String Methods, and Regular Expressions

How does word tokenization actually work?

The simplest strategy is to split at whitespace. You can do this with Python's built-in string methods:

In [75]:
alice.split()

["[Alice's",
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll',
 '1865]',
 'CHAPTER',
 'I.',
 'Down',
 'the',
 'Rabbit-Hole',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on',
 'the',
 'bank,',
 'and',
 'of',
 'having',
 'nothing',
 'to',
 'do:',
 'once',
 'or',
 'twice',
 'she',
 'had',
 'peeped',
 'into',
 'the',
 'book',
 'her',
 'sister',
 'was',
 'reading,',
 'but',
 'it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it,',
 "'and",
 'what',
 'is',
 'the',
 'use',
 'of',
 'a',
 "book,'",
 'thought',
 'Alice',
 "'without",
 'pictures',
 'or',
 "conversation?'",
 'So',
 'she',
 'was',
 'considering',
 'in',
 'her',
 'own',
 'mind',
 '(as',
 'well',
 'as',
 'she',
 'could,',
 'for',
 'the',
 'hot',
 'day',
 'made',
 'her',
 'feel',
 'very',
 'sleepy',
 'and',
 'stupid),',
 'whether',
 'the',
 'pleasure',
 'of',
 'making',
 'a',
 'daisy-chain',
 'would',
 'be',
 'worth',
 'the',
 'troubl

Splitting on whitespace doesn't handle punctuation. You can use regular expressions to split on more complex patterns. Python's built-in __re__ module provides regular expression functions.

In [78]:
import re

re.split("[ ,.:;!']", alice)

['[Alice',
 's',
 'Adventures',
 'in',
 'Wonderland',
 'by',
 'Lewis',
 'Carroll',
 '1865]\n\nCHAPTER',
 'I',
 '',
 'Down',
 'the',
 'Rabbit-Hole\n\nAlice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on',
 'the\nbank',
 '',
 'and',
 'of',
 'having',
 'nothing',
 'to',
 'do',
 '',
 'once',
 'or',
 'twice',
 'she',
 'had',
 'peeped',
 'into',
 'the\nbook',
 'her',
 'sister',
 'was',
 'reading',
 '',
 'but',
 'it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in\nit',
 '',
 '',
 'and',
 'what',
 'is',
 'the',
 'use',
 'of',
 'a',
 'book',
 '',
 '',
 'thought',
 'Alice',
 '',
 'without',
 'pictures',
 'or\nconversation?',
 '\n\nSo',
 'she',
 'was',
 'considering',
 'in',
 'her',
 'own',
 'mind',
 '(as',
 'well',
 'as',
 'she',
 'could',
 '',
 'for',
 'the\nhot',
 'day',
 'made',
 'her',
 'feel',
 'very',
 'sleepy',
 'and',
 'stupid)',
 '',
 'whether',
 'the',
 'pleasure\nof',
 'making',
 'a',
 'daisy-chain',
 'would',
 '

What if we also want to split at newlines?