## BeautifulSoup4

* Best way to encode elements: `.findAll("span", {"class":"green", "class":"red"})`.
* `find()` is `findAll()` once.
* `nameList = bsObj.findAll(text="the prince")` to get text.
* `.children` are tags immediately below the current one, while `.descendents` goes down the family tree.
* `next_siblings()` iterates through the sibling tags.
* `get_text()` strips all tags. Awesome. Should be last step though!
* `.parent` goes to the parent tag!
* `re` can bang pots with `bs4`, e.g. `images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})`
* `find` and `findAll` accept boolean lambda expressions. 

## Scrapy

It's very interesting seeing how a web scraper toolkit is built and applied, but I've held off on using it for now because I don't have an immediate need for heavy-duty web scrapers.

That being said, I ought to eventually rewrite the Baruch webscraper using Scrapy and Regex.

## Regex

About time I learned regex...

In [1]:
import requests
import re

In [2]:
dat = requests.get('http://en.wikipedia.org/wiki/Wikipedia').text

* The `re` module handles all of this.
* `[a]` means match something starting with the letter `a`.
* `[a-z]` means match something starting with a lowercase letter.
* `[abc]` will match *any* of `a`, `b`, or `c`.
* `[^a]` means "not `a`".
* `[\s]` matches and whitespace character, `[\S]` matches any non-whitespace character.
* `[.]` matches anything that isn't a newline.
* `*` after a character asks to match 0 or more times.
* `+` after a character asks to match 1 or more times.
* `?` after a character marks it as optional, matching 1 or 0 times.
* For the most general of cases, `{a,b}` will match at least `a` times and up to `b` times.
* In order to avoid weird issues with `\\` reservation, use raw strings e.g. `r"ab"`.
* To get a result use *either* `re.compile('[a-z]+').search('apple')` *or* `re.search('[a-z]+', 'apple')`.

In [12]:
re.compile('[a-z]+').search('apple').group()

'apple'

In [13]:
re.search('[a-z]+', 'apple').group()

'apple'

* Other methods: `group()` (returns the entire string itself), `start()`, `end()`, and `span()`.
* If no match is found instead of a `match` object you will get a `None`.
* `findall()` and `finditer()` return lists of matches.
* `|` is a zero-width or logical operator.
* The most useful of the flags: adding a `MULTILINE` argument enables reading the string in multiline mode. This makes `^` (which matches the beginning of a line) and `$` (which matches the end) useful.
* `\b` matches a word boundary. `\B` does the opposite!
* Groupings are made via e.g. `(ab)`. If these a present, then `group()` will return multiple values: `group(0)` will be the entire string, `group(1)` will be a match for the first group, and so on. Groups can be nested; in that case left-right order (in terms of where to look for the group in `group()`) is maintained.
* `groups()`, obviously, fetches all of the groups all at once.

* There is more but this is enough for the moment.