## Introduction

All data problems begin with a question and end with a narrative construct that provides a clear answer. From there, the next step is getting your data. As a Data Scientist, you'll spend an incredible amount of time and skills on acquiring, prepping, cleaning, and normalizing your data. In this tutorial, we'll review some of the best tools used in the rhelm of data acquisition. 

But first, let's go into the differences between Data Acquisition, Preparation, and Cleaning. 

### Data Acquisition

Data Acquisition is the process of getting your data, hence the term <i>acquisition</i>. Data doesn't come out of nowhere, so the very first step of any data science problem is going to be getting the data in the first place. 

### Data Preparation

Once you have the data, it might not be in the best format to work with. You might have scraped a bunch of data from a website, but need it in the form of a dataframe to work with it in an easier manner. This process is called data preparation - preparing your data in a format that's easiest to form with.

### Data Cleaning

Once your data is being stored or handled in a proper manner, that might still not be enough. You might have missing values or values that need normalizing. These inconsistencies that you fix before analysis refers to data cleaning. 


## Reading, Writing, and Handling Data Files

The simplest way of acquiring data is downloading a file - either from a website, straight from your desktop, or elsewhere. Once the data is downloaded, you'll open the files for reading and possible writing. 

### CSV files

Very often, you'll have to work with CSV files. A csv file is a comma-separated values file stores tabular data in plain text. 

In the following examples, we'll be working with NBA data, which you can download from [here](https://github.com/ByteAcademyCo/data-acq/blob/master/nba.csv).

#### CSV 

Python has a csv module, which you can utilize to work with CSV files.

In [4]:
import csv

Then, with the following 4 lines, you can print each row


In [5]:
with open('nba.csv', 'rt') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

['player', 'pos', 'age', 'bref_team_id', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts', 'season', 'season_end']
['Quincy Acy', 'SF', '23', 'TOT', '63', '0', '847', '66', '141', '0.468', '4', '15', '0.266666666666667', '62', '126', '0.492063492063492', '0.482', '35', '53', '0.66', '72', '144', '216', '28', '23', '26', '30', '122', '171', '2013-2014', '2013']
['Steven Adams', 'C', '20', 'OKC', '81', '20', '1197', '93', '185', '0.503', '0', '0', 'NA', '93', '185', '0.502702702702703', '0.503', '79', '136', '0.581', '142', '190', '332', '43', '40', '57', '71', '203', '265', '2013-2014', '2013']
['Jeff Adrien', 'PF', '27', 'TOT', '53', '12', '961', '143', '275', '0.52', '0', '0', 'NA', '143', '275', '0.52', '0.52', '76', '119', '0.639', '102', '204', '306', '38', '24', '36', '39', '108', '362', '2013-2014', '2013']
['Arron Afflalo', 'SG', '28', 'ORL', '73', '73', '25

Fairly straightforward, but let's see how else we can accomplish this. 


#### Pandas

Alternatively, you can use Pandas. Pandas is great for working with CSV files because it handles DataFrames. 

We begin by importing the needed libraries: pandas.


In [6]:
import pandas as pd

Then we use pandas to read the CSV file and show the first few rows. 

In [7]:
task_data = pd.read_csv("nba.csv")
task_data.head()

Unnamed: 0,player,pos,age,bref_team_id,g,gs,mp,fg,fga,fg.,...,drb,trb,ast,stl,blk,tov,pf,pts,season,season_end
0,Quincy Acy,SF,23,TOT,63,0,847,66,141,0.468,...,144,216,28,23,26,30,122,171,2013-2014,2013
1,Steven Adams,C,20,OKC,81,20,1197,93,185,0.503,...,190,332,43,40,57,71,203,265,2013-2014,2013
2,Jeff Adrien,PF,27,TOT,53,12,961,143,275,0.52,...,204,306,38,24,36,39,108,362,2013-2014,2013
3,Arron Afflalo,SG,28,ORL,73,73,2552,464,1011,0.459,...,230,262,248,35,3,146,136,1330,2013-2014,2013
4,Alexis Ajinca,C,25,NOP,56,30,951,136,249,0.546,...,183,277,40,23,46,63,187,328,2013-2014,2013


As you can see, by using pandas, we're able to fasten the process of viewing our data, as well as view it in a much more readable format. 


#### R Programming

We've just gone through how to read CSV files in Python. But how do you do this in R? Pretty simply, actually. R has built in functions to handle CSV files, so you don't even have to use a library to accomplish what we just did with Python.


In [2]:
data <- read.csv("nba.csv")

### JSON

Because HTTP is a protocol for transferring text, the data you request through a web API (which we'll go through soon enough) needs to be serialized into a string format, usually in JavaScript Object Notation (JSON). JavaScript objects look quite similar to Python dicts, which makes their string representations easy to interpret:

```
{ 
 "name" : "Lesley Cordero",
 "job" : "Data Scientist",
 "topics" : [ "data", "science", "data science"] 
}
```

Python has a module sepcifically for working with JSON, called `json`, which we can use as follows:

In [1]:
import json
serialized = """ { 
 "name" : "Lesley Cordero",
 "job" : "Data Scientist",
 "topics" : [ "data", "science", "data science"] 
} """

Next, we parse the JSON to create a Python dict, using the json module: 


In [2]:
deserialized = json.loads(serialized)
print(deserialized)

{'name': 'Lesley Cordero', 'job': 'Data Scientist', 'topics': ['data', 'science', 'data science']}


#### jsonlite

Now, in R, working with JSON can be a bit more complicated. Unlike Python, R doesn't have a data type that resembles JSON closely (dictionaries in Python). So we have to work with what we do have, which is lists, vectors, and matrices.

Working with the same data from the Python example, we have:


In [3]:
serialized = '{ 
 "name" : "Lesley Cordero",
 "job" : "Data Scientist",
 "topics" : [ "data", "science", "data science"] 
} '

Now, if we want to properly load this into R, we'll be using the `jsonlite` library. 

In [1]:
library("jsonlite")

Once we've loaded the library, we'll use the `fromJSON` function to convert this into a data type R is more familiar with: <b>lists</b>.


In [4]:
l <- fromJSON(serialized, simplifyVector=TRUE)

Notice that `simplifyVector` is set to `TRUE`. When simplifyMatrix is enabled, JSON arrays containing equal-length sub-arrays simplify into a matrix. 

And to convert this back to JSON, we type:

In [5]:
toJSON(l, pretty=TRUE)


{
  "name": ["Lesley Cordero"],
  "job": ["Data Scientist"],
  "topics": ["data", "science", "data science"]
} 

## APIs

There are several ways to extract information from the web. Use of APIs, Application Program Interfaces, is probably the best way to extract data from a website. APIs are especially great if your data is constantly changing. Many websites have public APIs providing data feeds via JSON or some other format. 

There are a number of ways to access these APIs from Python. In order to get the data, we make a request to a webserver, hence an easy way is to use the `requests` package. 

### GET request

There are many different types of requests. The most simplest is a GET request. GET requests are used to retrieve your data. In Python, you can make a get request to get the latest position of the international space station from the `OpenNotify` API.


In [3]:
import requests
response = requests.get("http://api.open-notify.org/iss-now.json")
print(response.status_code)

200


Which brings us to status codes. 

### Status Codes

What we just printed was a status code of `200`. Status codes are returned with every request made to a web server and indicate what happened with a request. The following are the most common types of status codes:

- `200` - everything worked as planned!
- `301` - the server is redirecting you to anotehr endpoint (domain).
- `400` - it means you made a bad request by not sending the right data or some other error.
- `401` - you're not authenticated, which means you don't have access to the server.
- `403` - this means access is forbidden. 
- `404` - whatever you tried to access wasn't found. 

Notice that if we try to access something that doesn't exist, we'll get a `404` error:

In [4]:
response = requests.get("http://api.open-notify.org/iss-pass")
print(response.status_code)

404


Let's try a get request where the status code returned is `404`. 


In [5]:
response = requests.get("http://api.open-notify.org/iss-pass.json")
print(response.status_code)

400


Like we mentioned before, this indicated a bad request. This is because it requires two parameters, as you can see [here](http://open-notify.org/Open-Notify-API/ISS-Pass-Times/). 

We set these with an optional `params` variable. You can opt to make a dictionary and then pass it into the `requests.get` function, like follows:


In [6]:
parameters = {"lat": 40.71, "lon": -74}
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)

You can skip the variable portion with the following instead: 


In [8]:
response = requests.get("http://api.open-notify.org/iss-pass.json?lat=40.71&lon=-74")
print(response.content)

b'{\n  "message": "success", \n  "request": {\n    "altitude": 100, \n    "datetime": 1496864501, \n    "latitude": 40.71, \n    "longitude": -74.0, \n    "passes": 5\n  }, \n  "response": [\n    {\n      "duration": 645, \n      "risetime": 1496866724\n    }, \n    {\n      "duration": 591, \n      "risetime": 1496872564\n    }, \n    {\n      "duration": 550, \n      "risetime": 1496878434\n    }, \n    {\n      "duration": 609, \n      "risetime": 1496884248\n    }, \n    {\n      "duration": 638, \n      "risetime": 1496890035\n    }\n  ]\n}\n'


This is pretty messy, but luckily, we can clean this up into JSON with:


In [10]:
data = response.json()
print(data)

{'message': 'success', 'request': {'altitude': 100, 'datetime': 1496864501, 'latitude': 40.71, 'longitude': -74.0, 'passes': 5}, 'response': [{'duration': 645, 'risetime': 1496866724}, {'duration': 591, 'risetime': 1496872564}, {'duration': 550, 'risetime': 1496878434}, {'duration': 609, 'risetime': 1496884248}, {'duration': 638, 'risetime': 1496890035}]}


### APIs with R

So far we've seen APIs with Python. Let's take a look on how you can use R to do some simple API calls. We'll be working with the `httr` library and the EPDB API, which we load in the next three lines:


In [1]:
library("httr")
url  <- "http://api.epdb.eu"
path <- "eurlex/directory_code"

With `httr`, you can make GET requests, like this:


In [3]:
raw.result <- GET(url=url, path=path)
print(raw.result)

Response [http://api.epdb.eu/eurlex/directory_code/]
  Date: 2017-06-07 19:44
  Status: 200
  Content-Type: application/json
  Size: 121 kB



Now let's pull the name entities from this GET request:


In [4]:
print(names(raw.result))

 [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
 [6] "content"     "date"        "times"       "request"     "handle"     


You can extract each of the entitites above with the `$` character, like this:

In [5]:
raw.result$status_code

## Web Scraping

Web Scraping tools are specifically developed for extracting information from websites. Web Scraping mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

### HTML

While performing web scraping, we deal with html tags. Thus, we must have good understanding of them. Below is the basic syntax of HTML:

``` html
<!DOCTYPE html> 
<html>
	<body>
		<h1> First Heading </h1>
		<p> First Paragraph </p>
	</body>
</html>
```
Let's break down each of these tags:

1. `<!DOCTYPE html>`: This is the initial HTML declaration.
2. `<html>`: The HTML document is going to be contained within this tag.
3. `<body>`: This is where the visible portion of the HTML document is between. 
4. `<h1>`: This is an HTML heading.
5. `<p>`: HTML paragraphs are defined here. 

We've also got the following tags:

- `<a>`: These always define HTML links, such as with 
``` HTML
<a href="http://byteacademy.co">This is Byte Academy's website!</a>
```
- `<table>`: HTML tables are defined with this tag, such as:
*Note that the `<tr>`are rows and `<td>` defines columns. 
``` HTML
<table style="width:100%">
	<tr>
		<td>Lesley</td>
		<td>Cordero</td>
		<td>24</td>
	</tr>
	<tr>
		<td>Helen</td>
		<td>Chen</td>
		<td>22</td>
	</tr>
</table>
```

This will yield the following:

```
Lesley		Cordero		24
Helen		Chen		22
```
- `<li>` initializes the beginning of a list. `<ul>` and `<ol>` each define whether it's an unordered list or an ordered list. 

### BeautifulSoup

BeautifulSoup is used to parse the data.

In [9]:
import requests
from bs4 import BeautifulSoup

wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States"
page = requests.get(wiki)
html = page.content

soup = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))


Now, let's explore this webpage! 

This outputs the title of the wikipedia page: 


In [10]:
print(soup.title)

<title>List of states and territories of the United States - Wikipedia</title>


The string version of this can be obtained with:

In [11]:
print(soup.title.string)

List of states and territories of the United States - Wikipedia


With `soup.a`, we can output the links under the `<a>` tag. We get the following:

In [12]:
print(soup.a)

<a id="top"></a>


But this only allows us to have one output. If we want to extract all the links within `<a>`, we will use `find_all()`, as the following:


In [13]:
all_links = soup.find_all("a")
for i in all_links:
	print(i.get("href"))

None
/wiki/Wikipedia:Featured_lists
#mw-head
#p-search
/wiki/List_of_sovereign_states_and_dependent_territories_in_the_Americas
/wiki/File:Map_of_USA_with_state_names_2.svg
/wiki/File:Map_of_USA_with_state_names_2.svg
/wiki/United_States_of_America
/wiki/Federal_republic
#cite_note-1
/wiki/U.S._state
/wiki/Commonwealth_(U.S._state)
#cite_note-2
#cite_note-3
#cite_note-4
#cite_note-5
/wiki/Capital_districts_and_territories#United_States
/wiki/Seat_of_government
/wiki/Washington,_D.C.
/wiki/Territories_of_the_United_States
/wiki/United_States_Minor_Outlying_Islands
#cite_note-6
#cite_note-7
/wiki/Contiguous_United_States
/wiki/North_America
/wiki/Canada
/wiki/Mexico
/wiki/Alaska
/wiki/Hawaii
/wiki/Archipelago
/wiki/Pacific_Ocean
/wiki/Caribbean_Sea
/wiki/Administrative_division
/wiki/United_States_Constitution
/wiki/Elections_in_the_United_States
/wiki/Local_government_in_the_United_States
/wiki/Constitutional_amendment#United_States
/wiki/State_constitution_(United_States)
/wiki/Republi

/wiki/Category:United_States
/wiki/Portal:United_States
/wiki/Template:Articles_on_first-level_administrative_divisions_of_North_American_countries
/wiki/Template_talk:Articles_on_first-level_administrative_divisions_of_North_American_countries
//en.wikipedia.org/w/index.php?title=Template:Articles_on_first-level_administrative_divisions_of_North_American_countries&action=edit
/wiki/Administrative_division
/wiki/Parishes_and_dependencies_of_Antigua_and_Barbuda
/wiki/Local_government_in_the_Bahamas
/wiki/Parishes_of_Barbados
/wiki/Districts_of_Belize
/wiki/Provinces_and_territories_of_Canada
/wiki/Provinces_of_Costa_Rica
/wiki/Provinces_of_Cuba
/wiki/Parishes_of_Dominica
/wiki/Provinces_of_the_Dominican_Republic
/wiki/Departments_of_El_Salvador
/wiki/Parishes_of_Grenada
/wiki/Departments_of_Guatemala
/wiki/Departments_of_Haiti
/wiki/Departments_of_Honduras
/wiki/Parishes_of_Jamaica
/wiki/Administrative_divisions_of_Mexico
/wiki/Departments_of_Nicaragua
/wiki/Provinces_of_Panama
/wiki/Pa

Since we're looking for the capital of each state, let's use `find_all` to retrieve the table tags:


In [14]:
all_tables = soup.find_all('table')


Now we have to identify the right table. We filter this by figuring out what the attribute class name is. In chrome, you can check the class name by right click on the table of web page. Then you click "Inspect" and copy the class name. You also go through the output of above command find the class name of right table.


In [15]:
right_table = soup.find('table', class_='wikitable sortable plainrowheaders')

We're now going to store the data from the website. We'll grab the first few columns, so we'll initialize a list for each of these here:


In [16]:
A = []
B = []
C = []

Next, is to actually grab the needed data and add it to each list. We iterate through the scraped data, row by row:


In [18]:
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    states = row.findAll('th') 
    if len(cells) == 9 or len(cells) == 8: 
        A.append(cells[0].find(text=True))
        B.append(states[0].find(text=True))
        C.append(cells[1].find(text=True))

Here, we actually create the DataFrame with pandas: 


In [19]:
import pandas as pd
df = pd.DataFrame(A, columns=['Number'])
df['State/UT'] = B
df['Admin_Capital'] = C

### rvest

Now we'll try scraping a website with R. R has a library called `rvest` which allows you scrape the HTML from any webpage. In the following two lines, we call this library and take the HTML with the `read_html` function. 

In [1]:
library(rvest)
movie <- read_html("http://www.imdb.com/title/tt1490017/")

Loading required package: xml2


Let's now scape some information from the website. `html_nodes` easily extract pieces out of HTML documents using css selectors while `html_text` extracts attributes, text, and tag name from the HTML. Using these two functions, we can extract the rating for this movie. 


In [3]:
rating <- movie %>%
    html_nodes("strong span") %>%
    html_text() %>%
    as.numeric()
print(rating)

[1] 7.8


Next, let's get the cast of the movie:


In [5]:
cast <- movie %>%
    html_nodes("#titleCast .itemprop span") %>%
    html_text()
print(cast)

 [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"     "Alison Brie"    
 [5] "David Burrows"   "Anthony Daniels" "Charlie Day"     "Amanda Farinos" 
 [9] "Keith Ferguson"  "Will Ferrell"    "Will Forte"      "Dave Franco"    
[13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"     


And lastly, we extract the first movie review on the site:


In [6]:
review <- movie %>%
    html_nodes("#titleUserReviewsTeaser p") %>%
    html_text()
print(review)

[1] "This film has great animation and a great story, it has an all star voice cast example Morgan freeman and will Ferrell. This film will be up on the shelve as one of the greatest films ever animated, ever thought about and ever written. When I was a kid playing with Lego I never thought to my self that they will make a film on it now that they have all my Christmases have come at once. Cant wait for the special features on blue ray. This movie will be up there with toy story 1 2 and 3 , the lion king , frozen and wreck it Ralph. The power of good films are in Lego hands people are genius congrats to all the Oscars for the road ahead"


## Advanced Web Scraping


### Sitemaps

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.

An example of what a sample XML sitemap might look like is:

``` 
<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>

      <loc>http://www.example.com/</loc>

      <lastmod>2005-01-01</lastmod>

      <changefreq>monthly</changefreq>

      <priority>0.8</priority>

   </url>

</urlset> 
```

### Estimating Website Size

The size of the website will affect how you crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. 

### Regular Expressions

A regular expression is a sequence of characters that define a string.

#### Simplest Form

The simplest form of a regular expression is a sequence of characters contained within <b>two backslashes</b>. For example, <i>python</i> would be  

``` 
\python
```

#### Case Sensitivity

Regular Expressions are <b>case sensitive</b>, which means 

``` 
\p and \P
```
are distinguishable from eachother. This means <i>python</i> and <i>Python</i> would have to be represented differently, as follows: 

``` 
\python and \Python
```

We can check these are different by running:

In [1]:
import re
re1 = re.compile('python')
print(bool(re1.match('Python')))

False


#### Disjunctions

If you want a regular expression to represent both <i>python</i> and <i>Python</i>, however, you can use <b>brackets</b> or the <b>pipe</b> symbol as the disjunction of the two forms. For example, 
``` 
[Pp]ython or \Python|python
```
could represent either <i>python</i> or <i>Python</i>. Likewise, 

``` 
[0123456789]
```
would represent a single integer digit. The pipe symbols are typically used for interchangable strings, such as in the following example:

```
\dog|cat
```

#### Ranges

If we want a regular expression to express the disjunction of a range of characters, we can use a <b>dash</b>. For example, instead of the previous example, we can write 

``` 
[0-9]
```
Similarly, we can represent all characters of the alphabet with 

``` 
[a-z]
```

#### Exclusions

Brackets can also be used to represent what an expression <b>cannot</b> be if you combine it with the <b>caret</b> sign. For example, the expression 

``` 
[^p]
```
represents any character, special characters included, but p.

#### Question Marks 

Question marks can be used to represent the expressions containing zero or one instances of the previous character. For example, 

``` 
<i>\colou?r
```
represents either <i>color</i> or <i>colour</i>. Question marks are often used in cases of plurality. For example, 

``` 
<i>\computers?
```
can be either <i>computers</i> or <i>computer</i>. If you want to extend this to more than one character, you can put the simple sequence within parenthesis, like this:

```
\Feb(ruary)?
```
This would evaluate to either <i>February</i> or <i>Feb</i>.

#### Kleene Star

To represent the expressions containing zero or <b>more</b> instances of the previous character, we use an <b>asterisk</b> as the kleene star. To represent the set of strings containing <i>a, ab, abb, abbb, ...</i>, the following regular expression would be used:  
```
\ab*
```

#### Wildcards

Wildcards are used to represent the possibility of any character and symbolized with a <b>period</b>. For example, 

```
\beg.n
```
From this regular expression, the strings <i>begun, begin, began,</i> etc., can be generated. 

#### Kleene+

To represent the expressions containing at <b>least</b> one or more instances of the previous character, we use a <b>plus</b> sign. To represent the set of strings containing <i>ab, abb, abbb, ...</i>, the following regular expression would be used:  

```
\ab+
```

### REs & BeautifulSoup 

If the previous section on regular expressions seemed a little disjointed, hereâ€™s where it all ties together. BeautifulSoup and regular expressions go hand in hand when it comes to scraping the Web. In fact, most functions that take in a string argument will also take in a regular expression as well. 

Looking at [this](http://www.pythonscraping.com/pages/page3.html) webpage, we'll see that there are product images of the following form:

``` HTML
<img src="../img/gifts/img3.jpg">
```
If we wanted to grab URLs to all of the product images, it might seem fairly straightforward: just grab all the image tags using `.findAll("img")`. Unfortunately, thereâ€™s a problem. There are â€œextraâ€ images (i.e, logos), often have hidden images, and blank images used for spacing and aligning elements, and other random image tags you might not be aware of.

The solution is to look for another approach. In this case, we can look at the file path of the product images. You've seen this part before:

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))


This prints out only the relative image paths that start with ../img/gifts/img and end in `.jpg`:


In [3]:
images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
    print(image["src"])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


A regular expression can be inserted as any argument in a BeautifulSoup expression, allowing you a great deal of flexibility in finding target elements.


### Lambda Expressions

Recall, a lambda expression is a function that is passed into another function as a variable. Instead of defining a function as f(x, y), you may define a function as f(g(x), y), or even f(g(x), h(x)).

BeautifulSoup allows us to pass certain types of functions as parameters into the findAll function. The only restriction is that these functions must take a tag object as an argument and return a boolean. Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to â€œtrueâ€ are returned while the rest are discarded.

For example, the following retrieves all tags that have exactly two attributes:

In [5]:
bsObj.findAll(lambda tag: len(tag.attrs) == 2)

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift3"><td>
 Fish Painting
 </td><td>
 If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
 </td><td>
 $10,005.00
 </td><td>
 <img src="../img/gifts/img3.jpg"/>
 </td>

Using lambda functions in BeautifulSoup, selectors can act as a great substitute for writing a regular expression, if youâ€™re comfortable with writing a little code.

