# Python BeautifulSoup Web Scraping Tutorial
Learn to scrape data from the web using the Python BeautifulSoup bs4 library.  
BeautifulSoup makes it easy to parse useful data out of an HTML page.  
First install the bs4 library on your system by running at the command line,   
*pip install beautifulsoup4* or *easy_install beautifulsoup4* (or bs4)  
See [BeautifulSoup official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the complete set of functions.

### Import requests so we can fetch the html content of the webpage
You can see our example page has about 28k characters.

In [10]:
import requests
r = requests.get('https://www.usclimatedata.com/climate/united-states/us')
print(len(r.text))

42402


### Import BeautifulSoup, and convert your HTML into a bs4 object
Now we can access specific HTML tags on the page using dot, just like a JSON object.

In [11]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text)
print(soup.title)
print(soup.title.string)

<title>Climate United States - Normals and averages</title>
Climate United States - Normals and averages


### Drill into the bs4 object to access page contents
soup.p will give you the contents of the first paragraph tag on the page.  
soup.a gives you anchors / links on the page.  
Get contents of an attribute inside an HTML tag using square brackets and perentheses.  
Use .parent to get the parent object, and .next_sibling to get the next peer object.  
**Use your browser's *inspect element* feature to find the tag for the data you want.**

In [12]:
print(soup.p)
print(soup.p.text)
print(soup.a)
print(soup.a['title'])
print()
print(soup.p.parent)

<p class="selection_title">Select a state by name</p>
Select a state by name
<a class="navbar-brand" href="/" title="Temperature - Precipitation - Sunshine - Snowfall"><img alt="Temperature - Precipitation - Sunshine - Snowfall" data-src="https://www.usclimatedata.com/assets/images/us-climate-data.png" height="34" src="https://www.usclimatedata.com/assets/images/us-climate-data.png" srcset="https://www.usclimatedata.com/assets/images/us-climate-data.png 1x, https://www.usclimatedata.com/assets/images/us-climate-data-2.png 2x" width="31"/><span class="white ml-2">U.S. Climate Data</span></a>
Temperature - Precipitation - Sunshine - Snowfall

<div class="float-left mb-4 mt-2"><p class="selection_title">Select a state by name</p></div>


### Prettify() is handy for formatted printing   
but note this works only on bs4 objects, not on strings, dicts or lists. For those you need to import pprint.

In [13]:
print(soup.p.parent.prettify())

<div class="float-left mb-4 mt-2">
 <p class="selection_title">
  Select a state by name
 </p>
</div>



### We need all the state links on this page
First we find_all anchor tags, and print out the href attribute, which is the actual link url.   
But we see the result includes some links we don't want, so we need to filter those out.

In [14]:
for link in soup.find_all('a'):
    print(link.get('href'))

/
#
/
/climate/united-states/us
/
/climate/united-states/us
/climate/alabama/united-states/3170
/climate/alaska/united-states/3171
/climate/arizona/united-states/3172
/climate/arkansas/united-states/3173
/climate/california/united-states/3174
/climate/colorado/united-states/3175
/climate/connecticut/united-states/3176
/climate/delaware/united-states/3177
/climate/district-of-columbia/united-states/3178
/climate/florida/united-states/3179
/climate/georgia/united-states/3180
/climate/hawaii/united-states/3181
/climate/idaho/united-states/3182
/climate/illinois/united-states/3183
/climate/indiana/united-states/3184
/climate/iowa/united-states/3185
/climate/kansas/united-states/3186
/climate/kentucky/united-states/3187
/climate/louisiana/united-states/3188
/climate/maine/united-states/3189
/climate/maryland/united-states/1872
/climate/massachusetts/united-states/3191
/climate/michigan/united-states/3192
/climate/minnesota/united-states/3193
/climate/mississippi/united-states/3194
/climate/

### Filter urls using string functions
We just add an *if* to check conditions, then add the good ones to a list.  
In the end we get 51 state links, including Washington DC.

In [15]:
base_url = 'https://www.usclimatedata.com'
state_links = []
for link in soup.find_all('a'):
    url = link.get('href')
    if url and '/climate/' in url and '/climate/united-states/us' not in url:
        state_links.append(url)
print(len(state_links))

53


### Test getting the data for one state
then print the title for that page.

In [16]:
r = requests.get(base_url + state_links[5])
soup = BeautifulSoup(r.text)
print(soup.title.string)

Climate Colorado - Temperature, Rainfall and Averages


### The data we need is in *tr* tags
But look, there are 58 tr tags on the page, and we only want 2 of them - the *Average high* rows.

In [17]:
rows = soup.find_all('tr')
print(len(rows))

12


### Filter rows, and add temp data to a list
We use a list comprehension to filter the rows.  
Then we have only 2 rows left.  
We iterate through those 2 rows, and add all the temps from data cells (td) into a list.

In [18]:
rows = [row for row in rows if 'Average high' in str(row)]
print(len(rows))

high_temps = []
for row in rows:
    tds = row.find_all('td')
    print(tds)
    for i in range(1,7):
        high_temps.append(tds[i].text)
print(high_temps)

2
[<td class="high text-right">45</td>, <td class="high text-right">46</td>, <td class="high text-right">54</td>, <td class="high text-right">61</td>, <td class="high text-right">72</td>, <td class="high text-right">82</td>]


IndexError: list index out of range

### Get the name of the State
First attempt we just split the title string into a list, and grab the second word.  
But that doesn't work for 2-word states like New York and North Carolina.   
So instead we slice the string from first blank to the hyphen. 

In [None]:
state = soup.title.string.split()[1]
print(state)
s = soup.title.string
state = s[s.find(' '):s.find('-')].strip()
print(state)

### Add state name and temp list to the data dictionary
For a single state, this is what our scraped data looks like.  
In this example we only got monthly highs by state, but you could drill into cities, and could get lows and precipitation. 

In [None]:
data = {}
data[state] = high_temps
print(data)

### Put it all together and iterate 51 states
We loop through our 51-state list, and get high temp data for each state, and add it to the data dict.  
This combines all our work above into a single for loop.  
The result is a dict with 51 states and a list of monthly highs for each.

In [None]:
data = {}
for state_link in state_links:
    url = base_url + state_link
    r = requests.get(base_url + state_link)
    soup = BeautifulSoup(r.text)
    rows = soup.find_all('tr')
    rows = [row for row in rows if 'Average high' in str(row)]
    high_temps = []
    for row in rows:
        tds = row.find_all('td')
        for i in range(1,7):
            high_temps.append(tds[i].text)
    s = soup.title.string
    state = s[s.find(' '):s.find('-')].strip()
    data[state] = high_temps
print(data)

### Save to CSV file
Lastly, we might want to write all this data to a CSV file.  
Here's a quick easy way to do that.

In [None]:
import csv

with open('high_temps.csv','w') as f:
    w = csv.writer(f)
    w.writerows(data.items())