# Web Scraping: Practice

Helpful link: 
- https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe

In [1]:
# Imports
import requests
from bs4 import BeautifulSoup

## List of San Diego Communities

In [2]:
# San Diego Communities Webpage
sd_communities = 'https://en.wikipedia.org/wiki/List_of_communities_and_neighborhoods_of_San_Diego'

In [3]:
### Setup: First, scrape the web page above, and use BeautifulSoup to parse it

In [4]:
# YOUR CODE HERE
page = requests.get(sd_communities)
soup = BeautifulSoup(page.content, 'html.parser')

In [5]:
# What is the title of the webpage?

# YOUR CODE HERE
title = soup.title.string
print(title)

List of communities and neighborhoods of San Diego - Wikipedia


Goal: we would like a dictionary of all the communities listed in the wikipedia page, with their links. 

We want all the community names as keys, and the (relative) links as values. 

In [6]:
# It should look something like this:
example = {
    'University Heights' : '/wiki/University_Heights/',
    'La Jolla' : '/wiki/La Jolla/'
}

print(example)

{'University Heights': '/wiki/University_Heights/', 'La Jolla': '/wiki/La Jolla/'}


### Communities - Part 1

Create a dictionary called `communities`, and fill it with the communities information, as above. 

From your `soup` object, use the find_all method to find all the links. 

Using that, you can loop through all links, to collect them into a dictionary.

For a first pass, don't worry about sub-selecting links, just get all links on the page. 

In [7]:
# YOU CODE HERE
communities = dict()
for link in soup.find_all('a'):
    title = link.get('title')
    link = link.get('href')
    communities[title] = link

In [8]:
# Check the resulting dictionary
#communities

### Communities - Part 2

If you did the part above, extracting links, you probably realized that you extracted a whole bunch of links you don't really want, for example, links from the side bar.

Figure out how to sub-select the part of the page that includes the table with all the links, and then run the the same link extraction on that specific part of the page. This should allow you to only extact the relevant links. 

In [9]:
# YOUR CODE HERE
table = soup.find('table')

communities = dict()
for link in table.find_all('a'):
    title = link.get('title')
    link = link.get('href')
    communities[title] = link

In [10]:
# Check out the results
#communities

### Communities - Part 3

You now have a dictionary of neighbourhoods in San Diego, and links to their respective pages on wikipedia.

See if you can loop through the list of links you have, and collect latitute and longitude data from each one (if available). 

In [11]:
# YOUR CODE HERE

In [12]:
# First - figure out how to do this with one example page
community_page = 'https://en.wikipedia.org/wiki/Ocean_Beach,_San_Diego'

In [13]:
# Get the page
page = requests.get(community_page)
soup = BeautifulSoup(page.content, 'html.parser')

In [14]:
# Get the infobox table
table = soup.select_one("table.infobox")

In [15]:
# Extract lat & lon from the table
lat = table.select_one("span.latitude").contents[0]
lon = table.select_one("span.longitude").contents[0]

In [17]:
# Loop across all community pages
base_url = 'https://en.wikipedia.org'

lats, lons = {}, {}
for name, link in communities.items():
    url = base_url + link
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    try:
        table = soup.select_one("table.infobox")
        lat = table.select("span.latitude")[0].contents[0]
        lon = table.select("span.longitude")[0].contents[0]
    except:
        lat, lon = None, None
    
    lats[name] = lat
    lons[name] = lon

In [18]:
# Check out the results
#lats
#lons

## San Diego Crime Stats Page

Check out the San Diego crime stats page below. Let's try and get some data from it. 

From the landing page, pull all all the table data, storing it into a dictionary that encodes the type of crime, and the number. 

Hints:
- Look for the HTML tag that holds the table data, and loop through all of those labels. 
- Using this approach, you can get all the table data by looping across one tag. 

In [19]:
# SD Crime stats page
crime_stats_link = "http://crimestats.arjis.org/default.aspx"

In [20]:
# YOUR CODE HERE

In [21]:
page = requests.get(crime_stats_link)
soup = BeautifulSoup(page.content, 'html.parser')

In [22]:
dat = {}
for tag in soup.find_all('nobr'):
    
    # Check if value is a label
    if tag.contents[0][0].isalpha():
        label = tag.contents[0]
        dat[label] = 0

    # If its not a label (then it's a number) - add to the most recent label
    else:
        dat[label] = tag.contents[0]

In [23]:
dat

{'Aggravated Assault**': '156',
 'Armed Robbery': '26',
 'Crime Index Total**': '1029',
 'Motor Vehicle Theft': '133',
 'Murder': '3',
 'Non-Residential Burglary': '51',
 'Rape**': '14',
 'Residential Burglary': '89',
 'Strong Arm Robbery': '24',
 'Theft < $400': '332',
 'Theft >= $400': '201',
 'Total Burglary': '140',
 'Total Property Crime': '806',
 'Total Thefts': '533',
 'Total Violent Crime**': '223'}

### Discussion

The crime page above takes inputs to select dates and places. How could we programmatically enter queries into it, and get the results?