# API's and Web Scraping

This notebook demonstrates cases for gathering data from the internet. Below is an extraction of data from Wordnik, OkCupid, Google News, Yahoo Stocks, and Yelp.<br><br><br>
***Wordnik:*** taking a look at the definitions provided for the word "chair"

In [1]:
import requests

WORDNIK_URL = 'http://api.wordnik.com/v4/word.json/chair/definitions?limit=200&includeRelated=true&useCanonical=false&includeTags=false&api_key=a2a73e7b926c924fad7001ca3111acd55af2ffabf50eb4ae5'
response = requests.get(WORDNIK_URL)

In [2]:
response.headers

{'Content-Type': 'application/json; charset=utf-8', 'Wordnik-API-Version': '4.12.20', 'Connection': 'close', 'Access-Control-Allow-Headers': 'Origin, X-Atmosphere-tracking-id, X-Atmosphere-Framework, X-Cache-Date, Content-Type, X-Atmosphere-Transport, X-Remote, api_key, auth_token, *', 'Date': 'Sat, 04 Jun 2016 19:51:37 GMT', 'Access-Control-Request-Headers': 'Origin, X-Atmosphere-tracking-id, X-Atmosphere-Framework, X-Cache-Date, Content-Type, X-Atmosphere-Transport,  X-Remote, api_key, *', 'Access-Control-Allow-Methods': 'POST, GET, OPTIONS, PUT, DELETE', 'Access-Control-Allow-Origin': '*'}

In [3]:
definitions = response.json()
print(type(definitions))

<class 'list'>


In [4]:
definitions[0]['text']

'A piece of furniture consisting of a seat, legs, back, and often arms, designed to accommodate one person.'

In [5]:
for definition in definitions:
    print('+', definition['text'])

+ A piece of furniture consisting of a seat, legs, back, and often arms, designed to accommodate one person.
+ A seat of office, authority, or dignity, such as that of a bishop.
+ An office or position of authority, such as a professorship.
+ A person who holds an office or a position of authority, such as one who presides over a meeting or administers a department of instruction at a college; a chairperson.
+ The position of a player in an orchestra.
+ Slang   The electric chair.
+ A seat carried about on poles; a sedan chair.
+ Any of several devices that serve to support or secure, such as a metal block that supports and holds railroad track in position.
+ To install in a position of authority, especially as a presiding officer.
+ To preside over as chairperson:  chair a meeting. 


***OkCupid:*** personal information of one user within the postal code region of 90024

In [6]:
OKC_ZIP_URL = 'https://www.okcupid.com/1/apitun/location/query?q=90024'

request = requests.get(OKC_ZIP_URL)
zipcode = request.json()
zipcode

{'lang': 'en',
 'message': 'Ahh, Los Angeles.',
 'query': '90024',
 'results': [{'city_name': 'Los Angeles',
   'country_code': 'US',
   'country_iso_code': 'US',
   'country_name': 'United States',
   'display_state': 1,
   'latitude': 34.06298,
   'locid': 4233989,
   'longitude': -118.43632,
   'metro_area': 4480,
   'nameid': 141547,
   'popularity': 32321,
   'postal_code': '90024',
   'state_code': 'CA',
   'state_name': 'California'}]}

***Google News:*** titles of the top headlines of the day

In [7]:
import requests
from bs4 import BeautifulSoup

GOOGLE_URL = 'http://news.google.com'
response = requests.get(GOOGLE_URL)

In [8]:
bs = BeautifulSoup(response.text, 'lxml')

In [9]:
print(type(bs.select('span.titletext')[0]))
first_title = (bs.select('span.titletext')[0])
print(first_title)
first_title.get_text()

<class 'bs4.element.Tag'>
<span class="titletext">The Latest: Family spokesman says Ali died of septic shock</span>


'The Latest: Family spokesman says Ali died of septic shock'

In [10]:
for title in bs.select('span.titletext')[0:15]:
    print('+', title.get_text())

+ The Latest: Family spokesman says Ali died of septic shock
+ Muhammad Ali's Hometown of Louisville Honors the Late Boxer as 'Our Inspiration'
+ 5 great Muhammad Ali pop-culture moments
+ Muhammad Ali Dies at 74: Titan of Boxing and the 20th Century
+ Live blog: Muhammad Ali dies age 74
+ Trump on black supporter: 'Look at my African-American over here'
+ Anti-Trump Voices Amplify on Internet, With Violent Results
+ AP News in Brief at 11:04 pm EDT
+ San Jose police chief defends officers accused of failing to protect Trump supporters from violence
+ Mayor Liccardo, San Jose Police Issue New Statements Regarding Violence at Donald Trump Rally
+ San Jose welcomes illegal immigrants but not Trump supporters
+ Citizen Obama: How the president is campaigning for a philosophy, not just a candidate
+ In France, are soldiers outside the Eiffel Tower and the Louvre really worth it?
+ France floods claim three more lives as massive mop-up begins
+ France Flooding Death Toll Rises to Four as Se

***Yahoo Stocks:*** Yahoo Inc. (YHOO) stock after hours trading price.<br> It begins at 4:15pm ET until 3:30pm ET the next day

In [11]:
YAHOO_STOCKS_URL = 'http://www.nasdaq.com/symbol/yhoo/after-hours'
response = requests.get(YAHOO_STOCKS_URL)

bs = BeautifulSoup(response.text, 'lxml')

In [12]:
last_sale = bs.select('#qwidget_lastsale')[0].get_text()
net_change = bs.select('#qwidget_netchange')[0].get_text()
percent = bs.select('#qwidget_percent')[0].get_text()
print(last_sale, '***', net_change, '***', percent)

$36.60 *** unch *** 


In [13]:
bs.select('.qwidget-dollar')

[<div class="qwidget-dollar" id="qwidget_lastsale">$36.60</div>,
 <div class="qwidget-dollar"><div>*  </div></div>]

***Yelp:*** Searched for General Assembly in Santa Monica. Intent of extracting title, location, hours, description, photo, rating values.

In [14]:
YELP_GA_URL = 'http://www.yelp.com/biz/general-assembly-santa-monica-santa-monica'
response = requests.get(YELP_GA_URL)

bs = BeautifulSoup(response.text, 'lxml')

In [15]:
title = bs.select('h1.biz-page-title')[0].get_text().strip()
title

'General Assembly Santa Monica'

In [16]:
title1 = bs.select('h1.biz-page-title')
if len(title1) > 0:
    title1 = title1[0].get_text().strip()

In [17]:
address_whatever = bs.select('address')
address_str_w = address_whatever[0].get_text()
print(address_str_w)


        1520 2nd St, Santa Monica, CA 90401
    


In [18]:
address = bs.select('address > span')
address

[<span itemprop="streetAddress">1520 2nd St</span>,
 <span itemprop="addressLocality">Santa Monica</span>,
 <span itemprop="addressRegion">CA</span>,
 <span itemprop="postalCode">90401</span>]

In [19]:
address_str = ''
for x in address:
    address_str += (x.get_text() + ' ')
print(address_str)

1520 2nd St Santa Monica CA 90401 


In [20]:
"".join([x.get_text() + ' ' for x in address])

'1520 2nd St Santa Monica CA 90401 '

In [21]:
address_tag = bs.select('address')
print(address_tag)
print(' ')
print(' ')
print(address_tag[0])

[<address>
        1520 2nd St, Santa Monica, CA 90401
    </address>, <address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">1520 2nd St</span><br/><span itemprop="addressLocality">Santa Monica</span>, <span itemprop="addressRegion">CA</span> <span itemprop="postalCode">90401</span><br/><meta content="US" itemprop="addressCountry"/>
</address>]
 
 
<address>
        1520 2nd St, Santa Monica, CA 90401
    </address>


In [22]:
address_tag = bs.select('address')
better_address = address_tag[0].get_text().strip()
better_address

'1520 2nd St, Santa Monica, CA 90401'

In [23]:
range_tag = bs.select('span.hour-range')
hours = range_tag[0].get_text() if range_tag else '' #this has to be True
hours

'10:00 am - 5:00 pm'

In [24]:
if range_tag:
    hours = range_tag[0].get_text()
else:
    hours = ''
    
print(hours)

10:00 am - 5:00 pm


In [25]:
business_desc = bs.select('div.js-from-biz-owner > p')
desc = business_desc[0].get_text()
print(desc)


                

    General Assembly's main campus in Santa Monica holds immersive courses, classes, workshops and events specializing in business, tech, and design with expert instructors from the top of their …

            


In [26]:
#bs.select('body')[0].get_text(strip=True)

In [27]:
img = bs.select('img.photo-box-img')[0]
img

<img alt="General Assembly Santa Monica - Santa Monica, CA, United States. SEO Class" class="photo-box-img" height="250" src="https://s3-media2.fl.yelpcdn.com/bphoto/SDUTsNrXSPBlkTxv0_ZTsA/ls.jpg" width="250"/>

In [28]:
img.attrs

{'alt': 'General Assembly Santa Monica - Santa Monica, CA, United States. SEO Class',
 'class': ['photo-box-img'],
 'height': '250',
 'src': 'https://s3-media2.fl.yelpcdn.com/bphoto/SDUTsNrXSPBlkTxv0_ZTsA/ls.jpg',
 'width': '250'}

In [29]:
img.attrs['src']

'https://s3-media2.fl.yelpcdn.com/bphoto/SDUTsNrXSPBlkTxv0_ZTsA/ls.jpg'

In [30]:
imgs = bs.select('img.photo-box-img')[0:15]
for img in imgs:
    print ('+', img.attrs['src'])

+ https://s3-media2.fl.yelpcdn.com/bphoto/SDUTsNrXSPBlkTxv0_ZTsA/ls.jpg
+ //s3-media2.fl.yelpcdn.com/photo/CxPXFeaelpIa6Z1u_eBDvw/30s.jpg
+ https://s3-media4.fl.yelpcdn.com/bphoto/UAu1R-HTsRPiquF9RvoZng/ls.jpg
+ //s3-media2.fl.yelpcdn.com/photo/CxPXFeaelpIa6Z1u_eBDvw/30s.jpg
+ https://s3-media1.fl.yelpcdn.com/bphoto/cQyAvTG-bSRs3kuU5_Ds-g/ls.jpg
+ //s3-media2.fl.yelpcdn.com/photo/CxPXFeaelpIa6Z1u_eBDvw/30s.jpg
+ //s3-media3.fl.yelpcdn.com/photo/HPMM2p2n8v5UkUHtEXQRfw/60s.jpg
+ //s3-media3.fl.yelpcdn.com/photo/svf8jo0t0l_EQoWtmzv2zQ/60s.jpg
+ //s3-media3.fl.yelpcdn.com/photo/FMpISV9yKMxrUhVcV-Ktiw/60s.jpg
+ //s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/978c1bee49d7/assets/img/1x1.png
+ //s3-media1.fl.yelpcdn.com/bphoto/-io-_fGRA5zHSbYF2G7hCg/348s.jpg
+ //s3-media4.fl.yelpcdn.com/photo/X-TsbgRc48AwUXEX8iWORA/60s.jpg
+ //s3-media2.fl.yelpcdn.com/assets/srv0/yelp_styleguide/978c1bee49d7/assets/img/1x1.png
+ //s3-media3.fl.yelpcdn.com/bphoto/1OKu_UkYmoBkM85ACAGBrA/348s.jpg
+ //s3-m

In [31]:
from IPython.display import Image
from IPython.core.display import HTML

imgs = bs.select('img.photo-box-img')
for img in imgs:
    print('+', img.attrs['alt'])

Image(url=imgs[0].attrs['src'])

+ General Assembly Santa Monica - Santa Monica, CA, United States. SEO Class
+ David P.
+ General Assembly Santa Monica - Santa Monica, CA, United States. Growth Hacking Class !
+ David P.
+ General Assembly Santa Monica - Santa Monica, CA, United States. Growth Hacking Class !
+ David P.
+ Ben P.
+ Cody A.
+ Rumala S.
+ General Assembly Santa Monica - Santa Monica, CA, United States
+ General Assembly Santa Monica - Santa Monica, CA, United States
+ Lisandra M.
+ General Assembly Santa Monica - Santa Monica, CA, United States. The best teachers ever! #GALA #UXDI #UXdesign
+ General Assembly Santa Monica - Santa Monica, CA, United States. The best teachers ever! #GALA #UXDI #UXdesign
+ General Assembly Santa Monica - Santa Monica, CA, United States. Fun activities at GA: team work, collaboration, plan, strategy...
+ General Assembly Santa Monica - Santa Monica, CA, United States. Fun activities at GA: team work, collaboration, plan, strategy...
+ General Assembly Santa Monica - Santa M

In [32]:
bs.select('meta[itemprop="ratingValue"]')

[<meta content="4.5" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="1.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="4.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>,
 <meta content="1.0" itemprop="ratingValue"/>,
 <meta content="5.0" itemprop="ratingValue"/>]

In [33]:
rating_tag = bs.select('div.biz-main-info meta[itemprop="ratingValue"]')
rating_tag

[<meta content="4.5" itemprop="ratingValue"/>]

In [34]:
rating_tag[0].attrs['content']

'4.5'

Having the ability to obtain data by simply using an API, web scraping or with open data (without forgetting that certain sites have restrictions, or require permission), is significant in how we function, and how we can improve life today, with more innovation, growth and making change within our society.