# Big Data Colab Week 3: Data Collection

In the past two weeks, we've looked at python and computational thinking as well as manipulating and displaying data using pandas and matplotlib. This week we'll focus on how to collect data.

In real life, data doesn't come pre-packaged in a pandas dataframe. Instead, we have to go collect and clean it from some source. This week we'll look at two ways to collect data: web scraping and APIs.

# Web Scraping
Web scraping is the process of collecting information or data from sites on the internet. And yes, copying and pasting a quote or picture from a website is technically web scraping, we're concerned with automating this process to deal with large volumes of data.

## Website Structure:
Websites are built and rendered using [HTML](https://www.w3schools.com/html/) tags (Hypertext markup language). Other helper languages, for example CSS and javascript, may be present on the site, but ultimatley the browser reads the HTML to decide what visuals and text to render. For example, there is are text, image, list, header, and custom tags. 

You can use tools like Chrome's DevTools to inspect (and change) the HTML for any page you visit.

As mentioned earlier, javascript is often used to help generate HTML for sites. Since the vast majority of HTML is generated programatically (and not by hand), patterns often appear in the structure of the page. Developers rely on these patterns in the HTML when web scraping.


## Packages:
### [Requests](https://requests.readthedocs.io/en/master/):

The requests library is "an elegant and simple HTTP library for Python." Much like typing a url into a browswer and getting a webpage response, you can provide a url to requests and recieve the HTML response for that page. 

Below, we'll follow the example of a web scraper for jobs on the site Monster. You'll write code to scrape info about housing prices from Zillow.

In [28]:
import requests

city='Baltimore'
job='Data-Scientist'
jobs_url = f'https://www.monster.com/jobs/search/?q={job}&where={city}'
jobs_page = requests.get(jobs_url)
jobs_page.content

b'<!DOCTYPE html>\r\n<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n    \r\n            <link rel="preconnect" href="https://coda.newjobs.com" />\r\n            <link rel="preconnect" href="https://js-seeker.newjobs.com" />\r\n            <link rel="preconnect" href="https://css-seeker.newjobs.com" />\r\n            <link rel="preconnect" href="https://securemedia.newjobs.com" />\r\n            <link rel="preconnect" href="https://logs2.jobs.com" />\r\n            <link rel="preconnect" href="https://job-openings.monster.com" />\r\n            <link rel="preconnect" href="https://apis.google.com" />\r\n            <link rel="preconnect" href="https://www.google.com" />\r\n            <link rel="preconnect" href="https://accounts.google.com" />\r\n            <link rel="preconnect" href="https://content.googleapis.com" />\r\n            <link rel="preconnect" href="https://ssl.gstatic.com" />\r\n            <link rel="preconnect" href="https://www.drop

Below, use requests and the zillow url **https://www.zillow.com/homes/ZIPCODEHERE_rb/** to get the HTML for listings within a zip code. Define zip code as a variable and pass it into the url. Inspect the content of the response.


In [None]:
#### YOUR CODE HERE #######

### [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

You'll notice that the response from the requests library isn't very pretty. You might be able to convert the string response to a dictionary with patience, time, and a lot of loops, but that would be dreadful.

Thankfully, beautifulsoup is a library that does all of that for you. It reads in HTML data, parses it into a custom data structure, and then allows you to do things like iterate over the HTML or search for specific keywords or tags. 

In [27]:
from bs4 import BeautifulSoup
jobs_soup = BeautifulSoup(jobs_page.content, 'html.parser')

NameError: ignored

As mentioned earlier, webistes are built from HTML tags. As sites become more complex, they use more custom tags. Its helpful to switch back and forth from Chrome's devtools to determine which html tags contain the data you're looking for. 

In [None]:
# Getting element that contains data
jobs_results = jobs_soup.find(id="ResultsContainer")
print(jobs_results.prettify())


Your turn. Find the HTML tag the encapsulates the results of the Zillow search using chrome dev tools and extract it into a variable.

In [None]:
##### YOUR CODE HERE #########




Now we have the HTML tag that contains all of the information we want. Unfortunatley there's still a lot of junk and unnecessary information in the tag. Never fear! Beautifulsoup still makes it easy to remove the information that we need.

In [None]:
# Gets the HTML for each card object. Still too much info!!!
job_elems = jobs_results.find_all('section', class_='card-content')
for elem in job_elems:
  print(elem, end='\n'*2)

In [None]:
# Gets the HTML tags for the desired information.
# Still need to extract the text from the tag.
for elem in job_elems:
    title_elem = elem.find('h2', class_='title')
    company_elem = elem.find('div', class_='company')
    location_elem = elem.find('div', class_='location')
    print(title_elem)
    print(company_elem)
    print(location_elem)
    print()

In [None]:
for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    # Remove this line
    if(None in (title_elem, company_elem, location_elem)):
      continue
    print(title_elem.text.strip())
    print(company_elem.text.strip())
    print(location_elem.text.strip())
    print()

Your turn. Starting with the high level HTML tag, extract all relevant information about the houses for sale (address, price, bathroom count, bedroom count, square footage). Finally, import pandas and save all of your extracted information into a dataframe. 

Hint: Loop over the list like above. Either create a dataframe with the appropriate columns and append to the data frame OR create a list of dictionaries that contain info about each property and create the df from that list. 

In [None]:
### YOUR CODE HERE ####

Now you have the code to create a dataframe for housing prices for a certain zip code. What if you want more data than just one zip code? Could you wrap your code in a loop that creates a larger dataframe that includes info from many zip codes?

In [None]:
#### YOUR CODE HERE ####

# APIs

An [API](https://www.freecodecamp.org/news/what-is-an-api-in-english-please-b880a3214a82/) is an application programming interface. It allows a user to interact with data or functionality of a software system using simple endpoints. An API endpoint hides all the heavy lifting of the data collection and cleaning or fucntionality. Instead of having to scrape a site for data, we can access the data directly.

## RESTful APIs
A [RESTful](https://searchapparchitecture.techtarget.com/definition/RESTful-API) API is one that allows a user to GET, PUT, POST, or DELETE data. As you'll see below, the endpoints look just like URLs you'd put into the search bar. We'll use the requests library from earlier to query the API

In [None]:
import requests

r = requests.get('https://disease.sh/v3/covid-19/all?yesterday=true&twoDaysAgo=false&allowNull=true')
r=r.json() # javascript-object-notation
r['active']

In [None]:
# What is this code doing?
import matplotlib.pyplot as plt
import pandas as pd

r = requests.get('https://disease.sh/v3/covid-19/historical/usa?lastdays=all')
r = r.json()
foo = dict(r['timeline']['cases'])

new_foo = {}
for key, val in foo.items():
  new_foo[pd.to_datetime(key)]= val

plt.plot(list(new_foo.keys()), list(new_foo.values()))
# new_foo = {pd.to_datetime(k): val for k, val in foo.items()}


Your turn. Pick an endpoint from disease.sh that returns timeseries data for a region. Make a GET request to the API and plot the data using matplotlib.

In [None]:
#### YOUR CODE HERE ####

## Auth
In the example above, the API did not require authentication. However, many APIs, especially those with more interesting data, may require authentication to use. This is to avoid abuse of the API. Below we'll see how to authenticate the Spotify API.

After seeing the workflow, you should be able to authenticate to any social media API. 

In [None]:
#### YOUR CODE HERE ####
!pip install spotipy
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# Project

The offline work this weeks involves building a web scraper or using an API to collect data. 

**Web scraper**:
Pick a website that has a simple structure (think lists that contain data like Zillow or Monster). Scrape the site and load the data into a dataframe. Use some of pandas built in tools to do a basic analysis of the data. Finally, use matplotlib to create some visualizations.

**API**:
Find an API you think is interesting and gather its data using the requests library. It could be a public API or one requiring some sort of auth. Use some of pandas built in tools to do a basic analysis of the data. Finally, use matplotlib to create some visualizations.

Both of these options require a bit of research to find a suitable API/site. 