Developed by Roger Wang (rq.wang@rutgers.edu)

Reference: https://www.pythonforbeginners.com/api/how-to-use-the-hacker-news-api

Data Cleaning using Pandas
--

In [None]:
#import modules

import pandas as pd
import numpy as np

In [None]:
# Create dataframe with missing values

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 
        'age': [42, np.nan, 36, 24, 73], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'preTestScore': [4, np.nan, np.nan, 2, 3],
        'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df

In [None]:
# Drop missing observations

df_no_missing = df.dropna()
df_no_missing

In [None]:
# Drop rows where all cells in that row is NA

df_cleaned = df.dropna(how='all')
df_cleaned

In [None]:
# Create a new column full of missing values

df['location'] = np.nan
df

In [None]:
# Drop column if they only contain missing values

df.dropna(axis=1, how='all')

In [None]:
# Drop rows that contain less than five observations

df.dropna(thresh=4)

In [None]:
# Fill in missing data with zeros

df.fillna(0)

In [None]:
# Fill in missing in preTestScore with the mean value of preTestScore

df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
df

In [None]:
# Fill in missing in postTestScore with each sex’s mean value of postTestScore

df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df

In [None]:
# Select some raws but ignore the missing data points

# Select the rows of df where age is not NaN and sex is not NaN
df[df['age'].notnull() & df['sex'].notnull()]

Challenge:
--
1.	Import raritan_river_data.csv file (http://mosaic.njaes.rutgers.edu/raritan-river/download/ boathouse site) into a pandas dataframe and show its first 5 rows
2.	First, drop all the rows that contain less than 8 non null values
3.	Then, remove the columns that contain NaN values
4.	Replace the column names of “buoy” and “ox” by “site” and “oxygen”												
5.	Re-order the data by the depth.	
6.	Show the top 5 rows with the highest oxygen value.
7.	Plot a histogram of the frequency of depth with 20 bins.
8.	Plot a scatter with oxygen versus temperature, size of 12 in by 6 in, and a title of “oxygen v.s. temperature”	


API examples:
    --

geopy is a Python 2 and 3 client for several popular geocoding web services.

geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

geopy includes geocoder classes for the OpenStreetMap Nominatim, Google Geocoding API (V3), and many other geocoding services. The full list is available on the Geocoders doc section. Geocoder classes are located in geopy.geocoders.

geopy is tested against CPython (versions 2.7, 3.4, 3.5, 3.6, 3.7), PyPy, and PyPy3. geopy does not and will not support CPython 2.6.

© geopy contributors 2006-2018 (see AUTHORS) under the MIT License.



## Installation
Install using pip with:

pip install geopy --user

In [None]:
# To geolocate a query to an address and coordinates:

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
location = geolocator.geocode("175 5th Avenue NYC")
print(location.address)
print('*'*50)
print((location.latitude, location.longitude))
print('*'*50)
print(location.raw)

In [None]:
# To find the address corresponding to a set of coordinates:

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
location = geolocator.reverse("52.509669, 13.376294")
print(location.address)
print('*'*50)
print((location.latitude, location.longitude))
print('*'*50)
print(location.raw)

## Measuring Distance

Geopy can calculate geodesic distance between two points using the geodesic distance or the great-circle distance, with a default of the geodesic distance available as the function geopy.distance.distance.

Here's an example usage of the geodesic distance:

In [None]:
from geopy.distance import geodesic
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
print(geodesic(newport_ri, cleveland_oh).miles)

In [None]:
# Using great-circle distance:

from geopy.distance import great_circle
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
print(great_circle(newport_ri, cleveland_oh).miles)

A further look inside
--
Source: app.dataquest.io

Organizations host their APIs on Web servers. When you type www.google.com in your browser's address bar, your computer is actually asking the www.google.com server for a Web page, which it then returns to your browser.

APIs work much the same way, except instead of your Web browser asking for a Web page, your program asks for data. The API usually returns this data in JavaScript Object Notation (JSON) format. We'll discuss JSON more later on in this mission.

We make an API request to the Web server we want to get data from. The server then replies and sends it to us. In Python, we use the requests library to do this.

There are many different types of requests. The most common is a GET request, which we use to retrieve data. We'll explore the other types in later missions.

We can use a simple GET request to retrieve information from the OpenNotify API.

OpenNotify has several API endpoints. An endpoint is a server route for retrieving specific data from an API. For example, the /comments endpoint on the reddit API might retrieve information about comments, while the /users endpoint might retrieve data about users.

The first endpoint we'll look at on OpenNotify is the iss-now.json endpoint. This endpoint gets the current latitude and longitude position of the ISS. A data set wouldn't be a great fit for this task because the information changes often, and involves some calculation on the server.

Check out the complete list of OpenNotify endpoints.

We've imported requests for you already, so please avoid doing it again in this mission. Importing requests will overwrite some of the custom API logic we've developed for answer checking.

# Instruction

The server will send a status code indicating the success or failure of your request. You can get the status code of the response from response.status_code.
Assign the status code to the variable status_code.

In [None]:
import requests
# Make a get request to get the latest position of the ISS from the OpenNotify API.
response = requests.get("http://api.open-notify.org/iss-now.json")
status_code = response.status_code
print(status_code)


The request we just made returned a status code of 200. Web servers return status codes every time they receive an API request. A status code provides information about what happened with a request. Here are some codes that are relevant to GET requests:

200 - Everything went okay, and the server returned a result (if any).<br>
301 - The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint's name has changed.<br>
401 - The server thinks you're not authenticated. This happens when you don't send the right credentials to access an API (we'll talk about this in a later mission).<br>
400 - The server thinks you made a bad request. This can happen when you don't send the information the API requires to process your request, among other things.<br>
403 - The resource you're trying to access is forbidden; you don't have the right permissions to see it.<br>
404 - The server didn't find the resource you tried to access.<br>

# Instruction
Make a GET request to http://api.open-notify.org/iss-pass.

Assign the status code of the response to status_code.

In [None]:
# Enter your answer below.
response = requests.get("http://api.open-notify.org/iss-pass")
status_code = response.status_code
print(status_code)

iss-pass wasn't a valid endpoint, so the API's server sent us a 404 status code in response. We forgot to add .json at the end, like the API documentation tells us to do.

# Instruction
Make a GET request to http://api.open-notify.org/iss-pass.json.

Assign the status code of the response to status_code.


In [None]:
# Enter your answer below.
response = requests.get("http://api.open-notify.org/iss-pass.json")
status_code = response.status_code
print(status_code)

You'll see that in the last example, we got a 400 status code, which indicates a bad request. If you look at the documentation for the OpenNotify API, we see that the ISS Pass endpoint requires two parameters.

This endpoint returns the next time the ISS will pass over a given location on the Earth.

To request this information, we'll need to pass the coordinates for a specific location to the API. We do this by passing in two parameters, latitude and longitude.

To accomplish this, we can add an optional keyword argument, params, to our request. In this case, we need to pass in two parameters:

lat - The latitude of the location
lon - The longitude of the location
We can make a dictionary that contains these parameters, and then pass them into the function.

We can also do the same thing directly by adding the query parameters to the url, like this:

http://api.open-notify.org/iss-pass.json?lat=40.71&lon=-74

It's almost always preferable to set up the parameters as a dictionary, because the requests library we mentioned earlier takes care of certain issues, like properly formatting the query parameters.

In [None]:
# Set up the parameters we want to pass to the API.
# This is the latitude and longitude of New York City.
parameters = {"lat": 40.71, "lon": -74}

# Make a get request with the parameters.
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)

# Print the content of the response (the data the server returned)
print(response.content)

# This gets the same data as the command above
response = requests.get("http://api.open-notify.org/iss-pass.json?lat=40.71&lon=-74")
print(response.content)
parameters = {"lat": 37.78, "lon": -122.41}
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)
content = response.content



You may have noticed that the content of the API response we received earlier was a string. Strings are the way we pass information back and forth through APIs, but it's hard to get the information we want out of them. How do we know how to decode the string we receive and work with it in Python?

Luckily, there's a format we call JSON. We mentioned it earlier in the mission. This format encodes data structures like lists and dictionaries as strings to ensure that machines can read them easily. JSON is the primary format for sending and receiving data through APIs.

Python offers great support for JSON through its json library. We can convert lists and dictionaries to JSON, and vice versa. Our ISS Pass data, for example, is a dictionary encoded as a string in JSON format.

The JSON library has two main methods:

dumps -- Takes in a Python object, and converts it to a string
loads -- Takes a JSON string, and converts it to a Python object

# Instruction

Use the JSON function loads to convert fast_food_franchise_string to a Python object.

Assign the resulting Python object to fast_food_franchise_2.

In [None]:
# Make a list of fast food chains.
best_food_chains = ["Taco Bell", "Shake Shack", "Chipotle"]
print(type(best_food_chains))

# Import the JSON library.
import json

# Use json.dumps to convert best_food_chains to a string.
best_food_chains_string = json.dumps(best_food_chains)
print(type(best_food_chains_string))

# Convert best_food_chains_string back to a list.
print(type(json.loads(best_food_chains_string)))

# Make a dictionary
fast_food_franchise = {
    "Subway": 24722,
    "McDonalds": 14098,
    "Starbucks": 10821,
    "Pizza Hut": 7600
}

# We can also dump a dictionary to a string and load it.
fast_food_franchise_string = json.dumps(fast_food_franchise)
print(type(fast_food_franchise_string))
fast_food_franchise_2 = json.loads(fast_food_franchise_string)



We can get the content of a response as a Python object by using the .json() method on the response.

# Instruction

Get the duration value of the ISS' first pass over San Francisco and assign the value to first_pass_duration.

In [None]:
# Make the same request we did two screens ago.
parameters = {"lat": 37.78, "lon": -122.41}
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)

# Get the response data as a Python object.  Verify that it's a dictionary.
json_data = response.json()
print(type(json_data))
print(json_data)
first_pass_duration = json_data["response"][0]["duration"]

The server sends more than a status code and the data when it generates a response. It also sends metadata containing information on how it generated the data and how to decode it. This information appears in the response headers. We can access it using the .headers property that responses have.

The headers will appear as a dictionary. For now, the content-type within the headers is the most important key. It tells us the format of the response, and how to decode it. For the OpenNotify API, the format is JSON, which is why we could decode it with JSON earlier.

# Instruction
Get content-type from response.headers.

Assign the content type to the content_type variable.

In [None]:
# Headers is a dictionary
print(response.headers)
content_type = response.headers["content-type"]

OpenNotify has one more API endpoint, astros.json. It tells us how many people are currently in space. You can find the format of the responses here.

Because we implement our own version of this API on our servers, this number will most likely be different from the current number (when you try to access the original API outside of Dataquest).

# Instruction
Find how many people are currently in space.

Assign the result to in_space_count.

In [None]:
# Call the API here.
response = requests.get("http://api.open-notify.org/astros.json")
json_data = response.json()

in_space_count = json_data["number"]
in_space_count

# Takeaways
Syntax
Accessing the content of the data the server returns:

response.content

Importing the JSON library:

import json

Getting the content of a response as a Python object:

response.json()

Accessing the information on how the server generated the data, and how to decode the data:

response.headers



Challenge:
--
    1. Read the tutorial: http://abdulbaqi.io/2017/09/13/Wdi/
    2. Retrieve the USA average annual temperatures of 1901-2012
    3. Plot the temperature trend

Python Web Scraping Tutorial using BeautifulSoup
--
source: dataquest.io

When performing data science tasks, it's common to want to use data found on the internet. You'll usually be able to access this data in csv format, or via an Application Programming Interface (API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you'll want to use a technique called web scraping to get the data from the web page into a format you can work with in your analysis.

In this tutorial, we'll show you how to perform web scraping using Python 3 and the BeautifulSoup library. We'll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library.

We'll be scraping weather forecasts from the National Weather Service site.

Before we get started, if you're looking for more background on APIs or the csv format, you might want to check out our Dataquest courses on APIs or data analysis.

## The components of a web page
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

HTML — contain the main content of the page.
CSS — add styling to make the page look nicer.
JS — Javascript files add interactivity to web pages.
Images — image formats, such as JPG and PNG allow web pages to show pictures.
After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.

## HTML
HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:

In [None]:
from IPython.display import HTML
HTML('''
<html>
    
</html>''')

We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we wouldn’t see anything:

Right inside an html tag, we put two other tags, the head tag, and the body tag. The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:

In [None]:
<html>
<head>
</head>
<body>
</body>
</html>

We still haven’t added any content to our page (that goes inside the body tag), so we again won’t see anything:

You may have noticed above that we put the head and body tags inside the html tag. In HTML, tags are nested, and can go inside other tags.

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:

In [None]:
<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>

Here’s how this will look:

Here’s a paragraph of text!

Here’s a second paragraph of text!

Tags have commonly used names that depend on their position in relation to other tags:

child — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
parent — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
sibiling — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.
We can also add properties to HTML tags that change their behavior:

In [None]:
<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
            <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <p>
            Here's a second paragraph of text!
            <a href="https://www.python.org">Python</a>        </p>
    </body></html>

Here’s how this will look:

Here’s a paragraph of text! Learn Data Science Online

Here’s a second paragraph of text! Python

In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

div — indicates a division, or area, of the page.
b — bolds any text inside.
i — italicizes any text inside.
table — creates a table.
form — creates an input form.
For a full list of tags, look here.

Before we move into actual web scraping, let’s learn about the class and id properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:

In [None]:
<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
</html>

Here’s how this will look:

Here’s a paragraph of text! Learn Data Science Online

Here’s a second paragraph of text! Python

As you can see, adding classes and ids doesn’t change how the tags are rendered at all.



In [None]:
# Web Scraping Example

from bs4 import BeautifulSoup
import requests
# Here, we're just importing both Beautiful Soup and the Requests library
page_link = 'https://cee.rutgers.edu/'
# this is the url that we've already determined is safe and legal to scrape from.
page_response = requests.get(page_link, timeout=5)
# here, we fetch the content from the url, using the requests library
page_content = BeautifulSoup(page_response.content, "html.parser")
#we use the html parser to parse the url content and store it in a variable.
textContent = []

In [None]:
print(page_content.prettify())

In [None]:
soup=page_content

soup.title

In [None]:
soup.title.name

In [None]:
soup.title.string

In [None]:
soup.title.parent.name

In [None]:
soup.p

In [None]:
soup.a

In [None]:
soup.find_all('a')

In [None]:
soup.find(id="menu-696-1")

In [None]:
# One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
# Another common task is extracting all the text from a page:

print(soup.get_text())


Challenge:
    --
    1. Open the website at https://airnow.gov/
    2. Find the webpage showing the local airquality at Rutgers
    3. Scrape the air quality data