### Task 1

Name: Sai Srujan<br>
API: USGS Earthquake Catalog

This notebook covers Task 1 - Data Collection.<br>
In this task we collect data related to earthquakes from USGS Earthquake Catalog for the last three years(2019, 2020, 2021).
The API has a limit on the amount of data that can be retrieved at once. In order to collect the complete required data we have to call the API multiple times with different parameters to get the required data. So we invoke the API only for a particular month data at once so the data is within the limit.

In [1]:
import json
import urllib.request
from pathlib import Path
import pandas as pd
from datetime import datetime

Common prefix of the url to invoke the API and the other parameters required for collection of data

In [2]:
# Common Prefix for API URLs
api_prefix = "https://earthquake.usgs.gov/fdsnws/event/1"
# Start and End Dates of various months to get data
api_dates = [{"starttime":"01-01", "endtime":"01-31", "month":"Jan"},
            {"starttime":"02-01", "endtime":"02-29", "month":"Feb"},
            {"starttime":"03-01", "endtime":"03-31", "month":"Mar"},
            {"starttime":"04-01", "endtime":"04-30", "month":"Apr"},
            {"starttime":"05-01", "endtime":"05-31", "month":"May"},
            {"starttime":"06-01", "endtime":"06-30", "month":"Jun"},
            {"starttime":"07-01", "endtime":"07-31", "month":"Jul"},           
            {"starttime":"08-01", "endtime":"08-31", "month":"Aug"},
            {"starttime":"09-01", "endtime":"09-30", "month":"Sep"},    
            {"starttime":"10-01", "endtime":"10-31", "month":"Oct"},        
            {"starttime":"11-01", "endtime":"11-30", "month":"Nov"},            
            {"starttime":"12-01", "endtime":"12-31", "month":"Dec"}]
# The various years for which the data has to be collected
api_years = [2019, 2020, 2021]

Creates a directory for the data storage if it does not exist, to store all the raw earthquake data files in this directory

In [3]:
dir_eq_raw = Path("./raw_eq_files")
dir_eq_raw.mkdir(parents=True, exist_ok=True)

To retreive the required data from the API based on the endpoint and the parameters specified. <br>
If the API fails, the method throws an exception for the failed API url.

In [4]:
def getData(endpoint, params):
    api_url = endpoint
    api_url += "?" + urllib.parse.urlencode(params)
    try:
        response = urllib.request.urlopen(api_url)
        api_data = response.read().decode()
        return api_data
    except:
        print("Failed to retrieve %s" % api_url)

Checks if the data retrieved from the API(/query endpoint) is correct, by comparing the month present in the data to the month for which the API call was made. Also, checks if the complete data is retrieved by comparing it with the count that is obtained from the API(invoking /count endpoint).  

In [5]:
def check_month(data, count, date):
    earthquake_data_df = pd.DataFrame(data['features'])
    # Comparing the length of data retrieved with the count value obtained from the API
    if(len(earthquake_data_df['properties']) == json.loads(count)['count']):
        earthquake_prop_df = pd.DataFrame(list(earthquake_data_df['properties']))
        # Comparing to check if the data retrieved was for the correct month
        if(pd.to_datetime(earthquake_prop_df['time'], unit='ms')[0].month == datetime.strptime(date,"%Y-%m-%d").month):
            return True
    return False

This method is used to retrieve the earthquake data for a particular specified period and store it in the directory created if the data is retrieved correctly. To check if the data is retrieved correctly, we invoke the check_month() method and if it returns true we store the data.

In [6]:
def getMonthlyData(date):
    # Params for the url to retrieve the data accordingly
    params = {"format":"geojson", "starttime":date['starttime'], "endtime":date['endtime'], "eventtype":"earthquake"};
    # To retrieve the earthquake data for the specified params
    url = api_prefix + "/query"
    earthquake_data = getData(url, params)
    # If the API call is successful and data is retrieved, only then we check if data is correct and store it.
    if(pd.notnull(earthquake_data)):
        data = json.loads(earthquake_data)
        # To retrieve the count of the data present for the specified params
        url = api_prefix + "/count"
        earthquake_data_count = getData(url, params)
        
        # To check if the data we are fetching has the same count and belongs to the same month of the year
        if(check_month(data, earthquake_data_count, date['starttime'])):
            # The filename for the data would be earthquake-(month)-(year). Month and Year will be set based on data retrieved.
            file_name = "earthquake-%s-%s.json" % (date['month'],date['year'])
            # Saves the file in the directory created
            try:
                path = dir_eq_raw / file_name
                print("Writing %s-%s data to %s" % (date['month'],date['year'],path))
                file = open(path, "w")
                json.dump(data, file, indent=4)
                file.close()
            except IOError:
                print("Error in writing to file")
        else:
            print("Data is not retrieved prorperly for %s, %s" % (month,year) )

In [7]:
"""
Retrieve the data for all the years specified by substituting the year in the date period defined in api_dates.
The params for the API to retrieve data would be created accordingly and the call would be made.
"""
for year in api_years:
    for dates in api_dates:
        api_date_params = dates.copy()
        api_date_params['starttime'] = str(year) +  "-" + api_date_params['starttime']
        api_date_params['endtime'] = str(year) +  "-" + api_date_params['endtime']
        api_date_params['year'] = year
        getMonthlyData(api_date_params)
    

Writing Jan-2019 data to raw_eq_files\earthquake-Jan-2019.json
Writing Feb-2019 data to raw_eq_files\earthquake-Feb-2019.json
Writing Mar-2019 data to raw_eq_files\earthquake-Mar-2019.json
Writing Apr-2019 data to raw_eq_files\earthquake-Apr-2019.json
Writing May-2019 data to raw_eq_files\earthquake-May-2019.json
Writing Jun-2019 data to raw_eq_files\earthquake-Jun-2019.json
Failed to retrieve https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2019-07-01&endtime=2019-07-31&eventtype=earthquake
Writing Aug-2019 data to raw_eq_files\earthquake-Aug-2019.json
Writing Sep-2019 data to raw_eq_files\earthquake-Sep-2019.json
Writing Oct-2019 data to raw_eq_files\earthquake-Oct-2019.json
Writing Nov-2019 data to raw_eq_files\earthquake-Nov-2019.json
Writing Dec-2019 data to raw_eq_files\earthquake-Dec-2019.json
Writing Jan-2020 data to raw_eq_files\earthquake-Jan-2020.json
Writing Feb-2020 data to raw_eq_files\earthquake-Feb-2020.json
Writing Mar-2020 data to raw_eq_files

In case the amount of data for a particular month exceeds the maximum limit of data that can be retrieved at once through the API, the API will fail.(As it can be seen in the case for July-2019 where it fails). So for these cases we create a method where we can pass the parameters by splitting the date period(Ex: every 10 days) and then call the API to get the complete data.
This method combines the data retrieved for the various date periods of the month into a single json and then stores the data for that particular period.

In [8]:
def combineMonthlyData(dates):
    total_data = {}
    month = ""
    year = ""
    for date in dates:
        month = date['month']
        year = date['year']
        # Params for the url to retrieve the data accordingly
        params = {"format":"geojson", "starttime":date['starttime'], "endtime":date['endtime'], "eventtype":"earthquake"};
        # To retrieve the earthquake data for the specified params
        url = api_prefix + "/query"
        earthquake_data = getData(url, params)
        if(pd.notnull(earthquake_data)):
            # To retrieve the count of the data present for the specified params
            url = api_prefix + "/count"
            earthquake_data_count = getData(url, params)
            data = json.loads(earthquake_data)
            # To check if the data we are fetching has the same count and belongs to the same month of the year
            if(check_month(data, earthquake_data_count, date['starttime'])):    
                if(not total_data):
                    total_data = data
                else:
                    # Combine the data into a single json for a particular month
                    for key, value in data.items():
                        if key in total_data and key in data:
                            if(type(total_data[key]) == list and type(value) == list):
                                total_data[key] = total_data[key] + value
                            else:
                                total_data[key] = [total_data[key]] + [value]
            else:
                print("Data is not retrieved prorperly for %s, %s, %s" % (month,year,date['starttime']) )
    if(total_data):
        # The filename for the data would be earthquake-(month)-(year). Month and Year will be set based on data retrieved.
        file_name = "earthquake-%s-%s.json" % (month, year)
        # Saves the file in the directory created
        try:
            path = dir_eq_raw / file_name
            print("Writing %s-%s data to %s" % (month,year,path))
            file = open(path, "w")
            json.dump(total_data, file, indent=4)
            file.close()
        except IOError:
            print("Error in writing to file")

In [9]:
# Split the dates for July 2019 into every 10 days and then call the method to get the combined data for the month
api_dates_july = [{"starttime":"2019-07-01", "endtime":"2019-07-10", "month":"Jul", "year":"2019"},
            {"starttime":"2019-07-11", "endtime":"2019-07-20", "month":"Jul","year":"2019"},
            {"starttime":"2019-07-21", "endtime":"2019-07-31", "month":"Jul","year":"2019"}]
combineMonthlyData(api_dates_july)

Writing Jul-2019 data to raw_eq_files\earthquake-Jul-2019.json
