## <span style=color:blue>This notebook illustrates how to extract ag data from USDS NASS, and weather data from NASAPOWER   </span>

### <span style=color:blue>This notebook is very preliminary with a lot of corner cutting; plan to improve it in coming days... </span>

In [1]:
# This useful if I want to give unique names to directories or files
import datetime
def curr_timestamp():
    current_datetime = datetime.datetime.now()
    formatted_datetime = current_datetime.strftime("%Y-%m-%d_%H-%M-%S")
    return formatted_datetime

### <span style=color:blue> Accessing USDA NASS, following code from https://towardsdatascience.com/harvest-and-analyze-agricultural-data-with-the-usda-nass-api-python-and-tableau-a6af374b8138.  In first cell below we define a class for interacting with the NASS QuickStats API, and in second cell we illustrate how to invoke that class </span>

In [10]:
# from https://towardsdatascience.com/harvest-and-analyze-agricultural-data-with-the-usda-nass-api-python-and-tableau-a6af374b8138
# with edits

#   Name:           c_usda_quick_stats.py
#   Author:         Randy Runtsch
#   Date:           March 29, 2022
#   Project:        Query USDA QuickStats API
#   Author:         Randall P. Runtsch
#
#   Description:    Query the USDA QuickStats api_GET API with a specified set of 
#                   parameters. Write the retrieved data, in CSV format, to a file.
#
#   See Quick Stats (NASS) API user guide:  https://quickstats.nass.usda.gov/api
#   Request a QuickStats API key here:      https://quickstats.nass.usda.gov/api#param_define
#
#   Attribution: This product uses the NASS API but is not endorsed or certified by NASS.
#
#   Changes
#

import urllib.request
from requests.utils import requote_uri
import requests

# One has to get a NASS API key - please get your own
my_NASS_API_key = '8932BB3A-66B1-3140-8FCF-9EF701D75C7B'

class c_usda_quick_stats:

    def __init__(self):

        # Set the USDA QuickStats API key, API base URL, and output file path where CSV files will be written. 

        # self.api_key = 'PASTE_YOUR_API_KEY_HERE'
        self.api_key = my_NASS_API_key

        self.base_url_api_get = 'http://quickstats.nass.usda.gov/api/api_GET/?key=' + self.api_key + '&'

        # self.output_file_path = r'c:\\usda_quickstats_files\\'
        self.output_file_path = '/Users/rick/AG-CODE--v03/USDA-NASS--v01/OUTPUTS/'

    def get_data(self, parameters, file_name):

        # Call the api_GET api with the specified parameters. 
        # Write the CSV data to the specified output file.

        # Create the full URL and retrieve the data from the Quick Stats server.
        
        full_url = self.base_url_api_get + parameters
        
        print(full_url)
        try:
            s_result = urllib.request.urlopen(full_url)
            # print(type(s_result))
            print(s_result.status, s_result.reason)
            # print(s_result.status_code)
            s_text = s_result.read().decode('utf-8')

            # Create the output file and write the CSV data records to the file.

            s_file_name = self.output_file_path + file_name + ".csv"
            o_file = open(s_file_name, "w", encoding="utf8")
            o_file.write(s_text)
            o_file.close()
        except requests.exceptions.RequestException as e:
            print(f"An error occurred while fetching the data: {e}")
        except ValueError as e:
            print(f"Failed to parse the response data: {e}")
        except:
            print(f"Failed because of unknown exception; perhaps the USDA NASS site is down")


In [11]:
# from https://towardsdatascience.com/harvest-and-analyze-agricultural-data-with-the-usda-nass-api-python-and-tableau-a6af374b8138
# with edits

#   Date:           March 29, 2022
#   Project:        Program controller to query USDA QuickStats API
#   Author:         Randall P. Runtsch
#
#   Description:    Create an instance of the c_usda_quick_stats class. Call it with
#                   the desired search parameter and output file name.
#
#   Attribution: This product uses the NASS API but is not endorsed or certified by NASS.
#
#   Changes
#

import sys

# sys.path.append('/Users/rick/AG-CODE--v03/USDA-NASS--v01/')
# from c_usda_quick_stats import c_usda_quick_stats
import urllib.parse

# Create an instance of the c_usda_quick_stats class. Call it with search parameters
# and the output file to write the returned CSV data into.



# the QuickStats site is very senstivite to how the full URL is built up.
# For example, the following spec for the parameters works
# But if you replace the line "'&unit_desc=ACRES' + \" with
# the line "'&' + urllib.parse.quote('unit_desc-ACRES')"
# then the site responds saying that you have exceeded the 50,000 record limit for one query

parameters =    'source_desc=SURVEY' +  \
                '&' + urllib.parse.quote('sector_desc=FARMS & LANDS & ASSETS') + \
                '&' + urllib.parse.quote('commodity_desc=FARM OPERATIONS') + \
                '&' + urllib.parse.quote('statisticcat_desc=AREA OPERATED') + \
                '&unit_desc=ACRES' + \
                '&freq_desc=ANNUAL' + \
                '&reference_period_desc=YEAR' + \
                '&year__GE=1997' + \
                '&agg_level_desc=NATIONAL' + \
                '&' + urllib.parse.quote('state_name=US TOTAL') + \
                '&format=CSV'

stats = c_usda_quick_stats()

# Including curr_timestamp() into file name to keep outputs separated during development/exploration
s_json = stats.get_data(parameters, 'national_farm_survey_acres_ge_1997_' + curr_timestamp())

http://quickstats.nass.usda.gov/api/api_GET/?key=8932BB3A-66B1-3140-8FCF-9EF701D75C7B&source_desc=SURVEY&sector_desc%3DFARMS%20%26%20LANDS%20%26%20ASSETS&commodity_desc%3DFARM%20OPERATIONS&statisticcat_desc%3DAREA%20OPERATED&unit_desc=ACRES&freq_desc=ANNUAL&reference_period_desc=YEAR&year__GE=1997&agg_level_desc=NATIONAL&state_name%3DUS%20TOTAL&format=CSV
<class 'http.client.HTTPResponse'>
200 OK


###  <span style=color:blue>Now obtaining NASAPOWER data for one point within each county of Indiana (should do it for more states!) </span>

<span style=color:blue>First, I got a list of all counties in Indiana, and then use geopy to get lat/lon points for each county.  (I cheated and asked ChatGPT for the list of counties; really I should have accessed the list using geopy -- will do that in a next iteration)    </span>

In [12]:
from geopy.geocoders import Nominatim
import pandas as pd

working_dir = '/Users/rick/AG-CODE--v03/ML/'
target_file = 'county_lat_long.csv'


# List of Indiana counties
counties = [
    "Adams", "Allen", "Bartholomew", "Benton", "Blackford", "Boone", "Brown", "Carroll",
    "Cass", "Clark", "Clay", "Clinton", "Crawford", "Daviess", "Dearborn", "Decatur",
    "DeKalb", "Delaware", "Dubois", "Elkhart", "Fayette", "Floyd", "Fountain", "Franklin",
    "Fulton", "Gibson", "Grant", "Greene", "Hamilton", "Hancock", "Harrison", "Hendricks",
    "Henry", "Howard", "Huntington", "Jackson", "Jasper", "Jay", "Jefferson", "Jennings",
    "Johnson", "Knox", "Kosciusko", "LaGrange", "Lake", "LaPorte", "Lawrence", "Madison",
    "Marion", "Marshall", "Martin", "Miami", "Monroe", "Montgomery", "Morgan", "Newton",
    "Noble", "Ohio", "Orange", "Owen", "Parke", "Perry", "Pike", "Porter", "Posey", "Pulaski",
    "Putnam", "Randolph", "Ripley", "Rush", "St. Joseph", "Scott", "Shelby", "Spencer", "Starke",
    "Steuben", "Sullivan", "Switzerland", "Tippecanoe", "Tipton", "Union", "Vanderburgh",
    "Vermillion", "Vigo", "Wabash", "Warren", "Warrick", "Washington", "Wayne", "Wells",
    "White", "Whitley"
]

# Geocoding function to retrieve coordinates for a county
def geocode_county(county):
    geolocator = Nominatim(user_agent="county_geocoder")
    location = geolocator.geocode(county + ", Indiana, USA")
    if location:
        return location.latitude, location.longitude
    else:
        return None, None

# Create a DataFrame to store the county data
df = pd.DataFrame(counties, columns=["County"])
df["Latitude"], df["Longitude"] = zip(*df["County"].map(geocode_county))
df['State'] = 'Indiana'

# Save the data to a CSV file
save_path = working_dir + target_file  # Replace with the desired save path
df.to_csv(save_path, index=False)
print(f"Data saved to: {save_path}")


Data saved to: /Users/rick/AG-CODE--v03/ML/county_lat_long.csv


In [13]:
# looking at the contents of the csv

working_dir = '/Users/rick/AG-CODE--v03/ml/'
filename = 'county_lat_long.csv'

df_ind_cty = pd.read_csv(working_dir + filename)
print(df_ind_cty.head())

        County   Latitude  Longitude    State
0        Adams  40.737167 -84.934730  Indiana
1        Allen  41.097557 -85.056208  Indiana
2  Bartholomew  39.191416 -85.820460  Indiana
3       Benton  40.603428 -87.323516  Indiana
4    Blackford  40.469283 -85.330553  Indiana


## <span style=color:blue>Next two cells illustrate how to obtain NASA POWER weather data for (the lat/lon points inside) the counties in Indiana.   </span>

<span style=color:blue>The corn growing season is from April to October, so restricting data retrieval to those months.

<span style=color:blue>Note: After experiencing the nuances of the USDA NASS API, I am conservatively building up the URL for accessing NASA POWER, rather than using the requests command.</span>

In [21]:
# setting up a URL template for making requests to NASA POWER

# growing season from April to October

import json

working_dir = '/Users/rick/AG-CODE--v03/ML/'
county_file = 'county_lat_long.csv'

dfcty = pd.read_csv(working_dir + county_file)
# print(dfcty.head())

# see https://gist.github.com/abelcallejo/d68e70f43ffa1c8c9f6b5e93010704b8
#   for available parameters
weather_params = ['T2M_MAX','T2M_MIN', 'PRECTOTCORR', 'GWETROOT', 'EVPTRNS', 'ALLSKY_SFC_PAR_TOT']
'''
   T2M_MAX: The maximum hourly air (dry bulb) temperature at 2 meters above the surface of the 
             earth in the period of interest.
   T2M_MIN: The minimum hourly air (dry bulb) temperature at 2 meters above the surface of the 
            earth in the period of interest.
   PRECTOTCORR: The bias corrected average of total precipitation at the surface of the earth 
                in water mass (includes water content in snow)
   EVPTRNS: The evapotranspiration energy flux at the surface of the earth
   ALLSKY_SFC_PAR_TOT: The total Photosynthetically Active Radiation (PAR) incident 
         on a horizontal plane at the surface of the earth under all sky conditions
'''



# for available parameter names, 
#       see https://gist.github.com/abelcallejo/d68e70f43ffa1c8c9f6b5e93010704b8
# focused on growing season, which is April to October
base_url = r"https://power.larc.nasa.gov/api/temporal/daily/point?"
base_url += 'parameters=T2M_MAX,T2M_MIN,PRECTOTCORR,GWETROOT,EVPTRNS,ALLSKY_SFC_PAR_TOT&'
base_url += 'community=RE&longitude={longitude}&latitude={latitude}&start={year}0401&end={year}1031&format=JSON'
# print(base_url)

In [24]:
def fetch_weather_county_year(county, year):
    row = dfcty.loc[dfcty['County'] == county]
    lon = row['Longitude'][0]
    lat = row['Latitude'][0]
    api_request_url = base_url.format(longitude=lon, latitude=lat, year=str(year))

    response = requests.get(url=api_request_url, verify=True, timeout=30.00)
    # print(response.status_code)
    content = json.loads(response.content.decode('utf-8'))
    print(type(content))
    print(content.keys())
    weather = content['properties']['parameter']

    df = pd.DataFrame(weather)
    
    return df, weather

df, weather = fetch_weather_county_year('Adams', 2014)

# examining the output
print(df.head())

print()
    
print(weather['T2M_MAX']['20140401'])

<class 'dict'>
dict_keys(['type', 'geometry', 'properties', 'header', 'messages', 'parameters', 'times'])
          T2M_MAX  T2M_MIN  PRECTOTCORR  GWETROOT  EVPTRNS  ALLSKY_SFC_PAR_TOT
20140401    14.27     5.53         1.01      0.80     0.18              117.24
20140402    12.88     3.26         4.37      0.80     0.07               69.87
20140403    17.19     3.97        52.73      0.85     0.00               18.46
20140404    14.93     0.83         2.50      0.87     0.04               37.51
20140405     9.79    -1.97         0.00      0.85     0.08              116.94

14.27


## <span style=color:blue>Grouping the data into months     </span>

<span style=color:blue>Continuing to work with just Adams County.  Should package as a function... </span>

In [25]:
# Want to have a name for the index of my dataframe
df = df.rename_axis(index='DATE')
print(df.head(5))


          T2M_MAX  T2M_MIN  PRECTOTCORR  GWETROOT  EVPTRNS  ALLSKY_SFC_PAR_TOT
DATE                                                                          
20140401    14.27     5.53         1.01      0.80     0.18              117.24
20140402    12.88     3.26         4.37      0.80     0.07               69.87
20140403    17.19     3.97        52.73      0.85     0.00               18.46
20140404    14.93     0.83         2.50      0.87     0.04               37.51
20140405     9.79    -1.97         0.00      0.85     0.08              116.94


In [26]:
# convert index to datetime format
df.index = pd.to_datetime(df.index, format='%Y%m%d')
df_monthly = df.resample('M').agg({'T2M_MAX':'mean',
                                       'T2M_MIN':'mean',
                                       'PRECTOTCORR':'sum',
                                       'GWETROOT':'mean',
                                       'ALLSKY_SFC_PAR_TOT':'sum'})



'''
dfTMAX = df[['T2M_MAX']].copy()
print(dfTMAX.head(5))


# convert index to datetime format
dfTMAX.index = pd.to_datetime(dfTMAX.index, format='%Y%m%d')
print(dfTMAX.head(5))

# resample to monthly frequency and calculate sum of T2M values
df_monthly = dfTMAX.resample('M').mean()
'''

# convert index back to string format YYYYMM
df_monthly.index = df_monthly.index.strftime('%Y%m')

# rename column to T2M_SUM
# df_monthly = df_monthly.rename(columns={'T2M': 'T2M_SUM'})

print(df_monthly.head(6))


          T2M_MAX    T2M_MIN  PRECTOTCORR  GWETROOT  ALLSKY_SFC_PAR_TOT
DATE                                                                   
201404  16.305333   3.215000       125.82  0.833000             2840.03
201405  21.613871  10.206129        92.08  0.795161             3515.73
201406  26.442667  16.716000       143.19  0.741000             3441.92
201407  25.886452  14.771935        48.41  0.640645             3526.86
201408  28.519032  16.899677       107.01  0.576129             3229.78
201409  23.733667  11.329667        83.94  0.606667             2631.40


### <span style=color:blue>As a next step, I will create a df with single row and 36 columns, named "2014-04__T2M_MAX", "2014-05__T2M_MAX", etc.    </span>

<span style=color:blue>Is there an elegant way to do that?  I am imaging a nested loop - which would get time-consuming for doing numerous states and counties</span>