## Notebook goal
The aggriculture data was collected from USDA National Agricultural Statistics Service (NASS) through https://quickstats.nass.usda.gov/. 

The bee dataset has quarterly data. The USDA data however is annual.

To make the USDA data comparable to the bee data, we will need to aggregate the USDA data to quarterly. This will be done by adding a weight to the annual crop data by expected seasonal relevance.

To avoid data leakage, future annual data will not be used in earlier quarters.

In [1]:
import pandas as pd
import os

In [2]:
# set working directory
ITM_DIR = os.path.join(os.getcwd(), '../data/import')

In [3]:
# read in the data
apples = pd.read_csv(os.path.join(ITM_DIR, 'apples acres bearing.csv'))
avocados = pd.read_csv(os.path.join(ITM_DIR, 'avocado acres bearing.csv'))
pears = pd.read_csv(os.path.join(ITM_DIR, 'pears acres bearing.csv'))
cucumbers = pd.read_csv(os.path.join(ITM_DIR, 'cucumbers acres planted.csv'))
honey = pd.read_csv(os.path.join(ITM_DIR, 'honey production in LB.csv'))
sunflower = pd.read_csv(os.path.join(ITM_DIR, 'sunflower acres planted.csv'))
squach = pd.read_csv(os.path.join(ITM_DIR, 'squach acres planted.csv'))

In [4]:
# Keep only the required columns for all datasets
datasets = [apples, avocados, pears, cucumbers, honey, sunflower, squach]
datasets = [df[['Year', 'State', 'Value']] for df in datasets]

# Remove all commas from the Value column for all datasets
for df in datasets:
    if df['Value'].dtype == 'object':
        # Check if the Value column is of type object before replacing commas
        df['Value'] = df['Value'].str.replace(',', '', regex=True)
    else:
        None


# Change Value to integer for all datasets and replace strings with 0
for df in datasets:
    df['Value'] = pd.to_numeric(df['Value'], errors='coerce').fillna(0).astype(int)

# Group by State, Year, and Domain and sum the Value column
datasets = [df.groupby(['State', 'Year'], as_index=False).sum() for df in datasets]

# Unpack the datasets back into individual variables
apples, avocados, pears, cucumbers, honey, sunflower, squach = datasets

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Value'] = df['Value'].str.replace(',', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Value'] = pd.to_numeric(df['Value'], errors='coerce').fillna(0).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Value'] = pd.to_numeric(df['Value'], errors='coerce').fillna(0

In [5]:
# combine all datasets into and add df name to their respective columns
# join on Year and State
# Rename columns to lowercase
apples = apples.rename(columns={'Value': 'apples_acres_bearing'})
avocados = avocados.rename(columns={'Value': 'avocados_acres_bearing'})
pears = pears.rename(columns={'Value': 'pears_acres_bearing'})
cucumbers = cucumbers.rename(columns={'Value': 'cucumbers_acres_planted'})
honey = honey.rename(columns={'Value': 'honey_production_in_lb'})
sunflower = sunflower.rename(columns={'Value': 'sunflower_acres_planted'})
squach = squach.rename(columns={'Value': 'squash_acres_planted'})

# Perform a left join on Year and State for all datasets and drop duplicate columns
df = apples
for dataset in [avocados, pears, cucumbers, honey, sunflower, squach]:
    df = pd.merge(df, dataset, on=['Year', 'State'], how='left')

# Drop duplicate columns if they exist
df = df.loc[:, ~df.columns.duplicated()]

In [6]:
# change State name to only have first letter capitalized
df['State'] = df['State'].str.title()

# replace missing values with 0
df = df.fillna(0)

In [7]:
# Example weight distribution over quarters
quarter_weights = {
    1: 0.15,
    2: 0.6,
    3: 0.2,
    4: 0.05
}

# Crop columns to distribute
crop_cols = [
    'apples_acres_bearing',
    'avocados_acres_bearing',
    'pears_acres_bearing',
    'cucumbers_acres_planted',
    'honey_production_in_lb',
    'sunflower_acres_planted',
    'squash_acres_planted'
]

# Create a DataFrame with all quarters for each state/year
quarters_df = df[['State', 'Year']].copy()
quarters_df = quarters_df.loc[quarters_df.index.repeat(4)].reset_index(drop=True)
quarters_df['quarter'] = [1, 2, 3, 4] * (len(df))

# Merge back the crop values
quarters_df = quarters_df.merge(df, on=['State', 'Year'], how='left')

# Apply weights to crop columns
for col in crop_cols:
    quarters_df[col + '_weighted'] = quarters_df.apply(
        lambda row: row[col] * quarter_weights[row['quarter']], axis=1
    )

# Drop the original crop column
for col in crop_cols:
    quarters_df = quarters_df.drop(columns=[col]) 

In [8]:
quarters_df

Unnamed: 0,State,Year,quarter,apples_acres_bearing_weighted,avocados_acres_bearing_weighted,pears_acres_bearing_weighted,cucumbers_acres_planted_weighted,honey_production_in_lb_weighted,sunflower_acres_planted_weighted,squash_acres_planted_weighted
0,Alabama,2002,1,57.90,0.0,9.75,0.0,352491.30,0.0,0.0
1,Alabama,2002,2,231.60,0.0,39.00,0.0,1409965.20,0.0,0.0
2,Alabama,2002,3,77.20,0.0,13.00,0.0,469988.40,0.0,0.0
3,Alabama,2002,4,19.30,0.0,3.25,0.0,117497.10,0.0,0.0
4,Alabama,2007,1,46.05,0.0,13.05,0.0,195340.80,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
2095,Wyoming,2017,4,1.95,0.0,0.00,0.0,188214.80,0.0,0.0
2096,Wyoming,2022,1,7.65,0.0,0.15,0.0,386214.45,0.0,0.0
2097,Wyoming,2022,2,30.60,0.0,0.60,0.0,1544857.80,0.0,0.0
2098,Wyoming,2022,3,10.20,0.0,0.20,0.0,514952.60,0.0,0.0


In [9]:
# change all columns except State to integer
for col in quarters_df.columns[1:]:
    quarters_df[col] = quarters_df[col].astype(int)

In [10]:
# change State and Year column names to lowercase
quarters_df = quarters_df.rename(columns={'State': 'state', 'Year': 'year'})

In [11]:
# Save the quarterly summary to a CSV file
# OUT_DIR = os.path.join(os.getcwd(), '../data/intermediate')

# quarters_df.to_csv(os.path.join(OUT_DIR, 'crops_quarterly.csv'), index=False)

In [16]:
# import qs.environmental_20250412.txt

environmental_data = pd.read_csv(os.path.join(ITM_DIR, 'qs.environmental_20250412.txt'), sep='\t',encoding='latin1')
environmental_data = environmental_data.rename(columns={'State': 'state', 'Year': 'year'})


In [17]:
environmental_data

Unnamed: 0,SOURCE_DESC,SECTOR_DESC,GROUP_DESC,COMMODITY_DESC,CLASS_DESC,PRODN_PRACTICE_DESC,UTIL_PRACTICE_DESC,STATISTICCAT_DESC,UNIT_DESC,SHORT_DESC,...,LOCATION_DESC,YEAR,FREQ_DESC,BEGIN_CODE,END_CODE,REFERENCE_PERIOD_DESC,WEEK_ENDING,LOAD_TIME,VALUE,CV_%
0,CENSUS,ENVIRONMENTAL,FARMS & LAND & ASSETS,AG LAND,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,TREATED,OPERATIONS,AG LAND - OPERATIONS WITH TREATED,...,"KENTUCKY, MIDWESTERN, MCLEAN",2007,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,5,
1,SURVEY,ENVIRONMENTAL,FIELD CROPS,WHEAT,WINTER,ORGANIC,ALL UTILIZATION PRACTICES,PEST MGMT,PCT OF OPERATIONS,"WHEAT, WINTER, ORGANIC - PEST MGMT, MEASURED I...",...,SOUTH DAKOTA,2009,ANNUAL,0,0,YEAR,,2012-01-01 00:00:00,100,
2,SURVEY,ENVIRONMENTAL,VEGETABLES,POTATOES,FALL,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,APPLICATIONS,LB,"POTATOES, FALL - APPLICATIONS, MEASURED IN LB",...,WASHINGTON,2005,ANNUAL,0,0,YEAR,,2005-06-11 10:11:36,21000,
3,SURVEY,ENVIRONMENTAL,VEGETABLES,VEGETABLE TOTALS,"(EXCL POTATOES), INCL STRAWBERRIES",IN THE OPEN,ALL UTILIZATION PRACTICES,PEST MGMT,PCT OF AREA PLANTED,"VEGETABLE TOTALS, (EXCL POTATOES), INCL STRAWB...",...,PROGRAM STATES,2016,ANNUAL,0,0,YEAR,,2016-08-31 13:58:03,29,
4,SURVEY,ENVIRONMENTAL,FRUIT & TREE NUTS,PEACHES,ALL CLASSES,ALL PRODUCTION PRACTICES,BEARING,APPLICATIONS,"LB / ACRE / APPLICATION, AVG","PEACHES, BEARING - APPLICATIONS, MEASURED IN L...",...,CALIFORNIA,2017,ANNUAL,0,0,YEAR,,2017-08-31 14:07:35,(D),
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1693945,SURVEY,ENVIRONMENTAL,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PEST MGMT,PCT OF AREA PLANTED,"OATS - PEST MGMT, MEASURED IN PCT OF AREA PLANTED",...,TEXAS,2023,ANNUAL,0,0,YEAR,,2025-02-03 15:00:00,0,
1693946,SURVEY,ENVIRONMENTAL,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PEST MGMT,PCT OF AREA PLANTED,"OATS - PEST MGMT, MEASURED IN PCT OF AREA PLANTED",...,TEXAS,2023,ANNUAL,0,0,YEAR,,2025-02-03 15:00:00,0,
1693947,SURVEY,ENVIRONMENTAL,FIELD CROPS,OATS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PEST MGMT,PCT OF AREA PLANTED,"OATS - PEST MGMT, MEASURED IN PCT OF AREA PLANTED",...,TEXAS,2023,ANNUAL,0,0,YEAR,,2025-02-03 15:00:00,7,
1693948,SURVEY,ENVIRONMENTAL,FIELD CROPS,PEANUTS,ALL CLASSES,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,PEST MGMT,PCT OF OPERATIONS,"PEANUTS - PEST MGMT, MEASURED IN PCT OF OPERAT...",...,TEXAS,2023,ANNUAL,0,0,YEAR,,2025-02-03 15:00:00,0,
