<div style="background-color:#FCE205; padding:10px; border-radius:5px; color:black; font-weight:bold;">
    <h2>Combining the bees data with the drought and weather data</h2>
</div>

In [14]:
# import libraries
import os
import pandas as pd

In [15]:
# # set working directory
ITM_DIR = os.path.join(os.getcwd(), '../data/intermediate')

In [16]:
# import datasets to combine

drought = pd.read_csv(os.path.join(ITM_DIR, 'drought_quarterly.csv'))
weather = pd.read_csv(os.path.join(ITM_DIR, 'quarterly_weather_summary.csv'))
bees = pd.read_csv(os.path.join(ITM_DIR, 'bees.csv'))

In [17]:
# combine datasets to the weather dataset on state, year and quarter
df = weather.merge(drought, how='left', on=['state', 'year', 'quarter'])
bees_full = df.merge(bees, how='left', on=['state', 'year', 'quarter'])

<div style="background-color:#FCE205; padding:10px; border-radius:5px; color:black; font-weight:bold;">
    <h3>Check for missing data</h3>
</div>

In [19]:
# which states are in weather but not in the bees dataset?
weather_states = weather.state.unique()
bees_states = bees.state.unique()
missing_states = set(weather_states) - set(bees_states)
missing_states

{'Alaska', 'Delaware', 'Nevada', 'New Hampshire', 'Rhode Island'}

In [20]:
# remove rows where state information is not available in the bees dataset
bees_full = bees_full[~bees_full['state'].isin(missing_states)]

In [21]:
bees_full.isna().sum().sort_values(ascending=False).head(50)

other_or_unknown                 94
renovated_colonies               94
state_code                       94
max_colonies                     94
lost_colonies                    94
percent_lost                     94
added_colonies                   94
num_colonies                     94
percent_renovated                94
other_pests_and_parasites        94
diseases                         94
pesticides                       94
varroa_mites                     94
D4_mean                           0
D0_mean                           0
D1_mean                           0
D2_mean                           0
D3_mean                           0
D2_max                            0
D0_max                            0
D1_max                            0
D3_max                            0
D4_max                            0
moderate_snow_sum                 0
year                              0
state                             0
latitude                          0
relative_humidity_2m_minmin 

In [22]:
# check for which years and quarters data is missing 
missing = bees_full[bees_full.isna().any(axis=1)]
missing = missing[['state', 'year', 'quarter']].drop_duplicates()
missing = missing.sort_values(by=['state', 'year', 'quarter']).reset_index(drop=True)
missing

Unnamed: 0,state,year,quarter
0,Alabama,2019,2
1,Alabama,2023,1
2,Arizona,2019,2
3,Arizona,2023,1
4,Arkansas,2019,2
...,...,...,...
89,West Virginia,2023,1
90,Wisconsin,2019,2
91,Wisconsin,2023,1
92,Wyoming,2019,2


Upon inspection, the missing data is for 2023 quarter 1, 2019 quarter 2 for all states and Hawaii 2022 for all quarters. 
(45 states * 2 = 90, 90 + 4 (Hawaii) = 94 missing values, which checks out)

2023 is not available in the original bees dataset and must have crossed over from the weather data.

quarter 2 from 2019 missing information may be troublesome for further analysis.

In [24]:
# Drop rows with missing values except for year = 2019 and quarter = 2
bees_full = bees_full[~(bees_full.isna().any(axis=1) & ~((bees_full['year'] == 2019) & (bees_full['quarter'] == 2)))]

In [25]:
bees_full.isna().sum().sort_values(ascending=False).head(50)

other_or_unknown                 45
renovated_colonies               45
state_code                       45
max_colonies                     45
lost_colonies                    45
percent_lost                     45
added_colonies                   45
num_colonies                     45
percent_renovated                45
other_pests_and_parasites        45
diseases                         45
pesticides                       45
varroa_mites                     45
D4_mean                           0
D0_mean                           0
D1_mean                           0
D2_mean                           0
D3_mean                           0
D2_max                            0
D0_max                            0
D1_max                            0
D3_max                            0
D4_max                            0
moderate_snow_sum                 0
year                              0
state                             0
latitude                          0
relative_humidity_2m_minmin 

In [26]:
# validate that the missing values are indeed for year = 2019 and quarter = 2
missing = bees_full[bees_full.isna().any(axis=1)]
missing = missing[['state', 'year', 'quarter']].drop_duplicates()
missing = missing.sort_values(by=['state', 'year', 'quarter']).reset_index(drop=True)
missing

Unnamed: 0,state,year,quarter
0,Alabama,2019,2
1,Arizona,2019,2
2,Arkansas,2019,2
3,California,2019,2
4,Colorado,2019,2
5,Connecticut,2019,2
6,Florida,2019,2
7,Georgia,2019,2
8,Hawaii,2019,2
9,Idaho,2019,2


In [14]:
# save csv file
OUT_DIR = os.path.join(os.getcwd(), '../data/cleaned')

bees_full.to_csv(os.path.join(OUT_DIR, 'bees_full_cleaned.csv'), index=False)