## Notebook goal

Clean up the bees dataset. This dataset was collected from Kaggle at https://www.kaggle.com/datasets/m000sey/save-the-honey-bees.

In [None]:
# import libraries
import pandas as pd
import os
import itertools
from statsmodels.tsa.statespace.sarimax import SARIMAX

In [2]:
# set working directory
ITM_DIR = os.path.join(os.getcwd(), '../data/import')

In [25]:
# read in the data
bees = pd.read_csv(os.path.join(ITM_DIR, 'save_the_bees.csv'))

**`state:`** state within the USA. Note, other is a collection of states for privacy reasons. And the United States state is the average across all states.

**`num_colonies:`** number of honey bee colonies

**`max_colonies:`** max number of honey bee colonies for that quarter

**`lost_colonies:`** number of colonies that were lost during that quarter

**`percent_lost:`** percentage of honey bee colonies lost during that quarter

**`renovated_colonies:`** colonies that were 'requeened' or received new bees

**`percent_renovated:`** percentage of honey bee colonies that were renovated

**`quarter:`** Q1 is Jan to March, Q2 is April to June, Q3 is July to September, and Q4 is October to December

**`year:`** year between 2015 and 2022

**`varroa_mites:`** Percentage of colonies affected by a species of mite that affects honey bee populations

**`other_pests_and_parasites:`** Percentage of colonies affected by a collection of other harmful critters

**`diseases:`** Percentage of colonies affected by certain diseases

**`pesticides:`** Percentage of colonies affected by the use of certain pesticides

**`other:`** Percentage of colonies affected by an unlisted cause

**`unknown:`** Percentage of colonies affected by an unknown cause

In [26]:
bees

Unnamed: 0,state,state_code,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
0,Alabama,AL,7000,7000,1800,26,2800,250,4,1,2015,10.0,5.4,0.0,2.2,9.1,9.4
1,Arizona,AZ,35000,35000,4600,13,3400,2100,6,1,2015,26.9,20.5,0.1,0.0,1.8,3.1
2,Arkansas,AR,13000,14000,1500,11,1200,90,1,1,2015,17.6,11.4,1.5,3.4,1.0,1.0
3,California,CA,1440000,1690000,255000,15,250000,124000,7,1,2015,24.7,7.2,3.0,7.5,6.5,2.8
4,Colorado,CO,3500,12500,1500,12,200,140,1,1,2015,14.6,0.9,1.8,0.6,2.6,5.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1448,West Virginia,WV,7500,8000,1100,14,0,220,3,4,2022,33.4,3.8,0.8,0.0,6.4,0.5
1449,Wisconsin,WI,26000,47000,3500,7,140,380,1,4,2022,23.2,21.4,19.4,17.5,9.9,11.7
1450,Wyoming,WY,19500,21000,3200,15,640,0,0,4,2022,22.9,5.9,4.2,0.0,0.0,7.4
1451,Other,OT,30030,30030,480,2,1190,130,0,4,2022,22.4,18.5,0.0,0.0,0.0,0.7


## Combintining other and unknown columns

As it is not clear wether other and unknown columns are mutually exclusive, these are combined by taking the max value of either two columns to avoid overcounting 

In [27]:
bees['other_or_unknown'] = bees[['other', 'unknown']].max(axis=1)

# drop other and unkonwn columns
bees.drop(columns=['other', 'unknown'], inplace=True)

In [28]:
bees.isnull().sum()

state                        0
state_code                   0
num_colonies                 0
max_colonies                 0
lost_colonies                0
percent_lost                 0
added_colonies               0
renovated_colonies           0
percent_renovated            0
quarter                      0
year                         0
varroa_mites                 0
other_pests_and_parasites    0
diseases                     0
pesticides                   0
other_or_unknown             0
dtype: int64

In [None]:
# save csv file
OUT_DIR = os.path.join(os.getcwd(), '../data/intermediate')

bees.to_csv(os.path.join(OUT_DIR, 'bees.csv'), index=False)

: 