# Importing and Processing Bee Data into Pandas DataFrames and CSV Files

Bee-related data is sourced from The United States Department of Agriculture, National Agricultural Statistics Service. Accessed through [data world]('https://data.world/finley/bee-colony-statistical-data-from-1987-2017').
The data is imported and aggregated using the 'Period' column, to result in a Value column for the number of bee colonies, for each calendar year, from 1987 to 2017. 

Columns removed include:

- Week Ending: all data in this column is NaN
- State ANSI: a numeric code to identify states
- Watershed: all data in this column is NaN
- Data Item: Descriptor of the origin of the value
- CV (%)

The resulting data frame has 3 columns: Year, State, and Value (number of bee colonies).
The cleaned data is stored as 'state_survey_processed.csv'

---
*citations:* [aggregating columns]('https://stackoverflow.com/questions/49783178/keep-other-columns-when-using-sum-with-groupby')

In [1]:
# imports
import pandas as pd
import os

### Creating a directory using os library to store processed data frames

*citation:* [Using `os` to make directories]('https://www.w3schools.com/python/ref_os_makedirs.asp')

In [2]:
directory = '../Project_4/Data_Bees/processed_dfs/'
os.makedirs(directory, exist_ok = True)

---
## Importing and Preparing 'Bee Colony Survey Data by State'


In [None]:
# reading in csv to a pandas data frame
df_3 = pd.read_csv('./Data_Bees/raw_bee_data/Bee Colony Survey Data by State.csv')

# dropping columns that won't be used 
df_3.drop(columns = ['CV (%)', 'Watershed', 'Week Ending', 'State ANSI'], inplace = True)

# Editing the 'Value' column from  dtype object to dtype int. 
# Removing comas from values.
df_3['Value'] = df_3['Value'].str.replace(',', '').astype(int)


# The periods of months in the 'Period' column are combined into calendar years.
# First: deconstruct the 'Period' string, into a start and end date, that can be used by .to_datetime()

def convert_period_to_dates(entry):
    year = entry['Year'] # --> using entry['column'], will make the function go through the rows
    period = entry['Period']

    if 'THRU' in period:
        start_mo, end_mo = period.split('THRU')
        start_date = f'{start_mo} {year}'
        end_date = f'{end_mo} {year}'  
    else:
        start_date = period + f'{year}'
        end_date = period + f'{year}'
        
    return pd.Series([start_date, end_date]) 

# Next, replace 'Period' with two new columns for start_date and end_date.  
# Apply the function to data frame
df_3[['date_start', 'date_end']] = df_3.apply(convert_period_to_dates, axis = 1)
df_3 = df_3.drop(columns = ['Period'])

# Finally, group the DataFrame by 'Year', 'State', and 'Data Item', then sum the 'Value' column
df_3 = df_3.groupby(['Year', 'State', 'Data Item'])['Value'].sum().reset_index()

# Make states lower case
df_3['State'] = df_3['State'].str.lower()


# Drop the column 'Data Item'
df_3.drop(columns = ['Data Item'], inplace = True)


# saving to .csv
df_3.to_csv(directory + 'state_survey_processed.csv')