# Importing and Processing Bee Data into Pandas DataFrames and CSV Files

Bee-related data is sourced from various datasets stored in both .xlsx and .csv formats. Some .xlsx files contain multiple sheets, each representing data for a specific year.

This notebook aims to import and process each data file, saving the processed data as .csv files. Subsequently, the saved files will undergo cleaning and exploratory data analysis (EDA).

In [1]:
# imports
import pandas as pd
import os
import pandas as pd

### Creating a directory using os library to store processed data frames

*citation:* [Using `os` to make directories]('https://www.w3schools.com/python/ref_os_makedirs.asp')

In [2]:
directory = '../Project_4/Data_Bees/processed_dfs/'
os.makedirs(directory, exist_ok = True)

---
## 1 ) Function to import and process data from all sheets/years in `BIP Bee Colony Loss Clean.xlsx`
*citation:* [Looking up sheet names in .xlsx notebook]('https://stackoverflow.com/questions/17977540/pandas-looking-up-the-list-of-sheets-in-an-excel-file')

In [3]:
# Creating the function
def process_data(file): # year data is in the format 2007-08
    
    file = str(file) 
    dfs = [] #store each data frame, for concatenation

    file_for_import = f'../Project_4/Data_Bees/{file}.xlsx'
    sheet_names = pd.ExcelFile(file_for_import).sheet_names #see citation
    
    for sheet_name in sheet_names:
        df = pd.read_excel(file_for_import, sheet_name = sheet_name)
        df['year'] = str(sheet_name[0:4]) #make a column for the year
        dfs.append(df) #append df to the list for concatenation

    #concatenate
    df_combo = pd.concat(dfs, ignore_index = True)

    return df_combo

# saving the output df as a variable
colony_loss_df = process_data('BIP Bee Colony Loss Clean')

# saving to .csv
colony_loss_df.to_csv(directory + 'bip_colony_loss_processed.csv')

---
### Looking at colony_loss_df information

**Observations:** 
- There is data for 12 years
- The data for 2008 and 2007 is missing about half of the states.
- The data spans 2007 - 2017
- The data set was pre-cleaned by the source

In [4]:
colony_loss_df.head()

Unnamed: 0,State,Total Winter All Loss,Beekeepers,Beekeepers Exclusive to State,Colonies,Colonies Exclusive to State,year
0,Maryland,7.6%,14,100%,4013,100%,2007
1,Washington,13.7%,5,0%,21870,0%,2007
2,New Jersey,15.1%,15,80%,22622,12%,2007
3,Arkansas,17.4%,20,100%,16955,100%,2007
4,Maine,18%,6,16.7%,45937,0.1%,2007


In [5]:
colony_loss_df['year'].describe()

count      557
unique      12
top       2013
freq        52
Name: year, dtype: object

In [6]:
colony_loss_df['year'].sort_values().nunique()

12

In [7]:
colony_loss_df['year'].sort_values().value_counts()

year
2013    52
2015    52
2011    51
2012    51
2014    51
2016    51
2010    50
2017    50
2009    48
2018    48
2007    27
2008    26
Name: count, dtype: int64

---
## 2) Import and process data from `Bee Colony Census Data by State.csv`

In [8]:
df_2 = pd.read_csv('./Data_Bees/Bee Colony Census data by State.csv')

In [9]:
df_2.describe()

Unnamed: 0,Year,Week Ending,State ANSI,Ag District,Ag District Code,County,County ANSI,Zip Code,Region,watershed_code,Watershed,CV (%)
count,200.0,0.0,200.0,0.0,0.0,0.0,0.0,0.0,0.0,200.0,0.0,50.0
mean,2004.5,,29.32,,,,,,,0.0,,21.09
std,5.604198,,15.662829,,,,,,,0.0,,17.735198
min,1997.0,,1.0,,,,,,,0.0,,2.7
25%,2000.75,,17.0,,,,,,,0.0,,6.775
50%,2004.5,,29.5,,,,,,,0.0,,14.15
75%,2008.25,,42.0,,,,,,,0.0,,33.65
max,2012.0,,56.0,,,,,,,0.0,,74.9


In [10]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Program           200 non-null    object 
 1   Year              200 non-null    int64  
 2   Period            200 non-null    object 
 3   Week Ending       0 non-null      float64
 4   Geo Level         200 non-null    object 
 5   State             200 non-null    object 
 6   State ANSI        200 non-null    int64  
 7   Ag District       0 non-null      float64
 8   Ag District Code  0 non-null      float64
 9   County            0 non-null      float64
 10  County ANSI       0 non-null      float64
 11  Zip Code          0 non-null      float64
 12  Region            0 non-null      float64
 13  watershed_code    200 non-null    int64  
 14  Watershed         0 non-null      float64
 15  Commodity         200 non-null    object 
 16  Data Item         200 non-null    object 
 1

In [11]:
df_2.isna().sum()

Program               0
Year                  0
Period                0
Week Ending         200
Geo Level             0
State                 0
State ANSI            0
Ag District         200
Ag District Code    200
County              200
County ANSI         200
Zip Code            200
Region              200
watershed_code        0
Watershed           200
Commodity             0
Data Item             0
Domain                0
Domain Category       0
Value                 0
CV (%)              150
dtype: int64

---
#### Dropping Columns

There are several columns we won't use in our analysis. Additionally, several columns are mosty blank.

* blank columns = 'Week Ending', 'Ag District', 'Ag District Code', 'County', 'County ANSI', 'Zip Code', 'Region', 'Watershed'

* columns with all the same value:
    -'Domain Category' = 'NOT SPECIFIED'
    -'Domain' = 'TOTAL'
    -'Data Item' = 'HONEY, BEE COLONY'
    -'Commodity' = 'HONEY'
    -'watershed_code' =0
    -'Program' = 'Census'
    -'Period' = 'END OF DEC' 

* unusable column  =  CV(%) is missing 150 values out of 200. CV(%) is the Coefficient of variation. Available for the 2012 Census of Agriculture only. 

In [12]:
df_2.drop(columns = ['Week Ending', 'Ag District', 'Ag District Code', 'County', 'County ANSI', 'Zip Code', 'Region', 'Domain Category', 'Domain', 'Data Item', 'Commodity', 'watershed_code', 'State ANSI', 'Geo Level', 'Period', 'Program', 'Watershed', 'CV (%)'], inplace = True)

In [13]:
df_2.head()

Unnamed: 0,Year,State,Value
0,2012,ALABAMA,11628
1,2012,ALASKA,546
2,2012,ARIZONA,58461
3,2012,ARKANSAS,23259
4,2012,CALIFORNIA,945589


**OBSERVATION:** There is data for 4 years only

In [14]:
# The 'value' column in this dataset will be explicitly defined for future merging or concatenation with other dataframes.
df_2['census_value'] = df_2['Value']

In [15]:
# saving to .csv
df_2.to_csv(directory + 'state_census_processed.csv')

---
## Importing Bee Colony Survey Data by State


In [16]:
df_3 = pd.read_csv('./Data_Bees/Bee Colony Survey Data by State.csv')
df_3.head()

Unnamed: 0,Year,Period,Week Ending,State,State ANSI,Watershed,Data Item,Value,CV (%)
0,2017,JAN THRU MAR,,ALABAMA,1,,ADDED & REPLACED,570,
1,2017,JAN THRU MAR,,ARIZONA,4,,ADDED & REPLACED,2900,
2,2017,JAN THRU MAR,,ARKANSAS,5,,ADDED & REPLACED,430,
3,2017,JAN THRU MAR,,CALIFORNIA,6,,ADDED & REPLACED,215000,
4,2017,JAN THRU MAR,,COLORADO,8,,ADDED & REPLACED,100,


In [17]:
df_3.drop(columns = ['CV (%)', 'Watershed', 'Week Ending'], inplace = True)

In [18]:
df_3.sort_values(by= ['State'])

Unnamed: 0,Year,Period,State,State ANSI,Data Item,Value
0,2017,JAN THRU MAR,ALABAMA,1,ADDED & REPLACED,570
2307,2011,MARKETING YEAR,ALABAMA,1,INVENTORY,9000
405,2015,OCT THRU DEC,ALABAMA,1,ADDED & REPLACED,80
2676,2002,MARKETING YEAR,ALABAMA,1,INVENTORY,12000
1528,2015,JUL THRU SEP,ALABAMA,1,"LOSS, DEADOUT",1400
...,...,...,...,...,...,...
1617,2015,OCT THRU DEC,WYOMING,56,"LOSS, DEADOUT",1300
3155,1992,MARKETING YEAR,WYOMING,56,INVENTORY,41000
584,2016,JAN THRU MAR,WYOMING,56,"INVENTORY, MAX",6500
1527,2015,APR THRU JUN,WYOMING,56,"LOSS, DEADOUT",3200


In [19]:
df_3['Period'].unique()

array(['JAN THRU MAR', 'APR THRU JUN', 'JUL THRU SEP', 'OCT THRU DEC',
       'MARKETING YEAR', 'FIRST OF JAN', 'FIRST OF APR', 'FIRST OF JUL',
       'FIRST OF OCT'], dtype=object)

In [37]:
# I need to combine the periods of months in the 'Period' column to be yearly data

def convert_period_to_dates(period, year):
    months = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']
    start_mo, end_mo = period.split('THRU')
    start_date = f'{start_mo} {year}'
    end_date = f'{end_mo} {year}'  # --> at this point, an example of the return is ('JAN  2007', ' MAR 2007')

    # Apply the function to the 'Period' column 
    return (start_date, end_date)

In [40]:
convert_period_to_dates('JAN THRU MAR', 2007)

('JAN  2007', ' MAR 2007')