# Importing and Processing Bee Data into Pandas DataFrames and CSV Files

Bee-related data is sourced from various datasets stored in both .xlsx and .csv formats. Some .xlsx files contain multiple sheets, each representing data for a specific year.

This notebook aims to import and process each data file, saving the processed data as .csv files. Subsequently, the saved files will undergo cleaning and exploratory data analysis (EDA).

In [1]:
# imports
import pandas as pd
import os
import pandas as pd

### Creating a directory using os library to store processed data frames

*citation:* [Using `os` to make directories]('https://www.w3schools.com/python/ref_os_makedirs.asp')

In [3]:
directory = '../Project_4/Data_Bees/processed_dfs/'
os.makedirs(directory, exist_ok = True)

---
### Function import and process data from all sheets/years in `BIP Bee Colony Loss Clean.xlsx`
*citation:* [Looking up sheet names in .xlsx notebook]('https://stackoverflow.com/questions/17977540/pandas-looking-up-the-list-of-sheets-in-an-excel-file')

In [5]:
# Creating the function
def process_data(file): # year data is in the format 2007-08
    
    file = str(file) 
    dfs = [] #store each data frame, for concatenation

    file_for_import = f'../Project_4/Data_Bees/{file}.xlsx'
    sheet_names = pd.ExcelFile(file_for_import).sheet_names #see citation
    
    for sheet_name in sheet_names:
        df = pd.read_excel(file_for_import, sheet_name = sheet_name)
        df['year'] = str(sheet_name[0:4]) #make a column for the year
        dfs.append(df) #append df to the list for concatenation

    #concatenate
    df_combo = pd.concat(dfs, ignore_index = True)

    return df_combo

# saving the output df as a variable
colony_loss_df = process_data('BIP Bee Colony Loss Clean')

In [8]:
colony_loss_df.head()

Unnamed: 0,State,Total Winter All Loss,Beekeepers,Beekeepers Exclusive to State,Colonies,Colonies Exclusive to State,year
0,Maryland,7.6%,14,100%,4013,100%,2007
1,Washington,13.7%,5,0%,21870,0%,2007
2,New Jersey,15.1%,15,80%,22622,12%,2007
3,Arkansas,17.4%,20,100%,16955,100%,2007
4,Maine,18%,6,16.7%,45937,0.1%,2007


In [9]:
# saving to .csv
colony_loss_df.to_csv(directory + 'bip_colony_loss_processed')

---
### Function import and process data from all sheets/years in `Bee Colony Census Data by County.csv`


In [24]:
df_2 = pd.read_csv('./Data_Bees/Bee Colony Census Data by County.csv')
df_2.head(30)

Unnamed: 0,Year,Period,State,State ANSI,Ag District,Ag District Code,County,County ANSI,Value,CV (%)
0,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,AUTAUGA,1.0,119,27.7
1,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,DALLAS,47.0,65,27.7
2,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,ELMORE,51.0,190,27.7
3,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,GREENE,63.0,14,27.7
4,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,HALE,65.0,10,27.7
5,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,LOWNDES,85.0,(D),(D)
6,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,MACON,87.0,22,27.7
7,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,MARENGO,91.0,(D),(D)
8,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,MONTGOMERY,101.0,(D),(D)
9,2012,END OF DEC,ALABAMA,1,BLACK BELT,40,PERRY,105.0,(D),(D)


In [14]:
df_2.describe()

Unnamed: 0,Year,State ANSI,Ag District Code,County ANSI
count,7830.0,7830.0,7830.0,7821.0
mean,2007.212005,30.292976,48.379949,92.702084
std,4.06547,15.004933,25.675393,84.042373
min,2002.0,1.0,10.0,1.0
25%,2002.0,19.0,30.0,33.0
50%,2007.0,29.0,50.0,75.0
75%,2012.0,45.0,70.0,127.0
max,2012.0,56.0,97.0,810.0


In [16]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7830 entries, 0 to 7829
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Year              7830 non-null   int64  
 1   Period            7830 non-null   object 
 2   State             7830 non-null   object 
 3   State ANSI        7830 non-null   int64  
 4   Ag District       7830 non-null   object 
 5   Ag District Code  7830 non-null   int64  
 6   County            7830 non-null   object 
 7   County ANSI       7821 non-null   float64
 8   Value             7830 non-null   object 
 9   CV (%)            2761 non-null   object 
dtypes: float64(1), int64(3), object(6)
memory usage: 611.8+ KB


In [22]:
df_2.isna().sum()

Year                   0
Period                 0
State                  0
State ANSI             0
Ag District            0
Ag District Code       0
County                 0
County ANSI            9
Value                  0
CV (%)              5069
dtype: int64

**To do:**
- drop CV - the variability of the data
- missing data for 'County ANSI' in 9 places
- import, process and merge ANSI data from national_county2020.txt to id states and counties by name
  

In [32]:
# some entries in the Value columns are not numbers
df_2['Value'].value_counts()

Value
 (D)      2306
6           76
9           68
18          62
10          61
          ... 
34,383       1
8,314        1
6,935        1
895          1
10,012       1
Name: count, Length: 1406, dtype: int64

---
#### Changing Data Types

Some columns are not stored in apropriate data types. 

The following columns need to be objects, as they are not mathematical objects: 'State ANSI', 'Ag District Code', 'County ANSI'

The Value column need to be changed to int, from object.


In [30]:
# Integers to Objects
cols_int_to_object = ['State ANSI', 'Ag District Code', 'County ANSI']

for col in cols_int_to_object:
    df_2[col] = df_2[col].astype(object)

# Objects to Integers

df_2['Value'] = df_2['Value'].astype(int)

ValueError: invalid literal for int() with base 10: ' (D)'