## Python Data Preparation

🐝 Time to merge the buzz! We’re joining bee data with scientific names to build one vibrant hive of insights—revealing where the buzz is and who's doing the pollinating. 🌸📊

First read the csv into pandas dataframes- Review-A Pandas DataFrame is like a spreadsheet in Python—it’s a two-dimensional table where you can store and work with data using rows and columns, just like you would in Excel or a SAS dataset.

In [2]:
import pandas as pd

df1=pd.read_csv('/workspaces/myfolder/SASInnovate25/pattern_decline_N_American_Bumblebees.csv', dtype={6: str, 16: str}, encoding='latin-1')
df2=pd.read_csv('/workspaces/myfolder/SASInnovate25/pattern_decline_Mexican_Bumblebees.csv' , encoding='latin-1')
df3=pd.read_csv('/workspaces/myfolder/SASInnovate25/Bumblebee_Others_Scientific_Common_Names.csv' , encoding='latin-1')
df4=pd.read_csv('/workspaces/myfolder/SASInnovate25/native_vs_nonnative_bumblebee_sighting_pollinators_of_farm_data_for_publication.csv' , encoding='latin-1')

Before concatenating 2 data frames to combine North American(excluding Alaska) and Mexican Bumblebees, Take a quick look at the dimensions of the 2 dataframes we are about to concatenate

In [2]:
# North American bumblebee decline dataframe
df1.shape

(66907, 26)

In [3]:
# Mexican bumblebee decline dataframe
df2.shape

(24, 26)

Concatenation is a way to stitch dataframes along an axis, either row axis or column axis. use concat() and pass it a list of DataFrames that you want to concatenate.

In [4]:
dfconc=pd.concat([df1,df2])
dfconc.shape

(66931, 26)

### Merging data frames

We're buzzing into bumblebee data with Python by merging common names and nesting habits—giving each bee its name tag at the hive party! 🐝 This makes connecting Latin and everyday names a breeze for sweet, streamlined analysis. 🍯

The command list(df3) in Python, when using pandas, will return a list of the column names in the DataFrame df3. It’s a quick way to view the structure of the DataFrame and understand what variables (columns) it contains.

In [6]:
list(df3)

['ScientificName',
 'Species',
 'specificEpithet',
 'CommonName',
 'Description',
 'Source']

The command print(df3) in Python will display the entire contents of the DataFrame df3 in the console or output window. This allows you to see all the rows and columns of data contained in df3, providing a full view of the dataset.

In [None]:
print(df3)

            ScientificName      Species specificEpithet  \
0              Agapostemon  Agapostemon             NaN   
1     Agapostemon sericeus  Agapostemon        sericeus   
2    Agapostemon splendens  Agapostemon       splendens   
3      Agapostemon texanus  Agapostemon         texanus   
4    Agapostemon virescens  Agapostemon       virescens   
..                     ...          ...             ...   
157        Osmia bucephala        Osmia       bucephala   
158       Osmia collinsiae        Osmia      collinsiae   
159        Osmia distincta        Osmia       distincta   
160         Osmia georgica        Osmia        georgica   
161           Osmia pumila        Osmia          pumila   

                      CommonName  \
0             Metallic Green Bee   
1              Silky Agapostemon   
2           Splendid Agapostemon   
3              Texas Agapostemon   
4    Bicolored Striped Sweat Bee   
..                           ...   
157       Large-headed Mason Bee   
158

KeyError: "None of [Index(['year', 'month', 'day', 'stateProvince', 'scientificName', 'kingdom',\n       'phylum', 'class', 'family', 'genus'],\n      dtype='object')] are in the [columns]"

The command df3.describe() in Python is used to generate summary statistics for the numerical columns in the DataFrame df3

In [11]:
df3.describe()

Unnamed: 0,ScientificName,Species,specificEpithet,CommonName,Description,Source
count,162,162,156,162,161,161
unique,162,23,151,161,122,7
top,Osmia pumila,Bombus,texana,Modest Masked Bee,Found in western U.S.; enjoys wildflowers and ...,Discover Life
freq,1,55,2,2,6,65


Take a quick look at the dimensions of the tables we are about to merge

In [6]:
dfconc.shape

(66931, 26)

In [7]:
df3.shape

(162, 6)

In the world of pandas, DataFrames have a merge() method,  with similar functionality to SAS merges. No need to sort ahead of time—perform all kinds of different joins by simply using the how keyword. It’s like a hive of possibilities for your data!

In [8]:
inner_join = dfconc.merge(df3, on=["SCIENTIFICNAME"], how="inner")

KeyError: 'SCIENTIFICNAME'

Dataframe column names are essentially string values, which are case sensitive in Python. Because of this, you will need to be careful whenever you utilize column names, such as when renaming a column, accessing columns or performing functions on them.

In [9]:
dfconc.columns = dfconc.columns.str.lower()

In [11]:
list(dfconc)

['id',
 'institutioncode',
 'collectioncode',
 'basisofrecord',
 'occurrenceid',
 'catalognumber',
 'recordedby',
 'year',
 'month',
 'day',
 'country',
 'stateprovince',
 'county',
 'locality',
 'verbatimlatitude',
 'verbatimlongitude',
 'identifiedby',
 'scientificname',
 'kingdom',
 'phylum',
 'class',
 'order',
 'family',
 'genus',
 'specificepithet',
 'scientificnameauthorship']

In [12]:
df3.columns = df3.columns.str.lower()

In [13]:
list(df3)

['scientificname',
 'species',
 'specificepithet',
 'commonname',
 'description',
 'source']

Use an inner join to merge dfconc and df3 on the scientificname column, keeping only the rows where there's a match in both tables—like inviting only the bees who appear on both guest lists! 🐝

In [15]:
df_inner = dfconc.merge(df3, on=["scientificname"], how="inner")

The command df_inner.head() in Python (using pandas) shows the first 5 rows of the df_inner DataFrame by default.

🐝 Think of it as peeking at the top of the hive—just a quick glance to see what kind of data is buzzing inside! If you want to see more or fewer rows, you can pass a number like df_inner.head(10).

In [16]:
df_inner.head

<bound method NDFrame.head of           id institutioncode collectioncode      basisofrecord  occurrenceid  \
0          1        USDA-ARS           BBSL  PreservedSpecimen   699384987.0   
1          2        USDA-ARS           BBSL  PreservedSpecimen   699384988.0   
2          3        USDA-ARS           BBSL  PreservedSpecimen   699384989.0   
3          4        USDA-ARS           BBSL  PreservedSpecimen   699384990.0   
4          5        USDA-ARS           BBSL  PreservedSpecimen   699384991.0   
...      ...             ...            ...                ...           ...   
66926  66927        USDA-ARS           BBSL  PreservedSpecimen           NaN   
66927  66928        USDA-ARS           BBSL  PreservedSpecimen           NaN   
66928  66929        USDA-ARS           BBSL  PreservedSpecimen           NaN   
66929  66930        USDA-ARS           BBSL  PreservedSpecimen           NaN   
66930  66931        USDA-ARS           BBSL  PreservedSpecimen           NaN   

      cat