Queen Bee’s Pattern Parade: 
👑🐝 The Queen Bee is on a mission to find the sweetest bee names with the best patterns! while SAS flexed its data-handling muscles to reveal hidden patterns, watch as Python weaves its web of regex wizadry. Which tool will uncover the juiciest insights or are they both equal? Let the name hunt begin! 

To obey the queen, we will find all instances of bee names ending in 'ern' 'ed' or with a '-' using perl regular expression. 

In [1]:
import pandas as pd
import re

# Read the scientific and common name lookup csv file into a DataFrame
df3=pd.read_csv('/workspaces/myfolder/SASInnovate25/Bumblebee_Others_Scientific_Common_Names.csv' , encoding='latin-1')

re is a standard library module (or "package") in Python that provides support for regular expressions—a powerful way to search, match, and manipulate strings based on patterns.

Since it's built into Python, you don't need to install it separately—just import re, and you're ready to start buzzing through text with regex! 🐝

In [2]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ScientificName   162 non-null    object
 1   Species          162 non-null    object
 2   specificEpithet  156 non-null    object
 3   CommonName       162 non-null    object
 4   Description      161 non-null    object
 5   Source           161 non-null    object
dtypes: object(6)
memory usage: 7.7+ KB


DataFrames can be filtered in multiple ways; the most intuitive of which boolean indexing creating a series of True/False values.

In [6]:
#  regex pattern (case-insensitive and end-of-word)
pattern = r'(?i)\b(?:ed|ern)\b|-'  # (?i) = ignore case, \b = word boundary

# Apply the filter
df_regex = df3[df3['CommonName'].str.contains(pattern, regex=True)]

# Display the filtered DataFrame
print(df_regex)

               ScientificName       Species specificEpithet  \
29           Bombus appositus        Bombus       appositus   
33           Bombus balteatus        Bombus       balteatus   
34            Bombus bifarius        Bombus        bifarius   
35         Bombus bimaculatus        Bombus     bimaculatus   
47          Bombus flavifrons        Bombus      flavifrons   
51        Bombus griseocollis        Bombus    griseocollis   
57             Bombus lucorum        Bombus         lucorum   
58         Bombus melanopygus        Bombus     melanopygus   
59              Bombus mixtus        Bombus          mixtus   
67         Bombus rotundiceps        Bombus     rotundiceps   
68         Bombus rufocinctus        Bombus     rufocinctus   
74           Bombus ternarius        Bombus       ternarius   
75          Bombus terrestris        Bombus      terrestris   
76           Bombus terricola        Bombus       terricola   
77              Bombus vagans        Bombus          va

Breakdown
this regex looks for:

Words ending in "ed" or "ern" followed by a non-word character (like a space, hyphen, or punctuation).

Or simply a hyphen (-) anywhere in the string.

(?i) makes the pattern case-insensitive.

\b ensures "ed" or "ern" appear at word boundaries, i.e., the end of a word.

In [4]:
df_regex.shape

(33, 6)

df_regex.shape is a Pandas DataFrame attribute that returns a tuple representing the dimensions of the DataFrame — specifically, the number of rows and columns.

since df_regex.shape returns (33, 6), that means the filtered DataFrame df_regex has:

33 rows (bee records that matched the regex pattern), and

6 columns (like ScientificName, Species etc.).

### Grouping Aggregating Data

In [7]:
import pandas as pd

# Read the North American bumblebee CSV file into a DataFrame for easy data manipulation and analysis, forcing column 6 and 16 to be strings
df1=pd.read_csv('/workspaces/myfolder/SASInnovate25/pattern_decline_N_American_Bumblebees.csv', dtype={6: str, 16: str}, encoding='latin-1')

In [11]:
df1.groupby(['scientificName','stateProvince']).size().reset_index(name='count').sort_values(by='count', ascending=False).head(20)

Unnamed: 0,scientificName,stateProvince,count
629,Bombus vosnesenskii,California,8982
85,Bombus bifarius,California,2950
94,Bombus bifarius,Utah,2392
557,Bombus terricola,Michigan,2185
306,Bombus impatiens,Illinois,1723
410,Bombus occidentalis,California,1712
86,Bombus bifarius,Colorado,1594
634,Bombus vosnesenskii,Oregon,1588
92,Bombus bifarius,Oregon,1555
95,Bombus bifarius,Washington,1189


This line of code performs a grouped count summary in Pandas and returns the top 20 combinations of scientific name and state/province by frequency, sorted in descending order.

🐝 In bee-speak: this is like tallying up how many times each bee species shows up in each state, ranking them from the most spotted to the least — the top 20 buzziest combos!

Explanation 
df1.groupby(['scientificName','stateProvince'])  # Group the DataFrame by both scientific name and state/province<br>
   .size()                                       # Count the number of rows (i.e., bee observations) in each group<br>
   .reset_index(name='count')                    # Convert the result to a DataFrame and name the count column 'count'<br>
   .sort_values(by='count', ascending=False)     # Sort the counts from highest to lowest<br>
   .head(20)                                     # Show only the top 20 results<br>