## Python Data Exploration

In [20]:
import pandas as pd

Metadata time to see what’s buzzing under the surface!" 🍯🐝

In [21]:
#metadata
df2.info()

NameError: name 'df2' is not defined

Why do I have to import the same data for each notebook in Workbench? Why can't I just import it in one notebook and use it across others?

In SAS Viya Workbench (and most notebook environments like Jupyter), each notebook runs in its own separate memory space (called a kernel).<br>
🔹 When you import data into one notebook, it lives only inside that notebook’s memory.<br>
🔹 Other notebooks can’t see or share that memory unless you explicitly save the data somewhere they can both access — like saving it to a file (CSV, SAS7BDAT, etc.) or a shared database.<br><br>

Think of it like this:<br>
🐝 Each notebook is like its own private hive — it doesn't know what's buzzing in the next hive unless you share the honey (data) in a common place.

In [2]:
import pandas as pd

df1=pd.read_csv('/workspaces/myfolder/SASInnovate25/pattern_decline_N_American_Bumblebees.csv', dtype={6: str, 16: str}, encoding='latin-1')
df2=pd.read_csv('/workspaces/myfolder/SASInnovate25/pattern_decline_Mexican_Bumblebees.csv' , encoding='latin-1')
df3=pd.read_csv('/workspaces/myfolder/SASInnovate25/Bumblebee_Others_Scientific_Common_Names.csv' , encoding='latin-1')
df4=pd.read_csv('/workspaces/myfolder/SASInnovate25/native_vs_nonnative_bumblebee_sighting_pollinators_of_farm_data_for_publication.csv' , encoding='latin-1')

🐝 Why doesn’t Python leave as many 'honey trails' of progress like SAS does?

Python stays quiet unless you ask it to speak (with print(), logging, or verbose settings), while SAS automatically logs every step to meet strict audit needs in industries like healthcare and finance. If you want more buzz in Python, you can add manual print()s, use the logging library, or turn on verbose options!

🐝✨ Let’s create a tiny SAS-style log in Python to show you how it can feel more "chatty" during program execution.

In [24]:
import pandas as pd
import logging

# Set up a basic logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

logging.info('Starting program execution...')

# Step 1: Read data
try:
    df = pd.read_csv('/workspaces/myfolder/SASInnovate25/pattern_decline_N_American_Bumblebees.csv', encoding='latin-1')
    logging.info('CSV file successfully read into a DataFrame.')
except Exception as e:
    logging.error(f'Error reading CSV file: {e}')

# Step 2: Check basic information
logging.info('Displaying dataset structure:')
print(df.info())

# Step 3: Calculate basic statistics
summary = df.describe()
logging.info('Calculated summary statistics.')

# Step 4: Preview data
logging.info('Here are the first 5 rows of the dataset:')
print(df.head())

logging.info('Program execution completed successfully.')

2025-04-29 01:53:49,659 - INFO - Starting program execution...
  df = pd.read_csv('/workspaces/myfolder/SASInnovate25/pattern_decline_N_American_Bumblebees.csv', encoding='latin-1')
2025-04-29 01:53:49,913 - INFO - CSV file successfully read into a DataFrame.
2025-04-29 01:53:49,914 - INFO - Displaying dataset structure:
2025-04-29 01:53:50,011 - INFO - Calculated summary statistics.
2025-04-29 01:53:50,011 - INFO - Here are the first 5 rows of the dataset:
2025-04-29 01:53:50,026 - INFO - Program execution completed successfully.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66907 entries, 0 to 66906
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        66907 non-null  int64  
 1   institutionCode           66907 non-null  object 
 2   collectionCode            66907 non-null  object 
 3   basisOfRecord             66907 non-null  object 
 4   occurrenceID              66907 non-null  int64  
 5   catalogNumber             66907 non-null  object 
 6   recordedBy                25350 non-null  object 
 7   year                      65778 non-null  float64
 8   month                     66368 non-null  float64
 9   day                       63897 non-null  float64
 10  country                   66818 non-null  object 
 11  stateProvince             66818 non-null  object 
 12  county                    59648 non-null  object 
 13  locality                  62342 non-null  object 
 14  verbat

What this does:

logging.info() shows friendly notes as you move through each step (just like SAS NOTES).

If something goes wrong, logging.error() prints an error (just like SAS ERRORS).

It timestamps each message automatically!

In [25]:
# Now that we have re-read the DF2 dataframe, we can look at the metadata
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        24 non-null     int64  
 1   institutionCode           24 non-null     object 
 2   collectionCode            24 non-null     object 
 3   basisOfRecord             24 non-null     object 
 4   occurrenceID              0 non-null      float64
 5   catalogNumber             24 non-null     object 
 6   recordedBy                0 non-null      float64
 7   year                      24 non-null     int64  
 8   month                     24 non-null     int64  
 9   day                       24 non-null     int64  
 10  country                   24 non-null     object 
 11  stateProvince             24 non-null     object 
 12  county                    0 non-null      float64
 13  locality                  23 non-null     object 
 14  verbatimLati

The method info() provides technical information about a DataFrame, so let’s view the output in more detail:

df2 is a DataFrame.

There are 24 entries, i.e. 24 rows.

Each row has a row label (aka the index) with values ranging from 0 to 0 to 23.

The table has 25 columns. Most columns have a value for each of the rows (all values are non-null).

There are some columns with textual data (strings, aka object). The other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).

The kind of data (characters, integers,…) in the different columns are summarized by listing the dtypes.

The approximate amount of RAM used to hold the DataFrame is provided as well.

In [None]:
#read the first 5 rows of df1 using the head method just like a PROC PRINT
df2.head()

Unnamed: 0,id,institutionCode,collectionCode,basisOfRecord,occurrenceID,catalogNumber,recordedBy,year,month,day,...,identifiedBy,scientificName,kingdom,phylum,class,order,family,genus,specificEpithet,scientificNameAuthorship
0,66908,USDA-ARS,BBSL,PreservedSpecimen,,BOMBUS1055,,1965,8,11,...,,Bombus pensylvanicus,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,pensylvanicus,(DeGeer 1773)
1,66909,USDA-ARS,BBSL,PreservedSpecimen,,BOMBUS1062,,1928,8,26,...,,Bombus pensylvanicus,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,pensylvanicus,(DeGeer 1773)
2,66910,USDA-ARS,BBSL,PreservedSpecimen,,BOMBUS1063,,1928,8,21,...,,Bombus pensylvanicus,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,pensylvanicus,(DeGeer 1773)
3,66911,USDA-ARS,BBSL,PreservedSpecimen,,BOMBUS1064,,1928,8,5,...,,Bombus pensylvanicus,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,pensylvanicus,(DeGeer 1773)
4,66912,USDA-ARS,BBSL,PreservedSpecimen,,BOMBUS1065,,1928,8,19,...,,Bombus pensylvanicus,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,pensylvanicus,(DeGeer 1773)


In [32]:
# Understanding hive activity by generating some descriptive statistics for one column, year-just like PROC MEANS
df2['year'].describe()

count      24.00000
mean     1940.12500
std        20.28774
min      1908.00000
25%      1928.00000
50%      1928.00000
75%      1962.00000
max      1984.00000
Name: year, dtype: float64

Learn where the bumblebees like to buzz around the most by getting frequency counts, similar to PROC FREQ

In [31]:
stateProvince_freq = df2['stateProvince'].value_counts()
print(stateProvince_freq)
scientificName_freq = df2['scientificName'].value_counts()
print(scientificName_freq)

stateProvince
Mexico          20
Quintana Roo     2
Durango          1
Tamaulipas       1
Name: count, dtype: int64
scientificName
Bombus pensylvanicus    23
Bombus impatiens         1
Name: count, dtype: int64


In [4]:
# List of the columns to display
columns_to_display = [
    'year', 'month', 'stateProvince',  
    'scientificName', 'kingdom', 'phylum', 'class', 'genus'
]

# Display only the specified columns
print(df1[columns_to_display])

         year  month stateProvince       scientificName   kingdom      phylum  \
0      1970.0    7.0       Arizona  Bombus occidentalis  Animalia  Arthropoda   
1      1970.0    7.0       Arizona  Bombus occidentalis  Animalia  Arthropoda   
2      1989.0    6.0       Arizona      Bombus bifarius  Animalia  Arthropoda   
3      1970.0    9.0       Arizona  Bombus occidentalis  Animalia  Arthropoda   
4      1961.0    8.0       Arizona      Bombus bifarius  Animalia  Arthropoda   
...       ...    ...           ...                  ...       ...         ...   
66902  1917.0    NaN      Colorado      Bombus bifarius  Animalia  Arthropoda   
66903  1940.0    NaN          Utah      Bombus bifarius  Animalia  Arthropoda   
66904  1940.0    NaN          Utah      Bombus bifarius  Animalia  Arthropoda   
66905  1940.0    NaN          Utah      Bombus bifarius  Animalia  Arthropoda   
66906  1940.0    NaN          Utah      Bombus bifarius  Animalia  Arthropoda   

         class   genus  
0 