# Python workshop - 2025

<div>
    <img src="../images/qcbs_logo_v2.svg" style="background-color: #f0f0f0; padding: 20px;"/>
</div>

<div>
    <img src="../images/python_logo_generic.svg" style="background-color: #f0f0f0; padding: 20px;"/>
</div>

**Last update**: 2025-05-19  
**Author**: El-Amine Mimouni  
**Affiliation**: Québec Centre for Biodiversity Science

**Overview**: In this notebook, we will see how to use Pandas.

---

# Pandas

Pandas stands for Panel Data (so not the animal 🐼). It is a more upgrade of NumPy.

If you want to learn more about it, visit [https://pandas.pydata.org/](https://pandas.pydata.org/).

In [None]:
# Import dependencies
import pandas as pd

# Creating a pandas dataframe

In [None]:
# Create a Series
my_series = pd.Series(data=[10, 20, 30, 40, 180])

print(my_series)
print(type(my_series))

In [None]:
# Create a Series with a factorial variable
my_series2 = pd.Series(data=["Biscoe", "Biscoe", "Dream", "Dream", "Dream"])

print(my_series2)
print(type(my_series2))

In [None]:
# Create a small dictionary
# You should know by now
penguins_dico = {
    "bill_length_mm": [39.1, 39.5, 40.3],
    "bill_depth_mm": [18.7, 17.4, 18.0]
    }

print(penguins_dico)

In [None]:
penguins_df = pd.DataFrame(data=penguins_dico)

print(type(penguins_df))
(penguins_df)

In [None]:
# But look under the hood
print(penguins_df.values)
print(type(penguins_df.values))

In [None]:
# So you still have access to familiar methods
penguins_df.sum(axis=0)

In [None]:
# It has additional mathematical methods
print(my_series.skew())
print(my_series.kurtosis())

# Better indexing

Pandas is more than just sugarcoating NumPy.

Location versus integer based indexing.

Because Pandas has column indices, it is capable of indexing based on their names.

In [None]:
# Read the entire Palmer penguins dataset
palmer = pd.read_csv(filepath_or_buffer="../data/penguins.csv")

# See it!
print(type(palmer))
print(type(palmer.values))
palmer

In [None]:
# Columns are now attributes of the DataFrame
# This means that you can select individual columns by their name rather than by their number
palmer[["island"]]

In [None]:
# If you want to choose several columns, supply them as a list.
palmer[["island", "species", "bill_length_mm"]]

In [None]:
# Again the most general indexing method is:
palmer.loc[:, :]

# So that your previous command was:
palmer.loc[:, ["island", "species", "bill_length_mm"]]

In [None]:
# In addition, you can also create boolean indices
palmer[["species"]] == "Adelie"

In [None]:
# You can use this boolean filter for row indexing
palmer.loc[palmer["island"] == "Dream", ["island", "species", "bill_length_mm"]]

In [None]:
# Again, several boolean filters can be chosen
# Note the parentheses
palmer.loc[(palmer["island"] == "Dream") & (palmer["species"] == "Adelie"), ["island", "species", "bill_length_mm"]]

# IMPORTANT
ESPECIALLY FOR NEWCOMERS

In [None]:
# In Pandas, the .loc[] method selects based on its filter
# so that
palmer.loc[0:10, :]

# Will return all observations from 0:10 INCLUDED!!!
# Which is not what you saw in Python lists or NumPy arrays

In [None]:
# In Pandas, the .loc[] method selects based on its filter
# Either way, it is unsuited for integer location based selection
# for columns

# The following code will give you an error
#palmer.loc[0:10, 0:3]

In [None]:
# If you need integer location subsetting, the .iloc[] method
# can be used.
# It behaves like we saw for Python lists and NumPy arrays.

# The following code will NOT give you an error
palmer.iloc[0:10, 0:3]

# Quick summaries

In [None]:
# You can also get quick summaries of variables
palmer.describe(include="number")

In [None]:
# If you want to save this table (or any other for
# that matter), you can use the .to_csv() method.

# Save the .describe() method table, then save it
# to a .csv table.
summary_tab = palmer.describe(include="number")
summary_tab.to_csv(path_or_buf="../data/penguins_summary.csv", sep=",")

In [None]:
# Qualitative variables use the argument "object"
palmer.describe(include="object")

# Data cleaning

Data cleaning is efficient, you can remove NaN values easily and precisely.

In [None]:
palmer.head()

In [None]:
palmer.dropna(inplace=False).head()

In [None]:
# Removing NaN on only a particular variable
# .dropna() removes values that are 
palmer_piece = palmer.loc[267:272, :].copy()
#
palmer_piece

In [None]:
# If you use .dropna(), you will remove both rows 268 and 271
#
palmer_piece.dropna()

In [None]:
# But you can specify on which columns you want to drop rows if they contain NaN values
#
palmer_piece.dropna(subset=["bill_depth_mm"])

# Constructing new variables

In [None]:
#  You can build new variables
print(palmer["bill_length_mm"] / palmer["bill_depth_mm"])

# Adding a column is done simply by 
palmer["bill_ratio"] = palmer["bill_length_mm"] / palmer["bill_depth_mm"]

In [None]:
# You can check to make sure it is there.
print(palmer.columns)

# Summarizing variables using .groupby

In [None]:
# You can use the .groupby() method to group dataframes by a qualitative variable
print(palmer.groupby(by="island"))
print(type(palmer.groupby(by="island")))

In [None]:
# You can get the function you want by using it as a method afterwards
# .count()
# .mean()
# .min()
# .max()
print("Average penguin bill length by island")
palmer.groupby(by="island").count()

In [None]:
# The following code will fail, but look at the output to
# understand
#
# You want to change the aggregate function
print("Average penguin bill length by island")
#palmer.groupby(by="island").mean()

In [None]:
# You can consider a single variable by appending it at the end
print("Average penguin bill length by island")
palmer.groupby(by="island")[["bill_length_mm"]].mean()

In [None]:
# You can submit more than one variable in the "by" parameter
# These are considered in succession
print("Average penguin bill length by island, then by species:")
palmer.groupby(by=["island", "species"])[["bill_length_mm"]].mean()

In [None]:
# More than one variable can be submitted, in which case they must be a list
# You can submit more than one variable in the "by" parameter
# These are considered in succession
print("Average penguin bill length and depth by island, then by species:")
palmer.groupby(by=["island", "species"])[["bill_length_mm", "bill_depth_mm"]].mean()

In [None]:
# You can even go full complexity by aggregating multiple variables

# More than one variable can be submitted, in which case they must be a list
# You can submit more than one variable in the "by" parameter
# These are considered in succession
print("Average penguin bill length and depth by island, then by species:")
palmer.groupby(by=["island", "species"])[["bill_length_mm", "bill_depth_mm"]].agg(["min", "max"])

# Pivot, stack et al.

In [None]:
# You can submit more than one variable in the "by" parameter
# These are considered in succession
toto = (palmer.groupby(by=["island", "species"])["bill_length_mm"].mean())

print("Average penguin bill length by island, then by species")
print(toto)

In [None]:
# Unstack
print(toto.unstack())