# Introduction to pandas

Welcome to pandas! This powerful Python library is used for:

1. Reading and cleaning experimental data  
2. Organizing results like an Excel sheet  
3. Filtering, sorting, and grouping data  
4. Making analysis easier and reproducible

> Think of pandas as Excel for Python — but faster and more flexible.



---



In this chapter, you’ll learn how to:

- Work with tables using **DataFrames**
- Load CSV data from an experiment
- Calculate group statistics
- Apply it to a real example: an ELISA protein concentration assay

<br>

---


## Quick Introduction to Useful pandas Syntax

Here are the most common pandas tools you'll use when working with lab data:

- **Create a DataFrame**  
  `pd.DataFrame(data)` → turns a dictionary or CSV file into a table  

- **Look at the data**  
  `df.head()` → first 5 rows  
  `df.shape` → (rows, columns)  
  `df.describe()` → quick summary statistics  

- **Select columns or rows**  
  `df['Protein_conc']` → select one column  
  `df.loc[0]` → select row by label or index  
  `df.iloc[0]` → select row by position  

- **Filter rows based on a condition**  
  `df[df['Protein_conc'] > 2.0]` → samples with high concentration  

- **Group by and calculate statistics**  
  `df.groupby('Group')['Protein_conc'].mean()` → mean per group  
  `df.groupby('Group').agg(['mean', 'std'])` → multiple stats  

- **Sort and export**  
  `df.sort_values(by='Protein_conc')` → sort by column  
  `df.to_csv('results.csv')` → save as CSV  

These are the most useful "everyday tools" in pandas — once you get the hang of them, you'll be able to handle almost any lab dataset.

## Importing Pandas

In [1]:
# First, import the pandas library
import pandas as pd
print(pd.__version__)

2.2.3


## Creating DataFrames

In [2]:
# From dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


## Inspecting Data

In [3]:
df.head()  # First few rows
df.info()  # Summary info
df.describe()  # Statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes


Unnamed: 0,Age
count,3.0
mean,30.0
std,5.0
min,25.0
25%,27.5
50%,30.0
75%,32.5
max,35.0


## Indexing and Selecting Data

In [4]:
df['Name']  # Single column
df[['Name', 'Age']]  # Multiple columns
df.loc[0]  # Row by label
df.iloc[0]  # Row by position

Name    Alice
Age        25
Name: 0, dtype: object

## Filtering and Boolean Indexing

In [5]:
df[df['Age'] > 28]  # People older than 28

Unnamed: 0,Name,Age
1,Bob,30
2,Charlie,35


## Adding and Removing Columns

In [6]:
df['Country'] = ['UK', 'USA', 'Canada']
df.drop('Age', axis=1)  # Remove column

Unnamed: 0,Name,Country
0,Alice,UK
1,Bob,USA
2,Charlie,Canada


## Handling Missing Values

In [7]:
df.loc[1, 'Country'] = None  # Insert missing value
df.fillna('Unknown')
df.dropna()

Unnamed: 0,Name,Age,Country
0,Alice,25,UK
2,Charlie,35,Canada


##  Grouping and Aggregation

In [8]:
df.groupby('Country').size()

Country
Canada    1
UK        1
dtype: int64

## Reading and Writing Files

In [9]:
# df.to_csv('mydata.csv', index=False)
# df = pd.read_csv('mydata.csv')

## Example: ELISA Plate Data (Protein Concentration)

You ran an ELISA experiment to measure protein concentration in 3 groups of samples:
- Control
- Treated_A
- Treated_B

Each sample has a measured absorbance and calculated protein concentration (in µg/mL).

Let’s load this sample dataset using pandas.

In [10]:
# Let's create a simple DataFrame manually for this example
data = {
    'Sample_ID': ['C1', 'C2', 'C3', 'A1', 'A2', 'A3', 'B1', 'B2', 'B3'],
    'Group': ['Control', 'Control', 'Control', 'Treated_A', 'Treated_A', 'Treated_A', 'Treated_B', 'Treated_B', 'Treated_B'],
    'Absorbance': [0.12, 0.13, 0.11, 0.25, 0.27, 0.26, 0.20, 0.22, 0.21],
    'Protein_conc': [1.2, 1.3, 1.1, 2.5, 2.7, 2.6, 2.0, 2.2, 2.1]
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

# Show the first few rows of the table
df.head()

Unnamed: 0,Sample_ID,Group,Absorbance,Protein_conc
0,C1,Control,0.12,1.2
1,C2,Control,0.13,1.3
2,C3,Control,0.11,1.1
3,A1,Treated_A,0.25,2.5
4,A2,Treated_A,0.27,2.7


## What is a DataFrame?

A **DataFrame** is like a spreadsheet or lab table — each **row** is a sample, and each **column** is a measurement or label.

Let’s explore how to analyze this table.

In [11]:
# Show the shape (rows, columns)
print("Shape:", df.shape)

# Summary statistics for numeric columns
df.describe()

Shape: (9, 4)


Unnamed: 0,Absorbance,Protein_conc
count,9.0,9.0
mean,0.196667,1.966667
std,0.062048,0.620484
min,0.11,1.1
25%,0.13,1.3
50%,0.21,2.1
75%,0.25,2.5
max,0.27,2.7


## Filtering and Selection

You can select specific rows or columns, just like filtering in Excel.

Let’s:
- Select only the "Protein_conc" column
- Filter samples where protein concentration > 2.0

In [12]:
# Select just the protein concentration column
df['Protein_conc']

0    1.2
1    1.3
2    1.1
3    2.5
4    2.7
5    2.6
6    2.0
7    2.2
8    2.1
Name: Protein_conc, dtype: float64

In [13]:
# Filter rows where Protein_conc > 2.0
high_conc = df[df['Protein_conc'] > 2.0]
high_conc

Unnamed: 0,Sample_ID,Group,Absorbance,Protein_conc
3,A1,Treated_A,0.25,2.5
4,A2,Treated_A,0.27,2.7
5,A3,Treated_A,0.26,2.6
7,B2,Treated_B,0.22,2.2
8,B3,Treated_B,0.21,2.1


## Group Statistics

Let’s calculate the average protein concentration **for each treatment group**.

This is useful for comparing experimental conditions.

In [14]:
# Group by the "Group" column and calculate the mean
group_means = df.groupby('Group')['Protein_conc'].mean()
group_means

Group
Control      1.2
Treated_A    2.6
Treated_B    2.1
Name: Protein_conc, dtype: float64

In [15]:
# We can also calculate standard deviation per group
group_std = df.groupby('Group')['Protein_conc'].std()
group_std

Group
Control      0.1
Treated_A    0.1
Treated_B    0.1
Name: Protein_conc, dtype: float64

## Optional: Sorting and Exporting

You can sort your results, or save them to a CSV file to share with colleagues.

In [16]:
# Sort by Protein_conc, descending
df_sorted = df.sort_values(by='Protein_conc', ascending=False)
df_sorted

Unnamed: 0,Sample_ID,Group,Absorbance,Protein_conc
4,A2,Treated_A,0.27,2.7
5,A3,Treated_A,0.26,2.6
3,A1,Treated_A,0.25,2.5
7,B2,Treated_B,0.22,2.2
8,B3,Treated_B,0.21,2.1
6,B1,Treated_B,0.2,2.0
1,C2,Control,0.13,1.3
0,C1,Control,0.12,1.2
2,C3,Control,0.11,1.1


In [17]:
# Export to CSV (if needed)
df.to_csv("elisa_results.csv", index=False)

## Summary

In this chapter, you learned how to:

- Create and explore a pandas **DataFrame**
- Select and filter experimental data
- Group by condition (e.g., Control vs Treated)
- Calculate statistics like **mean** and **std**
- Sort and export your results

