### **Week 4: Basic Data Manipulation with Pandas**
**Objective**: Teach students how to filter, select, sort, group, and apply functions to biology and chemistry-related datasets, preparing them for practical data analysis in life sciences.

### **1. Filtering and Selecting Data in Pandas**
#### **Concept**: Filtering rows and selecting columns
- **Filtering** involves selecting rows that meet specific conditions.
- **Selecting** allows you to choose columns from the DataFrame to work with.

#### **Examples**:
1. **Filtering Rows by Condition**:
   - Using a dataset of patient blood tests with columns like `Patient_ID`, `Cholesterol`, `Blood_Pressure`, and `Blood_Sugar`:

In [1]:
import pandas as pd

data = {
    "Patient_ID": [101, 102, 103, 104, 105],
    "Cholesterol": [180, 220, 150, 190, 240],
    "Blood_Pressure": [120, 130, 110, 125, 140],
    "Blood_Sugar": [90, 110, 95, 100, 130]
}
df = pd.DataFrame(data)

# Filtering for patients with cholesterol above 200
high_cholesterol = df[df["Cholesterol"] > 200]
print(high_cholesterol)

   Patient_ID  Cholesterol  Blood_Pressure  Blood_Sugar
1         102          220             130          110
4         105          240             140          130


2. **Selecting Specific Columns**:
   - Selecting only the `Patient_ID` and `Blood_Pressure` columns:

In [2]:
selected_columns = df[["Patient_ID", "Blood_Pressure"]]
print(selected_columns)

   Patient_ID  Blood_Pressure
0         101             120
1         102             130
2         103             110
3         104             125
4         105             140


#### **Hands-On Exercise**:
- Filter the data to show only rows where Blood Sugar is above 100.
- Select only the Patient_ID and Cholesterol columns from the DataFrame.

### **2. Sorting Data in Pandas**
#### **Concept**: Sorting by column values
- Sorting can be done in **ascending** or **descending** order to organize the data based on specific column values.

#### **Examples**:
1. **Sorting by Cholesterol in Ascending Order**:

In [3]:
sorted_cholesterol = df.sort_values(by="Cholesterol")
print(sorted_cholesterol)

   Patient_ID  Cholesterol  Blood_Pressure  Blood_Sugar
2         103          150             110           95
0         101          180             120           90
3         104          190             125          100
1         102          220             130          110
4         105          240             140          130


2. **Sorting by Blood Pressure in Descending Order**:

In [4]:
sorted_bp = df.sort_values(by="Blood_Pressure", ascending=False)
print(sorted_bp)

   Patient_ID  Cholesterol  Blood_Pressure  Blood_Sugar
4         105          240             140          130
1         102          220             130          110
3         104          190             125          100
0         101          180             120           90
2         103          150             110           95


#### **Hands-On Exercise**:
- Sort the DataFrame by Blood Pressure in descending order.
- Sort by Patient_ID in ascending order.

### **3. Grouping and Aggregating Data**
#### **Concept**: Grouping data to analyze specific subsets
- **Grouping** data allows you to create summary statistics for categories within your data, such as average values for test results across age groups.

#### **Examples**:
1. **Grouping by Blood Pressure Level and Sum Cholesterol**:
   - First, let’s categorize patients by blood pressure levels (`Low`, `Normal`, `High`) and then analyze cholesterol:

In [18]:
def bp_category(bp):
    if bp < 120:
        return "Low"
    elif bp <= 130:
        return "Normal"
    else:
        return "High"

df["BP_Category"] = df["Blood_Pressure"].apply(bp_category)
total_cholesterol_by_bp = df.groupby("BP_Category")["Cholesterol"].sum()
print(total_cholesterol_by_bp)

BP_Category
High      240
Low       150
Normal    590
Name: Cholesterol, dtype: int64


   2. **Grouping by Blood Pressure Category and Calculating Average Blood Sugar**:

In [19]:
avg_blood_sugar_by_bp = df.groupby("BP_Category")["Blood_Sugar"].mean()
print(avg_blood_sugar_by_bp)

BP_Category
High      130.0
Low        95.0
Normal    100.0
Name: Blood_Sugar, dtype: float64


3. **Multiple Aggregations**:
   - Applying multiple aggregations (e.g., sum and average) on grouped data:

In [20]:
summary = df.groupby("BP_Category").agg({"Cholesterol": "sum", "Blood_Sugar": "mean"})
print(summary)

             Cholesterol  Blood_Sugar
BP_Category                          
High                 240        130.0
Low                  150         95.0
Normal               590        100.0


#### **Hands-On Exercise**:
- Group the data by BP_Category and calculate the average Blood Pressure for each category.
- Group by BP_Category and find the average Cholesterol level for each group.
- Use multiple aggregations to find the sum of Cholesterol and mean of Blood Sugar by BP_Category.

### **4. Applying Functions to Datasets**
#### **Concept**: Using custom and built-in functions on columns
- **Apply functions** to transform or calculate new values for columns in a DataFrame.

#### **Examples**:
1. **Using `apply` to Calculate BMI based on Height and Weight**:
   - Add height and weight columns to calculate BMI using a lambda function.

In [21]:
df["Height"] = [1.75, 1.60, 1.82, 1.70, 1.65]  # Height in meters
df["Weight"] = [70, 65, 85, 75, 68]  # Weight in kg

df["BMI"] = df.apply(lambda row: row["Weight"] / (row["Height"] ** 2), axis=1)
print(df)

   Patient_ID  Cholesterol  Blood_Pressure  Blood_Sugar BP_Category  Height  \
0         101          180             120           90      Normal    1.75   
1         102          220             130          110      Normal    1.60   
2         103          150             110           95         Low    1.82   
3         104          190             125          100      Normal    1.70   
4         105          240             140          130        High    1.65   

   Weight        BMI  
0      70  22.857143  
1      65  25.390625  
2      85  25.661152  
3      75  25.951557  
4      68  24.977043  


2. **Applying a Custom Function to Categorize Cholesterol Levels**:
   - Create a function to label cholesterol as “High” if greater than 200, else “Normal”:

In [22]:
def categorize_cholesterol(chol):
    return "High" if chol > 200 else "Normal"

df["Cholesterol_Level"] = df["Cholesterol"].apply(categorize_cholesterol)
print(df)

   Patient_ID  Cholesterol  Blood_Pressure  Blood_Sugar BP_Category  Height  \
0         101          180             120           90      Normal    1.75   
1         102          220             130          110      Normal    1.60   
2         103          150             110           95         Low    1.82   
3         104          190             125          100      Normal    1.70   
4         105          240             140          130        High    1.65   

   Weight        BMI Cholesterol_Level  
0      70  22.857143            Normal  
1      65  25.390625              High  
2      85  25.661152            Normal  
3      75  25.951557            Normal  
4      68  24.977043              High  


#### **Hands-On Exercise**:
- Calculate a new column `Adjusted_Blood_Sugar` by adding 10% to each Blood Sugar value?
- Create a custom function that labels BMI as “Overweight” if above 25, and “Healthy” otherwise, and apply it to a new column?

### **Recap of Week 4**:
- **Key Concepts**: Filtering, selecting, sorting, grouping, and applying functions to some data.
- **Practice**: By the end of the week, students should feel comfortable manipulating data to explore their data, calculate averages, and create custom analysis.