
# Data 101 — Module 5, Session 2
## Pandas Basics (Demo Notebook)

This notebook follows the Session 2 slide outline:
- Importing Pandas
- Series and DataFrames
- Loading a CSV
- Selecting columns and rows (`loc` vs `iloc`)
- Column slicing with `loc`
- Boolean filtering
- Modifying data
- Summary statistics and grouping


## Import Pandas

In [1]:

import pandas as pd
import numpy as np

print("pandas:", pd.__version__)
print("numpy:", np.__version__)


pandas: 2.3.1
numpy: 2.3.1


## Series

In [2]:
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
display(s)

a    10
b    20
c    30
dtype: int64

In [3]:
print("Access by label:", s["b"])

Access by label: 20


In [4]:
print("Access by position:", s[1])

Access by position: 20


  print("Access by position:", s[1])


## DataFrame

In [5]:

data = {"Name": ["Alex", "Jamie"], "GPA": [3.4, 3.8]}
df_small = pd.DataFrame(data)
display(df_small)


Unnamed: 0,Name,GPA
0,Alex,3.4
1,Jamie,3.8


## Load Data from CSV

In [6]:
# A sample CSV is saved at: ./data/students.csv
df = pd.read_csv("./data/students.csv")
display(df.head())

Unnamed: 0,Name,Gender,Major,GPA,Hours_Studied
0,Alex,F,CS,3.4,10
1,Jamie,M,Math,3.8,15
2,Jordan,F,CS,3.1,8
3,Taylor,M,Physics,2.9,12
4,Riley,F,Math,3.6,9


In [10]:
df

Unnamed: 0,Name,Gender,Major,GPA,Hours_Studied
0,Alex,F,CS,3.4,10
1,Jamie,M,Math,3.8,15
2,Jordan,F,CS,3.1,8
3,Taylor,M,Physics,2.9,12
4,Riley,F,Math,3.6,9
5,Casey,M,CS,2.5,5
6,Morgan,F,Physics,3.2,11


In [7]:
print("\nInfo:")
display(df.info())


Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           7 non-null      object 
 1   Gender         7 non-null      object 
 2   Major          7 non-null      object 
 3   GPA            7 non-null      float64
 4   Hours_Studied  7 non-null      int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 412.0+ bytes


None

In [8]:
print("\nDescribe:")
display(df.describe())


Describe:


Unnamed: 0,GPA,Hours_Studied
count,7.0,7.0
mean,3.214286,10.0
std,0.437526,3.162278
min,2.5,5.0
25%,3.0,8.5
50%,3.2,10.0
75%,3.5,11.5
max,3.8,15.0


In [9]:
print("\nDescribe:")
display(df.describe(include="all"))


Describe:


Unnamed: 0,Name,Gender,Major,GPA,Hours_Studied
count,7,7,7,7.0,7.0
unique,7,2,3,,
top,Alex,F,CS,,
freq,1,4,3,,
mean,,,,3.214286,10.0
std,,,,0.437526,3.162278
min,,,,2.5,5.0
25%,,,,3.0,8.5
50%,,,,3.2,10.0
75%,,,,3.5,11.5


## Selecting Columns

In [11]:
gpa_series = df["GPA"]
display(gpa_series.head())

0    3.4
1    3.8
2    3.1
3    2.9
4    3.6
Name: GPA, dtype: float64

In [12]:
name_gpa_df = df[["Name", "GPA"]]
display(name_gpa_df.head())

Unnamed: 0,Name,GPA
0,Alex,3.4
1,Jamie,3.8
2,Jordan,3.1
3,Taylor,2.9
4,Riley,3.6


## Selecting Rows with `.loc` (label-based)

In [14]:

df = df.reset_index(drop=True)

row_label_0 = df.loc[0]
display(row_label_0)


Name             Alex
Gender              F
Major              CS
GPA               3.4
Hours_Studied      10
Name: 0, dtype: object

In [17]:
rows_0_to_2 = df.loc[0:2]
display(rows_0_to_2)

Unnamed: 0,Name,Gender,Major,GPA,Hours_Studied
0,Alex,F,CS,3.4,10
1,Jamie,M,Math,3.8,15
2,Jordan,F,CS,3.1,8


In [18]:
gpa_all_rows = df.loc[:, "GPA"]
display(gpa_all_rows.head())

0    3.4
1    3.8
2    3.1
3    2.9
4    3.6
Name: GPA, dtype: float64

## Selecting Rows with `.iloc` (position-based)

In [19]:

first_row = df.iloc[0]
first_three_rows = df.iloc[0:3]
second_column_all_rows = df.iloc[:, 1]

display(first_row)
display(first_three_rows)
display(second_column_all_rows.head())


Name             Alex
Gender              F
Major              CS
GPA               3.4
Hours_Studied      10
Name: 0, dtype: object

Unnamed: 0,Name,Gender,Major,GPA,Hours_Studied
0,Alex,F,CS,3.4,10
1,Jamie,M,Math,3.8,15
2,Jordan,F,CS,3.1,8


0    F
1    M
2    F
3    M
4    F
Name: Gender, dtype: object

## `loc` vs `iloc` example

In [20]:

example_loc = df.loc[2, "GPA"]   # row label 2, column "GPA"
example_iloc = df.iloc[2, 3]     # third row, fourth column (0-based indexing)
print("df.loc[2, 'GPA']  ->", example_loc)
print("df.iloc[2, 3]     ->", example_iloc)


df.loc[2, 'GPA']  -> 3.1
df.iloc[2, 3]     -> 3.1


## Column Slicing with `loc` and a common pitfall

In [22]:

one_col = df.loc[:, "GPA"]
multi_cols = df.loc[:, ["Name", "GPA"]]
range_cols = df.loc[:, "Name":"GPA"]

display(one_col.head())
display(multi_cols.head())
display(range_cols.head())


0    3.4
1    3.8
2    3.1
3    2.9
4    3.6
Name: GPA, dtype: float64

Unnamed: 0,Name,GPA
0,Alex,3.4
1,Jamie,3.8
2,Jordan,3.1
3,Taylor,2.9
4,Riley,3.6


Unnamed: 0,Name,Gender,Major,GPA
0,Alex,F,CS,3.4
1,Jamie,M,Math,3.8
2,Jordan,F,CS,3.1
3,Taylor,M,Physics,2.9
4,Riley,F,Math,3.6


In [24]:
bad = df[:, "GPA"]

InvalidIndexError: (slice(None, None, None), 'GPA')

In [25]:
try:
    bad = df[:, "GPA"]
except Exception as e:
    print("Using df[:, 'GPA'] raises ->", repr(e))

Using df[:, 'GPA'] raises -> InvalidIndexError((slice(None, None, None), 'GPA'))


## Boolean Filtering

In [26]:

high_gpa = df[df["GPA"] > 3.5]
display(high_gpa)

Unnamed: 0,Name,Gender,Major,GPA,Hours_Studied
1,Jamie,M,Math,3.8,15
4,Riley,F,Math,3.6,9


In [27]:
cs_or_math = df[(df["Major"] == "CS") | (df["Major"] == "Math")]
display(cs_or_math)

Unnamed: 0,Name,Gender,Major,GPA,Hours_Studied
0,Alex,F,CS,3.4,10
1,Jamie,M,Math,3.8,15
2,Jordan,F,CS,3.1,8
4,Riley,F,Math,3.6,9
5,Casey,M,CS,2.5,5


## Modifying Data

In [29]:

df["Passed"] = df["GPA"] >= 2.0
df["GPA_rounded"] = df["GPA"].round(0)
display(df.head())

Unnamed: 0,Name,Gender,Major,GPA,Hours_Studied,Passed,GPA_rounded
0,Alex,F,CS,3.4,10,True,3.0
1,Jamie,M,Math,3.8,15,True,4.0
2,Jordan,F,CS,3.1,8,True,3.0
3,Taylor,M,Physics,2.9,12,True,3.0
4,Riley,F,Math,3.6,9,True,4.0


## Summary Statistics

In [30]:

print("mean:", df["GPA"].mean())
print("median:", df["GPA"].median())
print("mode:", df["GPA"].mode().tolist())
print("min:", df["GPA"].min())
print("max:", df["GPA"].max())
print("std:", df["GPA"].std())

display(df.describe())


mean: 3.2142857142857144
median: 3.2
mode: [2.5, 2.9, 3.1, 3.2, 3.4, 3.6, 3.8]
min: 2.5
max: 3.8
std: 0.4375255094603872


Unnamed: 0,GPA,Hours_Studied,GPA_rounded
count,7.0,7.0,7.0
mean,3.214286,10.0,3.142857
std,0.437526,3.162278,0.690066
min,2.5,5.0,2.0
25%,3.0,8.5,3.0
50%,3.2,10.0,3.0
75%,3.5,11.5,3.5
max,3.8,15.0,4.0


## Grouping

In [31]:

avg_gpa_by_major = df.groupby("Major")["GPA"].mean().reset_index(name="Avg_GPA")
display(avg_gpa_by_major)

Unnamed: 0,Major,Avg_GPA
0,CS,3.0
1,Math,3.7
2,Physics,3.05



## Summary
- `pd.Series` and `pd.DataFrame` are core pandas objects.
- Use `.loc` for label-based and `.iloc` for position-based selection.
- Always select with `[rows, columns]` when slicing tables.
- Filter rows with boolean conditions.
- Create or transform columns for analysis.
- Compute descriptive statistics and group summaries.
