# Lab 02: Pandas Fundamentals

**ING3513 - Introduction to Artificial Intelligence and Machine Learning**

Pandas builds on NumPy to provide labeled, tabular data structures. Think of it as "Excel for Python" - but much more powerful.

**What you'll learn:**

- Creating and exploring DataFrames
- Selecting and filtering data
- Modifying and transforming data
- Aggregations and grouping
- Handling missing data


In [None]:
import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}")

## 1. Creating DataFrames

A DataFrame is a 2D labeled data structure with columns that can have different types.


In [None]:
# From a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "Age": [25, 30, 35, 28, 32],
    "City": ["Oslo", "Bergen", "Trondheim", "Oslo", "Bergen"],
    "Salary": [50000, 60000, 75000, 55000, 65000],
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:")
df

### The NumPy Connection

Under the hood, pandas uses NumPy arrays. You can see this with `.values`:


In [None]:
# The underlying NumPy array
print("Underlying values (NumPy array):")
print(df.values)
print(f"\nType: {type(df.values)}")

## 2. Exploring Data

Get a quick overview of your data with these essential methods.


In [None]:
# Basic information
print("Shape (rows, columns):", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

# Comprehensive info (memory usage, non-null counts)
print("\n" + "=" * 40)
df.info()

In [None]:
# Preview data
print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

In [None]:
# Statistical summary (numeric columns only)
df.describe()

## 3. Selecting Data

Pandas provides multiple ways to select data: by column name, by position, or by condition.


In [None]:
# Select a single column (returns a Series)
print("Names (Series):")
print(df["Name"])
print(f"\nType: {type(df['Name'])}")

In [None]:
# Select multiple columns (returns a DataFrame)
print("Name and Salary:")
df[["Name", "Salary"]]

In [None]:
# Select by row index using .loc (by label)
print("Row at index 2:")
print(df.loc[2])

# Select by position using .iloc
print("\nFirst 2 rows, columns 0-2:")
df.iloc[:2, :3]

## 4. Filtering Data

Boolean filtering works just like in NumPy, but with labeled columns.


In [None]:
# Filter by condition
print("People older than 28:")
df[df["Age"] > 28]

In [None]:
# Multiple conditions (use & for AND, | for OR)
print("People in Oslo with salary > 52000:")
df[(df["City"] == "Oslo") & (df["Salary"] > 52000)]

In [None]:
# Filter using .isin() for multiple values
print("People in Oslo or Trondheim:")
df[df["City"].isin(["Oslo", "Trondheim"])]

## 5. Modifying Data

Add, update, or remove columns and rows.


In [None]:
# Make a copy to avoid modifying the original
df_modified = df.copy()

# Add a new column
df_modified["Bonus"] = df_modified["Salary"] * 0.1
print("Added Bonus column:")
df_modified

In [None]:
# Rename columns
df_renamed = df.rename(columns={"Salary": "Annual_Salary", "City": "Location"})
print("Renamed columns:")
df_renamed.columns.tolist()

## 6. Sorting and Aggregations


In [None]:
# Sort by a column
print("Sorted by Age (descending):")
df.sort_values("Age", ascending=False)

In [None]:
# Basic aggregations
print("Salary statistics:")
print(f"  Mean: {df['Salary'].mean():,.0f}")
print(f"  Sum: {df['Salary'].sum():,}")
print(f"  Max: {df['Salary'].max():,}")
print(f"  Min: {df['Salary'].min():,}")

In [None]:
# GroupBy - aggregate by category
print("Average salary by city:")
df.groupby("City")["Salary"].mean()

In [None]:
# Multiple aggregations
print("Statistics by city:")
df.groupby("City").agg({"Salary": ["mean", "count"], "Age": "mean"})

## 7. Handling Missing Data

Real-world datasets often contain missing values. Pandas represents these as `NaN` (Not a Number) and provides tools to detect and handle them.


In [None]:
# Create a DataFrame with missing values
df_missing = pd.DataFrame(
    {"A": [1, 2, np.nan, 4], "B": [np.nan, 2, 3, 4], "C": [1, 2, 3, np.nan]}
)
print("DataFrame with missing values:")
df_missing

In [None]:
# Detect missing values
print("Missing values per column:")
print(df_missing.isna().sum())
print("\nTotal missing values:", df_missing.isna().sum().sum())

In [None]:
# Drop rows with any missing values
print("After dropping rows with NaN:")
df_missing.dropna()

In [None]:
# Fill missing values with a specific value
print("Fill NaN with 0:")
print(df_missing.fillna(0))

print("\nFill NaN with column mean:")
df_missing.fillna(df_missing.mean(numeric_only=True))

## 8. Reading Data from Files

In practice, you'll usually load data from files rather than creating DataFrames manually. CSV (Comma-Separated Values) is the most common format.


In [None]:
# Common file reading functions:
# pd.read_csv("data.csv")       - Read CSV file
# pd.read_excel("data.xlsx")    - Read Excel file
# pd.read_json("data.json")     - Read JSON file

# You can also write DataFrames to files:
# df.to_csv("output.csv", index=False)
# df.to_excel("output.xlsx", index=False)

# Example: Save our DataFrame to CSV and read it back
df.to_csv("sample_data.csv", index=False)
df_from_csv = pd.read_csv("sample_data.csv")
print("Data loaded from CSV:")
df_from_csv