# Module 03: Pandas Basics

**Estimated Time**: 60 minutes

## Learning Objectives

By the end of this module, you will:
- Understand what Pandas is and why it's essential
- Create and manipulate Series and DataFrames
- Load data from CSV, Excel, and other formats
- Select, filter, and sort data efficiently
- Perform group-by operations and aggregations
- Add, modify, and delete columns
- Handle basic date/time data
- Export data to various formats

## Prerequisites

- Modules 00-02 completed
- NumPy fundamentals
- Basic Python knowledge

---

## 1. Introduction to Pandas

**Pandas** is Python's most powerful library for data manipulation and analysis.

### Why Pandas?

- **Tabular data**: Works like Excel or SQL, but with programming power
- **Data cleaning**: Easily handle missing values, duplicates, and transformations
- **Time series**: Built-in support for dates and times
- **I/O**: Read/write CSV, Excel, SQL, JSON, and more
- **Integration**: Works seamlessly with NumPy, Matplotlib, and scikit-learn

### Real-World Analogy

Think of Pandas as:
- **Excel**: But with Python's automation and scalability
- **SQL**: But easier to learn and more flexible
- **Spreadsheet + Programming**: The best of both worlds

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

print(f"Pandas version: {pd.__version__}")
print("Ready to start!")

## 2. Pandas Series

A **Series** is a one-dimensional labeled array (like a column in Excel).

In [None]:
# Create Series from a list
temperatures = pd.Series([72, 68, 75, 71, 69])
print("Temperatures Series:")
print(temperatures)
print(f"\nType: {type(temperatures)}")

In [None]:
# Series with custom index (labels)
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
temperatures = pd.Series([72, 68, 75, 71, 69], index=days)

print("Labeled Series:")
print(temperatures)

# Access by label
print(f"\nWednesday temperature: {temperatures['Wednesday']}Â°F")
print(f"First temperature: {temperatures[0]}Â°F")

In [None]:
# Series attributes and methods
print("Series Information:")
print(f"Values: {temperatures.values}")
print(f"Index: {temperatures.index.tolist()}")
print(f"Shape: {temperatures.shape}")
print(f"Size: {temperatures.size}")
print(f"\nStatistics:")
print(f"Mean: {temperatures.mean():.1f}Â°F")
print(f"Max: {temperatures.max()}Â°F")
print(f"Min: {temperatures.min()}Â°F")

In [None]:
# Series operations (vectorized)
print("Temperature conversions:")
celsius = (temperatures - 32) * 5 / 9
print("\nCelsius:")
print(celsius.round(1))

# Boolean indexing
print("\nWarm days (>70Â°F):")
warm_days = temperatures[temperatures > 70]
print(warm_days)

## 3. DataFrames: The Core of Pandas

A **DataFrame** is a 2D labeled data structure (think spreadsheet or SQL table).

In [None]:
# Create DataFrame from dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "Age": [25, 30, 35, 28, 32],
    "City": ["New York", "London", "Paris", "Tokyo", "Berlin"],
    "Salary": [70000, 80000, 75000, 85000, 72000],
}

df = pd.DataFrame(data)
print("Employee DataFrame:")
print(df)

In [None]:
# DataFrame information
print("DataFrame Info:")
print(f"Shape: {df.shape} (rows x columns)")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
print(f"\nMemory usage: {df.memory_usage().sum()} bytes")

In [None]:
# Quick look at data
print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

print("\nRandom 2 rows:")
print(df.sample(2))

In [None]:
# Summary statistics
print("Statistical Summary:")
print(df.describe())

print("\nDetailed info:")
df.info()

## 4. Loading Data from Files

Pandas can read data from many formats. Let's practice with our sample datasets!

In [None]:
# Load sales data
sales_df = pd.read_csv("../data/sales_data.csv")

print("Sales Data:")
print(sales_df.head())
print(f"\nShape: {sales_df.shape}")

In [None]:
# Load customer data
customers_df = pd.read_csv("../data/customer_data.csv")

print("Customer Data:")
print(customers_df.head())
print(f"\nColumns: {customers_df.columns.tolist()}")

In [None]:
# Explore the data
print("Data Overview:")
print(f"Total sales records: {len(sales_df)}")
print(f"Total customers: {len(customers_df)}")
print(f"\nSales columns: {sales_df.columns.tolist()}")
print(f"\nFirst sale: {sales_df['Date'].iloc[0]}")
print(f"Last sale: {sales_df['Date'].iloc[-1]}")

## 5. Selecting Data

Learn different ways to select columns and rows.

In [None]:
# Select single column (returns Series)
names = df["Name"]
print("Names (Series):")
print(names)
print(f"Type: {type(names)}")

In [None]:
# Select multiple columns (returns DataFrame)
subset = df[["Name", "Salary"]]
print("Name and Salary:")
print(subset)
print(f"Type: {type(subset)}")

In [None]:
# Select rows by position with iloc
print("First row (iloc):")
print(df.iloc[0])

print("\nFirst 3 rows, first 2 columns:")
print(df.iloc[:3, :2])

In [None]:
# Select rows by label/condition with loc
print("Row at index 0 (loc):")
print(df.loc[0])

# Select specific rows and columns
print("\nSpecific selection:")
print(df.loc[0:2, ["Name", "City"]])

## 6. Filtering Data

Filter rows based on conditions (Boolean indexing).

In [None]:
# Simple filter
high_earners = df[df["Salary"] >= 75000]
print("High earners (Salary >= $75,000):")
print(high_earners)

In [None]:
# Multiple conditions (AND)
young_high_earners = df[(df["Age"] < 30) & (df["Salary"] > 70000)]
print("Young high earners (Age < 30 AND Salary > $70,000):")
print(young_high_earners)

In [None]:
# OR condition
special_cases = df[(df["Age"] > 32) | (df["Salary"] < 72000)]
print("Special cases (Age > 32 OR Salary < $72,000):")
print(special_cases)

In [None]:
# Filter by list (isin)
selected_cities = df[df["City"].isin(["New York", "Tokyo"])]
print("Employees in New York or Tokyo:")
print(selected_cities)

In [None]:
# String operations
# Using sales data
electronics = sales_df[sales_df["Category"] == "Electronics"]
print(f"Electronics sales: {len(electronics)} transactions")
print(electronics.head())

## 7. Sorting Data

In [None]:
# Sort by single column
sorted_by_salary = df.sort_values("Salary", ascending=False)
print("Sorted by Salary (highest first):")
print(sorted_by_salary)

In [None]:
# Sort by multiple columns
sorted_multi = df.sort_values(["City", "Age"])
print("Sorted by City, then Age:")
print(sorted_multi)

In [None]:
# Top N values
top_3_earners = df.nlargest(3, "Salary")
print("Top 3 earners:")
print(top_3_earners[["Name", "Salary"]])

youngest_2 = df.nsmallest(2, "Age")
print("\nYoungest 2 employees:")
print(youngest_2[["Name", "Age"]])

## 8. Adding and Modifying Data

In [None]:
# Add new column
df["Bonus"] = df["Salary"] * 0.10
print("With Bonus column:")
print(df[["Name", "Salary", "Bonus"]])

In [None]:
# Add calculated column
df["Total_Compensation"] = df["Salary"] + df["Bonus"]
print("With Total Compensation:")
print(df[["Name", "Salary", "Bonus", "Total_Compensation"]])

In [None]:
# Conditional column
df["Experience_Level"] = df["Age"].apply(lambda x: "Senior" if x >= 30 else "Junior")
print("With Experience Level:")
print(df[["Name", "Age", "Experience_Level"]])

In [None]:
# Modify existing values
df_copy = df.copy()
df_copy.loc[df_copy["City"] == "Paris", "City"] = "Paris, France"
print("Modified City values:")
print(df_copy[["Name", "City"]])

In [None]:
# Delete column
df_copy = df.copy()
df_copy = df_copy.drop("Bonus", axis=1)  # axis=1 for columns
print("After dropping Bonus:")
print(df_copy.columns.tolist())

## 9. Group-By Operations

Group data by categories and perform aggregations.

In [None]:
# Group sales by category
category_sales = sales_df.groupby("Category")["Sales"].sum()
print("Total Sales by Category:")
print(category_sales)
print(f"\nType: {type(category_sales)}")

In [None]:
# Multiple aggregations
category_stats = sales_df.groupby("Category")["Sales"].agg(["sum", "mean", "count"])
print("Sales Statistics by Category:")
print(category_stats)

In [None]:
# Group by multiple columns
region_category = sales_df.groupby(["Region", "Category"])["Sales"].sum()
print("Sales by Region and Category:")
print(region_category)

In [None]:
# Group and aggregate multiple columns
region_summary = sales_df.groupby("Region").agg({"Sales": ["sum", "mean"], "Units": "sum"})
print("Region Summary:")
print(region_summary)

## 10. Basic Date/Time Handling

In [None]:
# Convert string to datetime
sales_df["Date"] = pd.to_datetime(sales_df["Date"])
print("Date column info:")
print(sales_df["Date"].dtype)
print("\nFirst few dates:")
print(sales_df["Date"].head())

In [None]:
# Extract date components
sales_df["Year"] = sales_df["Date"].dt.year
sales_df["Month"] = sales_df["Date"].dt.month
sales_df["Day"] = sales_df["Date"].dt.day
sales_df["DayOfWeek"] = sales_df["Date"].dt.day_name()

print("Date components:")
print(sales_df[["Date", "Year", "Month", "Day", "DayOfWeek"]].head())

In [None]:
# Filter by date
jan_15_onwards = sales_df[sales_df["Date"] >= "2024-01-15"]
print(f"Sales from Jan 15 onwards: {len(jan_15_onwards)} records")
print(jan_15_onwards.head())

## 11. Exporting Data

In [None]:
# Export to CSV
df.to_csv("employees_export.csv", index=False)
print("Exported to employees_export.csv")

# Verify by loading
loaded = pd.read_csv("employees_export.csv")
print("\nLoaded back:")
print(loaded.head())

# Clean up
import os

os.remove("employees_export.csv")
print("\nTemp file removed")

## 12. Practical Example: Sales Analysis

In [None]:
# Load and prepare data
sales = pd.read_csv("../data/sales_data.csv")
sales["Date"] = pd.to_datetime(sales["Date"])

print("SALES ANALYSIS REPORT")
print("=" * 50)

# Overall statistics
total_sales = sales["Sales"].sum()
avg_sales = sales["Sales"].mean()
total_transactions = len(sales)

print(f"Total Sales: ${total_sales:,.2f}")
print(f"Average Transaction: ${avg_sales:,.2f}")
print(f"Total Transactions: {total_transactions}")

In [None]:
# Best performing category
category_performance = (
    sales.groupby("Category")
    .agg({"Sales": "sum", "Units": "sum"})
    .sort_values("Sales", ascending=False)
)

print("\nPerformance by Category:")
print(category_performance)

In [None]:
# Best region
region_sales = sales.groupby("Region")["Sales"].sum().sort_values(ascending=False)
print("\nSales by Region:")
print(region_sales)

best_region = region_sales.index[0]
print(f"\nBest Region: {best_region} (${region_sales.iloc[0]:,.2f})")

In [None]:
# Top products
product_sales = sales.groupby("Product")["Sales"].sum().sort_values(ascending=False)
print("\nTop 3 Products:")
print(product_sales.head(3))

## 13. Exercises

In [None]:
# Exercise 1: Customer Analysis
# TODO: Load customer_data.csv
# TODO: Find the average Total_Spent by Membership_Level
# TODO: Identify which State has the most customers
# TODO: Calculate the total revenue from all customers

# Your code here

In [None]:
# Exercise 2: Housing Data
# TODO: Load housing_prices.csv
# TODO: Find the average price by City
# TODO: Calculate the price per square foot for each house
# TODO: Find houses with 4+ bedrooms and price < $500,000

# Your code here

In [None]:
# Exercise 3: Create Your Own DataFrame
# TODO: Create a DataFrame with information about 5 books
# Columns: Title, Author, Year, Pages, Rating (1-5)
# TODO: Add a column for 'Category' (Fiction/Non-Fiction)
# TODO: Find the highest-rated book
# TODO: Calculate average pages by Category

# Your code here

## 14. Key Takeaways

Excellent work! You've mastered Pandas basics:

âœ“ **Series and DataFrames**: Core Pandas data structures  
âœ“ **Loading data**: Read from CSV and other formats  
âœ“ **Selecting data**: loc, iloc, boolean indexing  
âœ“ **Filtering**: Conditional selection with boolean operators  
âœ“ **Sorting**: Single and multi-column sorting  
âœ“ **Adding columns**: Create calculated and conditional columns  
âœ“ **Group-by**: Aggregate data by categories  
âœ“ **Date/time**: Basic datetime operations  
âœ“ **Exporting**: Save processed data  

## Next Steps

**Next Module**: `04_data_cleaning.ipynb`

In Module 04, you'll learn to handle messy real-world data: missing values, duplicates, and data quality issues.

---

**Great job!** You're building a strong foundation in data science. Keep practicing! ðŸ“Š