# Pandas Crash Course for Data Science Assessments

**Date Created:** 20 January 2026

This comprehensive notebook covers essential Pandas concepts commonly tested in data science interviews and assessments. It includes teaching sections with clear explanations, practical examples, and practice questions with solutions.

## Table of Contents

1. [DataFrame Creation and Basic Operations](#1-dataframe-creation-and-basic-operations)
2. [Indexing and Selection](#2-indexing-and-selection)
3. [Merging and Joining](#3-merging-and-joining)
4. [GroupBy Operations and Aggregations](#4-groupby-operations-and-aggregations)
5. [Pivot Tables and Reshaping](#5-pivot-tables-and-reshaping)
6. [Apply, Map, and Lambda Functions](#6-apply-map-and-lambda-functions)
7. [Handling Missing Data](#7-handling-missing-data)
8. [String Operations](#8-string-operations)
9. [Date/Time Operations](#9-datetime-operations)
10. [Practice Questions](#10-practice-questions)

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

---

## 1. DataFrame Creation and Basic Operations

A **DataFrame** is a two-dimensional, size-mutable, heterogeneous tabular data structure with labelled axes (rows and columns). It is the primary data structure in Pandas.

### Creating DataFrames

There are multiple ways to create a DataFrame:

In [None]:
# Method 1: From a dictionary
data_dict = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'city': ['London', 'Manchester', 'Birmingham', 'London', 'Leeds'],
    'salary': [50000, 60000, 75000, 55000, 65000]
}
df_from_dict = pd.DataFrame(data_dict)
print("DataFrame from dictionary:")
print(df_from_dict)

In [None]:
# Method 2: From a list of dictionaries
data_list = [
    {'product': 'Laptop', 'price': 999.99, 'quantity': 50},
    {'product': 'Mouse', 'price': 29.99, 'quantity': 200},
    {'product': 'Keyboard', 'price': 79.99, 'quantity': 150}
]
df_from_list = pd.DataFrame(data_list)
print("DataFrame from list of dictionaries:")
print(df_from_list)

In [None]:
# Method 3: From a NumPy array with custom index and columns
np_array = np.random.randint(1, 100, size=(4, 3))
df_from_numpy = pd.DataFrame(
    np_array,
    index=['row1', 'row2', 'row3', 'row4'],
    columns=['A', 'B', 'C']
)
print("DataFrame from NumPy array:")
print(df_from_numpy)

### Basic DataFrame Operations

In [None]:
# Create a sample DataFrame for demonstrations
employees = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105, 106, 107, 108],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR', 'Sales'],
    'salary': [75000, 55000, 80000, 60000, 52000, 90000, 58000, 61000],
    'years_experience': [5, 3, 7, 4, 2, 10, 3, 5],
    'hire_date': pd.to_datetime(['2019-03-15', '2021-06-01', '2017-09-20', '2020-01-10', 
                                  '2022-05-25', '2014-11-08', '2021-08-15', '2019-12-01'])
})
print("Sample Employee DataFrame:")
print(employees)

In [None]:
# View first n rows (default 5)
print("First 3 rows:")
print(employees.head(3))

In [None]:
# View last n rows (default 5)
print("Last 2 rows:")
print(employees.tail(2))

In [None]:
# Get shape (rows, columns)
print(f"Shape: {employees.shape}")
print(f"Number of rows: {employees.shape[0]}")
print(f"Number of columns: {employees.shape[1]}")

In [None]:
# Get column names and data types
print("Column names:")
print(employees.columns.tolist())
print("\nData types:")
print(employees.dtypes)

In [None]:
# Get summary information
print("DataFrame info:")
employees.info()

In [None]:
# Descriptive statistics for numerical columns
print("Descriptive statistics:")
print(employees.describe())

In [None]:
# Unique values and value counts
print("Unique departments:")
print(employees['department'].unique())

print("\nDepartment value counts:")
print(employees['department'].value_counts())

---

## 2. Indexing and Selection

Pandas provides powerful ways to select and filter data using `.loc[]`, `.iloc[]`, and boolean indexing.

### `.loc[]` - Label-based Selection

Use `.loc[]` when you want to select by **row/column labels**.

In [None]:
# Set employee_id as index for better demonstration
df_indexed = employees.set_index('employee_id')
print("DataFrame with employee_id as index:")
print(df_indexed)

In [None]:
# Select a single row by label
print("Row for employee 103:")
print(df_indexed.loc[103])

In [None]:
# Select multiple rows by labels
print("Rows for employees 101, 103, 105:")
print(df_indexed.loc[[101, 103, 105]])

In [None]:
# Select specific rows and columns
print("Name and salary for employees 102-104:")
print(df_indexed.loc[102:104, ['name', 'salary']])

### `.iloc[]` - Integer Position-based Selection

Use `.iloc[]` when you want to select by **integer position** (0-indexed).

In [None]:
# Select first row
print("First row (position 0):")
print(employees.iloc[0])

In [None]:
# Select rows 1-3 and columns 0-2
print("Rows 1-3, columns 0-2:")
print(employees.iloc[1:4, 0:3])

In [None]:
# Select specific rows and columns by position
print("Rows 0, 2, 4 and columns 1, 3:")
print(employees.iloc[[0, 2, 4], [1, 3]])

### Boolean Indexing

Filter rows based on conditions.

In [None]:
# Single condition
print("Employees with salary > 60000:")
print(employees[employees['salary'] > 60000])

In [None]:
# Multiple conditions with AND (&)
print("Engineering employees with salary > 70000:")
print(employees[(employees['department'] == 'Engineering') & (employees['salary'] > 70000)])

In [None]:
# Multiple conditions with OR (|)
print("Employees in HR or with experience > 5 years:")
print(employees[(employees['department'] == 'HR') | (employees['years_experience'] > 5)])

In [None]:
# Using isin() for multiple values
print("Employees in Engineering or Sales:")
print(employees[employees['department'].isin(['Engineering', 'Sales'])])

In [None]:
# Using query() method - more readable for complex conditions
print("Using query():")
print(employees.query('salary > 60000 and years_experience >= 4'))

---

## 3. Merging and Joining

Pandas provides several methods to combine DataFrames: `merge()`, `join()`, and `concat()`.

### Creating Sample DataFrames for Merging

In [None]:
# Left DataFrame - Orders
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer_id': ['C001', 'C002', 'C001', 'C003', 'C004'],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'amount': [999.99, 29.99, 79.99, 299.99, 149.99]
})
print("Orders DataFrame:")
print(orders)

In [None]:
# Right DataFrame - Customers
customers = pd.DataFrame({
    'customer_id': ['C001', 'C002', 'C003', 'C005'],
    'customer_name': ['John Smith', 'Jane Doe', 'Bob Wilson', 'Alice Brown'],
    'city': ['London', 'Manchester', 'Birmingham', 'Leeds']
})
print("Customers DataFrame:")
print(customers)

### `merge()` - SQL-style Joins

The `merge()` function provides flexibility with different join types: `inner`, `left`, `right`, and `outer`.

In [None]:
# Inner join - only matching rows
inner_merged = pd.merge(orders, customers, on='customer_id', how='inner')
print("Inner Join (only matching customer_ids):")
print(inner_merged)

In [None]:
# Left join - all rows from left DataFrame
left_merged = pd.merge(orders, customers, on='customer_id', how='left')
print("Left Join (all orders, matching customers):")
print(left_merged)

In [None]:
# Right join - all rows from right DataFrame
right_merged = pd.merge(orders, customers, on='customer_id', how='right')
print("Right Join (all customers, matching orders):")
print(right_merged)

In [None]:
# Outer join - all rows from both DataFrames
outer_merged = pd.merge(orders, customers, on='customer_id', how='outer')
print("Outer Join (all rows from both):")
print(outer_merged)

In [None]:
# Merge on different column names
orders_renamed = orders.rename(columns={'customer_id': 'cust_id'})
merged_diff_cols = pd.merge(orders_renamed, customers, left_on='cust_id', right_on='customer_id')
print("Merge with different column names:")
print(merged_diff_cols)

### `concat()` - Concatenating DataFrames

Use `concat()` to stack DataFrames vertically (axis=0) or horizontally (axis=1).

In [None]:
# Create two DataFrames to concatenate
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Vertical concatenation (stacking rows)
concat_vertical = pd.concat([df1, df2], ignore_index=True)
print("Vertical concatenation:")
print(concat_vertical)

In [None]:
# Horizontal concatenation (stacking columns)
df3 = pd.DataFrame({'C': [9, 10], 'D': [11, 12]})
concat_horizontal = pd.concat([df1, df3], axis=1)
print("Horizontal concatenation:")
print(concat_horizontal)

---

## 4. GroupBy Operations and Aggregations

The `groupby()` function follows a **split-apply-combine** strategy:
1. **Split** the data into groups based on criteria
2. **Apply** a function to each group independently
3. **Combine** the results into a data structure

In [None]:
# Create a sales DataFrame for groupby examples
sales = pd.DataFrame({
    'date': pd.to_datetime(['2025-01-15', '2025-01-15', '2025-01-16', '2025-01-16',
                            '2025-01-17', '2025-01-17', '2025-01-18', '2025-01-18']),
    'store': ['London', 'Manchester', 'London', 'Manchester',
              'London', 'Manchester', 'London', 'Manchester'],
    'product': ['Electronics', 'Clothing', 'Electronics', 'Electronics',
                'Clothing', 'Clothing', 'Furniture', 'Furniture'],
    'revenue': [1500, 800, 2000, 1200, 600, 900, 3000, 2500],
    'units_sold': [10, 20, 15, 8, 15, 25, 5, 4]
})
print("Sales DataFrame:")
print(sales)

In [None]:
# Basic groupby with single aggregation
print("Total revenue by store:")
print(sales.groupby('store')['revenue'].sum())

In [None]:
# Groupby with multiple aggregations using agg()
print("Multiple aggregations by store:")
store_stats = sales.groupby('store').agg({
    'revenue': ['sum', 'mean', 'max'],
    'units_sold': ['sum', 'mean']
})
print(store_stats)

In [None]:
# Named aggregations (cleaner column names)
print("Named aggregations:")
named_agg = sales.groupby('store').agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    total_units=('units_sold', 'sum'),
    transaction_count=('revenue', 'count')
)
print(named_agg)

In [None]:
# Groupby multiple columns
print("Revenue by store and product:")
multi_group = sales.groupby(['store', 'product'])['revenue'].sum().reset_index()
print(multi_group)

In [None]:
# Custom aggregation function
def revenue_range(x: pd.Series) -> float:
    """Calculate the range of revenue values.
    
    Args:
        x: Series of revenue values.
    
    Returns:
        The difference between max and min values.
    """
    return x.max() - x.min()

print("Revenue range by store:")
print(sales.groupby('store')['revenue'].agg(revenue_range))

In [None]:
# Transform - apply function and return same-shaped result
print("Adding normalised revenue column:")
sales['revenue_normalised'] = sales.groupby('store')['revenue'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print(sales[['store', 'revenue', 'revenue_normalised']])

---

## 5. Pivot Tables and Reshaping

Pivot tables provide a way to summarise and reorganise data, similar to pivot tables in Excel.

In [None]:
# Remove the normalised column for cleaner examples
sales = sales.drop(columns=['revenue_normalised'])

# Basic pivot table
print("Pivot table - Revenue by store and product:")
pivot_basic = pd.pivot_table(
    sales,
    values='revenue',
    index='store',
    columns='product',
    aggfunc='sum'
)
print(pivot_basic)

In [None]:
# Pivot table with multiple aggregations
print("Pivot table with sum and mean:")
pivot_multi = pd.pivot_table(
    sales,
    values='revenue',
    index='store',
    columns='product',
    aggfunc=['sum', 'mean'],
    fill_value=0
)
print(pivot_multi)

In [None]:
# Pivot table with margins (totals)
print("Pivot table with totals:")
pivot_margins = pd.pivot_table(
    sales,
    values='revenue',
    index='store',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print(pivot_margins)

### Reshaping with `melt()` and `stack()/unstack()`

In [None]:
# Create wide format data
wide_data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'maths': [85, 90, 78],
    'english': [88, 82, 95],
    'science': [92, 88, 85]
})
print("Wide format:")
print(wide_data)

In [None]:
# Melt - wide to long format
long_data = pd.melt(
    wide_data,
    id_vars=['name'],
    value_vars=['maths', 'english', 'science'],
    var_name='subject',
    value_name='score'
)
print("Long format (melted):")
print(long_data)

In [None]:
# Pivot - long to wide format
back_to_wide = long_data.pivot(index='name', columns='subject', values='score')
print("Back to wide format (pivoted):")
print(back_to_wide)

---

## 6. Apply, Map, and Lambda Functions

These functions allow you to apply custom transformations to your data.

In [None]:
# Sample DataFrame
df_apply = pd.DataFrame({
    'name': ['alice smith', 'bob jones', 'charlie brown'],
    'salary': [50000, 60000, 75000],
    'bonus_pct': [0.10, 0.15, 0.12]
})
print("Original DataFrame:")
print(df_apply)

In [None]:
# apply() on a Series - applies function to each element
df_apply['name_title'] = df_apply['name'].apply(str.title)
print("After applying str.title:")
print(df_apply)

In [None]:
# apply() with lambda function
df_apply['total_compensation'] = df_apply.apply(
    lambda row: row['salary'] * (1 + row['bonus_pct']),
    axis=1
)
print("With total compensation:")
print(df_apply)

In [None]:
# map() - for Series, maps values using a dictionary or function
salary_bands = {
    50000: 'Junior',
    60000: 'Mid',
    75000: 'Senior'
}
df_apply['band'] = df_apply['salary'].map(salary_bands)
print("With salary bands mapped:")
print(df_apply)

In [None]:
# Custom function with apply
def categorise_compensation(total_comp: float) -> str:
    """Categorise total compensation into bands.
    
    Args:
        total_comp: The total compensation value.
    
    Returns:
        Category string based on compensation level.
    """
    if total_comp < 60000:
        return 'Low'
    elif total_comp < 75000:
        return 'Medium'
    else:
        return 'High'

df_apply['comp_category'] = df_apply['total_compensation'].apply(categorise_compensation)
print("With compensation category:")
print(df_apply)

In [None]:
# applymap() / map() on DataFrame - applies to every element
# Note: applymap() is deprecated in newer pandas, use map() instead
numeric_df = pd.DataFrame({'A': [1.5, 2.3, 3.7], 'B': [4.2, 5.8, 6.1]})
rounded_df = numeric_df.map(lambda x: round(x))
print("Rounded DataFrame:")
print(rounded_df)

---

## 7. Handling Missing Data

Missing data is common in real-world datasets. Pandas uses `NaN` (Not a Number) and `None` to represent missing values.

In [None]:
# Create DataFrame with missing values
df_missing = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, np.nan, 35, 28, np.nan],
    'city': ['London', 'Manchester', None, 'London', 'Leeds'],
    'salary': [50000, 60000, np.nan, np.nan, 65000]
})
print("DataFrame with missing values:")
print(df_missing)

In [None]:
# Check for missing values
print("Is null (boolean mask):")
print(df_missing.isnull())

print("\nCount of missing values per column:")
print(df_missing.isnull().sum())

print("\nPercentage of missing values:")
print((df_missing.isnull().sum() / len(df_missing)) * 100)

In [None]:
# Drop rows with any missing values
print("Drop rows with any NaN:")
print(df_missing.dropna())

In [None]:
# Drop rows where specific column is missing
print("Drop rows where 'age' is NaN:")
print(df_missing.dropna(subset=['age']))

In [None]:
# Fill missing values with a constant
print("Fill all NaN with 'Unknown':")
print(df_missing.fillna('Unknown'))

In [None]:
# Fill with different values per column
fill_values = {
    'age': df_missing['age'].mean(),
    'city': 'Unknown',
    'salary': df_missing['salary'].median()
}
df_filled = df_missing.fillna(fill_values)
print("Fill with column-specific values:")
print(df_filled)

In [None]:
# Forward fill (propagate last valid value)
print("Forward fill:")
print(df_missing.ffill())

In [None]:
# Backward fill (propagate next valid value)
print("Backward fill:")
print(df_missing.bfill())

In [None]:
# Interpolate missing values (for numeric columns)
df_numeric = pd.DataFrame({'values': [1.0, np.nan, np.nan, 4.0, 5.0, np.nan, 7.0]})
print("Original:")
print(df_numeric)
print("\nInterpolated:")
print(df_numeric.interpolate())

---

## 8. String Operations

Pandas provides vectorised string operations through the `.str` accessor.

In [None]:
# Create DataFrame with string data
df_str = pd.DataFrame({
    'full_name': ['  John Smith  ', 'jane doe', 'BOB WILSON', 'Alice Brown-Jones'],
    'email': ['john@example.com', 'jane@test.co.uk', 'bob@sample.org', 'alice@demo.com'],
    'phone': ['020-1234-5678', '0161-987-6543', '0121-555-1234', '0113-222-3333']
})
print("Original DataFrame:")
print(df_str)

In [None]:
# Case transformations
print("Lowercase:")
print(df_str['full_name'].str.lower())

print("\nUppercase:")
print(df_str['full_name'].str.upper())

print("\nTitle case:")
print(df_str['full_name'].str.title())

In [None]:
# Strip whitespace
print("Stripped whitespace:")
print(df_str['full_name'].str.strip())

In [None]:
# Split strings
df_str['first_name'] = df_str['full_name'].str.strip().str.split().str[0]
df_str['last_name'] = df_str['full_name'].str.strip().str.split().str[-1]
print("After splitting names:")
print(df_str[['full_name', 'first_name', 'last_name']])

In [None]:
# Contains check (returns boolean)
print("Emails containing 'example':")
print(df_str[df_str['email'].str.contains('example')])

In [None]:
# Replace strings
print("Replace '-' with space in phone:")
print(df_str['phone'].str.replace('-', ' '))

In [None]:
# Extract using regex
print("Extract email domain:")
df_str['domain'] = df_str['email'].str.extract(r'@(.+)$')
print(df_str[['email', 'domain']])

In [None]:
# String length
print("Email lengths:")
print(df_str['email'].str.len())

In [None]:
# Starts with / Ends with
print("Names starting with 'J':")
print(df_str[df_str['full_name'].str.strip().str.lower().str.startswith('j')])

---

## 9. Date/Time Operations

Pandas has excellent support for working with dates and times through the `datetime64` dtype and `.dt` accessor.

In [None]:
# Create DataFrame with datetime data
df_datetime = pd.DataFrame({
    'event': ['Meeting', 'Conference', 'Workshop', 'Presentation', 'Review'],
    'date_str': ['2025-01-15', '2025-02-20', '2025-03-10', '2025-04-05', '2025-05-25'],
    'timestamp': pd.to_datetime(['2025-01-15 09:30:00', '2025-02-20 14:00:00',
                                  '2025-03-10 10:00:00', '2025-04-05 15:30:00',
                                  '2025-05-25 11:00:00'])
})
print("DataFrame with datetime:")
print(df_datetime)
print("\nData types:")
print(df_datetime.dtypes)

In [None]:
# Convert string to datetime
df_datetime['date'] = pd.to_datetime(df_datetime['date_str'])
print("After conversion:")
print(df_datetime.dtypes)

In [None]:
# Extract datetime components
df_datetime['year'] = df_datetime['timestamp'].dt.year
df_datetime['month'] = df_datetime['timestamp'].dt.month
df_datetime['day'] = df_datetime['timestamp'].dt.day
df_datetime['hour'] = df_datetime['timestamp'].dt.hour
df_datetime['day_name'] = df_datetime['timestamp'].dt.day_name()
df_datetime['month_name'] = df_datetime['timestamp'].dt.month_name()

print("Extracted components:")
print(df_datetime[['event', 'timestamp', 'year', 'month', 'day', 'hour', 'day_name']])

In [None]:
# Date arithmetic
df_datetime['days_from_now'] = (df_datetime['timestamp'] - pd.Timestamp.now()).dt.days
print("Days from now:")
print(df_datetime[['event', 'timestamp', 'days_from_now']])

In [None]:
# Filter by date range
start_date = '2025-02-01'
end_date = '2025-04-30'
mask = (df_datetime['timestamp'] >= start_date) & (df_datetime['timestamp'] <= end_date)
print(f"Events between {start_date} and {end_date}:")
print(df_datetime.loc[mask, ['event', 'timestamp']])

In [None]:
# Create date range
date_range = pd.date_range(start='2025-01-01', periods=7, freq='D')
print("Date range (7 days):")
print(date_range)

In [None]:
# Resample time series data
ts_data = pd.DataFrame({
    'date': pd.date_range(start='2025-01-01', periods=30, freq='D'),
    'sales': np.random.randint(100, 500, 30)
})
ts_data.set_index('date', inplace=True)

print("Weekly sum of sales:")
print(ts_data.resample('W').sum())

In [None]:
# Rolling window calculations
ts_data['rolling_mean_7d'] = ts_data['sales'].rolling(window=7).mean()
print("With 7-day rolling mean:")
print(ts_data.head(10))

---

## 10. Practice Questions

Test your Pandas skills with these practice questions. Each question has a hidden solution that you can reveal after attempting it yourself.

In [None]:
# Create sample datasets for practice questions
np.random.seed(42)

# Employee dataset
employees_practice = pd.DataFrame({
    'employee_id': range(1, 21),
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 
             'Ivy', 'Jack', 'Kate', 'Leo', 'Mia', 'Noah', 'Olivia', 'Peter', 
             'Quinn', 'Rose', 'Sam', 'Tina'],
    'department': np.random.choice(['Engineering', 'Sales', 'HR', 'Marketing', 'Finance'], 20),
    'salary': np.random.randint(40000, 100000, 20),
    'years_experience': np.random.randint(1, 15, 20),
    'hire_date': pd.date_range(start='2015-01-01', periods=20, freq='120D'),
    'performance_score': np.random.choice([np.nan, 3.0, 3.5, 4.0, 4.5, 5.0], 20)
})

# Transactions dataset
transactions = pd.DataFrame({
    'transaction_id': range(1001, 1031),
    'employee_id': np.random.choice(range(1, 21), 30),
    'amount': np.random.uniform(50, 5000, 30).round(2),
    'category': np.random.choice(['Travel', 'Equipment', 'Training', 'Supplies', 'Software'], 30),
    'transaction_date': pd.date_range(start='2025-01-01', periods=30, freq='D')
})

print("Employees Dataset:")
print(employees_practice.head())
print(f"\nShape: {employees_practice.shape}")

print("\n" + "="*50)
print("\nTransactions Dataset:")
print(transactions.head())
print(f"\nShape: {transactions.shape}")

### Question 1: Basic Filtering

Find all employees in the Engineering department who have more than 5 years of experience. Display only the `name`, `department`, `salary`, and `years_experience` columns.

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
result = employees_practice[
    (employees_practice['department'] == 'Engineering') & 
    (employees_practice['years_experience'] > 5)
][['name', 'department', 'salary', 'years_experience']]
print(result)
```

</details>

### Question 2: GroupBy with Multiple Aggregations

Calculate the following statistics for each department:
- Total number of employees
- Average salary
- Maximum years of experience
- Minimum salary

Sort the results by average salary in descending order.

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
dept_stats = employees_practice.groupby('department').agg(
    employee_count=('employee_id', 'count'),
    avg_salary=('salary', 'mean'),
    max_experience=('years_experience', 'max'),
    min_salary=('salary', 'min')
).sort_values('avg_salary', ascending=False)

print(dept_stats)
```

</details>

### Question 3: Merging DataFrames

Merge the `employees_practice` and `transactions` DataFrames to show each transaction along with the employee's name and department. Include all transactions even if the employee information is missing.

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
merged_data = pd.merge(
    transactions,
    employees_practice[['employee_id', 'name', 'department']],
    on='employee_id',
    how='left'
)
print(merged_data.head(10))
```

</details>

### Question 4: Pivot Table

Create a pivot table showing the total transaction amount for each department and category combination. Include row and column totals.

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
# First merge to get department info
merged_for_pivot = pd.merge(
    transactions,
    employees_practice[['employee_id', 'department']],
    on='employee_id'
)

# Create pivot table
pivot_result = pd.pivot_table(
    merged_for_pivot,
    values='amount',
    index='department',
    columns='category',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)
print(pivot_result.round(2))
```

</details>

### Question 5: Handling Missing Data

In the `employees_practice` DataFrame:
1. Count how many employees have missing `performance_score` values
2. Fill the missing `performance_score` values with the average score of their respective department
3. Show the employees who had missing scores (before and after filling)

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
# Count missing values
missing_count = employees_practice['performance_score'].isnull().sum()
print(f"Number of missing performance scores: {missing_count}")

# Identify employees with missing scores
missing_mask = employees_practice['performance_score'].isnull()
print("\nEmployees with missing scores:")
print(employees_practice.loc[missing_mask, ['name', 'department', 'performance_score']])

# Fill with department average using transform
employees_filled = employees_practice.copy()
employees_filled['performance_score'] = employees_filled.groupby('department')['performance_score'].transform(
    lambda x: x.fillna(x.mean())
)

print("\nAfter filling with department averages:")
print(employees_filled.loc[missing_mask, ['name', 'department', 'performance_score']])
```

</details>

### Question 6: Apply and Lambda Functions

Create a new column called `salary_band` in the `employees_practice` DataFrame that categorises employees as:
- 'Junior' if salary < 50000
- 'Mid' if salary >= 50000 and < 70000
- 'Senior' if salary >= 70000 and < 85000
- 'Lead' if salary >= 85000

Then show the count of employees in each salary band.

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
def assign_salary_band(salary: int) -> str:
    """Assign a salary band based on salary value.
    
    Args:
        salary: The employee's salary.
    
    Returns:
        The salary band category.
    """
    if salary < 50000:
        return 'Junior'
    elif salary < 70000:
        return 'Mid'
    elif salary < 85000:
        return 'Senior'
    else:
        return 'Lead'

employees_practice['salary_band'] = employees_practice['salary'].apply(assign_salary_band)

print("Salary band distribution:")
print(employees_practice['salary_band'].value_counts())

print("\nSample of results:")
print(employees_practice[['name', 'salary', 'salary_band']].head(10))
```

</details>

### Question 7: String Operations

Using the transactions DataFrame, create a new column `category_code` that contains the first 3 letters of the category in uppercase. Then filter for transactions where the category code starts with 'TR'.

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
transactions['category_code'] = transactions['category'].str[:3].str.upper()

print("All transactions with category codes:")
print(transactions[['transaction_id', 'category', 'category_code']].head(10))

print("\nTransactions where category starts with 'TR':")
tr_transactions = transactions[transactions['category_code'].str.startswith('TR')]
print(tr_transactions[['transaction_id', 'category', 'category_code', 'amount']])
```

</details>

### Question 8: Date/Time Operations

Using the `employees_practice` DataFrame:
1. Calculate how many days each employee has been with the company (from their hire date to today)
2. Find employees who were hired in the first quarter (January-March) of any year
3. Calculate the average tenure (in years) by department

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
# Calculate days with company
today = pd.Timestamp.now()
employees_practice['days_employed'] = (today - employees_practice['hire_date']).dt.days
employees_practice['years_employed'] = employees_practice['days_employed'] / 365.25

print("Employee tenure:")
print(employees_practice[['name', 'hire_date', 'days_employed', 'years_employed']].head())

# Find Q1 hires
q1_hires = employees_practice[employees_practice['hire_date'].dt.month.isin([1, 2, 3])]
print("\nEmployees hired in Q1:")
print(q1_hires[['name', 'hire_date']])

# Average tenure by department
avg_tenure = employees_practice.groupby('department')['years_employed'].mean().round(2)
print("\nAverage tenure by department (years):")
print(avg_tenure.sort_values(ascending=False))
```

</details>

### Question 9: Complex Aggregation

For each employee, calculate:
1. Total transaction amount
2. Number of transactions
3. Average transaction amount
4. Most frequent transaction category

Then join this with the employee information to create a comprehensive view.

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
# Aggregate transaction data
transaction_summary = transactions.groupby('employee_id').agg(
    total_amount=('amount', 'sum'),
    transaction_count=('transaction_id', 'count'),
    avg_amount=('amount', 'mean'),
    most_frequent_category=('category', lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)
).reset_index()

# Round the monetary values
transaction_summary['total_amount'] = transaction_summary['total_amount'].round(2)
transaction_summary['avg_amount'] = transaction_summary['avg_amount'].round(2)

# Merge with employee data
employee_transactions = pd.merge(
    employees_practice[['employee_id', 'name', 'department', 'salary']],
    transaction_summary,
    on='employee_id',
    how='left'
)

print("Employee Transaction Summary:")
print(employee_transactions)
```

</details>

### Question 10: Data Transformation Challenge

Create a summary report that shows:
1. For each department and salary band combination, show the count of employees and average performance score
2. Add a column showing what percentage of the department falls into each salary band
3. Sort by department and then by salary band

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
# Ensure salary_band column exists
if 'salary_band' not in employees_practice.columns:
    employees_practice['salary_band'] = employees_practice['salary'].apply(
        lambda x: 'Junior' if x < 50000 else ('Mid' if x < 70000 else ('Senior' if x < 85000 else 'Lead'))
    )

# Fill missing performance scores for this analysis
df_analysis = employees_practice.copy()
df_analysis['performance_score'] = df_analysis.groupby('department')['performance_score'].transform(
    lambda x: x.fillna(x.mean())
)

# Group by department and salary band
summary = df_analysis.groupby(['department', 'salary_band']).agg(
    employee_count=('employee_id', 'count'),
    avg_performance=('performance_score', 'mean')
).reset_index()

# Calculate department totals for percentage
dept_totals = summary.groupby('department')['employee_count'].transform('sum')
summary['pct_of_department'] = (summary['employee_count'] / dept_totals * 100).round(1)

# Round performance score
summary['avg_performance'] = summary['avg_performance'].round(2)

# Sort by department and salary band
band_order = ['Junior', 'Mid', 'Senior', 'Lead']
summary['salary_band'] = pd.Categorical(summary['salary_band'], categories=band_order, ordered=True)
summary = summary.sort_values(['department', 'salary_band'])

print("Department and Salary Band Summary:")
print(summary.to_string(index=False))
```

</details>

### Question 11: Window Functions

Using the transactions DataFrame:
1. Add a column showing the running total of transaction amounts (cumulative sum)
2. Add a column showing the 3-day rolling average of transaction amounts
3. Add a column showing the rank of each transaction by amount within each category

In [None]:
# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
# Sort by date first
transactions_sorted = transactions.sort_values('transaction_date').copy()

# Running total (cumulative sum)
transactions_sorted['running_total'] = transactions_sorted['amount'].cumsum().round(2)

# 3-day rolling average
transactions_sorted['rolling_avg_3d'] = transactions_sorted['amount'].rolling(window=3).mean().round(2)

# Rank within each category
transactions_sorted['rank_in_category'] = transactions_sorted.groupby('category')['amount'].rank(
    method='dense', ascending=False
).astype(int)

print("Transactions with window functions:")
print(transactions_sorted[['transaction_date', 'category', 'amount', 
                           'running_total', 'rolling_avg_3d', 'rank_in_category']].head(15))
```

</details>

### Question 12: Data Cleaning Challenge

Given the following messy DataFrame, clean it by:
1. Standardising the name format (title case, stripped whitespace)
2. Fixing the salary column (remove currency symbols and commas, convert to numeric)
3. Standardising the department names (consistent capitalisation)
4. Converting the date column to datetime format

In [None]:
# Messy DataFrame to clean
messy_data = pd.DataFrame({
    'name': ['  john smith  ', 'JANE DOE', 'bob Wilson', '  alice BROWN  '],
    'salary': ['£50,000', '£65,000', '45000', '£72,500'],
    'department': ['engineering', 'SALES', 'Engineering', 'sales'],
    'start_date': ['01/03/2020', '15-06-2021', '2022/01/10', '01-12-2019']
})
print("Messy data:")
print(messy_data)

# Your solution here


<details>
<summary>Click to reveal answer</summary>

```python
clean_data = messy_data.copy()

# Standardise names
clean_data['name'] = clean_data['name'].str.strip().str.title()

# Fix salary - remove £ and commas, convert to numeric
clean_data['salary'] = clean_data['salary'].str.replace('£', '', regex=False)
clean_data['salary'] = clean_data['salary'].str.replace(',', '', regex=False)
clean_data['salary'] = pd.to_numeric(clean_data['salary'])

# Standardise department names
clean_data['department'] = clean_data['department'].str.strip().str.title()

# Convert dates - handle mixed formats
clean_data['start_date'] = pd.to_datetime(clean_data['start_date'], dayfirst=True)

print("Cleaned data:")
print(clean_data)
print("\nData types:")
print(clean_data.dtypes)
```

</details>

---

## Summary

This notebook covered the essential Pandas concepts commonly tested in data science assessments:

1. **DataFrame Creation** - Multiple methods to create DataFrames from dictionaries, lists, and arrays
2. **Indexing and Selection** - Using `.loc[]`, `.iloc[]`, and boolean indexing for data access
3. **Merging and Joining** - Combining DataFrames with `merge()`, `join()`, and `concat()`
4. **GroupBy Operations** - Split-apply-combine pattern with aggregations
5. **Pivot Tables** - Reshaping data for analysis and reporting
6. **Apply and Lambda** - Custom transformations on data
7. **Missing Data** - Detection, removal, and imputation strategies
8. **String Operations** - Vectorised string manipulation with `.str` accessor
9. **Date/Time Operations** - Working with temporal data using `.dt` accessor

### Key Tips for Assessments

- Always check data types with `.dtypes` before operations
- Use `.shape` and `.info()` to understand your data structure
- Prefer vectorised operations over loops for better performance
- Use `copy()` when you need to modify a DataFrame without affecting the original
- Remember that `.loc[]` uses labels and `.iloc[]` uses integer positions
- Use parentheses around each condition in boolean indexing with `&` and `|`

### Additional Resources

- [Pandas Official Documentation](https://pandas.pydata.org/docs/)
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)