# 10 - Pandas Exercise Solutions

## Introduction

This notebook contains complete solutions to all exercises from `09_exercise.ipynb`.

**Important:** Try solving the exercises yourself first before looking at these solutions!

## How to Use

1. Attempt each exercise in `09_exercise.ipynb` first
2. Compare your solution with the solutions here
3. Understand the approach and logic
4. Try alternative solutions if possible


## Exercise 1: Basic Operations - Solution


In [1]:
import pandas as pd

# Create the DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000]
})

print("DataFrame:")
print(df)
print()

# 1. Display the first 2 rows
print("First 2 rows:")
print(df.head(2))
print()

# 2. Get the shape of the DataFrame
print(f"Shape: {df.shape}")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print()

# 3. Calculate the average age
avg_age = df['Age'].mean()
print(f"Average age: {avg_age}")
print()

# 4. Find the person with the highest salary
highest_salary_person = df.loc[df['Salary'].idxmax(), 'Name']
highest_salary = df['Salary'].max()
print(f"Person with highest salary: {highest_salary_person} (${highest_salary:,})")


DataFrame:
      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    Diana   28   55000

First 2 rows:
    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000

Shape: (4, 3)
Rows: 4, Columns: 3

Average age: 29.5

Person with highest salary: Charlie ($70,000)


## Exercise 2: Filtering - Solution


In [2]:
# First, recreate the DataFrame from Exercise 1
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000]
})

# Add Department column
df['Department'] = ['IT', 'Sales', 'IT', 'Marketing']
print("DataFrame with Department:")
print(df)
print()

# 1. Filter people older than 28
older_than_28 = df[df['Age'] > 28]
print("People older than 28:")
print(older_than_28)
print()

# 2. Filter people with salary between 55000 and 65000
salary_range = df[(df['Salary'] >= 55000) & (df['Salary'] <= 65000)]
print("People with salary between 55000 and 65000:")
print(salary_range)
print()

# 3. Select only Name and Salary columns for people in IT department
it_people = df[df['Department'] == 'IT'][['Name', 'Salary']]
print("IT department - Name and Salary:")
print(it_people)


DataFrame with Department:
      Name  Age  Salary Department
0    Alice   25   50000         IT
1      Bob   30   60000      Sales
2  Charlie   35   70000         IT
3    Diana   28   55000  Marketing

People older than 28:
      Name  Age  Salary Department
1      Bob   30   60000      Sales
2  Charlie   35   70000         IT

People with salary between 55000 and 65000:
    Name  Age  Salary Department
1    Bob   30   60000      Sales
3  Diana   28   55000  Marketing

IT department - Name and Salary:
      Name  Salary
0    Alice   50000
2  Charlie   70000


## Exercise 3: Data Cleaning - Solution


In [3]:
# Create DataFrame with missing values and duplicates
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Eve'],
    'Age': [25, 30, None, 30, 32],
    'City': ['New York', 'London', None, 'London', 'Sydney']
})

print("Original DataFrame:")
print(df)
print()

# 1. Count missing values in each column
print("Missing values per column:")
print(df.isnull().sum())
print()

# 2. Fill missing Age with the mean age
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)
print(f"After filling missing Age with mean ({mean_age:.1f}):")
print(df)
print()

# 3. Fill missing City with 'Unknown'
df['City'] = df['City'].fillna('Unknown')
print("After filling missing City with 'Unknown':")
print(df)
print()

# 4. Remove duplicate rows
df_cleaned = df.drop_duplicates()
print("After removing duplicates:")
print(df_cleaned)
print(f"\nOriginal shape: {df.shape}, Cleaned shape: {df_cleaned.shape}")


Original DataFrame:
      Name   Age      City
0    Alice  25.0  New York
1      Bob  30.0    London
2  Charlie   NaN      None
3      Bob  30.0    London
4      Eve  32.0    Sydney

Missing values per column:
Name    0
Age     1
City    1
dtype: int64

After filling missing Age with mean (29.2):
      Name    Age      City
0    Alice  25.00  New York
1      Bob  30.00    London
2  Charlie  29.25      None
3      Bob  30.00    London
4      Eve  32.00    Sydney

After filling missing City with 'Unknown':
      Name    Age      City
0    Alice  25.00  New York
1      Bob  30.00    London
2  Charlie  29.25   Unknown
3      Bob  30.00    London
4      Eve  32.00    Sydney

After removing duplicates:
      Name    Age      City
0    Alice  25.00  New York
1      Bob  30.00    London
2  Charlie  29.25   Unknown
4      Eve  32.00    Sydney

Original shape: (5, 3), Cleaned shape: (4, 3)


## Exercise 4: GroupBy and Aggregation - Solution


In [4]:
# Create sales data
df = pd.DataFrame({
    'Product': ['A', 'A', 'B', 'B', 'C', 'C'],
    'Sales': [100, 150, 200, 180, 300, 250],
    'Region': ['North', 'South', 'North', 'South', 'North', 'South']
})

print("Sales Data:")
print(df)
print()

# 1. Calculate total sales by product
total_sales_by_product = df.groupby('Product')['Sales'].sum()
print("Total sales by product:")
print(total_sales_by_product)
print()

# 2. Calculate average sales by region
avg_sales_by_region = df.groupby('Region')['Sales'].mean()
print("Average sales by region:")
print(avg_sales_by_region)
print()

# 3. Find the product with highest total sales
top_product = total_sales_by_product.idxmax()
top_sales = total_sales_by_product.max()
print(f"Product with highest total sales: {top_product} (${top_sales})")


Sales Data:
  Product  Sales Region
0       A    100  North
1       A    150  South
2       B    200  North
3       B    180  South
4       C    300  North
5       C    250  South

Total sales by product:
Product
A    250
B    380
C    550
Name: Sales, dtype: int64

Average sales by region:
Region
North    200.000000
South    193.333333
Name: Sales, dtype: float64

Product with highest total sales: C ($550)


## Exercise 5: Merging DataFrames - Solution


In [5]:
# Create df1: Name, Age, City
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Tokyo', 'Paris']
})

# Create df2: Name, Salary, Department
df2 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Eve', 'Frank'],
    'Salary': [50000, 60000, 70000, 55000],
    'Department': ['IT', 'Sales', 'IT', 'Marketing']
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
print()

# 1. Perform an inner join
inner_join = pd.merge(df1, df2, on='Name', how='inner')
print("Inner Join (only matching names):")
print(inner_join)
print()

# 2. Perform a left join
left_join = pd.merge(df1, df2, on='Name', how='left')
print("Left Join (all from df1):")
print(left_join)
print()

# 3. Compare the results
print("Comparison:")
print(f"Inner join rows: {len(inner_join)}")
print(f"Left join rows: {len(left_join)}")
print(f"\nInner join has only matching records: {inner_join['Name'].tolist()}")
print(f"Left join has all records from df1: {left_join['Name'].tolist()}")


DataFrame 1:
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Tokyo
3    Diana   28     Paris

DataFrame 2:
    Name  Salary Department
0  Alice   50000         IT
1    Bob   60000      Sales
2    Eve   70000         IT
3  Frank   55000  Marketing

Inner Join (only matching names):
    Name  Age      City  Salary Department
0  Alice   25  New York   50000         IT
1    Bob   30    London   60000      Sales

Left Join (all from df1):
      Name  Age      City   Salary Department
0    Alice   25  New York  50000.0         IT
1      Bob   30    London  60000.0      Sales
2  Charlie   35     Tokyo      NaN        NaN
3    Diana   28     Paris      NaN        NaN

Comparison:
Inner join rows: 2
Left join rows: 4

Inner join has only matching records: ['Alice', 'Bob']
Left join has all records from df1: ['Alice', 'Bob', 'Charlie', 'Diana']


## Exercise 6: Complete Data Engineering Task - Solution


In [6]:
# 1. Create a CSV file with sales data
sales_data = pd.DataFrame({
    'Date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19'],
    'Product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
    'Quantity': [10, 5, 8, None, 12],
    'Price': [999, 25, 999, 75, 25]
})

sales_data.to_csv('sales_exercise.csv', index=False)
print("1. CSV file created:")
print(sales_data)
print()


1. CSV file created:
         Date   Product  Quantity  Price
0  2024-01-15    Laptop      10.0    999
1  2024-01-16     Mouse       5.0     25
2  2024-01-17    Laptop       8.0    999
3  2024-01-18  Keyboard       NaN     75
4  2024-01-19     Mouse      12.0     25



In [7]:
# 2. Read the CSV file
df = pd.read_csv('sales_exercise.csv')
print("2. Data read from CSV:")
print(df)
print(f"Shape: {df.shape}")
print()


2. Data read from CSV:
         Date   Product  Quantity  Price
0  2024-01-15    Laptop      10.0    999
1  2024-01-16     Mouse       5.0     25
2  2024-01-17    Laptop       8.0    999
3  2024-01-18  Keyboard       NaN     75
4  2024-01-19     Mouse      12.0     25
Shape: (5, 4)



In [8]:
# 3. Calculate Revenue (Quantity * Price)
df['Revenue'] = df['Quantity'] * df['Price']
print("3. After calculating Revenue:")
print(df)
print()


3. After calculating Revenue:
         Date   Product  Quantity  Price  Revenue
0  2024-01-15    Laptop      10.0    999   9990.0
1  2024-01-16     Mouse       5.0     25    125.0
2  2024-01-17    Laptop       8.0    999   7992.0
3  2024-01-18  Keyboard       NaN     75      NaN
4  2024-01-19     Mouse      12.0     25    300.0



In [9]:
# 4. Handle any missing values
print("4. Missing values before handling:")
print(df.isnull().sum())
print()

# Fill missing Quantity with 0 (or could use mean/median)
df['Quantity'] = df['Quantity'].fillna(0)

# Recalculate Revenue after filling missing values
df['Revenue'] = df['Quantity'] * df['Price']

print("After handling missing values:")
print(df)
print(f"Missing values after: {df.isnull().sum().sum()}")
print()


4. Missing values before handling:
Date        0
Product     0
Quantity    1
Price       0
Revenue     1
dtype: int64

After handling missing values:
         Date   Product  Quantity  Price  Revenue
0  2024-01-15    Laptop      10.0    999   9990.0
1  2024-01-16     Mouse       5.0     25    125.0
2  2024-01-17    Laptop       8.0    999   7992.0
3  2024-01-18  Keyboard       0.0     75      0.0
4  2024-01-19     Mouse      12.0     25    300.0
Missing values after: 0



In [10]:
# 5. Calculate total revenue by product
revenue_by_product = df.groupby('Product')['Revenue'].sum().sort_values(ascending=False)
print("5. Total revenue by product:")
print(revenue_by_product)
print()


5. Total revenue by product:
Product
Laptop      17982.0
Mouse         425.0
Keyboard        0.0
Name: Revenue, dtype: float64



In [11]:
# 6. Save the processed data to a new CSV file
df.to_csv('processed_sales_exercise.csv', index=False)
print("6. Processed data saved to 'processed_sales_exercise.csv'")
print("\nFinal processed DataFrame:")
print(df)
print()


6. Processed data saved to 'processed_sales_exercise.csv'

Final processed DataFrame:
         Date   Product  Quantity  Price  Revenue
0  2024-01-15    Laptop      10.0    999   9990.0
1  2024-01-16     Mouse       5.0     25    125.0
2  2024-01-17    Laptop       8.0    999   7992.0
3  2024-01-18  Keyboard       0.0     75      0.0
4  2024-01-19     Mouse      12.0     25    300.0



In [12]:
# Bonus: Add date operations (convert Date to datetime, extract month)
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month
df['Month_Name'] = df['Date'].dt.strftime('%B')
df['Year'] = df['Date'].dt.year

print("Bonus: After adding date operations:")
print(df[['Date', 'Product', 'Quantity', 'Price', 'Revenue', 'Month', 'Month_Name', 'Year']])
print()

# Revenue by month
revenue_by_month = df.groupby('Month_Name')['Revenue'].sum()
print("Revenue by month:")
print(revenue_by_month)


Bonus: After adding date operations:
        Date   Product  Quantity  Price  Revenue  Month Month_Name  Year
0 2024-01-15    Laptop      10.0    999   9990.0      1    January  2024
1 2024-01-16     Mouse       5.0     25    125.0      1    January  2024
2 2024-01-17    Laptop       8.0    999   7992.0      1    January  2024
3 2024-01-18  Keyboard       0.0     75      0.0      1    January  2024
4 2024-01-19     Mouse      12.0     25    300.0      1    January  2024

Revenue by month:
Month_Name
January    18407.0
Name: Revenue, dtype: float64


## Summary


**Key Learning Points:**
- ✅ Basic operations: head(), shape, mean(), idxmax()
- ✅ Filtering: Boolean indexing with conditions
- ✅ Data cleaning: fillna(), drop_duplicates()
- ✅ GroupBy: groupby() with aggregations
- ✅ Merging: Different join types (inner, left, right, outer)
- ✅ Complete workflow: Reading, cleaning, transforming, aggregating, saving

**Alternative Approaches:**
- You could use different methods for filling missing values (mean, median, forward fill)
- You could use different join types depending on your requirements
- You could add more date operations (day of week, quarter, etc.)

**Practice Tips:**
- Try modifying the solutions to see what happens
- Experiment with different parameters
- Combine multiple operations in one line where possible
- Always verify your results make sense!
