# 09 - Pandas Exercises

## Introduction

Practice makes perfect! This notebook contains exercises to reinforce what you've learned. Try to solve each exercise before looking at solutions.

## Instructions

1. Read each exercise carefully
2. Write your solution in the code cell
3. Run your code to verify it works
4. Compare with the solution (if provided)


## Exercise 1: Basic Operations

Create a DataFrame with the following data:
- Name: ['Alice', 'Bob', 'Charlie', 'Diana']
- Age: [25, 30, 35, 28]
- Salary: [50000, 60000, 70000, 55000]

Then:
1. Display the first 2 rows
2. Get the shape of the DataFrame
3. Calculate the average age
4. Find the person with the highest salary


In [20]:
import pandas as pd
df  = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000],
    
})

#1
print(df.head(2))
#2
print(df.shape)
#3
print(df['Age'].mean())
#4
print(df.loc[df['Salary'].idxmax()])
# Your solution here


    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000
(4, 3)
29.5
Name      Charlie
Age            35
Salary      70000
Name: 2, dtype: object


## Exercise 2: Filtering

Using the DataFrame from Exercise 1:
1. Filter people older than 28
2. Filter people with salary between 55000 and 65000
3. Select only Name and Salary columns for people in IT department (add Department column first) 


In [None]:
#1
print(df[df['Age']>28])
#2
print(df[(df['Salary']>55000) & (df['Salary']<65000)])
#3
print(df.loc[df['Department'] == 'IT',['Name','Salary']])


## Exercise 3: Data Cleaning

Create a DataFrame with missing values and duplicates:
- Name: ['Alice', 'Bob', 'Charlie', 'Bob', 'Eve']
- Age: [25, 30, None, 30, 32]
- City: ['New York', 'London', None, 'London', 'Sydney']

Tasks:
1. Count missing values in each column
2. Fill missing Age with the mean age
3. Fill missing City with 'Unknown'
4. Remove duplicate rows


In [None]:
# Your solution here
#1
print(df.isnull().sum())
#2
df.fillna({'Age':df['Age'].mean()})
print(df['Age'])
#3
df.fillna({'City':'Unknown'})
print(df['City'])
#4
df.dropna() 

## Exercise 4: GroupBy and Aggregation

Create sales data:
- Product: ['A', 'A', 'B', 'B', 'C', 'C']
- Sales: [100, 150, 200, 180, 300, 250]
- Region: ['North', 'South', 'North', 'South', 'North', 'South']

Tasks:
1. Calculate total sales by product
2. Calculate average sales by region
3. Find the product with highest total sales


In [39]:
# Your solution here
sales_data = ({
    'Product': ['A', 'A', 'B', 'B', 'C', 'C'],
    'Sales': [100, 150, 200, 180, 300, 250],
    'Region': ['North', 'South', 'North', 'South', 'North', 'South']
})
df = pd.DataFrame(sales_data)

#1
print(df.groupby('Product')['Sales'].sum())
#2
print(df.groupby('Region')['Sales'].mean())
#3
print(f"product with highest sales is {df.groupby('Product')['Sales'].sum().idxmax()} and totala ammount is {df.groupby('Product')['Sales'].sum().max()}")



Product
A    250
B    380
C    550
Name: Sales, dtype: int64
Region
North    200.000000
South    193.333333
Name: Sales, dtype: float64
product with highest sales is C and totala ammount is 550


## Exercise 5: Merging DataFrames

Create two DataFrames:
- df1: Name, Age, City
- df2: Name, Salary, Department

Merge them:
1. Perform an inner join
2. Perform a left join
3. Compare the results


In [48]:
# Your solution here
df1 = pd.DataFrame({
    'Name':['Brijesh','Kartikey','Kshtiz','Manu'],
    'Age':[24,23,25,22],
    'City':['Pratapgarh','Bijnor','Meerut','Saharanpur']
})
df2 = pd.DataFrame({
    'Name':['Harsh','Kartikey','Kalash','Manu'],
    'Salary':[24000,23000,25000,22000],
    'Department':['IT','IT','CS','ECE']
})

#1
inner_join = pd.merge(df1,df2,on="Name",how="inner")
print(f"Inner Join :\n {inner_join}")

#2
left_join = pd.merge(df1,df2,on="Name",how="left")
print(f"Inner Join : {left_join}")


#3
print(inner_join)
print(left_join)




       Name  Age        City   Salary Department
0   Brijesh   24  Pratapgarh      NaN        NaN
1  Kartikey   23      Bijnor  23000.0         IT
2    Kshtiz   25      Meerut      NaN        NaN
3      Manu   22  Saharanpur  22000.0        ECE        Name  Age        City  Salary Department
0  Kartikey   23      Bijnor   23000         IT
1      Manu   22  Saharanpur   22000        ECE


## Exercise 6: Complete Data Engineering Task

**Scenario:** You have sales data that needs processing.

**Tasks:**
1. Create a CSV file with sales data (Date, Product, Quantity, Price)
2. Read the CSV file
3. Calculate Revenue (Quantity * Price)
4. Handle any missing values
5. Calculate total revenue by product
6. Save the processed data to a new CSV file

**Bonus:** Add date operations (convert Date to datetime, extract month)


In [53]:
# Your solution here
#1
data = {
    'Date': ['2025-12-25', '2025-12-26', '2025-12-27', '2025-12-28', '2025-12-29'],
    'Product': ['A', 'B', 'C', 'A', 'B'],
    'Quantity': [10, 5, 8, 12, 7],
    'Price': [100, 200, 150, 100, 200]
}
df = pd.DataFrame(data)
df.to_csv("first.csv")
print("CSV file has been created!")

#2
print("--------------------------------------------------------------------------------")
print(pd.read_csv('first.csv'))

#3
df['Revenue'] = df['Quantity'] * df['Price']

#4
df.fillna(0)

#5
print("--------------------------------------------------------------------------------")
print(df.groupby('Product')['Revenue'].sum())

#6
df.to_csv('processed_sales_data.csv', index=False)


CSV file has been created!
--------------------------------------------------------------------------------
   Unnamed: 0        Date Product  Quantity  Price
0           0  2025-12-25       A        10    100
1           1  2025-12-26       B         5    200
2           2  2025-12-27       C         8    150
3           3  2025-12-28       A        12    100
4           4  2025-12-29       B         7    200
--------------------------------------------------------------------------------
Product
A    2200
B    2400
C    1200
Name: Revenue, dtype: int64


## Summary

Great job completing the exercises! 

**Key Takeaways:**
- Practice is essential for mastering pandas
- Real-world data engineering combines multiple concepts
- Always test your code with sample data
- Data cleaning is often the most time-consuming step

**Next Steps:**
- Try working with real datasets
- Explore more advanced pandas features
- Learn about performance optimization
- Practice with larger datasets
