# Advanced Pandas Exercises: Online Learning Platform

This notebook contains 15 exercises to practice advanced Pandas operations and plotting in the context of an online learning platform. You are provided with two DataFrames:
- **df_courses**: Contains course enrollment records (e.g., EnrollmentID, EnrollmentDate, UserID, CourseFee, PaymentType, CourseSubject).
- **df_users**: Contains user details (e.g., UserID, UserName, ExperienceLevel, Region).

The first code cell simulates realistic datasets for you to work with. Run it to load `df_courses` and `df_users` into your environment.

Exercises 1–10 focus on data manipulation, while Exercises 11–15 introduce plotting with Pandas and Matplotlib. Complete each exercise by writing the necessary code in the provided cells.

**Note**: Ensure Matplotlib is imported (`import matplotlib.pyplot as plt`) for plotting exercises.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import random

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

# Simulate df_users
n_users = 120
user_ids = [f'U{str(i).zfill(4)}' for i in range(1, n_users + 1)]
first_names = ['Alex', 'Priya', 'Jordan', 'Sofia', 'Kai', 'Maya', 'Ryan', 'Aisha', 'Lucas', 'Zara']
last_names = ['Patel', 'Nguyen', 'Kim', 'Lopez', 'Singh', 'Chen', 'Garcia', 'Ali', 'Rossi', 'Kumar']
user_names = [f'{random.choice(first_names)} {random.choice(last_names)}' for _ in range(n_users)]
experience_levels = np.random.randint(1, 11, size=n_users)
regions = ['North America', 'Europe', 'Asia', 'South America', 'Africa', 'Oceania']
user_regions = [random.choice(regions) for _ in range(n_users)]

df_users = pd.DataFrame({
    'UserID': user_ids,
    'UserName': user_names,
    'ExperienceLevel': experience_levels,
    'Region': user_regions
})

# Simulate df_courses
n_enrollments = 600
enrollment_ids = [f'E{str(i).zfill(5)}' for i in range(1, n_enrollments + 1)]
base_date = datetime(2025, 1, 1)
enrollment_dates = [base_date + timedelta(days=random.randint(0, 90)) for _ in range(n_enrollments)]
user_ids_enroll = [random.choice(user_ids) for _ in range(n_enrollments)]
course_fees = np.random.uniform(5, 200, size=n_enrollments).round(2)
payment_types = ['Credit Card', 'PayPal', 'Bank Transfer', 'Voucher', 'Paypal']
payment_types_enroll = [random.choice(payment_types) for _ in range(n_enrollments)]
course_subjects = ['Programming', 'Data Science', 'Design', 'Business', 'Languages']
course_subjects_enroll = [random.choice(course_subjects) for _ in range(n_enrollments)]

df_courses = pd.DataFrame({
    'EnrollmentID': enrollment_ids,
    'EnrollmentDate': enrollment_dates,
    'UserID': user_ids_enroll,
    'CourseFee': course_fees,
    'PaymentType': payment_types_enroll,
    'CourseSubject': course_subjects_enroll
})

# Display sample data
print('=== Sample df_users ===')
print(df_users.head())
print('\n=== Sample df_courses ===')
print(df_courses.head())

## Exercise 1: Merge Enrollments with User Details
Merge `df_courses` and `df_users` using an **inner join** on `UserID`. Display the first 5 rows of the resulting DataFrame.

## Exercise 2: Left Join with Missing User Data
Perform a **left join** between `df_courses` and `df_users` on `UserID`. How many rows in the resulting DataFrame have missing user names (`UserName` is NaN)? Use `.isna()` to check.

## Exercise 3: Average Course Fee by Payment Type
Group `df_courses` by `PaymentType` and calculate the average `CourseFee` for each type. Display the results.

## Exercise 4: Binning Course Fees
Create a new column `FeeCategory` in `df_courses` by binning `CourseFee` into 4 categories: 'Low' (0-25), 'Medium' (25-75), 'High' (75-150), and 'Premium' (150+). Display the first 10 rows showing `CourseFee` and `FeeCategory`.

## Exercise 5: Enrollment Count and Total Fees by Course Subject
Group `df_courses` by `CourseSubject` and calculate both the number of enrollments (`EnrollmentID` count) and the total `CourseFee`. Rename the aggregated columns to `EnrollmentCount` and `TotalFees`. Display the results.

## Exercise 6: Add a Discounted Fee Column
Create a new column `DiscountedFee` in `df_courses` by applying a 10% discount to `CourseFee`. Show the first 5 rows with both `CourseFee` and `DiscountedFee`.

## Exercise 7: Vertical Concatenation of Enrollments
Simulate receiving a new batch of 25 enrollments by sampling from `df_courses` (with replacement). Concatenate this batch vertically with the original `df_courses` and reset the index. Display the shape of the resulting DataFrame.

## Exercise 8: Custom Aggregation - Fee Range
Define a custom function to calculate the range (max - min) of `CourseFee` for each `PaymentType`. Apply this function using `groupby` and display the results.

## Exercise 9: Sort Enrollments by Fee
Sort `df_courses` by `CourseFee` in descending order and display the top 5 most expensive enrollments.

## Exercise 10: Quantile Binning of Experience Levels
Add a column `ExperienceQuantile` to `df_users` by binning `ExperienceLevel` into 4 quantiles using `pd.qcut`. Label the bins as 'Q1', 'Q2', 'Q3', and 'Q4'. Show the first 10 rows with `ExperienceLevel` and `ExperienceQuantile`.

## Exercise 11: Bar Plot of Average Course Fee by Payment Type
Create a bar plot showing the average `CourseFee` for each `PaymentType`. Use `groupby` to calculate the averages, then plot with `.plot.bar()`. Add a title and labels for the axes.

## Exercise 12: Pie Chart of Enrollment Distribution by Course Subject
Create a pie chart showing the proportion of enrollments by `CourseSubject`. Use `value_counts()` to get the counts, then plot with `.plot.pie()`. Include percentage labels (`autopct='%1.1f%%'`) and a title.

## Exercise 13: Histogram of Course Fees
Create a histogram of `CourseFee` values from `df_courses` with 20 bins. Use `.plot.hist()` and add a title and axis labels.

## Exercise 14: Line Plot of Cumulative Fees Over Time
Create a line plot showing the cumulative sum of `CourseFee` over time (`EnrollmentDate`). Set `EnrollmentDate` as the index, calculate the cumulative sum with `.cumsum()`, and plot with `.plot.line()`. Add a title and axis labels.

## Exercise 15: Box Plot of Course Fees by Course Subject
Create a box plot showing the distribution of `CourseFee` for each `CourseSubject`. Use `.boxplot()` with `column='CourseFee'` and `by='CourseSubject'`. Add a title and axis labels, and remove the automatic suptitle.