# Python for Data Analysis - Week 3
## Minor Assignment: Pandas Fundamentals II

**Due Date:** Wednesday, April 23, 2025

### Overview
In this assignment, you will practice the core Pandas concepts covered in today's lecture: indexing and selection, filtering data, and handling missing values. You'll work with a customer purchase dataset to clean, transform, and extract insights from the data.

### Learning Objectives
By completing this assignment, you will be able to:
- Use different methods for indexing and selecting data in Pandas
- Apply filtering operations to extract specific subsets of data
- Identify and handle missing values using various techniques
- Apply these techniques to solve real-world data cleaning challenges

### Dataset
You will be working with a customer purchase dataset (`customer_purchase_data.csv`) containing information about customers, their demographics, and their purchase transactions.

### Submission Guidelines
- Submit your completed notebook via the course portal
- Include your name and student ID in the notebook
- Ensure all code cells are executed and outputs are visible
- Add comments to explain your code and reasoning

Let's begin!

## Student Information

**Name:**  
**Student ID:**  

## Setup

First, let's import the necessary libraries and load the dataset.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)

# For plotting in the notebook
%matplotlib inline

In [5]:
from google.colab import files
uploaded = files.upload()
import pandas as pd
df = pd.read_csv(list(uploaded.keys())[0])
df.head()

Saving customer_purchase_data.csv to customer_purchase_data (2).csv


Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
0,1001,Male,34,72000,Engineer,Bachelor,East,Married,2024-03-05,P001,Laptop HP Elite,Electronics,Computers,1200.5,1.0,Credit Card
1,1001,Male,34,72000,Engineer,Bachelor,East,Married,2024-02-15,P045,External Hard Drive,Electronics,Accessories,89.99,2.0,Credit Card
2,1002,Female,28,65000,Teacher,Master,West,Single,2024-03-10,P012,Yoga Mat,Sports,Fitness,35.5,1.0,PayPal
3,1003,Female,45,95000,Doctor,PhD,Central,Married,2024-03-07,P023,Coffee Maker,Home,Kitchen,149.99,,Debit Card
4,1003,Female,45,95000,Doctor,PhD,Central,Married,2024-03-15,P056,Professional Blender,Home,Kitchen,299.95,1.0,Credit Card


## Part 1: Exploring the Dataset (10 points)

Before we dive into the main tasks, let's first explore the dataset to understand its structure and content.

### 1.1 Dataset Information

Examine the basic information about the dataset by answering the following questions:

1. How many rows and columns does the dataset have?
2. What are the column names and their data types?
3. Are there any missing values in the dataset? If so, in which columns?

In [6]:
# Check the shape of the dataset
rows, cols = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {cols}")

Number of rows: 57
Number of columns: 16


In [7]:
# Display information about the dataset
print("\nColumn names and data types:")
print(df.dtypes)


Column names and data types:
CustomerID         int64
Gender            object
Age                int64
Income             int64
Occupation        object
Education         object
Region            object
MaritalStatus     object
PurchaseDate      object
ProductID         object
ProductName       object
Category          object
Subcategory       object
Price            float64
Quantity         float64
PaymentMethod     object
dtype: object


In [8]:
# Check for missing values
print("\nMissing values in each column:")
print(df.isna().sum())


Missing values in each column:
CustomerID       0
Gender           0
Age              0
Income           0
Occupation       0
Education        0
Region           0
MaritalStatus    0
PurchaseDate     0
ProductID        0
ProductName      0
Category         0
Subcategory      0
Price            0
Quantity         2
PaymentMethod    0
dtype: int64


## Part 2: Indexing and Selection (30 points)

In this section, you will practice various methods for selecting and indexing data in Pandas.

### 2.1 Basic Indexing

Use different indexing methods to extract the following from the dataset:

1. Select the 'CustomerID', 'Age', 'Income', and 'Region' columns using bracket notation
2. Select the same columns using dot notation
3. Select rows 10 through 20 (inclusive) using iloc
4. Select the first 5 customers who made purchases in the 'Electronics' category using loc

In [9]:
# 1. Select columns using bracket notation
df[['CustomerID', 'Age', 'Income', 'Region']]

Unnamed: 0,CustomerID,Age,Income,Region
0,1001,34,72000,East
1,1001,34,72000,East
2,1002,28,65000,West
3,1003,45,95000,Central
4,1003,45,95000,Central
...,...,...,...,...
52,1046,32,70000,Central
53,1047,48,112000,West
54,1048,27,59000,South
55,1049,38,84000,North


In [10]:
# 2. Select columns using dot notation
df[['CustomerID', 'Age', 'Income', 'Region']]  # Dot notation not applicable for multiple columns

Unnamed: 0,CustomerID,Age,Income,Region
0,1001,34,72000,East
1,1001,34,72000,East
2,1002,28,65000,West
3,1003,45,95000,Central
4,1003,45,95000,Central
...,...,...,...,...
52,1046,32,70000,Central
53,1047,48,112000,West
54,1048,27,59000,South
55,1049,38,84000,North


In [11]:
# 3. Select rows 10 through 20 using iloc
df.iloc[10:21]

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
10,1007,Female,31,67000,Analyst,Bachelor,North,Single,2024-03-09,P067,Running Shoes,Sports,Footwear,120.5,1.0,Debit Card
11,1008,Male,26,55000,Designer,Bachelor,East,Single,2024-03-11,P078,Smart Watch,Electronics,Wearables,249.99,1.0,Credit Card
12,1009,Female,48,110000,Director,Master,South,Married,2024-03-06,P089,Dining Table,Furniture,Dining,899.99,1.0,Bank Transfer
13,1010,Male,33,78000,Developer,Master,West,Married,2024-03-08,P090,Office Chair,Furniture,Office,349.5,2.0,Credit Card
14,1011,Female,29,59000,Nurse,Bachelor,Central,Single,2024-02-25,P101,First Aid Kit,Health,Emergency,45.99,1.0,Debit Card
15,1012,Male,41,92000,Consultant,PhD,East,Married,2024-03-13,P112,Business Laptop,Electronics,Computers,1599.99,1.0,Credit Card
16,1012,Male,41,92000,Consultant,PhD,East,Married,2024-01-15,P089,Dining Table,Furniture,Dining,899.99,1.0,Bank Transfer
17,1013,Female,36,81000,Pharmacist,Master,North,Divorced,2024-03-04,P123,Prescription Glasses,Health,Vision,199.5,1.0,Health Insurance
18,1014,Male,23,48000,Accountant,Bachelor,West,Single,2024-03-01,P134,Tax Software,Software,Finance,79.99,1.0,PayPal
19,1015,Female,37,76000,Marketing,Master,South,Married,2024-02-22,P145,Digital Camera,Electronics,Photography,699.95,1.0,Credit Card


In [12]:
# 4. Select the first 5 customers who made purchases in the 'Electronics' category
df.loc[df['Category'] == 'Electronics'].head(5)

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
0,1001,Male,34,72000,Engineer,Bachelor,East,Married,2024-03-05,P001,Laptop HP Elite,Electronics,Computers,1200.5,1.0,Credit Card
1,1001,Male,34,72000,Engineer,Bachelor,East,Married,2024-02-15,P045,External Hard Drive,Electronics,Accessories,89.99,2.0,Credit Card
5,1003,Female,45,95000,Doctor,PhD,Central,Married,2024-01-22,P078,Smart Watch,Electronics,Wearables,249.99,1.0,Debit Card
7,1005,Female,21,35000,Student,High School,West,Single,2024-02-28,P045,External Hard Drive,Electronics,Accessories,89.99,1.0,PayPal
8,1005,Female,21,35000,Student,High School,West,Single,2024-03-14,P098,Wireless Earbuds,Electronics,Audio,129.95,1.0,PayPal


### 2.2 Advanced Indexing

Now, let's explore more advanced indexing techniques:

1. Set the 'CustomerID' column as the index of the DataFrame
2. Select all purchase information for customer with ID 1003 using the index
3. Multi-level indexing: Create a MultiIndex using 'Region' and 'Category' as index levels
4. Select all purchases in the 'East' region for the 'Electronics' category using the MultiIndex

In [25]:
# 1. Set 'CustomerID' as the index
df.set_index('CustomerID', inplace=True)

KeyError: "None of ['CustomerID'] are in the columns"

In [18]:
# 2. Select all purchase information for customer 1003
df.loc[1003]

Unnamed: 0_level_0,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1003,Female,45,95000,Doctor,PhD,Central,Married,2024-03-07,P023,Coffee Maker,Home,Kitchen,149.99,,Debit Card
1003,Female,45,95000,Doctor,PhD,Central,Married,2024-03-15,P056,Professional Blender,Home,Kitchen,299.95,1.0,Credit Card
1003,Female,45,95000,Doctor,PhD,Central,Married,2024-01-22,P078,Smart Watch,Electronics,Wearables,249.99,1.0,Debit Card


In [23]:
# 3. Create a MultiIndex using 'Region' and 'Category'
df_multi = df.set_index(['Region', 'Category'])

In [22]:
# 4. Select all purchases in the 'East' region for the 'Electronics' category
df_multi.loc[('East', 'Electronics')]

  df_multi.loc[('East', 'Electronics')]


Unnamed: 0_level_0,Unnamed: 1_level_0,Gender,Age,Income,Occupation,Education,MaritalStatus,PurchaseDate,ProductID,ProductName,Subcategory,Price,Quantity,PaymentMethod
Region,Category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
East,Electronics,Male,34,72000,Engineer,Bachelor,Married,2024-03-05,P001,Laptop HP Elite,Computers,1200.5,1.0,Credit Card
East,Electronics,Male,34,72000,Engineer,Bachelor,Married,2024-02-15,P045,External Hard Drive,Accessories,89.99,2.0,Credit Card
East,Electronics,Male,26,55000,Designer,Bachelor,Single,2024-03-11,P078,Smart Watch,Wearables,249.99,1.0,Credit Card
East,Electronics,Male,41,92000,Consultant,PhD,Married,2024-03-13,P112,Business Laptop,Computers,1599.99,1.0,Credit Card
East,Electronics,Female,32,69000,HR Manager,Master,Married,2024-02-17,P190,Ergonomic Keyboard,Accessories,129.99,1.0,Credit Card
East,Electronics,Female,25,52000,Graphic Designer,Bachelor,Single,2024-03-19,P256,Drawing Tablet,Design,299.0,1.0,PayPal
East,Electronics,Male,24,51000,Customer Service,Associate,Single,2024-03-01,P412,Headset,Audio,129.0,1.0,Debit Card


### 2.3 Practical Application: Creating Customer Profiles

Now, use your indexing skills to create a customer profile DataFrame that contains the following information for each unique customer:
- CustomerID
- Gender
- Age
- Income
- Education
- Region
- MaritalStatus

Hint: You'll need to remove duplicate customer entries since the same customer may have made multiple purchases.

In [26]:
# Create customer profile DataFrame
# Select the required columns and drop duplicates based on CustomerID (index)
customer_profile = df[['Gender', 'Age', 'Income', 'Education', 'Region', 'MaritalStatus']].reset_index().drop_duplicates(subset='CustomerID').set_index('CustomerID')

# Display the customer profile DataFrame
customer_profile

Unnamed: 0_level_0,Gender,Age,Income,Education,Region,MaritalStatus
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1001,Male,34,72000,Bachelor,East,Married
1002,Female,28,65000,Master,West,Single
1003,Female,45,95000,PhD,Central,Married
1004,Male,52,120000,Master,East,Divorced
1005,Female,21,35000,High School,West,Single
1006,Male,39,85000,Bachelor,Central,Married
1007,Female,31,67000,Bachelor,North,Single
1008,Male,26,55000,Bachelor,East,Single
1009,Female,48,110000,Master,South,Married
1010,Male,33,78000,Master,West,Married


## Part 3: Filtering Data (30 points)

In this section, you will practice applying filters to extract specific subsets of data.

### 3.1 Basic Filtering

Apply filters to find the following information:

1. Customers who are younger than 30 years old
2. Purchases made in the 'Electronics' category with a price greater than $500
3. Female customers who have made purchases in the 'Books' category
4. Customers from the 'West' region who are married

In [27]:
# Reset index if needed
if df.index.name == 'CustomerID':
    df = df.reset_index()

# 1. Customers younger than 30
df[df['Age'] < 30]

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
2,1002,Female,28,65000,Teacher,Master,West,Single,2024-03-10,P012,Yoga Mat,Sports,Fitness,35.5,1.0,PayPal
7,1005,Female,21,35000,Student,High School,West,Single,2024-02-28,P045,External Hard Drive,Electronics,Accessories,89.99,1.0,PayPal
8,1005,Female,21,35000,Student,High School,West,Single,2024-03-14,P098,Wireless Earbuds,Electronics,Audio,129.95,1.0,PayPal
11,1008,Male,26,55000,Designer,Bachelor,East,Single,2024-03-11,P078,Smart Watch,Electronics,Wearables,249.99,1.0,Credit Card
14,1011,Female,29,59000,Nurse,Bachelor,Central,Single,2024-02-25,P101,First Aid Kit,Health,Emergency,45.99,1.0,Debit Card
18,1014,Male,23,48000,Accountant,Bachelor,West,Single,2024-03-01,P134,Tax Software,Software,Finance,79.99,1.0,PayPal
22,1017,Female,27,63000,Journalist,Bachelor,Central,Single,2024-02-19,P178,Voice Recorder,Electronics,Audio,175.5,1.0,Debit Card
30,1025,Female,25,52000,Graphic Designer,Bachelor,East,Single,2024-03-19,P256,Drawing Tablet,Electronics,Design,299.0,1.0,PayPal
35,1030,Male,29,64000,Fitness Trainer,Bachelor,East,Single,2024-03-11,P301,Fitness Equipment,Sports,Training,399.99,1.0,Credit Card
36,1030,Male,29,64000,Fitness Trainer,Bachelor,East,Single,2024-01-25,P312,Protein Powder,Health,Supplements,65.0,2.0,Debit Card


In [28]:
# 2. Electronics purchases with price > $500
df[(df['Category'] == 'Electronics') & (df['Price'] > 500)]

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
0,1001,Male,34,72000,Engineer,Bachelor,East,Married,2024-03-05,P001,Laptop HP Elite,Electronics,Computers,1200.5,1.0,Credit Card
15,1012,Male,41,92000,Consultant,PhD,East,Married,2024-03-13,P112,Business Laptop,Electronics,Computers,1599.99,1.0,Credit Card
19,1015,Female,37,76000,Marketing,Master,South,Married,2024-02-22,P145,Digital Camera,Electronics,Photography,699.95,1.0,Credit Card
38,1032,Male,22,42000,Social Media Specialist,Associate,Central,Single,2024-02-20,P334,Smartphone,Electronics,Mobile,899.0,1.0,Credit Card


In [29]:
# 3. Female customers who purchased books
df[(df['Gender'] == 'Female') & (df['Category'] == 'Books')]

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
34,1029,Female,46,105000,Psychologist,PhD,North,Divorced,2024-02-21,P290,Psychology Books,Books,Professional,175.0,2.0,Credit Card
37,1031,Female,39,88000,Attorney,JD,West,Married,2024-03-08,P323,Legal Reference Books,Books,Professional,220.0,1.0,Bank Transfer
47,1041,Female,44,102000,Marketing Director,MBA,West,Married,2024-02-23,P423,Marketing Strategy Books,Books,Business,110.0,2.0,Bank Transfer
53,1047,Female,48,112000,Human Resources Director,Master,West,Married,2024-02-25,P489,HR Management Books,Books,Professional,120.0,2.0,Credit Card
55,1049,Female,38,84000,Nutritionist,Master,North,Married,2024-03-10,P501,Nutrition Books,Books,Health,95.0,2.0,Debit Card


In [30]:
# 4. Married customers from West region
df[(df['Region'] == 'West') & (df['MaritalStatus'] == 'Married')]

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
13,1010,Male,33,78000,Developer,Master,West,Married,2024-03-08,P090,Office Chair,Furniture,Office,349.5,2.0,Credit Card
31,1026,Male,43,99000,Dentist,Doctorate,West,Married,2024-02-29,P267,Water Flosser,Health,Dental,89.95,1.0,Health Insurance
37,1031,Female,39,88000,Attorney,JD,West,Married,2024-03-08,P323,Legal Reference Books,Books,Professional,220.0,1.0,Bank Transfer
42,1036,Male,47,108000,Pilot,Bachelor,West,Married,2024-03-13,P378,Travel Luggage,Travel,Bags,199.5,2.0,Credit Card
47,1041,Female,44,102000,Marketing Director,MBA,West,Married,2024-02-23,P423,Marketing Strategy Books,Books,Business,110.0,2.0,Bank Transfer
53,1047,Female,48,112000,Human Resources Director,Master,West,Married,2024-02-25,P489,HR Management Books,Books,Professional,120.0,2.0,Credit Card


### 3.2 Advanced Filtering

Now let's apply more complex filtering conditions:

1. Find high-value customers (Income > $90,000) who have made purchases in the 'Furniture' or 'Electronics' categories
2. Find customers who made purchases in January 2024 (hint: extract month and year from the PurchaseDate)
3. Find customers who have made multiple purchases (more than one transaction)
4. Find the top 5 most expensive products purchased using 'Credit Card' as the payment method

In [31]:
# 1. High-value customers who purchased Furniture or Electronics
df[(df['Income'] > 90000) & (df['Category'].isin(['Furniture', 'Electronics']))]

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
5,1003,Female,45,95000,Doctor,PhD,Central,Married,2024-01-22,P078,Smart Watch,Electronics,Wearables,249.99,1.0,Debit Card
12,1009,Female,48,110000,Director,Master,South,Married,2024-03-06,P089,Dining Table,Furniture,Dining,899.99,1.0,Bank Transfer
15,1012,Male,41,92000,Consultant,PhD,East,Married,2024-03-13,P112,Business Laptop,Electronics,Computers,1599.99,1.0,Credit Card
16,1012,Male,41,92000,Consultant,PhD,East,Married,2024-01-15,P089,Dining Table,Furniture,Dining,899.99,1.0,Bank Transfer
50,1044,Male,53,135000,CEO,MBA,South,Married,2024-02-16,P456,Executive Chair,Furniture,Office,899.0,1.0,Bank Transfer


In [32]:
# Convert PurchaseDate to datetime if not already
if not pd.api.types.is_datetime64_dtype(df['PurchaseDate']):
    df['PurchaseDate'] = pd.to_datetime(df['PurchaseDate'])

# 2. Customers who made purchases in January 2024
df[pd.to_datetime(df['PurchaseDate']).dt.to_period('M') == '2024-01']

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod
5,1003,Female,45,95000,Doctor,PhD,Central,Married,2024-01-22,P078,Smart Watch,Electronics,Wearables,249.99,1.0,Debit Card
16,1012,Male,41,92000,Consultant,PhD,East,Married,2024-01-15,P089,Dining Table,Furniture,Dining,899.99,1.0,Bank Transfer
36,1030,Male,29,64000,Fitness Trainer,Bachelor,East,Single,2024-01-25,P312,Protein Powder,Health,Supplements,65.0,2.0,Debit Card


In [33]:
# 3. Customers with multiple purchases
df[df.index.duplicated(keep=False)]

Unnamed: 0,CustomerID,Gender,Age,Income,Occupation,Education,Region,MaritalStatus,PurchaseDate,ProductID,ProductName,Category,Subcategory,Price,Quantity,PaymentMethod


In [34]:
# 4. Top 5 most expensive products purchased with Credit Card
df[df['PaymentMethod'] == 'Credit Card'][['ProductName', 'Price']].nlargest(5, 'Price')

Unnamed: 0,ProductName,Price
15,Business Laptop,1599.99
0,Laptop HP Elite,1200.5
51,Lab Equipment,975.0
38,Smartphone,899.0
19,Digital Camera,699.95


### 3.3 Practical Application: Customer Segmentation

Use filtering to segment customers based on the following criteria:

1. Create a 'CustomerValue' column that categorizes customers as follows:
   - 'High': Income > $90,000
   - 'Medium': Income between $60,000 and $90,000
   - 'Low': Income < $60,000
   
2. Create a 'AgeGroup' column that categorizes customers as follows:
   - 'Young': Age < 30
   - 'Middle-aged': Age between 30 and 45
   - 'Senior': Age > 45
   
3. Create a 'PurchaseFrequency' column that categorizes customers as follows:
   - 'Frequent': More than 2 purchases
   - 'Occasional': 1-2 purchases

4. Create a cross-tabulation of CustomerValue and AgeGroup to see the distribution of customers

In [41]:
# Create customer profile DataFrame if not already created
if 'customer_profiles' not in locals():
    customer_profiles = df[['CustomerID', 'Gender', 'Age', 'Income', 'Education', 'Region', 'MaritalStatus']].drop_duplicates(subset=['CustomerID'])

# 1. Create CustomerValue column
df['CustomerValue'] = np.where(df['Income'] > 90000, 'High',
                              np.where(df['Income'] >= 60000, 'Medium', 'Low'))


In [None]:
# 2. Create AgeGroup column


In [None]:
# 3. Create PurchaseFrequency column


In [None]:
# 4. Create cross-tabulation


## Part 4: Handling Missing Values (30 points)

In this section, you will identify and handle missing values in the dataset.

### 4.1 Identifying Missing Values

Let's first identify all missing values in the dataset:

1. Calculate the number of missing values in each column
2. Calculate the percentage of missing values in each column
3. Create a visualization to illustrate the missing values pattern

In [None]:
# 1. Count missing values in each column
# Your code here

In [None]:
# 2. Calculate percentage of missing values
# Your code here

In [None]:
# 3. Visualize missing values
# Your code here

### 4.2 Handling Missing Values

Now, let's apply different techniques to handle missing values:

1. Create a new DataFrame with rows that have missing values
2. Create a new DataFrame with rows that have no missing values
3. Fill missing Quantity values with the median quantity for that product category
4. Fill missing Price values with the mean price for that product category

In [None]:
# 1. Rows with missing values
# Your code here

In [None]:
# 2. Rows with no missing values
# Your code here

In [None]:
# 3. Fill missing Quantity with median by category
# Your code here

In [None]:
# 4. Fill missing Price with mean by category
# Your code here

### 4.3 Practical Application: Creating a Clean Dataset

Create a clean version of the dataset by applying the following steps:

1. Fill missing Quantity values with the median quantity for that product category
2. Fill missing Price values with the mean price for that product category
3. Create a 'TotalAmount' column that multiplies Price by Quantity
4. Convert PurchaseDate to datetime format if not already
5. Create a 'PurchaseMonth' and 'PurchaseYear' column
6. Create a 'CustomerSpend' DataFrame that shows the total amount spent by each customer

In [None]:
# Create a copy of the DataFrame to work with
clean_df = df.copy()

# 1. Fill missing Quantity values
# Your code here

In [None]:
# 2. Fill missing Price values
# Your code here

In [None]:
# 3. Create TotalAmount column
# Your code here

In [None]:
# 4. Convert PurchaseDate to datetime
# Your code here

In [None]:
# 5. Create PurchaseMonth and PurchaseYear columns
# Your code here

In [None]:
# 6. Create CustomerSpend DataFrame
# Your code here

## Bonus Challenge (10 extra points)

Analyze the purchasing patterns of customers based on demographic factors:

1. For each age group, determine the most popular product category (by number of purchases)
2. For each income level ('High', 'Medium', 'Low'), calculate the average spend per purchase
3. Compare spending patterns between male and female customers across different product categories
4. Identify which regions have the highest average purchase amounts

In [None]:
# 1. Most popular product category by age group
# Your code here

In [None]:
# 2. Average spend per purchase by income level
# Your code here

In [None]:
# 3. Spending patterns by gender across product categories
# Your code here

In [None]:
# 4. Regions with highest average purchase amounts
# Your code here

## Summary

In this assignment, you've practiced various techniques for indexing, selecting, and filtering data in Pandas, as well as identifying and handling missing values. These skills are essential for any data analysis project and form the foundation for more advanced data manipulation operations.

Summarize what you've learned from this assignment and how you might apply these techniques in future data analysis tasks.

*Your summary here*