# Setup environment and load data

Dataset: https://s3.hothienlac.com/yomitoon/sales_data.csv

In [2]:
import pandas as pd

# üü¢ LEVEL 1 ‚Äî Basic Data Understanding (Score 1‚Äì3)

## **Q1. Load and inspect the dataset**

### Task

Load the CSV file and:

1. Display the first 5 rows
2. Show column names and data types

### üí° Hint

Use:

* `pd.read_csv`
* `.head()`
* `.info()`

### üìö Reference

* [https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
* [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)

### üß† Explanation

You are learning how to **inspect unfamiliar data** quickly and verify that Pandas interpreted types correctly (dates, numbers, strings).


In [3]:
dataset = pd.read_csv('https://s3.hothienlac.com/yomitoon/sales_data.csv')

In [5]:
dataset.head()

Unnamed: 0,order_id,order_date,customer_id,customer_name,city,product,category,quantity,unit_price,payment_method
0,1001,2024-01-02,C001,Alice,New York,Laptop,Electronics,1,1200,Credit Card
1,1002,2024-01-02,C002,Bob,Los Angeles,Headphones,Electronics,2,150,PayPal
2,1003,2024-01-03,C003,Charlie,New York,Office Chair,Furniture,1,350,Credit Card
3,1004,2024-01-03,C001,Alice,New York,Mouse,Electronics,3,25,Debit Card
4,1005,2024-01-04,C004,Diana,Chicago,Desk,Furniture,1,500,Bank Transfer


In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   order_id        10 non-null     int64 
 1   order_date      10 non-null     object
 2   customer_id     10 non-null     object
 3   customer_name   10 non-null     object
 4   city            10 non-null     object
 5   product         10 non-null     object
 6   category        10 non-null     object
 7   quantity        10 non-null     int64 
 8   unit_price      10 non-null     int64 
 9   payment_method  10 non-null     object
dtypes: int64(3), object(7)
memory usage: 932.0+ bytes


## **Q2. Create a new column for total order value**

### Task

Create a new column called `total_amount`
Formula:

```
quantity √ó unit_price
```

### üí° Hint

Use:

* `.assign()`

### üìö Reference

* [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html)

### üß† Explanation

This teaches **feature engineering** and the **functional Pandas mindset** (`inplace=False`, return a new DataFrame).


In [8]:
q2_data = dataset.assign(total_amount=dataset['quantity'] * dataset['unit_price'])
q2_data

Unnamed: 0,order_id,order_date,customer_id,customer_name,city,product,category,quantity,unit_price,payment_method,total_amount
0,1001,2024-01-02,C001,Alice,New York,Laptop,Electronics,1,1200,Credit Card,1200
1,1002,2024-01-02,C002,Bob,Los Angeles,Headphones,Electronics,2,150,PayPal,300
2,1003,2024-01-03,C003,Charlie,New York,Office Chair,Furniture,1,350,Credit Card,350
3,1004,2024-01-03,C001,Alice,New York,Mouse,Electronics,3,25,Debit Card,75
4,1005,2024-01-04,C004,Diana,Chicago,Desk,Furniture,1,500,Bank Transfer,500
5,1006,2024-01-04,C005,Eve,Chicago,Laptop,Electronics,1,1100,Credit Card,1100
6,1007,2024-01-05,C002,Bob,Los Angeles,Monitor,Electronics,2,300,Debit Card,600
7,1008,2024-01-05,C003,Charlie,New York,Desk Lamp,Furniture,2,45,PayPal,90
8,1009,2024-01-06,C006,Frank,Miami,Tablet,Electronics,1,600,Credit Card,600
9,1010,2024-01-06,C001,Alice,New York,Keyboard,Electronics,1,80,Bank Transfer,80


# üü° LEVEL 2 ‚Äî Filtering & Simple Analysis (Score 4‚Äì6)

## **Q3. Filter high-value orders**

### Task

Select only orders where:

* `total_amount > 500`

### üí° Hint

Use **one** of:

* `.query()`
* Boolean indexing (`df[condition]`)

### üìö Reference

* [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html)

### üß† Explanation

Filtering is the foundation of **real data analysis**.
You learn how Pandas handles **boolean logic**.

In [9]:
q3_data = q2_data.query('total_amount > 500')
q3_data

Unnamed: 0,order_id,order_date,customer_id,customer_name,city,product,category,quantity,unit_price,payment_method,total_amount
0,1001,2024-01-02,C001,Alice,New York,Laptop,Electronics,1,1200,Credit Card,1200
5,1006,2024-01-04,C005,Eve,Chicago,Laptop,Electronics,1,1100,Credit Card,1100
6,1007,2024-01-05,C002,Bob,Los Angeles,Monitor,Electronics,2,300,Debit Card,600
8,1009,2024-01-06,C006,Frank,Miami,Tablet,Electronics,1,600,Credit Card,600


## **Q4. Count how many orders each customer made**

### Task

Create a table showing:

* `customer_id`
* number of orders per customer

### üí° Hint

Use:

* `.groupby()`
* `.count()` **or** `.size()`

### üìö Reference

* [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)

### üß† Explanation

You learn how Pandas **splits data into groups** and applies calculations per group.

In [13]:
q4_data = q2_data.groupby('customer_id').count()
q4_data

Unnamed: 0_level_0,order_id,order_date,customer_name,city,product,category,quantity,unit_price,payment_method,total_amount
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
C001,3,3,3,3,3,3,3,3,3,3
C002,2,2,2,2,2,2,2,2,2,2
C003,2,2,2,2,2,2,2,2,2,2
C004,1,1,1,1,1,1,1,1,1,1
C005,1,1,1,1,1,1,1,1,1,1
C006,1,1,1,1,1,1,1,1,1,1


# üî¥ LEVEL 3 ‚Äî Aggregation & Time Awareness (Score 7‚Äì10)

## **Q5. Calculate total spending per customer**

### Task

For each `customer_id`, compute:

* Total money spent

### üí° Hint

Use:

* `.groupby()`
* `.agg()`

### üìö Reference

* [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)

### üß† Explanation

This introduces **aggregation pipelines** and prepares you for more complex analytics.

In [16]:
q5_data = q2_data.groupby('customer_id').agg({'total_amount': 'sum'})
q5_data

Unnamed: 0_level_0,total_amount
customer_id,Unnamed: 1_level_1
C001,1355
C002,900
C003,440
C004,500
C005,1100
C006,600


## **Q6. Daily revenue analysis**

### Task

Calculate:

1. Total revenue per day
2. Sort results by date

### üí° Hint

Use:

* `parse_dates` in `read_csv`
* `.groupby()`
* `.sort_values()`

### üìö Reference

* [https://pandas.pydata.org/docs/user_guide/timeseries.html](https://pandas.pydata.org/docs/user_guide/timeseries.html)

### üß† Explanation

Time-based grouping is essential for **business dashboards and reports**.

In [18]:
q6_data = pd.read_csv('https://s3.hothienlac.com/yomitoon/sales_data.csv', parse_dates=['order_date'])
q6_data


Unnamed: 0,order_id,order_date,customer_id,customer_name,city,product,category,quantity,unit_price,payment_method
0,1001,2024-01-02,C001,Alice,New York,Laptop,Electronics,1,1200,Credit Card
1,1002,2024-01-02,C002,Bob,Los Angeles,Headphones,Electronics,2,150,PayPal
2,1003,2024-01-03,C003,Charlie,New York,Office Chair,Furniture,1,350,Credit Card
3,1004,2024-01-03,C001,Alice,New York,Mouse,Electronics,3,25,Debit Card
4,1005,2024-01-04,C004,Diana,Chicago,Desk,Furniture,1,500,Bank Transfer
5,1006,2024-01-04,C005,Eve,Chicago,Laptop,Electronics,1,1100,Credit Card
6,1007,2024-01-05,C002,Bob,Los Angeles,Monitor,Electronics,2,300,Debit Card
7,1008,2024-01-05,C003,Charlie,New York,Desk Lamp,Furniture,2,45,PayPal
8,1009,2024-01-06,C006,Frank,Miami,Tablet,Electronics,1,600,Credit Card
9,1010,2024-01-06,C001,Alice,New York,Keyboard,Electronics,1,80,Bank Transfer


In [20]:
q6_data = q6_data.assign(total_amount=q6_data['quantity'] * q6_data['unit_price'])
q6_data


Unnamed: 0,order_id,order_date,customer_id,customer_name,city,product,category,quantity,unit_price,payment_method,total_amount
0,1001,2024-01-02,C001,Alice,New York,Laptop,Electronics,1,1200,Credit Card,1200
1,1002,2024-01-02,C002,Bob,Los Angeles,Headphones,Electronics,2,150,PayPal,300
2,1003,2024-01-03,C003,Charlie,New York,Office Chair,Furniture,1,350,Credit Card,350
3,1004,2024-01-03,C001,Alice,New York,Mouse,Electronics,3,25,Debit Card,75
4,1005,2024-01-04,C004,Diana,Chicago,Desk,Furniture,1,500,Bank Transfer,500
5,1006,2024-01-04,C005,Eve,Chicago,Laptop,Electronics,1,1100,Credit Card,1100
6,1007,2024-01-05,C002,Bob,Los Angeles,Monitor,Electronics,2,300,Debit Card,600
7,1008,2024-01-05,C003,Charlie,New York,Desk Lamp,Furniture,2,45,PayPal,90
8,1009,2024-01-06,C006,Frank,Miami,Tablet,Electronics,1,600,Credit Card,600
9,1010,2024-01-06,C001,Alice,New York,Keyboard,Electronics,1,80,Bank Transfer,80


In [21]:
daily_revenue = q6_data.groupby('order_date')['total_amount'].sum().sort_index()
daily_revenue

Unnamed: 0_level_0,total_amount
order_date,Unnamed: 1_level_1
2024-01-02,1500
2024-01-03,425
2024-01-04,1600
2024-01-05,690
2024-01-06,680


## **Q7. Rank customers by spending**

### Task

Rank customers from highest to lowest total spending.

### üí° Hint

Use:

* `.rank()`

### üìö Reference

* [https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html)

### üß† Explanation

Ranking teaches you how Pandas handles **ordering, ties, and numeric comparisons**.

---

In [23]:
q7_data = q5_data.rank(ascending=False)
q7_data

Unnamed: 0_level_0,total_amount
customer_id,Unnamed: 1_level_1
C001,1.0
C002,3.0
C003,6.0
C004,5.0
C005,2.0
C006,4.0


# üéØ Learning Outcome by Level

| Level | You can now‚Ä¶                             |
| ----- | ---------------------------------------- |
| 1     | Load, inspect, create columns            |
| 2     | Filter and group data                    |
| 3     | Aggregate, rank, and analyze time series |