<a href="https://colab.research.google.com/github/EndangSupriyadi/GCI_GLOBAL_2025/blob/master/HW7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

-----

# Homework 7: Time Series Data

## Mission

Using the Beijing PM2.5 dataset, perform two types of feature engineering, creating **lag features (shifting)** and **cyclical time features (encoding)**.

## Task

Your task is to use the [Beijing PM2.5 dataset](https://archive.ics.uci.edu/dataset/381/beijing+pm2+5+data) and implement the data preprocessing and feature engineering steps within the `homework()` function.

1.  **Combining Columns and Creating Index**: Combine the `year`, `month`, `day`, and `hour` columns to create a `datetime` index for a new DataFrame `pm25` containing only the `pm2.5` values.
2.  **Resample Daily**: Resample the data to a daily frequency by taking the last observed value (23:00) for each day, saving it as `pm25_daily`.
3.  **Create Lag Features (add as new columns)**: **Add new columns to `pm25_daily`** for the `pm2.5` value 1, 2, and 3 days prior, named `lag_1`, `lag_2`, and `lag_3`.
4.  **Create Cyclical Time Features (add as new columns)**: From the index, extract the day of month (1-31) and encode it cyclically with \(θ = 2π · day / 31\). **Add new columns** `day_cos` and `day_sin` to `pm25_daily`. \
**Note:** Although months have 28/29/30/31 days, we fix the period at **31** as a simple, consistent approximation of within-month seasonality.
5.  **Return Value**: Return the final DataFrame `pm25_daily` with the following columns **(in this order)**: `pm2.5`, `lag_1`, `lag_2`, `lag_3`, `day_cos`, `day_sin`.

## Inputs/Outputs of `homework()`

* Inputs:
    * `beijing_pm25` (`pd.DataFrame`)
* Outputs:
    * `pm25_daily` (`pd.DataFrame`): DataFrame with the created features
        * The columns should be **(in this order)**: `pm2.5`, `lag_1`, `lag_2`, `lag_3`, `day_cos`, `day_sin`

Note: **Do not include import statements or code for downloading and loading the Beijing PM2.5 dataset.**

## Submission Guidelines
When submitting your solution, only submit the entire `homework()` function. Submit by selecting this week's assignment in the Omnicampus homework section, pasting the function into the submission area, and then clicking [Submit Python Code].

Please pay attention to the following points when submitting.
- Erase the `!!WRITE ME!!` when submitting
- Write your answer as one function
- When the instructions say "create feature(s)", it means **add new columns to the `pd.DataFrame`**.
-----

## Deadline

Wed, Nov 12th, 20:00 JST (GMT+9)

## 1. Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import math

## 2. Download and load Beijing PM2.5 dataset

In [2]:
import requests
import io
# Specify the URL where the data is located
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv'

# Fetch the data from the url
r = requests.get(url, stream=True).content
# Load the data into a DataFrame
beijing_pm25 = pd.read_csv(io.BytesIO(r))

## 3. Solution

#### **NOTE: Do not include import statements or code for downloading and loading the Beijing PM2.5 dataset, as explained above.**

In [3]:
# beijing_pm25 is passed in as a pd.DataFrame
def homework(beijing_pm25):
    # 1️⃣ Kombinasikan kolom year, month, day, dan hour menjadi datetime index
    beijing_pm25["datetime"] = pd.to_datetime(
        beijing_pm25[["year", "month", "day", "hour"]]
    )
    pm25 = beijing_pm25.set_index("datetime")[["pm2.5"]]

    # 2️⃣ Resample ke frekuensi harian (ambil nilai terakhir jam 23:00)
    pm25_daily = pm25.resample("D").last()

    # 3️⃣ Buat lag features (1, 2, dan 3 hari sebelumnya)
    pm25_daily["lag_1"] = pm25_daily["pm2.5"].shift(1)
    pm25_daily["lag_2"] = pm25_daily["pm2.5"].shift(2)
    pm25_daily["lag_3"] = pm25_daily["pm2.5"].shift(3)

    # 4️⃣ Buat cyclical time features berdasarkan day of month
    day = pm25_daily.index.day
    theta = 2 * np.pi * day / 31
    pm25_daily["day_cos"] = np.cos(theta)
    pm25_daily["day_sin"] = np.sin(theta)

    # 5️⃣ Urutkan kolom sesuai instruksi
    pm25_daily = pm25_daily[["pm2.5", "lag_1", "lag_2", "lag_3", "day_cos", "day_sin"]]

    # 6️⃣ Kembalikan DataFrame akhir
    return pm25_daily


### Try your code output

Run the cell below to test your code's output.

**Note:** Your `homework()` function must return a DataFrame with **all** of the following columns **in this exact order**:
`pm2.5`, `lag_1`, `lag_2`, `lag_3`, `day_cos`, `day_sin`.
Before submitting, **visually confirm** that these columns exist and appear in this order (no extra/missing columns).


In [4]:
expected = ["pm2.5", "lag_1", "lag_2", "lag_3", "day_cos", "day_sin"]
out = homework(beijing_pm25)
print("Columns:", list(out.columns))
assert list(out.columns) == expected, "Column order or names are incorrect."
print("✅ Passed: columns exist and are in the correct order.")

Columns: ['pm2.5', 'lag_1', 'lag_2', 'lag_3', 'day_cos', 'day_sin']
✅ Passed: columns exist and are in the correct order.
