In [37]:
# Initialize Otter
import otter
grader = otter.Notebook("hw8.ipynb")

# CPSC 330 - Applied Machine Learning

## Homework 8: Introduction to Computer vision and Time Series

**Due date: see the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [38]:
from hashlib import sha1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge, LinearRegression
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

from sklearn.metrics import r2_score

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
## Instructions
rubric={points}

You will earn points for following these instructions and successfully submitting your work on Gradescope.  

### Group wotk instructions

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2.
  
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).
- If you would like to use late tokens for the homework, all group members must have the necessary late tokens available. Please note that the late tokens will be counted for all members of the group.   


### General submission instructions

- Please **read carefully
[Use of Generative AI policy](https://ubc-cs.github.io/cpsc330-2025W1/syllabus.html#use-of-generative-ai-in-the-course)** before starting the homework assignment. 
- **Run all cells before submitting:** Go to `Kernel -> Restart Kernel and Clear All Outputs`, then select `Run -> Run All Cells`. This ensures your notebook runs cleanly from start to finish without errors.
  
- **Submit your files on Gradescope.**  
   - Upload only your `.ipynb` file **with outputs displayed** and any required output files.
     
   - Do **not** submit other files from your repository.  
   - If you need help, see the [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/).  
- **Check that outputs render properly.**  
   - Make sure all plots and outputs appear in your submission.
     
   - If your `.ipynb` file is too large and doesn't render on Gradescope, also upload a PDF or HTML version so the TAs can view your work.  
- **Keep execution order clean.**  
   - Execution numbers must start at "1" and increase in order.
     
   - Notebooks without visible outputs may not be graded.  
   - Out-of-order or missing execution numbers may result in mark deductions.  
- **Follow course submission guidelines:** Review the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2025W1/docs/homework_instructions.html) for detailed guidance on completing and submitting assignments. 
   
</div>

_Points:_ 2

<!-- END QUESTION -->

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset and storing it under the `data` folder. We will be forcasting average avocado price for the next week. 

In [39]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [40]:
df.shape

(18249, 13)

In [41]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [42]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [43]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [44]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series? 
rubric={points:4}

In the [Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) dataset from lecture demo, we had different measurements for each Location. 

We want you to consider this for the avocado prices dataset. For which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 4

We have separate measurements for each region and type.
Each row in the dataset corresponds to a unique Date–region–type combination, meaning that AveragePrice and Volume are tracked over time separately for every (region, type) pair.
Thus, the categorical features defining independent time series are region and type.

In [45]:
print("Unique regions:", df["region"].nunique())
print("Unique types:", df["type"].nunique())
num_ts = df.groupby(["region", "type"]).ngroups
print("Number of separate time series:", num_ts)


Unique regions: 54
Unique types: 2
Number of separate time series: 108


In [46]:
...

Ellipsis

In [47]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements? 
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 4

The avocado dataset is not perfectly equally spaced.
Although the measurements are intended to be weekly, several (region, type) time series contain missing weeks. When computing the day differences between consecutive dates within each group, many differences are not equal to 7 days. This indicates unequal spacing in the time series.

In [48]:
df = df.sort_values("Date")

df["date_diff"] = df.groupby(["region", "type"])["Date"].diff().dt.days
not_weekly = df[df["date_diff"].notna() & (df["date_diff"] != 7)]

print("Number of non-weekly gaps:", len(not_weekly))
print("Sample of non-weekly gaps:")
print(not_weekly[["region", "type", "Date", "date_diff"]].head())

Number of non-weekly gaps: 2
Sample of non-weekly gaps:
              region     type       Date  date_diff
2   WestTexNewMexico  organic 2015-12-13       14.0
26  WestTexNewMexico  organic 2017-07-02       21.0


In [49]:
...

Ellipsis

In [50]:
...

Ellipsis

In [51]:
...

Ellipsis

In [52]:
...

Ellipsis

In [53]:
...

Ellipsis

In [54]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions 
rubric={points:4}

In the Rain in Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 4

The regions in the avocado dataset are not strictly distinct places. Many region names represent large geographic or market areas rather than unique locations. For example, the dataset contains regions such as “West”, “Northeast”, “GreatLakes”, and “TotalUS”, which clearly overlap with smaller city-based regions like “Albany”, “LosAngeles”, or “Chicago”. Since some regions represent broad areas and others represent individual cities, the regions are not mutually exclusive. Therefore, unlike the Rain in Australia dataset, these regions overlap.

In [55]:
regions = sorted(df["region"].unique())
print("Number of regions:", len(regions))
print(regions)

Number of regions: 54
['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston', 'BuffaloRochester', 'California', 'Charlotte', 'Chicago', 'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver', 'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton', 'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville', 'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale', 'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork', 'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia', 'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland', 'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento', 'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina', 'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse', 'Tampa', 'TotalUS', 'West', 'WestTexNewMexico']


In [56]:
...

Ellipsis

In [57]:
...

Ellipsis

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from the lecture.

In [58]:
import pandas as pd


def create_lag_feature(
    df: pd.DataFrame,
    orig_feature: str,
    lag: int,
    groupby: list[str],
    new_feature_name: str | None = None,
    clip: bool = False,
) -> pd.DataFrame:
    """
    Create a lagged (or ahead) version of a feature, optionally per group.

    Assumes df is already sorted by time within each group and has unique indices.

    Parameters
    ----------
    df : pd.DataFrame
        The dataset.
    orig_feature : str
        Name of the column to lag.
    lag : int
        The lag:
          - negative → values from the past (t-1, t-2, ...)
          - positive → values from the future (t+1, t+2, ...)
    groupby : list of str
        Column(s) to group by if df contains multiple time series.
    new_feature_name : str, optional
        Name of the new column. If None, a name is generated automatically.
    clip : bool, default False
        If True, drop rows where the new feature is NaN.

    Returns
    -------
    pd.DataFrame
        A new dataframe with the additional column added.
    """
    if lag == 0:
        raise ValueError("lag cannot be 0 (no shift). Use the original feature instead.")

    # Default name if not provided
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = f"{orig_feature}_lag{abs(lag)}"
        else:
            new_feature_name = f"{orig_feature}_ahead{lag}"

    df = df.copy()

    # Map your convention (negative=past, positive=future) to pandas shift
    # pandas: shift(+k) → past, shift(-k) → future
    periods = abs(lag) if lag < 0 else -lag

    df[new_feature_name] = (
        df.groupby(groupby, sort=False)[orig_feature]
          .shift(periods)
    )

    if clip:
        df = df.dropna(subset=[new_feature_name])

    return df


We first sort our dataframe properly:

In [59]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,date_diff
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,7.0
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,7.0
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,7.0
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,7.0
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,7.0
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,7.0
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico,7.0


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [60]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,date_diff,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,7.0,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,7.0,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,7.0,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,7.0,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,7.0,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,7.0,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,7.0,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,7.0,1.56


Our goal is to predict `AveragePriceNextWeek`. 

Let's split the data:

In [61]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline 
rubric={points}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 4

Train R2: 0.8285800937261841
Test R2: 0.7631780188583048

In [62]:
y_train_true = df_train["AveragePriceNextWeek"]
y_test_true  = df_test["AveragePriceNextWeek"]
y_train_pred = df_train["AveragePrice"]  
y_test_pred  = df_test["AveragePrice"]


train_r2 = r2_score(y_train_true, y_train_pred)

...

Ellipsis

In [63]:
test_r2 = r2_score(y_test_true, y_test_pred)

...

Ellipsis

In [64]:
print("Train R2:", train_r2)
print("Test R2:", test_r2)

Train R2: 0.8285800937261841
Test R2: 0.7631780188583048


In [65]:
...

Ellipsis

In [66]:
assert not train_r2 is None, "Are you using the correct variable name?"
assert not test_r2 is None, "Are you using the correct variable name?"
assert sha1(str(round(train_r2, 3)).encode('utf8')).hexdigest() == 'b1136fe2a8918904393ab6f40bfb3f38eac5fc39', "Your training score is not correct. Are you using the right features?"
assert sha1(str(round(test_r2, 3)).encode('utf8')).hexdigest() == 'cc24d9a9b567b491a56b42f7adc582f2eefa5907', "Your test score is not correct. Are you using the right features?"

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 10

To forecast the average avocado price one week ahead, I experimented with several different ways of encoding the date, along with a range of regression models. The three date-encoding strategies I tried were:
1. Ordinal time: days since the start of the dataset
2.	Calendar parts: year, month, week-of-year, and day-of-week
3.	Cyclical encoding: sin/cos transformations of week-of-year to capture seasonality

For each encoding, I trained multiple models: Linear Regression, Ridge Regression, Random Forest, and Gradient Boosting. I evaluated each model using R^2 on both the training and test sets.

Across all combinations, the best performance came from:
- 	Date Encoding: Cyclical (seasonal sin/cos)
- 	Model: Ridge Regression (alpha = 1.0)
- 	Train R^2: 0.8465
- 	Test R^2: 0.8056

This result exceeds the benchmark of R^2 \ge 0.79. Seasonal encodings worked best overall, suggesting that vocado prices follow strong yearly patterns. Linear and Ridge Regression performed especially well when combined with these cyclical features, indicating that the relationship between seasonal signals and next week’s price is fairly smooth and linear. Tree-based models tended to overfit more.

Overall, encoding the date using cyclical seasonality and applying a regularized linear model produced the most stable and accurate forecasts.

In [67]:
target_col = "AveragePriceNextWeek"

X_train = df_train.drop(columns=[target_col])
y_train = df_train[target_col]

X_test = df_test.drop(columns=[target_col])
y_test = df_test[target_col]

num_feats = [
    "AveragePrice", "Total Volume", "4046", "4225", "4770",
    "Total Bags", "Small Bags", "Large Bags", "XLarge Bags"
]
cat_feats = ["type", "region"]
date_feat = "Date"

In [68]:
def add_date_ordinal(df, col="Date"):
    df = df.copy()
    df[col + "_ordinal"] = (df[col] - df[col].min()).dt.days
    return df
def add_date_parts(df, col="Date"):
    df = df.copy()
    df[col + "_year"] = df[col].dt.year
    df[col + "_month"] = df[col].dt.month
    df[col + "_week"] = df[col].dt.isocalendar().week.astype(int)
    df[col + "_dow"] = df[col].dt.dayofweek
    return df
def add_date_cyclical(df, col="Date"):
    df = df.copy()
    week = df[col].dt.isocalendar().week.astype(int)
    df[col + "_sin"] = np.sin(2 * np.pi * week / 52)
    df[col + "_cos"] = np.cos(2 * np.pi * week / 52)
    return df

In [69]:
def run_model_with_encoding(encoder_fn, models, label):
    Xtr = encoder_fn(X_train)
    Xte = encoder_fn(X_test)

    new_num_feats = num_feats.copy()
    new_cat_feats = cat_feats.copy()

    for c in Xtr.columns:
        if c.startswith("Date_") and c not in new_num_feats:
            new_num_feats.append(c)

    preproc = ColumnTransformer(
        transformers=[
            ("num", Pipeline([
                ("impute", SimpleImputer(strategy="median")),
                ("scale", StandardScaler())
            ]), new_num_feats),
            ("cat", OneHotEncoder(handle_unknown="ignore"), new_cat_feats),
        ],
        remainder="drop"
    )

    results = []
    for name, model in models.items():
        pipe = Pipeline([
            ("preprocess", preproc),
            ("model", model)
        ])

        pipe.fit(Xtr, y_train)

        pred_tr = pipe.predict(Xtr)
        pred_te = pipe.predict(Xte)

        tr_r2 = r2_score(y_train, pred_tr)
        te_r2 = r2_score(y_test, pred_te)

        results.append((label, name, tr_r2, te_r2))

    return results

In [70]:
models = {
    "LinearRegression": LinearRegression(),
    "Ridge(alpha=1.0)": Ridge(alpha=1.0),
    "RandomForest": RandomForestRegressor(
        n_estimators=300, random_state=0, n_jobs=-1, max_depth=None
    ),
    "GradientBoosting": GradientBoostingRegressor(random_state=0)
}

all_results = []
all_results += run_model_with_encoding(add_date_ordinal, models, "Date Ordinal")
all_results += run_model_with_encoding(add_date_parts, models, "Date Parts")
all_results += run_model_with_encoding(add_date_cyclical, models, "Date Cyclical")

results_df = pd.DataFrame(
    all_results, columns=["DateEncoding", "Model", "Train_R2", "Test_R2"]
).sort_values("Test_R2", ascending=False)

results_df

Unnamed: 0,DateEncoding,Model,Train_R2,Test_R2
9,Date Cyclical,Ridge(alpha=1.0),0.846459,0.805552
8,Date Cyclical,LinearRegression,0.846459,0.805541
1,Date Ordinal,Ridge(alpha=1.0),0.845578,0.80264
0,Date Ordinal,LinearRegression,0.845579,0.802625
3,Date Ordinal,GradientBoosting,0.862717,0.800138
11,Date Cyclical,GradientBoosting,0.861562,0.798209
7,Date Parts,GradientBoosting,0.863439,0.79319
5,Date Parts,Ridge(alpha=1.0),0.845828,0.783535
4,Date Parts,LinearRegression,0.845828,0.7835
2,Date Ordinal,RandomForest,0.979886,0.782934


In [71]:
best_row = results_df.iloc[0]
best_row

DateEncoding       Date Cyclical
Model           Ridge(alpha=1.0)
Train_R2                0.846459
Test_R2                 0.805552
Name: 9, dtype: object

In [72]:
...

Ellipsis

In [73]:
...

Ellipsis

In [74]:
...

Ellipsis

In [75]:
...

Ellipsis

In [76]:
...

Ellipsis

In [77]:
...

Ellipsis

In [78]:
...

Ellipsis

In [79]:
...

Ellipsis

In [80]:
...

Ellipsis

In [81]:
...

Ellipsis

In [82]:
...

Ellipsis

In [83]:
...

Ellipsis

In [84]:
...

Ellipsis

In [85]:
...

Ellipsis

In [86]:
...

Ellipsis

In [87]:
...

Ellipsis

In [88]:
...

Ellipsis

In [89]:
...

Ellipsis

In [90]:
...

Ellipsis

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Short answer questions

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:6}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.
3. When studying time series modeling, we explored several ways to encode date information as a feature for the citibike dataset. When we used time of day as a numeric feature, the Ridge model was not able to capture the periodic pattern. Why? How did we tackle this problem? Briefly explain.

<div class="alert alert-warning">

Solution_2.1
    
</div>

_Points:_ 6

1. Emergency room arrivals - patients arrive at irregular times, producing an unequally spaced time series of arrival timestamps.
2. Lag features struggle with unequally spaced time points.Lag features assume a fixed temporal gap (e.g., last week, last hour, last observation). When time intervals are irregular, “lag 1” does not correspond to a consistent amount of time, making the meaning of the lag ambiguous.
3. We encoded time of day as a numeric variable like:

    - $\text{hour of day} = 0,1,2,\dots,23$

    This creates a linear scale, but the true pattern is cyclical: hour 23 is next to hour 0.
    A linear model (like Ridge) cannot naturally learn circular relationships, so it fails to model the repeating daily pattern.

    How we fixed it:
    We used cyclical encoding:

    - $\sin(2\pi \tfrac{\text{hour}}{24}), \quad \cos(2\pi \tfrac{\text{hour}}{24})$

    This turns the hour into a point on a circle, correctly capturing periodicity.
    Once we used sin/cos features, Ridge successfully learned the repeating 24-hour pattern.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Computer vision 
rubric={points:6}

The following questions pertain to the lecture on multiclass classification and introduction to computer vision. 

1. How many parameters (coefficients and intercepts) will `sklearn`’s `LogisticRegression()` model learn for a four-class classification problem, assuming that you have 10 features? Briefly explain your answer.
2. In Lecture 19, we briefly discussed how neural networks are sort of like `sklearn`'s pipelines, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
3. Imagine that you have a small dataset with ~1000 images containing pictures and names of 50 different Computer Science faculty members from UBC. Your goal is to develop a reasonably accurate multi-class classification model for this task. Describe which model/technique you would use and briefly justify your choice in one to three sentences.

<div class="alert alert-warning">

Solution_2.2
    
</div>

_Points:_ 6

1. LogisticRegression() trains one binary classifier per class.

    Each classifier has:
    - 10 coefficients
    - 1 intercept

    So for 4 classes, total parameters =
    4 * (10 + 1) = 44
2. Neural networks build representations layer-by-layer, where early layers learn general visual features (edges, shapes, textures) and later layers learn task-specific features.
    This modular structure allows us to reuse (transfer) the early layers learned on a large dataset and only retrain the final layers for a new task, saving time and greatly improving performance on small datasets.
3. I would use transfer learning with a pretrained convolutional neural network such as ResNet, MobileNet, or EfficientNet, and fine-tune only the last layers.
This works well because pretrained CNNs already encode strong general-purpose visual features, and fine-tuning requires far less data - ideal for a small dataset like 1000 images.

<!-- END QUESTION -->

<br><br>

Before submitting your assignment, please make sure you have followed all the instructions in the Submission Instructions section at the top. 

Here is a quick checklist before submitting: 

- [ ] Restart kernel, clear outputs, and run all cells from top to bottom.  
- [ ] `.ipynb` file runs without errors and contains all outputs.  
- [ ] Only `.ipynb` and required output files are uploaded (no extra files).  
- [ ] Execution numbers start at **1** and are in order.  
- [ ] If `.ipynb` is too large and doesn't render on Gradescope, also upload a PDF/HTML version.  
- [ ] Reviewed the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2025W1/docs/homework_instructions.html).  

![](img/eva-well-done.png)