In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw8.ipynb")

# CPSC 330 - Applied Machine Learning

## Homework 8: Introduction to Computer vision and Time Series (Lectures 19 and 20) 

**Due date: see the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Imports

In [2]:
from hashlib import sha1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score

<div class="alert alert-info">
    
## Submission instructions
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2023W1/blob/main/docs/homework_instructions.md). 

**You may work in a group on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).


When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission.
4. Make sure that the plots and output are rendered properly in your submitted file. 
5. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb.

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset and storing it under the `data` folder. We will be forcasting average avocado price for the next week. 

In [3]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [4]:
df.shape

(18249, 13)

In [5]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [6]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [7]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [8]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series? 
rubric={points:4}

In the [Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) dataset from lecture demo, we had different measurements for each Location. 

We want you to consider this for the avocado prices dataset. For which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18249 entries, 0 to 11
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          18249 non-null  datetime64[ns]
 1   AveragePrice  18249 non-null  float64       
 2   Total Volume  18249 non-null  float64       
 3   4046          18249 non-null  float64       
 4   4225          18249 non-null  float64       
 5   4770          18249 non-null  float64       
 6   Total Bags    18249 non-null  float64       
 7   Small Bags    18249 non-null  float64       
 8   Large Bags    18249 non-null  float64       
 9   XLarge Bags   18249 non-null  float64       
 10  type          18249 non-null  object        
 11  year          18249 non-null  int64         
 12  region        18249 non-null  object        
dtypes: datetime64[ns](1), float64(9), int64(1), object(2)
memory usage: 1.9+ MB


In [10]:
df.sort_values(by=["Date","region"], ascending=False).head()


Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2018-03-25,0.84,965185.06,438526.12,199585.9,11017.42,316055.62,153009.89,160999.1,2046.63,conventional,2018,WestTexNewMexico
0,2018-03-25,1.62,15303.4,2325.3,2171.66,0.0,10806.44,10569.8,236.64,0.0,organic,2018,WestTexNewMexico
0,2018-03-25,0.93,7667064.46,2567279.74,1912986.38,118289.91,3068508.43,1309580.19,1745630.06,13298.18,conventional,2018,West
0,2018-03-25,1.6,271723.08,26996.28,77861.39,117.56,166747.85,87108.0,79495.39,144.46,organic,2018,West
0,2018-03-25,1.03,43409835.75,14130799.1,12125711.42,758801.12,16394524.11,12540327.19,3544729.39,309467.53,conventional,2018,TotalUS


In [11]:
print(df["Date"].min())
print(df["Date"].max())

2015-01-04 00:00:00
2018-03-25 00:00:00


For the avocado dataset we also have separate measurements based off of region, as can be seen above. Additionally, we have measurements taken on the same day for different types of avocado. Thus we have separate measurements for 'region' and 'type' in the dataset, subdividing it into multiple time series.  

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements? 
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [12]:
#taken from demo 20
def plot_time_spacing_distribution(df, feature, region):
    """
    Plots the distribution of time spacing for a given region.
    
    Parameters:
        df (pd.DataFrame): The input DataFrame with columns 'Location' and 'Date'.
        region (str): The region (e.g., location) to analyze.
    """
    # Ensure 'Date' is in datetime format
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Filter data for the given region
    region_data = df[df[feature] == region]
    
    if region_data.empty:
        print(f"No data available for region: {region}")
        return
    
    # Calculate time differences
    time_diffs = region_data['Date'].sort_values().diff().dropna()
    
    # Count the frequency of each time difference
    value_counts = time_diffs.value_counts().sort_index()
    
    # Display value counts
    print(f"Time spacing counts for {region}:\n{value_counts}\n")
    
    # # Plot the bar chart
    # plt.bar(value_counts.index.astype(str), value_counts.values, color='skyblue', edgecolor='black')
    # plt.title(f"Time Difference Distribution for {region}")
    # plt.xlabel("Time Difference (days)")
    # plt.ylabel("Frequency")
    # plt.xticks(rotation=45)
    # plt.grid(axis='y', linestyle='--', alpha=0.7)
    # plt.show()

In [13]:
unique_regions = df['region'].unique()
unique_regions.tolist()

for region in unique_regions:
    plot_time_spacing_distribution(df, 'region', region)

Time spacing counts for Albany:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for Atlanta:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for BaltimoreWashington:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for Boise:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for Boston:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for BuffaloRochester:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for California:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for Charlotte:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for Chicago:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for CincinnatiDayton:
Date
0 days    169
7 days    168
Name: count, dtype: int64

Time spacing counts for Columbus:
Date


In [14]:
avocado_types = df['type'].unique()
avocado_types.tolist()

for avocado_type in avocado_types:
    plot_time_spacing_distribution(df, 'type', avocado_type)

Time spacing counts for conventional:
Date
0 days    8957
7 days     168
Name: count, dtype: int64

Time spacing counts for organic:
Date
0 days    8954
7 days     168
Name: count, dtype: int64



Taking a quick look at the time series distribution for each region and type, it seems like the data is equally distributed between the categorical variables identified in 1.1.  

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions 
rubric={points:4}

In the Rain in Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

There's definitely some overlap. Looking at the region names in 1.2 you can see that some of them are for specific regions such as WestTexNewMexico, while others are for much more general regions such as West or even TotalUS. 

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from [Lecture 19](https://github.com/UBC-CS/cpsc330-2023W1/tree/main/lectures), with some improvements.

In [15]:
def create_lag_feature(df, orig_feature, lag, groupby, new_feature_name=None, clip=False):
    """
    Creates a new feature that's a lagged version of an existing one.
    
    NOTE: assumes df is already sorted by the time columns and has unique indices.
    
    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        The dataset.
    orig_feature : str
        The column name of the feature we're copying
    lag : int
        The lag; negative lag means values from the past, positive lag means values from the future
    groupby : list
        Column(s) to group by in case df contains multiple time series
    new_feature_name : str
        Override the default name of the newly created column
    clip : bool
        If True, remove rows with a NaN values for the new feature
    
    Returns
    -------
    pandas.core.frame.DataFrame
        A new dataframe with the additional column added.
        
    """
        
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = "%s_lag%d" % (orig_feature, -lag)
        else:
            new_feature_name = "%s_ahead%d" % (orig_feature, lag)
    
    new_df = df.assign(**{new_feature_name : np.nan})
    for name, group in new_df.groupby(groupby):        
        if lag < 0: # take values from the past
            new_df.loc[group.index[-lag:],new_feature_name] = group.iloc[:lag][orig_feature].values
        else:       # take values from the future
            new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][orig_feature].values
            
    if clip:
        new_df = new_df.dropna(subset=[new_feature_name])
        
    return new_df

We first sort our dataframe properly:

In [16]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [17]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,1.56


Our goal is to predict `AveragePriceNextWeek`. 

Let's split the data:

In [18]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline 
rubric={points}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 4

_Type your answer here, replacing this text._

In [19]:
# Assuming df_train is your DataFrame
X = df_train[["AveragePrice"]].values  # Features: Just AveragePrice (n_samples, 1)
y = df_train["AveragePriceNextWeek"].values  # Target: AveragePriceNextWeek (n_samples,)

# Predictions: Using AveragePrice as the constant prediction for each row
y_pred = df_train["AveragePrice"].values  # Predicted value for each row is just AveragePrice

# Calculate R² score for the train data
train_r2 = r2_score(y, y_pred)
print("R² Score:", train_r2)

R² Score: 0.8285800937261841


In [20]:
# Assuming df_test is your DataFrame
X_test = df_test[["AveragePrice"]].values  # Features: Just AveragePrice (n_samples, 1)
y_test = df_test["AveragePriceNextWeek"].values  # Target: AveragePriceNextWeek (n_samples,)

# Predictions: Using AveragePrice as the constant prediction for each row
y_pred_test = df_test["AveragePrice"].values  # Predicted value for each row is just AveragePrice

# Calculate R² score for the test data
test_r2 = r2_score(y_test, y_pred_test)
print("Test R² Score:", test_r2)


Test R² Score: 0.7631780188583048


In [21]:
assert not train_r2 is None, "Are you using the correct variable name?"
assert not test_r2 is None, "Are you using the correct variable name?"
assert sha1(str(round(train_r2, 3)).encode('utf8')).hexdigest() == 'b1136fe2a8918904393ab6f40bfb3f38eac5fc39', "Your training score is not correct. Are you using the right features?"
assert sha1(str(round(test_r2, 3)).encode('utf8')).hexdigest() == 'cc24d9a9b567b491a56b42f7adc582f2eefa5907', "Your test score is not correct. Are you using the right features?"

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

I'm planning on encoding the date as a polynomial feature crossing month with day of the week. Year is already captured as a separate categorical column so I'm not planning on extracting it. I picked these two values for the data-time index because I thought they have the greatest chance of capturing a pattern in the data, as monthly changes greatly affect growth rates, and shipments usually come on certain days of the week, possibly affecting the cost of the avocado. 

In [22]:
# Step 1: Extract month and day of the week
df_train['month'] = df_train['Date'].dt.month
df_train['day_of_week'] = df_train['Date'].dt.dayofweek

df_test['day_of_week'] = df_test['Date'].dt.dayofweek
df_test['month'] = df_test['Date'].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['month'] = df_train['Date'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['day_of_week'] = df_train['Date'].dt.dayofweek
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['day_of_week'] = df_test['Date'].dt.dayofweek
A value is trying to be set on a copy of a slice 

In [23]:
# I'll start with organizing my features

numeric_features = ["Total Volume", "4046", "4225", "4770", "Total Bags", "Small Bags", "Large Bags", "XLarge Bags", "AveragePrice"]
categorical_features = ["type", "year", "region"]
target = ["AveragePriceNextWeek"]
drop_features = ["Date"]

polynomial_features = ['month', 'day_of_week']

In [24]:
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import PowerTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# make a transformer for each type of feature, adpated from demo-20
def preprocess_features(
    df_train,
    df_test,
    numeric_features,
    categorical_features,
    polynomial_features,
    drop_features,
    target
):
    numeric_transformer = StandardScaler()
    categorical_transformer = OneHotEncoder(drop="if_binary", handle_unknown="ignore", sparse_output=False)
    
    date_poly_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),  # Cross terms for month and day_of_week
    StandardScaler()  # Standardize the polynomial features
    )
    
    preprocessor = make_column_transformer(
            (numeric_transformer, numeric_features),
            (categorical_transformer, categorical_features),
            (date_poly_pipeline, polynomial_features),
            ("drop", drop_features),
    )
    preprocessor.fit(df_train)
    encoder = preprocessor.named_transformers_['onehotencoder']
    #print("Categories seen during training:", encoder.categories_)
    #print("New categories during transform:", df_test['region'].unique())
    #print(df_train[['type', 'year', 'region']].dtypes)

    ohe_feature_names = (
        preprocessor.named_transformers_['onehotencoder'].get_feature_names_out(categorical_features)
        .tolist()
    )

    poly_feature_names = (
        preprocessor.named_transformers_['pipeline'].named_steps['polynomialfeatures'].get_feature_names_out(polynomial_features)
        .tolist()
    )

    new_columns = numeric_features + ohe_feature_names + poly_feature_names

    X_train_enc = pd.DataFrame(
        preprocessor.transform(df_train), index=df_train.index, columns=new_columns
    )
    X_test_enc = pd.DataFrame(
        preprocessor.transform(df_test), index=df_test.index, columns=new_columns
    )

    y_train = df_train[target]
    y_test = df_test[target]

    return X_train_enc, y_train, X_test_enc, y_test, preprocessor

In [25]:
X_train_enc, y_train, X_test_enc, y_test, preprocessor = preprocess_features(
    df_train,
    df_test,
    numeric_features,
    categorical_features,
    polynomial_features,
    drop_features, target
)



In [26]:
X_train_enc.head()


Unnamed: 0,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,AveragePrice,type_organic,...,region_Syracuse,region_Tampa,region_TotalUS,region_West,region_WestTexNewMexico,month,day_of_week,month^2,month day_of_week,day_of_week^2
0,-0.234535,-0.229503,-0.222203,-0.214954,-0.232206,-0.229907,-0.223154,-0.172063,-0.432512,0.0,...,0.0,0.0,0.0,0.0,0.0,-1.532848,0.0,-1.094284,-1.532848,0.0
1,-0.23444,-0.230948,-0.219448,-0.214272,-0.233587,-0.231513,-0.223789,-0.172063,-0.383676,0.0,...,0.0,0.0,0.0,0.0,0.0,-1.532848,0.0,-1.094284,-1.532848,0.0
2,-0.233469,-0.231018,-0.21953,-0.214196,-0.22985,-0.226469,-0.224325,-0.172063,-0.554604,0.0,...,0.0,0.0,0.0,0.0,0.0,-1.532848,0.0,-1.094284,-1.532848,0.0
3,-0.233283,-0.230996,-0.21817,-0.213945,-0.230999,-0.228629,-0.222193,-0.172063,-0.823205,0.0,...,0.0,0.0,0.0,0.0,0.0,-1.532848,0.0,-1.094284,-1.532848,0.0
4,-0.225747,-0.230668,-0.196131,-0.213811,-0.232627,-0.22993,-0.224856,-0.172063,-0.994133,0.0,...,0.0,0.0,0.0,0.0,0.0,-1.229648,0.0,-1.023758,-1.229648,0.0


In [27]:
from sklearn.pipeline import make_pipeline

def score_lr_print_coeff(preprocessor, df_train, y_train, df_test, y_test, X_train_enc):
    lr_pipe = make_pipeline(preprocessor, Ridge())
    lr_pipe.fit(df_train, y_train)
    print("Train score: {:.2f}".format(lr_pipe.score(df_train, y_train)))
    print("Test score: {:.2f}".format(lr_pipe.score(df_test, y_test)))
    lr_coef = pd.DataFrame(
        data=lr_pipe.named_steps["ridge"].coef_.flatten(),
        index=X_train_enc.columns,
        columns=["Coef"],
    )
    return lr_coef.sort_values(by="Coef", ascending=False)

In [28]:
score_lr_print_coeff(preprocessor, df_train, y_train, df_test, y_test, X_train_enc)


Train score: 0.85
Test score: 0.80




Unnamed: 0,Coef
AveragePrice,0.315157
type_organic,0.115813
region_SanFrancisco,0.101904
region_HartfordSpringfield,0.099640
region_NewYork,0.077909
...,...
region_Denver,-0.053540
month^2,-0.073466
region_SouthCentral,-0.073669
region_DallasFtWorth,-0.076250


Using Ridge, I achieved a test score of 0.80 which is pretty decent. Looking at the coefficients, we can see some helpful results. AveragePrice has the most positive influence on AveragePriceNextWeek (unsurprisingly), but organic has the second highest influence (which is also not the surprising). Returning to the research question that prompted the analysis of this dataset, San Francisco seems to be the region with the largest positive correlation on average Avocado prices, and Houston seems to the region with the smallest. Several other regions in Texas also have a negative correlation, so if you want cheap avocados, live in Texas. 

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Short answer questions

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:6}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.
3. When studying time series modeling, we explored several ways to encode date information as a feature for the citibike dataset. When we used time of day as a numeric feature, the Ridge model was not able to capture the periodic pattern. Why? How did we tackle this problem? Briefly explain.

<div class="alert alert-warning">

Solution_2.1
    
</div>

_Points:_ 6

1. Whale sightings off the coast of California.
2. Creating lagged versions of these features would struggle. If the time is unevenly spaced, the lagged features would also be unevenly spaced. TwoDaysAhead versus OneWeekAhead could have a very different effect on the target value and so you would have to make multiple lagged features for each time disparity.
3. This is because a cyclic pattern like time-of-day for timeseries is inherently non-linear, so Ridge was no able to tackle the periodic pattern. To address this problem, we encoded time-of-day as a categorical feature with one-hot-encoder which created better results. 

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Computer vision 
rubric={points:6}

The following questions pertain to Lecture 19 on multiclass classification and introduction to computer vision. 

1. How many parameters (coefficients and intercepts) will `sklearn`’s `LogisticRegression()` model learn for a four-class classification problem, assuming that you have 10 features? Briefly explain your answer.
2. In Lecture 19, we briefly discussed how neural networks are sort of like `sklearn`'s pipelines, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
3. Imagine that you have a small dataset with ~1000 images containing pictures and names of 50 different Computer Science faculty members from UBC. Your goal is to develop a reasonably accurate multi-class classification model for this task. Describe which model/technique you would use and briefly justify your choice in one to three sentences.

<div class="alert alert-warning">

Solution_2.2
    
</div>

_Points:_ 6

1. For multi-class problems, LogisticRegression learns a coefficient for each feature for each class. It also learns an intercept for each class. Therefore the model would have 44 parameters.
2. This is useful because you can start with a robust model and then build off it to tailor it to your question as opposed to doing everything by scratch. This is because tailoring an existing model is simply involves carrying out more sequential transformations.
3. I would use an existing model like Inception and then build on it using transfer learning to tailor it towards our dataset. Because the general datasets like ImageNet that most of these models are trained on probably wont have the classes we're looking for (individual pictures of CS professors), we have to train them on our own dataset. The different classes would presumably be each of the different faculty members and being able to identify their faces and assign them to the right name (class.)

<!-- END QUESTION -->

<br><br>

**Before submitting your assignment, please make sure you have followed all the instructions in the Submission instructions section at the top.** 

![](img/eva-well-done.png)