In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw8.ipynb")



# CPSC 330 - Applied Machine Learning

## Homework 8: Introduction to Computer vision, Time Series, and Survival Analysis (Lectures 19 to 20) 

**Due date: see the [Apr 07, 11:59 pm](https://github.com/UBC-CS/cpsc330-2024W2?tab=readme-ov-file#deliverable-due-dates-tentative).**

## Imports

In [2]:
from hashlib import sha1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score

<div class="alert alert-info">
    
## Submission instructions
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2024W2/blob/main/docs/homework_instructions.md). 

**You may work in a group on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).


When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission.
4. Make sure that the plots and output are rendered properly in your submitted file. 
5. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb.

<br><br>

## Exercise 1: time series prediction

In this exercise we'll be looking at a [dataset of avocado prices](https://www.kaggle.com/neuromusic/avocado-prices). You should start by downloading the dataset and storing it under the `data` folder. We will be forcasting average avocado price for the next week. 

In [3]:
df = pd.read_csv("data/avocado.csv", parse_dates=["Date"], index_col=0)
df.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [4]:
df.shape

(18249, 13)

In [5]:
df["Date"].min()

Timestamp('2015-01-04 00:00:00')

In [6]:
df["Date"].max()

Timestamp('2018-03-25 00:00:00')

It looks like the data ranges from the start of 2015 to March 2018 (~2 years ago), for a total of 3.25 years or so. Let's split the data so that we have a 6 months of test data.

In [7]:
split_date = '20170925'
df_train = df[df["Date"] <= split_date]
df_test  = df[df["Date"] >  split_date]

In [8]:
assert len(df_train) + len(df_test) == len(df)

<br><br>

<!-- BEGIN QUESTION -->

### 1.1 How many time series? 
rubric={points:4}

In the [Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) dataset from lecture demo, we had different measurements for each Location. 

We want you to consider this for the avocado prices dataset. For which categorical feature(s), if any, do we have separate measurements? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.1
    
</div>

_Points:_ 4

The categorical features that we have separate measurements for are region and type. For region, for example, the date 2015-15-27 appears multiple times, one for each distinct region in our data set. Additionally, the same applies for type, we can see different measurements for organic avocados and conventional avocados.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.2 Equally spaced measurements? 
rubric={points:4}

In the Rain in Australia dataset, the measurements were generally equally spaced but with some exceptions. How about with this dataset? Justify your answer by referencing the dataset.

<div class="alert alert-warning">

Solution_1.2
    
</div>

_Points:_ 4

Yes, the measurements are generally equally spaced in this dataset. It looks like the measurements are taken every week starting from 2015-01-04 until 2017-12-31 for each region and type. Looking at the index number, we can see that from 0-51 each measurement is 7 days after the previous measurement, which corresponds to 52 weeks in a year. However in 2018, it seems like the measurements were taken every week starting from 2018-01-07 until 2018-03-25, a total of 12 weeks instead of 52 weeks, which may suggest that the data collection finished early. However, the measurements are still taken weekly, thus, we can say that, in general, our data is equally spaced.

In [9]:
df.head(52)["Date"]

0    2015-12-27
1    2015-12-20
2    2015-12-13
3    2015-12-06
4    2015-11-29
5    2015-11-22
6    2015-11-15
7    2015-11-08
8    2015-11-01
9    2015-10-25
10   2015-10-18
11   2015-10-11
12   2015-10-04
13   2015-09-27
14   2015-09-20
15   2015-09-13
16   2015-09-06
17   2015-08-30
18   2015-08-23
19   2015-08-16
20   2015-08-09
21   2015-08-02
22   2015-07-26
23   2015-07-19
24   2015-07-12
25   2015-07-05
26   2015-06-28
27   2015-06-21
28   2015-06-14
29   2015-06-07
30   2015-05-31
31   2015-05-24
32   2015-05-17
33   2015-05-10
34   2015-05-03
35   2015-04-26
36   2015-04-19
37   2015-04-12
38   2015-04-05
39   2015-03-29
40   2015-03-22
41   2015-03-15
42   2015-03-08
43   2015-03-01
44   2015-02-22
45   2015-02-15
46   2015-02-08
47   2015-02-01
48   2015-01-25
49   2015-01-18
50   2015-01-11
51   2015-01-04
Name: Date, dtype: datetime64[ns]

You can see above how the data is collected weekly which suggest measurements were equally spaced

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.3 Interpreting regions 
rubric={points:4}

In the Rain in Australia dataset, each location was a different place in Australia. For this dataset, look at the names of the regions. Do you think the regions are also all distinct, or are there overlapping regions? Justify your answer by referencing the data.

<div class="alert alert-warning">

Solution_1.3
    
</div>

_Points:_ 4

In [10]:
region_names = df['region'].unique()
region_names

array(['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
       'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
       'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
       'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
       'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
       'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
       'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
       'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
       'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
       'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
       'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
       'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
       'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'], dtype=object)

No, there are some overlapping regions in our dataset. Looking at the unique region names from above, we can see that we have regions such as specific cities like San Diego, San Francisco, and Sacramento, but we also have broader regions such as states like California. This is an overlapping region because San Diego, San Francisco, and Sacramento are all contained within California. Therefore, unlike the In the Rain in Australia dataset, where each location was a different place in Australia, this does not seem to be the case in our avocados dataset. We have some specific US cities, but also broader regions, which is the cause for overlap.

<!-- END QUESTION -->

<br><br>

We will use the entire dataset despite any location-based weirdness uncovered in the previous part.

We will be trying to forecast the avocado price. The function below is adapted from [Lecture 19](https://github.com/UBC-CS/cpsc330-2024W2/tree/main/lectures), with some improvements.

In [11]:
def create_lag_feature(df, orig_feature, lag, groupby, new_feature_name=None, clip=False):
    """
    Creates a new feature that's a lagged version of an existing one.
    
    NOTE: assumes df is already sorted by the time columns and has unique indices.
    
    Parameters
    ----------
    df : pandas.core.frame.DataFrame
        The dataset.
    orig_feature : str
        The column name of the feature we're copying
    lag : int
        The lag; negative lag means values from the past, positive lag means values from the future
    groupby : list
        Column(s) to group by in case df contains multiple time series
    new_feature_name : str
        Override the default name of the newly created column
    clip : bool
        If True, remove rows with a NaN values for the new feature
    
    Returns
    -------
    pandas.core.frame.DataFrame
        A new dataframe with the additional column added.
        
    """
        
    if new_feature_name is None:
        if lag < 0:
            new_feature_name = "%s_lag%d" % (orig_feature, -lag)
        else:
            new_feature_name = "%s_ahead%d" % (orig_feature, lag)
    
    new_df = df.assign(**{new_feature_name : np.nan})
    for name, group in new_df.groupby(groupby):        
        if lag < 0: # take values from the past
            new_df.loc[group.index[-lag:],new_feature_name] = group.iloc[:lag][orig_feature].values
        else:       # take values from the future
            new_df.loc[group.index[:-lag], new_feature_name] = group.iloc[lag:][orig_feature].values
            
    if clip:
        new_df = new_df.dropna(subset=[new_feature_name])
        
    return new_df

We first sort our dataframe properly:

In [12]:
df_sort = df.sort_values(by=["region", "type", "Date"]).reset_index(drop=True)
df_sort

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico
18247,2018-03-18,1.56,15896.38,2055.35,1499.55,0.00,12341.48,12114.81,226.67,0.0,organic,2018,WestTexNewMexico


We then call `create_lag_feature`. This creates a new column in the dataset `AveragePriceNextWeek`, which is the following week's `AveragePrice`. We have set `clip=True` which means it will remove rows where the target would be missing.

In [13]:
df_hastarget = create_lag_feature(df_sort, "AveragePrice", +1, ["region", "type"], "AveragePriceNextWeek", clip=True)
df_hastarget

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,AveragePriceNextWeek
0,2015-01-04,1.22,40873.28,2819.50,28287.42,49.90,9716.46,9186.93,529.53,0.0,conventional,2015,Albany,1.24
1,2015-01-11,1.24,41195.08,1002.85,31640.34,127.12,8424.77,8036.04,388.73,0.0,conventional,2015,Albany,1.17
2,2015-01-18,1.17,44511.28,914.14,31540.32,135.77,11921.05,11651.09,269.96,0.0,conventional,2015,Albany,1.06
3,2015-01-25,1.06,45147.50,941.38,33196.16,164.14,10845.82,10103.35,742.47,0.0,conventional,2015,Albany,0.99
4,2015-02-01,0.99,70873.60,1353.90,60017.20,179.32,9323.18,9170.82,152.36,0.0,conventional,2015,Albany,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18243,2018-02-18,1.56,17597.12,1892.05,1928.36,0.00,13776.71,13553.53,223.18,0.0,organic,2018,WestTexNewMexico,1.57
18244,2018-02-25,1.57,18421.24,1974.26,2482.65,0.00,13964.33,13698.27,266.06,0.0,organic,2018,WestTexNewMexico,1.54
18245,2018-03-04,1.54,17393.30,1832.24,1905.57,0.00,13655.49,13401.93,253.56,0.0,organic,2018,WestTexNewMexico,1.56
18246,2018-03-11,1.56,22128.42,2162.67,3194.25,8.93,16762.57,16510.32,252.25,0.0,organic,2018,WestTexNewMexico,1.56


Our goal is to predict `AveragePriceNextWeek`. 

Let's split the data:

In [14]:
df_train = df_hastarget[df_hastarget["Date"] <= split_date]
df_test  = df_hastarget[df_hastarget["Date"] >  split_date]

<br><br>

<!-- BEGIN QUESTION -->

### 1.4 `AveragePrice` baseline 
rubric={points}

Soon we will want to build some models to forecast the average avocado price a week in advance. Before we start with any ML though, let's try a baseline. Previously we used `DummyClassifier` or `DummyRegressor` as a baseline. This time, we'll do something else as a baseline: we'll assume the price stays the same from this week to next week. So, we'll set our prediction of "AveragePriceNextWeek" exactly equal to "AveragePrice", assuming no change. That is kind of like saying, "If it's raining today then I'm guessing it will be raining tomorrow". This simplistic approach will not get a great score but it's a good starting point for reference. If our model does worse that this, it must not be very good. 

Using this baseline approach, what $R^2$ do you get on the train and test data?

<div class="alert alert-warning">

Solution_1.4
    
</div>

_Points:_ 4

In [15]:
train_r2 = r2_score(df_train["AveragePriceNextWeek"], df_train["AveragePrice"])
train_r2

0.8285800937261841

In [16]:
test_r2 = r2_score(df_test["AveragePriceNextWeek"], df_test["AveragePrice"])
test_r2

0.7631780188583048

In [17]:
assert not train_r2 is None, "Are you using the correct variable name?"
assert not test_r2 is None, "Are you using the correct variable name?"
assert sha1(str(round(train_r2, 3)).encode('utf8')).hexdigest() == 'b1136fe2a8918904393ab6f40bfb3f38eac5fc39', "Your training score is not correct. Are you using the right features?"
assert sha1(str(round(test_r2, 3)).encode('utf8')).hexdigest() == 'cc24d9a9b567b491a56b42f7adc582f2eefa5907', "Your test score is not correct. Are you using the right features?"

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 1.5 Forecasting average avocado price
rubric={points:10}

Now that the baseline is done, let's build some models to forecast the average avocado price a week later. Experiment with a few approachs for encoding the date. Justify the decisions you make. Which approach worked best? Report your test score and briefly discuss your results.

Benchmark: you should be able to achieve $R^2$ of at least 0.79 on the test set. I got to 0.80, but not beyond that. Let me know if you do better!

Note: because we only have 2 splits here, we need to be a bit wary of overfitting on the test set. Try not to test on it a ridiculous number of times. If you are interested in some proper ways of dealing with this, see for example sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html), which is like cross-validation for time series data.

<div class="alert alert-warning">

Solution_1.5
    
</div>

_Points:_ 10

After testing a few approaches for encoding the date, the best approach that worked was one-hot encoding by month and adding lag features. This makes sense because by one-hot encoding by month we might be able to capture any seasonal trends which could affect the price of avocados. Additionally, adding lag features would allow us to use information about the past weeks in helping us predicting avocado prices the next week. The past weeks pricing may have an affect on the next week, thus including lag features make sense.

CODE TAKEN FROM LECTURE 20: https://ubc-cs.github.io/cpsc330-2024W2/lectures/notes/20_time-series.html#feature-engineering-encoding-date-time-as-feature-s

In [18]:
def preprocess_features(
    train_df,
    test_df,
    numeric_features,
    categorical_features,
    drop_features,
    target
):

    all_features = set(numeric_features + categorical_features + drop_features + target)
    if set(train_df.columns) != all_features:
        print("Missing columns", set(train_df.columns) - all_features)
        print("Extra columns", all_features - set(train_df.columns))
        raise Exception("Columns do not match")

    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="median"), StandardScaler()
    )
    categorical_transformer = make_pipeline(
        SimpleImputer(strategy="constant", fill_value="missing"),
        OneHotEncoder(handle_unknown="ignore", sparse_output=False),
    )

    preprocessor = make_column_transformer(
        (numeric_transformer, numeric_features),
        (categorical_transformer, categorical_features),
        ("drop", drop_features),
    )
    preprocessor.fit(train_df)
    ohe_feature_names = (
        preprocessor.named_transformers_["pipeline-2"]
        .named_steps["onehotencoder"]
        .get_feature_names_out(categorical_features)
        .tolist()
    )
    new_columns = numeric_features + ohe_feature_names

    X_train_enc = pd.DataFrame(
        preprocessor.transform(train_df), index=train_df.index, columns=new_columns
    )
    X_test_enc = pd.DataFrame(
        preprocessor.transform(test_df), index=test_df.index, columns=new_columns
    )

    y_train = df_train["AveragePriceNextWeek"]
    y_test = df_test["AveragePriceNextWeek"]

    return X_train_enc, y_train, X_test_enc, y_test, preprocessor

In [19]:
numeric_features = [
    "AveragePrice", "Total Volume", "4046", "4225", "4770", 
    "Total Bags", "Small Bags", "Large Bags", "XLarge Bags", "year"
]

categorical_features = [
    "type", "region"
]

drop_features = [
    "Date"
]

target = ["AveragePriceNextWeek"]

In [20]:
df_train = df_train.assign(
    Month=df_train["Date"].apply(lambda x: x.month_name())
)
df_test = df_test.assign(Month=df_test["Date"].apply(lambda x: x.month_name()))

In [21]:
df_train = create_lag_feature(df_train, orig_feature='AveragePrice', lag=1, groupby=['region'], new_feature_name='AveragePrice-1')
df_train = create_lag_feature(df_train, orig_feature='AveragePrice', lag=2, groupby=['region'], new_feature_name='AveragePrice-2')
df_train = create_lag_feature(df_train, orig_feature='AveragePrice', lag=3, groupby=['region'], new_feature_name='AveragePrice-3')

df_test = create_lag_feature(df_test, orig_feature='AveragePrice', lag=1, groupby=['region'], new_feature_name='AveragePrice-1')
df_test = create_lag_feature(df_test, orig_feature='AveragePrice', lag=2, groupby=['region'], new_feature_name='AveragePrice-2')
df_test = create_lag_feature(df_test, orig_feature='AveragePrice', lag=3, groupby=['region'], new_feature_name='AveragePrice-3')

numeric_features_full = numeric_features + ['AveragePrice-1', 'AveragePrice-2', 'AveragePrice-3']

X_train_enc, y_train, X_test_enc, y_test, preprocessor = preprocess_features(
    df_train, df_test,
    numeric_features_full,
    categorical_features,
    drop_features + ["Month"],
    target
)

In [None]:
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train_enc, y_train)

In [None]:
y_pred = model.predict(X_test_enc)
r2 = r2_score(y_test, y_pred)
r2

The model achieved an $R^2$ score of 0.83 on the test set. This goes beyond the benchmark set at 0.79 which may suggest that the model improved significantly in predicting avocado prices with the encoding of month as well as introduction of lag features. However, on the other side, it may mean that our model is also overfitting.

<!-- END QUESTION -->

<br><br><br><br>

## Exercise 2: Short answer questions

<!-- BEGIN QUESTION -->

### 2.1 Time series

rubric={points:6}

The following questions pertain to Lecture 20 on time series data:

1. Sometimes a time series has missing time points or, worse, time points that are unequally spaced in general. Give an example of a real world situation where the time series data would have unequally spaced time points.
2. In class we discussed two approaches to using temporal information: encoding the date as one or more features, and creating lagged versions of features. Which of these (one/other/both/neither) two approaches would struggle with unequally spaced time points? Briefly justify your answer.
3. When studying time series modeling, we explored several ways to encode date information as a feature for the citibike dataset. When we used time of day as a numeric feature, the Ridge model was not able to capture the periodic pattern. Why? How did we tackle this problem? Briefly explain.

<div class="alert alert-warning">

Solution_2.1
    
</div>

_Points:_ 6

1.) An example of a real world situation where the time series data would have unequally spaced time points could be customer support tickets. The reason why this would be unequally spaced is because customers are not issuing support tickets at fixed intervals, they will issue support tickets when they have a problem, which is typically irregular/sporadic. This could lead to gaps in between the data during the day, even worse, we may not get any support tickets issued one day, and 100 support tickets issued another. Therefore we can see how this would lead to unequally spaced time points.

2.) Creating lagged versions of features would struggle with unequally spaced time points, this is because when we use lagged features, we typically want the intervals of observations to be of equal distance. However, if we have unequally spaced time points, each interval of observation will be different from each other, making it hard for the model to pick up on patterns as well as the temporal relationship between data points.

3.) The problem with using time of day as a numeric feature is that the ridge model will not be able to capture the periodic pattern, this is because a linear function can only learn a linear function of the time of day. A linear function would not be able to pick up a periodic function of the time of day. In order to tackle this problem we encoded the feature as a categorical variable instead through OneHotEncoder.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

### 2.2 Computer vision 
rubric={points:6}

The following questions pertain to Lecture 19 on multiclass classification and introduction to computer vision. 

1. How many parameters (coefficients and intercepts) will `sklearn`’s `LogisticRegression()` model learn for a four-class classification problem, assuming that you have 10 features? Briefly explain your answer.
2. In Lecture 19, we briefly discussed how neural networks are sort of like `sklearn`'s pipelines, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
3. Imagine that you have a small dataset with ~1000 images containing pictures and names of 50 different Computer Science faculty members from UBC. Your goal is to develop a reasonably accurate multi-class classification model for this task. Describe which model/technique you would use and briefly justify your choice in one to three sentences.

<div class="alert alert-warning">

Solution_2.2
    
</div>

_Points:_ 6

1.) Assuming that we have 10 features and model a four-class classification problem using sklearn's LogisticRegression, it will learn a coefficient for each feature and each class, as well as an intercept for each class. Thus, we will have 10 coefficients for each feature for each class, and 4 intercepts as we learn an intercept for each class. The final number of parameters will be 10 coefficients x 4 classes + 4 intercepts = 44. We will learn 44 parameters.

2.) The reason that multiple sequential transformations of data is so useful when it comes to transfer learning is that with pre-trained models they typically have a lot of useful features in the lower layers so that instead of having to re-train the model from scratch, we can just fine-tune the outer layers, which allow us to efficiently adapt the model and tailor it to our tasks.

3.) I would use a pre-trained Convolutional neural networks (CNN), this is because as stated above, the pre-trained models have learned many useful features in their lower layers due to the exposure to large datasets. Thus, we can fine-tune the outer layers to our specific task.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 2.3 Survival analysis
<hr>

rubric={points:6}

The following questions pertain to Lecture 21 on survival analysis. We'll consider the use case of customer churn analysis.

1. What is the problem with simply labeling customers are "churned" or "not churned" and using standard supervised learning techniques?
2. Consider customer A who just joined last week vs. customer B who has been with the service for a year. Who do you expect will leave the service first: probably customer A, probably customer B, or we don't have enough information to answer? Briefly explain your answer. 
3. If a customer's survival function is almost flat during a certain period, how do we interpret that?

<div class="alert alert-warning">

Solution_2.3
    
</div>

_Points:_ 6

1.) The problem with simply labeling customers as "churned" or "not churned" and using standard supervised learning techniques is that when we do that, we are predicting based on when the data was collected. However, it would make more sense to find out how long before a customer "churns" to help incentivize them to stay, whether that is offering promotions or discounts. Thus, instead of simply labeling customers as "churned" or "not churned", we should instead be interested in discovering the time until the customer churns. Additionally, this leads to something called right-censoring; this is because customers who have "not churned", might "churn" shortly, meaning we do not have access to the entire customer data.

2.) There is not enough information to answer; this is because customer A, who joined last week, may churn early if they discover that the subscription service does not provide enough value or does not fit their needs. Customer B, who has been with the service for a year, has found that the subscription service has fulfilled its value and is no longer necessary to continue, and thus churning shortly after a year. Therefore, without more information, we are unlikely to be able to predict who leaves the service first as both customers have different reasons to leave or stay.

3.) If a customer's survival function is almost flat during a certain period, that means that there is a very low risk that the customer "churns". This suggests that during this period, the likelihood of the event occurring is stable and unchanged, meaning that the customer is most likely not going to "churn".

<!-- END QUESTION -->

<br><br>

**Before submitting your assignment, please make sure you have followed all the instructions in the Submission instructions section at the top.** 

![](img/eva-well-done.png)