---
---
Problem Set 10: Machine Learning II

Applied Data Science using Python

New York University, Abu Dhabi

Out: 21st Nov 2023 || **Due: 28th Nov 2023 at 23:59**

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the fundamental concepts of applied machine learning
- Learn the fundamental concepts of supervised learning

### Specific Goals
- Learn the basics of regression
- Learn to apply different models of regression:
    - linear regression
    - polynomial regression
    - kNN regression
- Understand bias-variance tradeoff
- Learn to apply cross validation
- Learn to apply regularization (L1 vs. L2)
- Learn to evaluate and compare the performance of your regression models
- Learn to apply feature scaling
- Feature engineering
- Understand transfer learning

## Collaboration Policy
- You are allowed to talk with / work with other students on homework assignments.
- You can share ideas but not code, analyses or results; you must submit your own code and results. All submitted code will be compared against all code submitted this and previous semesters and online using MOSS. We will also critically analyze the similarities in the submitted reports, methodologies, and results, **but we will not police you**. We expect you all to be mature and responsible enough to finish your work with full integrity.
- You are expected to comply with the [University Policy on Academic Integrity and Plagiarism](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/academic-integrity-for-students-at-nyu.html). Violations may result in penalties, such as failure in a particular assignment.

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Late Submission Policy
You can submit the homework for upto 3 late days. However, we will deduct **20 points** from your homework grade **for each late day you take**. We will not accept the homework after 3 late days.


## Disclaimer
The number of points do not necessarily signify/correlate to the difficulty level of the tasks.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/) as **P10_YOUR NETID.ipynb**.

## Kaggle Username

[ENTER YOUR KAGGLE USERNAME HERE]

---




# General Instructions
This homework is worth 100 points. It has 2 parts. Below each part, we provide a set of concepts required to complete that part. All the parts need to be completed in this Jupyter (Colab) Notebook. Please start this homework early as modeling may take some time.



# Part I: Predicting the Prevalence of the CCD Disease (70 points)

For a long time now, humans of the **United States of America (USA)** have been suffering from a communicable disease called the CCD, short for the **Climate Change Denialism**, a serious disease that is making humans incapable to reason. True story! <sup>1</sup>

The Center of Logical Reasoning has been collecting the data related to the disease since 2010, and has reached out to NYU for help in creating a model for the prediction of the prevalance of **Climate Change Denialism** in different states using a set of features. The dataset is **spatio-temporal** as it has prevalance rates of the disease for ~50 states (spatial), across 7 years (temporal).

------------------
<sup>1. This is a work of fiction. The story, names, writing, data depicted in this problem set are mostly ficticious. Any similarity to actual persons, living or dead, or to actual papers, is not purely coincidental but definitely inspirational. The "Climate Change Denialism" is a fictitious disease that may have been inspired by a same name disorder found amongst certain individuals in the world.</sup>


## A. Training for the US (35 points)

Using the dataset `us_train.csv`, train a machine learning based regression model that predicts the prevalence of **Climate Change Denialism** disease for a particular state in the USA. The features are in the columns labeled as `A`, `B`, ..., `AC`. The outcome variable (i.e. the prevalence of CCD disease) is present in the column `outcome`.

You may try different models (linear, polynomial, kNN) to see which one performs the best for estimation of the prevalence of the disease. You have data for the years 2010 to 2015 for 50 states in the U.S. Your data will be tested on data from 2016. The features for 2016 are provided in the file `us_test_x.csv`. The outcome/labels for 2016 are not provided.

For this part, you are required to train and evaluate your regression models very similar to what we did in the recitation.

As a submission for this part, you will fill the `us_predictions.csv` file and submit that along with this Notebook to Brightspace. You will also submit `us_predictions.csv` file to Kaggle (see Part B).

In [462]:
# Importing libraries you "may" need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_absolute_error, r2_score
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

##### Step 1: Load dataset and sample it

In [463]:
# Load the training data
train_data = pd.read_csv('us_train.csv')

# Display a sample of the training data
train_data.sample(10)


Unnamed: 0,Id,states,year,outcome,A,B,C,D,E,F,...,T,U,V,W,X,Y,Z,AA,AB,AC
49,"Wyoming,2010",Wyoming,2010,7.2,,,,,,,...,,,,,,,,,,
114,"Iowa,2012",Iowa,2012,9.7,5.614808,52.463305,53.459291,38.960644,0.0,30.282616,...,35.446952,37.679391,19.006272,30.825487,47.9024,0.0,59.770235,51.443002,92.371679,63.034177
90,"South Dakota,2011",South Dakota,2011,9.4,5.913704,67.656172,65.015726,0.0,0.0,38.755585,...,42.802329,0.0,35.381513,38.156667,0.0,0.0,79.234963,51.126943,0.0,70.040519
80,"New Mexico,2011",New Mexico,2011,10.0,7.505855,59.988472,60.371746,44.485883,0.0,34.527703,...,31.700018,9.800002,37.512929,31.054514,39.12837,0.0,72.882793,44.066556,96.353161,67.954206
221,"Michigan,2014",Michigan,2014,10.3,4.558037,46.278891,45.439103,39.487153,31.401298,22.12186,...,20.964773,26.326637,28.509873,25.846464,49.564624,120.194129,49.262921,48.816906,94.585621,60.159225
236,"Oregon,2014",Oregon,2014,8.9,2.50692,23.139446,29.480003,17.238046,0.0,16.252795,...,11.079828,10.911698,20.826361,11.487317,20.408963,0.0,23.393007,23.18803,47.901731,29.558001
152,"Arizona,2013",Arizona,2013,10.6,3.290547,48.940443,47.204166,30.924915,0.0,22.002291,...,23.84214,27.110506,28.579189,24.165763,43.040883,0.0,50.488766,44.172483,89.119765,56.819195
92,"Texas,2011",Texas,2011,10.2,3.411752,47.1338,43.56496,27.803677,32.117923,19.730116,...,20.597708,28.823534,17.477615,22.977555,37.361283,99.183684,58.172504,29.945781,63.63623,42.620401
96,"Washington,2011",Washington,2011,8.9,5.686254,45.780676,43.343818,24.352186,0.0,17.616175,...,23.811535,23.443141,25.363855,25.762713,32.312461,51.413583,48.477087,36.032322,70.826764,46.196938
262,"Illinois,2015",Illinois,2015,9.9,3.858192,45.960526,47.365028,29.006354,44.956067,22.365659,...,18.498807,32.214281,57.382783,21.239013,48.629034,116.491699,46.316835,52.612292,94.011899,70.665353


##### Step 2: Impute missing values (mainly found for all states in Y2010)

In [464]:
# Define a function to fill NaNs with mean for each column in a group
def fill_nans(group, imputation_func):
    return group.fillna(imputation_func)

# Apply the function to each group (each state)
train_data = train_data.groupby('states', group_keys=True).apply(lambda group: fill_nans(group, group.mean()))

# Reset the index of the dataframe
train_data = train_data.reset_index(drop=True)

train_data.head(10)
# train_data[train_data['states'] == 'Alabama']
# train_data[train_data['states'] == 'Alaska']


Unnamed: 0,Id,states,year,outcome,A,B,C,D,E,F,...,T,U,V,W,X,Y,Z,AA,AB,AC
0,"Alabama,2010",Alabama,2010,13.1,4.977296,56.006209,53.38604,40.94328,0.0,29.966852,...,26.299833,38.192169,29.174189,31.091563,51.739476,0.0,66.782957,55.159269,100.995312,71.881987
1,"Alabama,2011",Alabama,2011,11.7,5.686254,55.929102,53.95863,39.308647,0.0,29.419012,...,31.846101,31.513731,19.822173,31.054514,43.67231,0.0,76.22604,47.71848,84.129253,60.503087
2,"Alabama,2012",Alabama,2012,12.2,5.1829,54.726428,53.459291,39.861468,0.0,26.437205,...,28.927972,29.658531,19.006272,35.567869,56.714634,0.0,69.583259,50.424329,90.950577,65.401658
3,"Alabama,2013",Alabama,2013,13.8,4.730162,54.839514,52.374146,46.201077,0.0,27.890228,...,26.213871,42.245818,20.162031,32.176514,47.926604,0.0,61.504497,54.301205,100.949823,68.126497
4,"Alabama,2014",Alabama,2014,12.9,4.785939,55.831873,52.975344,35.277862,0.0,29.94728,...,23.35454,45.55201,33.16042,29.981898,52.48019,0.0,63.849149,57.054759,104.734293,72.677907
5,"Alabama,2015",Alabama,2015,13.5,4.501224,58.704126,54.162787,44.067346,0.0,36.140534,...,21.156681,41.990755,53.720052,26.677021,57.903644,0.0,62.751841,66.29757,124.212613,92.700786
6,"Alaska,2010",Alaska,2010,5.3,3.37006,46.371256,55.469361,0.0,0.0,22.103201,...,24.480612,0.0,47.988532,25.029107,0.0,0.0,57.782747,45.811168,0.0,68.252144
7,"Alaska,2011",Alaska,2011,7.9,4.549003,50.967649,54.400914,0.0,0.0,20.082439,...,27.171444,0.0,27.068989,28.687129,0.0,0.0,48.811412,37.736553,0.0,58.714818
8,"Alaska,2012",Alaska,2012,6.9,3.239312,40.941952,53.676605,0.0,0.0,17.785029,...,26.075919,0.0,26.238747,25.334307,0.0,0.0,70.177988,48.896319,0.0,61.554502
9,"Alaska,2013",Alaska,2013,7.1,3.084888,50.251347,55.745873,0.0,0.0,18.43854,...,25.71456,0.0,43.26028,24.165763,0.0,0.0,55.996632,45.57925,0.0,66.147719


##### Step 3: Split the dataset into train and test sets

In [465]:
X = train_data.drop(['Id', 'states', 'year', 'outcome'], axis=1)
y = train_data['outcome']

In [466]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [486]:
# Scale X_train and X_test
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Scale Y_train and Y_test
scaler = StandardScaler()
y_train_scaled = scaler.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = scaler.transform(y_test.values.reshape(-1, 1))

##### Linear Regression (No regularization)

In [523]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

linear_model = LinearRegression()

# Perform 5-fold cross-validation
cv_scores_linear = cross_val_score(linear_model, X, y, cv=5, scoring='neg_mean_absolute_error')

# Calculate the average MAE
avg_mae_linear = np.mean(-cv_scores_linear)
print(f"Average MAE for Linear Regression: {avg_mae_linear}")


Average MAE for Linear Regression: 0.8534117010232437


In [524]:
# Initialize the model
linear_model = LinearRegression()

# Train the model on the entire training dataset
linear_model.fit(X_train, y_train)

# Make predictions on the test dataset
y_pred_test = linear_model.predict(X_test)

# Calculate the MAE on the test dataset
mae_test = mean_absolute_error(y_test, y_pred_test)
print(f"MAE on Test Data: {mae_test}")


MAE on Test Data: 0.7120501563092623


##### Linear regression (Ridge and Lasso regularization)

In [527]:
# GridSearchCV for Ridge Regression
alphas = np.linspace(0.01, 100, 10000)
parameters = {'alpha': alphas}
ridge = Ridge()
ridge_grid = GridSearchCV(ridge, parameters, scoring='neg_mean_absolute_error', cv=5)
ridge_grid.fit(X_train, y_train)

# Print best parameters and MAE from GridSearchCV
print("Best parameters:", ridge_grid.best_params_)
print("Best MAE from CV:", -ridge_grid.best_score_)

# Train Ridge model with best alpha
best_alpha = ridge_grid.best_params_['alpha']
ridge_best = Ridge(alpha=best_alpha)
ridge_best.fit(X_train, y_train)

# Make predictions on test data
y_pred_test = ridge_best.predict(X_test)

# Calculate MAE on test data
mae_test = mean_absolute_error(y_test, y_pred_test)
print(f"MAE on Test Data: {mae_test}")


Best parameters: {'alpha': 100.0}
Best MAE from CV: 0.7283418293444243
MAE on Test Data: 0.7084924097437492


In [526]:
# Initialize Ridge and GridSearchCV
ridge = Ridge()
parameters = {'alpha': np.linspace(0.01, 100, 10000)}
ridge_grid_scaled = GridSearchCV(ridge, parameters, scoring='neg_mean_absolute_error', cv=5)
ridge_grid_scaled.fit(X_train_scaled, y_train_scaled)

# Best parameters and MAE
print("Best parameters (scaled data):", ridge_grid_scaled.best_params_)
print("Best MAE (scaled data):", -ridge_grid_scaled.best_score_)

# Train Ridge model with best alpha on scaled training data
best_alpha = ridge_grid.best_params_['alpha']
ridge_best_scaled = Ridge(alpha=best_alpha)
ridge_best_scaled.fit(X_train_scaled, y_train_scaled)

# Make predictions on scaled test data
y_pred_test_scaled = ridge_best_scaled.predict(X_test_scaled)

# Calculate MAE on scaled test data
mae_test_scaled = mean_absolute_error(y_test_scaled, y_pred_test_scaled)
print(f"MAE on Scaled Test Data: {mae_test_scaled}")

Best parameters (scaled data): {'alpha': 9.14}
Best MAE (scaled data): 0.4222854011905584
MAE on Scaled Test Data: 0.41954266212252433


In [509]:
# Define a range of alpha values for Lasso Regression
parameters = {'alpha': np.linspace(0.001, 0.5, 100)}

lasso = Lasso()
lasso_grid = GridSearchCV(lasso, parameters, scoring='neg_mean_absolute_error', cv=5)
lasso_grid.fit(X_train, y_train)

# Best parameters and MAE from GridSearchCV
best_alpha_unscaled = lasso_grid.best_params_['alpha']
print("Best parameters on unscaled data:", best_alpha_unscaled)
print("Best MAE on unscaled data:", -lasso_grid.best_score_)

# Retrain the model with the best alpha and evaluate on the test set
lasso_best_unscaled = Lasso(alpha=best_alpha_unscaled)
lasso_best_unscaled.fit(X_train, y_train)
y_pred_test_unscaled = lasso_best_unscaled.predict(X_test)
mae_test_unscaled = mean_absolute_error(y_test, y_pred_test_unscaled)
print("MAE on unscaled test data:", mae_test_unscaled)


Best parameters on unscaled data: 0.07660606060606061
Best MAE on unscaled data: 0.722297109385398
MAE on unscaled test data: 0.7126673475262825


In [510]:
# GridSearchCV with scaled data
lasso_grid_scaled = GridSearchCV(Lasso(), parameters, scoring='neg_mean_absolute_error', cv=5)
lasso_grid_scaled.fit(X_train_scaled, y_train_scaled)

# Best parameters and MAE from GridSearchCV
best_alpha_scaled = lasso_grid_scaled.best_params_['alpha']
print("Best parameters on scaled data:", best_alpha_scaled)
print("Best MAE on scaled data:", -lasso_grid_scaled.best_score_)

# Retrain the model with the best alpha and evaluate on the scaled test set
lasso_best_scaled = Lasso(alpha=best_alpha_scaled)
lasso_best_scaled.fit(X_train_scaled, y_train_scaled)
y_pred_test_scaled = lasso_best_scaled.predict(X_test_scaled)
mae_test_scaled = mean_absolute_error(y_test_scaled, y_pred_test_scaled)
print("MAE on scaled test data:", mae_test_scaled)


Best parameters on scaled data: 0.0060404040404040404
Best MAE on scaled data: 0.42915095594377595
MAE on scaled test data: 0.4214905193939588


##### Polynomial Regression (No regularization)

In [470]:
# Polynomial Regression
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
cv_scores_poly = cross_val_score(poly_model, X, y, cv=5, scoring='neg_mean_absolute_error')

# Calculate the average MAE
avg_mae_poly = np.mean(-cv_scores_poly)
print(f"Average MAE for Polynomial Regression: {avg_mae_poly}")

Average MAE for Polynomial Regression: 7.414024065889262


##### Polynomial regression (Ridge regularization)

In [471]:
from sklearn.pipeline import Pipeline
from warnings import filterwarnings
filterwarnings('ignore')

# Define a pipeline combining Polynomial Features with Ridge Regression
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('ridge', Ridge())
])

parameters = {'ridge__alpha': np.linspace(0.001,0.5,100)}
poly_ridge_grid = GridSearchCV(pipeline, parameters, scoring='neg_mean_absolute_error', cv=5)
poly_ridge_grid.fit(X_train, y_train)

print("Best parameters:", poly_ridge_grid.best_params_)
print("Best MAE:", -poly_ridge_grid.best_score_)


Best parameters: {'ridge__alpha': 0.5}
Best MAE: 2.429973694681771


##### Polynomial regression (Lasso regularization)

In [511]:
# Define a pipeline combining Polynomial Features with Lasso Regression
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('lasso', Lasso())
])

parameters = {'lasso__alpha': np.linspace(0.001, 0.5, 100)}
poly_lasso_grid = GridSearchCV(pipeline, parameters, scoring='neg_mean_absolute_error', cv=5)
poly_lasso_grid.fit(X_train, y_train)

# Best parameters and MAE from GridSearchCV
best_alpha_poly_unscaled = poly_lasso_grid.best_params_['lasso__alpha']
print("Best parameters on unscaled data:", best_alpha_poly_unscaled)
print("Best MAE on unscaled data:", -poly_lasso_grid.best_score_)

# Retrain the model with the best parameters and evaluate on the test set
pipeline_best_unscaled = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('lasso', Lasso(alpha=best_alpha_poly_unscaled))
])
pipeline_best_unscaled.fit(X_train, y_train)
y_pred_test_unscaled = pipeline_best_unscaled.predict(X_test)
mae_test_unscaled = mean_absolute_error(y_test, y_pred_test_unscaled)
print("MAE on unscaled test data:", mae_test_unscaled)

Best parameters on unscaled data: 0.5
Best MAE on unscaled data: 0.8971506140335519
MAE on unscaled test data: 0.8712613998440097


In [512]:
# GridSearchCV with scaled data
poly_lasso_grid_scaled = GridSearchCV(pipeline, parameters, scoring='neg_mean_absolute_error', cv=5)
poly_lasso_grid_scaled.fit(X_train_scaled, y_train_scaled)

# Best parameters and MAE from GridSearchCV
best_alpha_poly_scaled = poly_lasso_grid_scaled.best_params_['lasso__alpha']
print("Best parameters on scaled data:", best_alpha_poly_scaled)
print("Best MAE on scaled data:", -poly_lasso_grid_scaled.best_score_)

# Retrain the model with the best parameters and evaluate on the scaled test set
pipeline_best_scaled = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('lasso', Lasso(alpha=best_alpha_poly_scaled))
])
pipeline_best_scaled.fit(X_train_scaled, y_train_scaled)
y_pred_test_scaled = pipeline_best_scaled.predict(X_test_scaled)
mae_test_scaled = mean_absolute_error(y_test_scaled, y_pred_test_scaled)
print("MAE on scaled test data:", mae_test_scaled)


Best parameters on scaled data: 0.0060404040404040404
Best MAE on scaled data: 0.4036970883205181
MAE on scaled test data: 0.4439842263136876


##### kNN Regression (No optimization via GridSearchCV)

In [473]:
from sklearn.neighbors import KNeighborsRegressor

# Trying kNN with 3 neighbors as an example
knn_model = KNeighborsRegressor(n_neighbors=3)

# Perform 5-fold cross-validation
cv_scores_knn = cross_val_score(knn_model, X, y, cv=5, scoring='neg_mean_absolute_error')

# Calculate the average MAE
avg_mae_knn = np.mean(-cv_scores_knn)
print(f"Average MAE for kNN (3 neighbors): {avg_mae_knn}")


Average MAE for kNN (3 neighbors): 1.2387777777777778


##### kNN Regression (Optimization via GridSearchCV)

In [514]:
parameters = {'n_neighbors': range(1, 51, 2)}

knn = KNeighborsRegressor()
knn_grid = GridSearchCV(knn, parameters, scoring='neg_mean_absolute_error', cv=5)
knn_grid.fit(X_train, y_train)

print("Best parameters:", knn_grid.best_params_)
print("Best MAE:", -knn_grid.best_score_)


Best parameters: {'n_neighbors': 3}
Best MAE: 1.042638888888889


##### Best model: Linear Regression (lasso regularization)

In [532]:
# Load and prepare the us_test_x dataset
test_data = pd.read_csv('us_test_x.csv')
us_test_x = test_data.drop(['Id', 'states', 'year'], axis=1)  # Adjust the columns as needed

# Scale the features of us_test_x using the same scaler as the training data
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)  # Assuming X_train is your training data
us_test_x_scaled = scaler_X.transform(us_test_x)

# Best alpha from the Lasso regression with polynomial features
best_alpha_poly_scaled = poly_lasso_grid_scaled.best_params_['lasso__alpha']

# Retrain the model with the best alpha from polynomial Lasso regression
pipeline_best_scaled = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('lasso', Lasso(alpha=best_alpha_poly_scaled))
])
pipeline_best_scaled.fit(X_train_scaled, y_train_scaled)  # Assuming y_train_scaled is your scaled target variable

# Make predictions on the scaled us_test_x data
us_test_x_predictions = pipeline_best_scaled.predict(us_test_x_scaled)

# Undo the scaling
us_test_x_predictions = scaler.inverse_transform(us_test_x_predictions.reshape(-1, 1)).flatten()

# Prepare the submission file
submission = pd.DataFrame({
    'Id': test_data['Id'],  # Ensure this matches the identifier column in your dataset
    'Predicted': us_test_x_predictions
})

# Save to CSV file for submission
submission.to_csv('us_predictions.csv', index=False)


### Rubric

- +20 points for logical and reasonable steps to training and testing the models using the techniques taught in the course
- +15 points showing code and evaluation of **at least two regression models** at least one of which makes the same predictions as submitted on Kaggle and in the document `us_predictions.csv`

## B. Kaggle Submission (35 points)

Create an account on Kaggle, and submit your predictions as `us_predictions.csv` with the two columns `Id` and `Predicted` to Kaggle.

You will be evaluated on the `Mean Absolute Error` as a scoring metric.

There are seven benchmarks/baselines that we have provided you on Kaggle. These are as follows:

- `Trivial Baseline`
- `Baseline A (1 and 2)`
- `Baseline B (1 and 2)`
- `Baseline C (1 and 2)`

To be able to get full points on this task, you would need to pass the `Trivial Baseline`, either of `A1` or `A2`, either of `B1` or `B2`, **and** either of `C1` or `C2` baseline.

Note that the score you see on Kaggle Leaderboard for your submission is only based on 50% of the test dataset (i.e. 25 data points) -- we have hidden the other 50% of the dataset, and your score on those will only be revealed once the competition ends. In general, if you pass the baseline on the publicly available data, your model should pass the baselines on the hidden data as well. But we have kept it hidden so that you don't overfit your model on the test set.

The Kaggle data points for the test set are from 2016 the features for which are provided in `us_test_x.csv`.

You have a maximum for 15 submissions per day on Kaggle. Before submitting the notebook, enter your Kaggle username in the **Kaggle Username** section above.

### Rubric

- +10 points for achieving/crossing Baseline A1 or Baseline A2 across both public (7 points) and hidden data points (3 points).
- +5 points for achieving/crossing Trivial Baseline across both public (3 points) and hidden data points (2 points).
- +10 points for achieving/crossing Baseline B1 or Baseline B2 across both public (7 points) and hidden data points (3 points).
- +10 points for achieving/crossing Baseline C1 or Baseline C2 across both public (7 points) and hidden data points (3 points).




## *Concepts required to complete this task*

*   Basics of Machine Learning
*   Basics of Regression
*   Feature Engineering




# Part II: Transfer Learning (30 points)

Many machine learning methods work well only under a common assumption: the training and test data are drawn from the same feature space and/or the same distribution. When the distribution changes, most statistical models need to be rebuilt from scratch using newly collected training data. In many real world applications, it is expensive or impossible to recollect the needed training data and rebuild the models. It would be nice to reduce the need and effort to recollect the training data. In such cases, **knowledge transfer** or **transfer learning** between task domains would be desirable.

**Transfer learning** is a machine learning method where a model developed for a task is reused for a model on a second task. For example, in the paper on *Revealing Inherent Gender Biases in Using Word Embeddings for Sentiment Analysis* in PS7, the (imaginary) authors used word embeddings for sentiment analysis. That was a transfer learning approach where word embeddings were created from a machine learning model that was trained for the purpose of [predicting words](https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/#:~:text=Language%20modeling%20involves%20predicting%20the,machine%20translation%20and%20speech%20recognition.), but the model was later **reused** to extract word embeddings to be used for the sentiment analysis task. Another example would be a **spam filtering model** that has been trained on emails of one user (the source distribution) and is applied to a new user who receives significantly different emails (the target distribution). This process of applying the model to a different target distribution is sometimes also known as **domain adaptation**. <Sup>2</Sup>

---------
<sup> 2. Some people distinguish between **transfer learning** and **domain adaptation**, some don't. These are not very precisely defined terms in the literature.</sup>



## A. Cross-Country Generalizability of the Model (15 points)

Using a model trained on all the data from the United States, estimate the prevalence of Climate Change Denialism disease for the 8 provinces of the **Dominion of Canada** for the years 2011 to 2014. You may have to modify and retrain your model according to the data and features available to you for Canada.

The dataset (i.e. features) for Canada is available as `ca_test_x.csv`. You will submit your final predictions as `ca_predictions.csv` the template for which is provided to you in the handout. As you will notice, the features for Canada are a subset of the features for the USA, therefore, you'll have to train your US based model accordingly.

As a submission for this part, you will fill the `ca_predictions.csv` file and submit that along with this Notebook to Brightspace. You will also submit `ca_predictions.csv` file to Kaggle (see Part B).

In [476]:
######### SOLUTION #########


######### SOLUTION END #########

### Rubric

- +5 points for logical and reasonable steps to training and testing the models using the techniques taught in the course
- +10 points showing code and evaluation of **at least two regression models** at least one of which makes the same predictions as submitted on Kaggle and in the document `ca_predictions.csv`.

## B. Kaggle Submission (15 points)

You will submit your predictions as `ca_predictions.csv` with the two columns `Id` and `Predicted` to Kaggle.

You will be evaluated on the `Mean Absolute Error` scoring metric.

There is one benchmark/baseline that we have provided you on Kaggle that you will have to meet/beat to receive all the points.

Note that the score you see on Kaggle Leaderboard for your submission is only based on 75% of the dataset (i.e. 24 data points) -- we have hidden 25% of the dataset (8 data points), and your score on those will only be revealed once the competition ends.

You have a maximum for 15 submissions per day on Kaggle. Before submitting the notebook, enter your Kaggle username in the **Kaggle Username** section above.

### Rubric

- +15 points for achieving/crossing the baseline provided across both public (10 points) and hidden data points (5 points).

## *Concepts required to complete this task*

*   Basics of Machine Learning
*   Basics of Regression
*   Feature Engineering
*   Concept of Transfer Learning