---
---
Problem Set 10: Machine Learning II

Applied Data Science using Python

New York University, Abu Dhabi

Out: 21st Nov 2023 || **Due: 28th Nov 2023 at 23:59**

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the fundamental concepts of applied machine learning
- Learn the fundamental concepts of supervised learning

### Specific Goals
- Learn the basics of regression
- Learn to apply different models of regression:
    - linear regression
    - polynomial regression
    - kNN regression
- Understand bias-variance tradeoff
- Learn to apply cross validation
- Learn to apply regularization (L1 vs. L2)
- Learn to evaluate and compare the performance of your regression models
- Learn to apply feature scaling
- Feature engineering
- Understand transfer learning

## Collaboration Policy
- You are allowed to talk with / work with other students on homework assignments.
- You can share ideas but not code, analyses or results; you must submit your own code and results. All submitted code will be compared against all code submitted this and previous semesters and online using MOSS. We will also critically analyze the similarities in the submitted reports, methodologies, and results, **but we will not police you**. We expect you all to be mature and responsible enough to finish your work with full integrity.
- You are expected to comply with the [University Policy on Academic Integrity and Plagiarism](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/academic-integrity-for-students-at-nyu.html). Violations may result in penalties, such as failure in a particular assignment.

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Late Submission Policy
You can submit the homework for upto 3 late days. However, we will deduct **20 points** from your homework grade **for each late day you take**. We will not accept the homework after 3 late days.


## Disclaimer
The number of points do not necessarily signify/correlate to the difficulty level of the tasks.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/) as **P10_YOUR NETID.ipynb**.

## Kaggle Username

Leaderboard: LosPodiumHermanos  
Username: IDontKnowWhatIAmDoing

---




# General Instructions
This homework is worth 100 points. It has 2 parts. Below each part, we provide a set of concepts required to complete that part. All the parts need to be completed in this Jupyter (Colab) Notebook. Please start this homework early as modeling may take some time.



# Part I: Predicting the Prevalence of the CCD Disease (70 points)

For a long time now, humans of the **United States of America (USA)** have been suffering from a communicable disease called the CCD, short for the **Climate Change Denialism**, a serious disease that is making humans incapable to reason. True story! <sup>1</sup>

The Center of Logical Reasoning has been collecting the data related to the disease since 2010, and has reached out to NYU for help in creating a model for the prediction of the prevalance of **Climate Change Denialism** in different states using a set of features. The dataset is **spatio-temporal** as it has prevalance rates of the disease for ~50 states (spatial), across 7 years (temporal).

------------------
<sup>1. This is a work of fiction. The story, names, writing, data depicted in this problem set are mostly ficticious. Any similarity to actual persons, living or dead, or to actual papers, is not purely coincidental but definitely inspirational. The "Climate Change Denialism" is a fictitious disease that may have been inspired by a same name disorder found amongst certain individuals in the world.</sup>


## A. Training for the US (35 points)

Using the dataset `us_train.csv`, train a machine learning based regression model that predicts the prevalence of **Climate Change Denialism** disease for a particular state in the USA. The features are in the columns labeled as `A`, `B`, ..., `AC`. The outcome variable (i.e. the prevalence of CCD disease) is present in the column `outcome`.

You may try different models (linear, polynomial, kNN) to see which one performs the best for estimation of the prevalence of the disease. You have data for the years 2010 to 2015 for 50 states in the U.S. Your data will be tested on data from 2016. The features for 2016 are provided in the file `us_test_x.csv`. The outcome/labels for 2016 are not provided.

For this part, you are required to train and evaluate your regression models very similar to what we did in the recitation.

As a submission for this part, you will fill the `us_predictions.csv` file and submit that along with this Notebook to Brightspace. You will also submit `us_predictions.csv` file to Kaggle (see Part B).

In [13]:
# Importing libraries you "may" need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_absolute_error, r2_score
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

##### Load the dataset

In [14]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load the datasets
train_data_path = 'us_train.csv'
test_data_path = 'us_test_x.csv'
train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

# Separate features and target from training data
X_train = train_data.drop(['outcome'], axis=1)
y_train = train_data['outcome']
X_test = test_data.copy()

# Preprocessing steps
# 1. Impute missing values
# 2. One-hot encode the 'states' column
# 3. Drop the 'Id' column
# 4. Normalize / scale the features

preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), [col for col in X_train.columns if col not in ['Id', 'states']]),
        ('cat', OneHotEncoder(), ['states'])
    ],
    remainder='drop'  # drop the columns not specified in transformers (a.k.a. the 'Id' column)
)

# Creating a pipeline that first preprocesses the data and then applies scaling
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('scaler', StandardScaler())])

# Fit and transform the training data, transform the testing data
X_train_processed = pipeline.fit_transform(X_train)
X_test_processed = pipeline.transform(X_test)


In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score

# Linear Regression
linear_model = LinearRegression()
linear_scores = cross_val_score(linear_model, X_train_processed, y_train, cv=5, scoring='neg_mean_absolute_error')
linear_mae = -np.mean(linear_scores)

# Polynomial Regression (degree 2)
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_scores = cross_val_score(poly_model, X_train_processed, y_train, cv=5, scoring='neg_mean_absolute_error')
poly_mae = -np.mean(poly_scores)

# KNN Regression
knn_model = KNeighborsRegressor(n_neighbors=3)
knn_scores = cross_val_score(knn_model, X_train_processed, y_train, cv=5, scoring='neg_mean_absolute_error')
knn_mae = -np.mean(knn_scores)

print(f"Linear Regression MAE: {linear_mae}")
print(f"Polynomial Regression MAE: {poly_mae}")
print(f"KNN Regression MAE: {knn_mae}")


Linear Regression MAE: 0.6769149326536387
Polynomial Regression MAE: 0.7155699353633228
KNN Regression MAE: 0.6071111111111112


In [16]:
import warnings
warnings.filterwarnings('ignore')

# Define the alpha grid
alpha_grid = {'alpha': np.logspace(-4, 4, 20)}

# Lasso Regression with GridSearchCV
lasso = Lasso()
lasso_cv = GridSearchCV(lasso, alpha_grid, scoring='neg_mean_absolute_error', cv=10)
lasso_cv.fit(X_train_processed, y_train)
best_lasso_model = lasso_cv.best_estimator_
print("Best Lasso MAE:", -lasso_cv.best_score_, "with alpha:", lasso_cv.best_params_['alpha'])

# Ridge Regression with GridSearchCV
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, alpha_grid, scoring='neg_mean_absolute_error', cv=10)
ridge_cv.fit(X_train_processed, y_train)
best_ridge_model = ridge_cv.best_estimator_
print("Best Ridge MAE:", -ridge_cv.best_score_, "with alpha:", ridge_cv.best_params_['alpha'])


Best Lasso MAE: 0.46289202260645174 with alpha: 0.004832930238571752
Best Ridge MAE: 0.47010888438516807 with alpha: 4.281332398719396


In [17]:
# Fit the best model to the entire training data
best_model = best_lasso_model
best_model.fit(X_train_processed, y_train)

# Make predictions on the test set
predictions_2016 = best_model.predict(X_test_processed)

submission = pd.DataFrame({
    "Id": test_data['Id'],
    "Predicted": predictions_2016
})
submission.to_csv('us_predictions.csv', index=False)


### Rubric

- +20 points for logical and reasonable steps to training and testing the models using the techniques taught in the course
- +15 points showing code and evaluation of **at least two regression models** at least one of which makes the same predictions as submitted on Kaggle and in the document `us_predictions.csv`

## B. Kaggle Submission (35 points)

Create an account on Kaggle, and submit your predictions as `us_predictions.csv` with the two columns `Id` and `Predicted` to Kaggle.

You will be evaluated on the `Mean Absolute Error` as a scoring metric.

There are seven benchmarks/baselines that we have provided you on Kaggle. These are as follows:

- `Trivial Baseline`
- `Baseline A (1 and 2)`
- `Baseline B (1 and 2)`
- `Baseline C (1 and 2)`

To be able to get full points on this task, you would need to pass the `Trivial Baseline`, either of `A1` or `A2`, either of `B1` or `B2`, **and** either of `C1` or `C2` baseline.

Note that the score you see on Kaggle Leaderboard for your submission is only based on 50% of the test dataset (i.e. 25 data points) -- we have hidden the other 50% of the dataset, and your score on those will only be revealed once the competition ends. In general, if you pass the baseline on the publicly available data, your model should pass the baselines on the hidden data as well. But we have kept it hidden so that you don't overfit your model on the test set.

The Kaggle data points for the test set are from 2016 the features for which are provided in `us_test_x.csv`.

You have a maximum for 15 submissions per day on Kaggle. Before submitting the notebook, enter your Kaggle username in the **Kaggle Username** section above.

### Rubric

- +10 points for achieving/crossing Baseline A1 or Baseline A2 across both public (7 points) and hidden data points (3 points).
- +5 points for achieving/crossing Trivial Baseline across both public (3 points) and hidden data points (2 points).
- +10 points for achieving/crossing Baseline B1 or Baseline B2 across both public (7 points) and hidden data points (3 points).
- +10 points for achieving/crossing Baseline C1 or Baseline C2 across both public (7 points) and hidden data points (3 points).




## *Concepts required to complete this task*

*   Basics of Machine Learning
*   Basics of Regression
*   Feature Engineering




# Part II: Transfer Learning (30 points)

Many machine learning methods work well only under a common assumption: the training and test data are drawn from the same feature space and/or the same distribution. When the distribution changes, most statistical models need to be rebuilt from scratch using newly collected training data. In many real world applications, it is expensive or impossible to recollect the needed training data and rebuild the models. It would be nice to reduce the need and effort to recollect the training data. In such cases, **knowledge transfer** or **transfer learning** between task domains would be desirable.

**Transfer learning** is a machine learning method where a model developed for a task is reused for a model on a second task. For example, in the paper on *Revealing Inherent Gender Biases in Using Word Embeddings for Sentiment Analysis* in PS7, the (imaginary) authors used word embeddings for sentiment analysis. That was a transfer learning approach where word embeddings were created from a machine learning model that was trained for the purpose of [predicting words](https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/#:~:text=Language%20modeling%20involves%20predicting%20the,machine%20translation%20and%20speech%20recognition.), but the model was later **reused** to extract word embeddings to be used for the sentiment analysis task. Another example would be a **spam filtering model** that has been trained on emails of one user (the source distribution) and is applied to a new user who receives significantly different emails (the target distribution). This process of applying the model to a different target distribution is sometimes also known as **domain adaptation**. <Sup>2</Sup>

---------
<sup> 2. Some people distinguish between **transfer learning** and **domain adaptation**, some don't. These are not very precisely defined terms in the literature.</sup>



## A. Cross-Country Generalizability of the Model (15 points)

Using a model trained on all the data from the United States, estimate the prevalence of Climate Change Denialism disease for the 8 provinces of the **Dominion of Canada** for the years 2011 to 2014. You may have to modify and retrain your model according to the data and features available to you for Canada.

The dataset (i.e. features) for Canada is available as `ca_test_x.csv`. You will submit your final predictions as `ca_predictions.csv` the template for which is provided to you in the handout. As you will notice, the features for Canada are a subset of the features for the USA, therefore, you'll have to train your US based model accordingly.

As a submission for this part, you will fill the `ca_predictions.csv` file and submit that along with this Notebook to Brightspace. You will also submit `ca_predictions.csv` file to Kaggle (see Part B).

In [36]:
from sklearn.preprocessing import RobustScaler

# Load the U.S. training data again
us_train_path = 'us_train.csv'
us_train_data = pd.read_csv(us_train_path)

# Select the features that are also present in the Canadian dataset
selected_features = ['B', 'F', 'H', 'I', 'K', 'L', 'M', 'O', 'T', 'W', 'X', 'Y', 'AA', 'AC']
X_train = us_train_data[selected_features]
y_train = us_train_data['outcome']

# Preprocessing the data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), selected_features)
    ]
)

# Creating a pipeline that first preprocesses the data and then applies scaling
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('scaler', RobustScaler())])

# Fit and transform the training data
X_train_processed = pipeline.fit_transform(X_train)


In [44]:
# Define the alpha grid
alpha_grid = {'alpha': np.logspace(-10, 10, 100)}

# Ridge Regression with GridSearchCV
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, alpha_grid, scoring='neg_mean_absolute_error', cv=10)
ridge_cv.fit(X_train_processed, y_train)
best_ridge_model = ridge_cv.best_estimator_

# Lasso Regression with GridSearchCV
lasso = Lasso()
lasso_cv = GridSearchCV(lasso, alpha_grid, scoring='neg_mean_absolute_error', cv=10)
lasso_cv.fit(X_train_processed, y_train)
best_lasso_model = lasso_cv.best_estimator_


In [38]:
# knn Regression with GridSearchCV
param_grid = {'n_neighbors': range(1, 101, 2)}
knn = KNeighborsRegressor()
knn_cv = GridSearchCV(knn, param_grid, scoring='neg_mean_absolute_error', cv=10)
knn_cv.fit(X_train_processed, y_train)
best_knn_model = knn_cv.best_estimator_

In [39]:
# Linear Regression
linear_model = LinearRegression()
linear_scores = cross_val_score(linear_model, X_train_processed, y_train, cv=10, scoring='neg_mean_absolute_error')
linear_mae = -np.mean(linear_scores)

In [40]:
# Polynomial Regression (degree 2)
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_scores = cross_val_score(poly_model, X_train_processed, y_train, cv=10, scoring='neg_mean_absolute_error')
poly_mae = -np.mean(poly_scores)

In [45]:
# Print the best model and its MAE score
print("Best Ridge MAE:", -ridge_cv.best_score_, "with alpha:", ridge_cv.best_params_['alpha'])
print("Best Lasso MAE:", -lasso_cv.best_score_, "with alpha:", lasso_cv.best_params_['alpha'])
print("Best KNN MAE:", -knn_cv.best_score_, "with n_neighbors:", knn_cv.best_params_['n_neighbors'])
print("Linear Regression MAE:", linear_mae)
print("Polynomial Regression MAE:", poly_mae)

Best Ridge MAE: 0.8565632065493402 with alpha: 5.0941380148163855
Best Lasso MAE: 0.853451830837769 with alpha: 0.012045035402587835
Best KNN MAE: 0.7463333333333333 with n_neighbors: 3
Linear Regression MAE: 0.8586570826798494
Polynomial Regression MAE: 1.0990142984376041


In [34]:
# Load the Canadian test data
ca_test_path = 'ca_test_x.csv'
ca_test_data = pd.read_csv(ca_test_path)
X_test_ca = ca_test_data[selected_features]

# Transform the testing data using the same pipeline
X_test_ca_processed = pipeline.transform(X_test_ca)


In [35]:
# Make predictions on the Canadian test set
ca_predictions = best_lasso_model.predict(X_test_ca_processed)

# Prepare the submission file
submission_ca = pd.DataFrame({
    "Id": ca_test_data['Id'],
    "Predicted": ca_predictions
})
submission_ca.to_csv('ca_predictions.csv', index=False)


### Rubric

- +5 points for logical and reasonable steps to training and testing the models using the techniques taught in the course
- +10 points showing code and evaluation of **at least two regression models** at least one of which makes the same predictions as submitted on Kaggle and in the document `ca_predictions.csv`.

## B. Kaggle Submission (15 points)

You will submit your predictions as `ca_predictions.csv` with the two columns `Id` and `Predicted` to Kaggle.

You will be evaluated on the `Mean Absolute Error` scoring metric.

There is one benchmark/baseline that we have provided you on Kaggle that you will have to meet/beat to receive all the points.

Note that the score you see on Kaggle Leaderboard for your submission is only based on 75% of the dataset (i.e. 24 data points) -- we have hidden 25% of the dataset (8 data points), and your score on those will only be revealed once the competition ends.

You have a maximum for 15 submissions per day on Kaggle. Before submitting the notebook, enter your Kaggle username in the **Kaggle Username** section above.

### Rubric

- +15 points for achieving/crossing the baseline provided across both public (10 points) and hidden data points (5 points).

## *Concepts required to complete this task*

*   Basics of Machine Learning
*   Basics of Regression
*   Feature Engineering
*   Concept of Transfer Learning