# `DSML_WS_10` - Group Project Q&A

This week, we will use the workshop timeslot to work on the group projects. Therefore, we only have the following content in this notebook:

- **Task**: Predicting electricity demand - continued
- **Task**: Classifying breast cancer samples

---

## 1. Task: Predicting electricity demand - continued

In the preparation task for workshop 8, you predicted **average electrical load** from **average temperature** using polynomial features with `scikit learn`. Let us continue from there by doing the following: 

- Load data and filter dataframe to exclude any observations with `Avg_temp` outside the range of -20 to +30 degrees.
- Define X and y vectors, and perform train/test split.
- Create polynomial features up to degree 50 and scale using `StandardScaler`.
- Initialize and fit model using `LinearRegression`.
- Initialize and fit second model with appropriate alpha using `Ridge`.
- Initialize and fit third model with appropriate alpha using `Lasso`.
- Initialize and fit fourth model with appropriate value for n_neighbors using `KNeighborsRegressor`.
- Compare model performances using `mean_absolute_error` and `r2_score`.
- **Extra task**: for each of the four models, create a scatter plot of the test data and plot the regression line on top of it.

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

In [None]:
# load data
df = pd.read_csv("Pittsburgh_load_data.csv")

# limit data points to range -20 to 30
df = df[(df["Avg_temp"] >= -20) & (df["Avg_temp"] <= 30)]

df.head()

In [None]:
# define X and y vectors
xa = df["Avg_temp"]
ya = df["AVG"]

# train-test split
X_train, X_test, y_train, y_test = train_test_split(xa, ya, test_size=0.3,random_state=42)

# create poly features
poly_reg = PolynomialFeatures(degree = 50, include_bias = False)
X_train_poly = poly_reg.fit_transform(X_train.values.reshape(-1,1))
X_test_poly = poly_reg.transform(X_test.values.reshape(-1,1))

# scale
scaler = StandardScaler()
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
X_test_poly_scaled = scaler.transform(X_test_poly)

# initialize and fit linear model
linear_model = LinearRegression()
linear_model.fit(X_train_poly_scaled,y_train)
linear_model_predictions = linear_model.predict(X_test_poly_scaled)

# initialize and fit ridge model
ridge_model = Ridge(alpha = 0.1, solver = 'lsqr')
ridge_model.fit(X_train_poly_scaled,y_train)
ridge_model_predictions = ridge_model.predict(X_test_poly_scaled)

# initialize and fit lasso model
lasso_model = Lasso(alpha = 0.01)
lasso_model.fit(X_train_poly_scaled,y_train)
lasso_model_predictions = lasso_model.predict(X_test_poly_scaled)

# initialize and fit knn model
knn_model = KNeighborsRegressor(n_neighbors=25)
knn_model.fit(X_train_poly_scaled,y_train)
knn_model_predictions = knn_model.predict(X_test_poly_scaled)

# compare model performances
print("Linear model - MAE: ", mean_absolute_error(y_test, linear_model_predictions), " R2: ", r2_score(y_test, linear_model_predictions))
print("Ridge model - MAE: ", mean_absolute_error(y_test, ridge_model_predictions), " R2: ", r2_score(y_test, ridge_model_predictions))
print("Lasso model - MAE: ", mean_absolute_error(y_test, lasso_model_predictions), " R2: ", r2_score(y_test, lasso_model_predictions))
print("KNN model - MAE: ", mean_absolute_error(y_test, knn_model_predictions), " R2: ", r2_score(y_test, knn_model_predictions))

In [None]:
# EXTRA TASK

# define poly x scaled for full x range
x_full_range = np.linspace(-20, 30, 1000)
x_full_poly = poly_reg.transform(x_full_range.reshape(-1,1))
x_full_poly_scaled = scaler.transform(x_full_poly)

In [None]:
# plot linear model
plt.figure(figsize = (8,6))
plt.scatter(X_test, y_test, marker="x")
plt.plot(x_full_range, linear_model.predict(x_full_poly_scaled), color='C1')
plt.xlim(-20,30)
plt.ylim(1,2.5)
plt.xlabel("Avg Temperature (째C)")
plt.ylabel("Avg Demand (GW)")
plt.show()

In [None]:
# plot ridge model
plt.figure(figsize = (8,6))
plt.scatter(X_test, y_test, marker="x")
plt.plot(x_full_range, ridge_model.predict(x_full_poly_scaled), color='C1')
plt.xlim(-20,30)
plt.ylim(1,2.5)
plt.xlabel("Avg Temperature (째C)")
plt.ylabel("Avg Demand (GW)")
plt.show()

In [None]:
# plot lasso model
plt.figure(figsize = (8,6))
plt.scatter(X_test, y_test, marker="x")
plt.plot(x_full_range, lasso_model.predict(x_full_poly_scaled), color='C1')
plt.xlim(-20,30)
plt.ylim(1,2.5)
plt.xlabel("Avg Temperature (째C)")
plt.ylabel("Avg Demand (GW)")
plt.show()

In [None]:
# plot knn model
plt.figure(figsize = (8,6))
plt.scatter(X_test, y_test, marker="x")
plt.plot(x_full_range, knn_model.predict(x_full_poly_scaled), color='C1')
plt.xlim(-20,30)
plt.ylim(1,2.5)
plt.xlabel("Avg Temperature (째C)")
plt.ylabel("Avg Demand (GW)")
plt.show()

---

## 2. Task: Classifying breast cancer samples

In workshop 9, we looked at the workings of relevant classification algorithms. One issue with we did not consideer was that we have trained our algorithms on the full set of available data. While this is fine for understanding how classification works in general, it is not suitable for developing predictive models (as you know by now).

As a result, the classification metrics from last week's workshop are relatively meaningless as we need to evaluate on previously unseen data. 

**Design a proper model development routine to train a high-performing classification algorithm for the breast cancer dataset. Proceed as follows:**

- Define your feature (let's continue to focus on `area_mean` and `concave points_mean`) and target sets.
- Partition the data into training, validation and test set, and scale the input features.
- Train a support vector machine on the training set.
- Tweak hyperparameters (e.g., kernel) by validating on the validation set.
- Report test metrics from the unseen test set (only look at the test set once you are finished validating your model. Do not go back and forth as this would create leakage!).

In [None]:
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [None]:
# load data
cancer_df = pd.read_csv("breast_cancer.csv", index_col = "id")

# define feature and target
x = cancer_df[['area_mean','concave points_mean']].values
y = cancer_df['diagnosis'].values

# partition data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3,random_state=42)
X_train, X_hold, y_train, y_hold = train_test_split(X_train, y_train, test_size=(0.2/0.7),random_state=42)

# scale input features (based on X_train)
norm = StandardScaler()
X_train_norm = norm.fit_transform(X_train)
X_hold_norm = norm.transform(X_hold)
X_test_norm = norm.transform(X_test)

In [None]:
# train model on training set
model_linear = SVC(kernel='linear', coef0=1.0)
model_linear.fit(X_train_norm, y_train)

# evaluate on holdout set
holdout_predictions = model_linear.predict(X_hold_norm)
print("Holdout Accuracy =", accuracy_score(y_hold, holdout_predictions))
print("Holdout Precision =", precision_score(y_hold, holdout_predictions, pos_label="M"))
print("Holdout Recall =", recall_score(y_hold, holdout_predictions, pos_label="M"))

In [None]:
# train model on training set
model_poly = SVC(kernel='poly', C = 100, degree=2, coef0=1.0)
model_poly.fit(X_train_norm, y_train)

# evaluate on holdout set
holdout_predictions = model_poly.predict(X_hold_norm)
print("Holdout Accuracy =", accuracy_score(y_hold, holdout_predictions))
print("Holdout Precision =", precision_score(y_hold, holdout_predictions, pos_label="M"))
print("Holdout Recall =", recall_score(y_hold, holdout_predictions, pos_label="M"))

In [None]:
# train model on training set
model_RBF = SVC(kernel='rbf', C = 100, coef0=1.0)
model_RBF.fit(X_train_norm, y_train)

# evaluate on holdout set
holdout_predictions = model_RBF.predict(X_hold_norm)
print("Holdout Accuracy =", accuracy_score(y_hold, holdout_predictions))
print("Holdout Precision =", precision_score(y_hold, holdout_predictions, pos_label="M"))
print("Holdout Recall =", recall_score(y_hold, holdout_predictions, pos_label="M"))

In [None]:
# evaluate preferred model (SVM with polynomial kernel, d=2, C=100) on test set
test_predictions = model_poly.predict(X_test_norm)
print("Test Accuracy =", accuracy_score(y_test, test_predictions))
print("Test Precision =", precision_score(y_test, test_predictions, pos_label="M"))
print("Test Recall =", recall_score(y_test, test_predictions, pos_label="M"))

---