# Homework 4: Fairness and bias interventions

## Regression: Download the "wine quality" dataset:

https://archive.ics.uci.edu/dataset/186/wine+quality

## Unzip the file "wine+quality.zip" to obtain:

- winequality.names
- winequality-red.csv
- winequality-white.csv

Predifine the answers:

In [1]:
answers = {}

### Implement a  linear regressor using all continuous attributes (i.e., everything except color) to predict the wine quality. Use an 80/20 train/test split. Use sklearn’s `linear_model.LinearRegression`

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Load datasets
winequality_red = pd.read_csv("winequality-red.csv", sep=';')
winequality_white = pd.read_csv("winequality-white.csv", sep=';')

# Concatenate the datasets
wine_data = pd.concat([winequality_red, winequality_white], axis=0).reset_index(drop=True)

# Set a random seed and split the train/test subsets
random_seed = 42
train_data, test_data = train_test_split(wine_data, test_size=0.2, random_state=random_seed)

# Display the train and test data
print(f"Train data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")

# Train the linear regression model
X_train = train_data.drop(columns=['quality'])
y_train = train_data['quality']
X_test = test_data.drop(columns=['quality'])
y_test = test_data['quality']

# normalize the dataset
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

Train data shape: (5197, 12)
Test data shape: (1300, 12)


In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [4]:
# run linear regression here
model = LinearRegression()
model.fit(X_train_normalized, y_train)

1. Report the feature with the largest coefficient value and the corresponding coefficient (not including any offset term).

In [5]:
import numpy as np

coefficients = model.coef_
feature_names = X_train.columns
largest_coeff_idx = np.argmax(np.abs(coefficients))

feature = feature_names[largest_coeff_idx]
corresponding_coefficient = coefficients[largest_coeff_idx]
print(feature, corresponding_coefficient)

alcohol 0.32243737948877366


In [6]:
answers['Q1'] = [feature, corresponding_coefficient]

2. On the first example in the test set, determine which feature has the largest effect and report its effect (see "Explaining predictions using weight plots & effect plots").

In [7]:
first_test_example = X_test_normalized[0]
effects = first_test_example * coefficients

largest_effect_idx = np.argmax(np.abs(effects))

feature = feature_names[largest_effect_idx]
corresponding_coefficient = effects[largest_effect_idx]
print(feature, corresponding_coefficient)

alcohol 0.4645765261736787


In [8]:
answers['Q2'] = [feature, corresponding_coefficient]

3. (2 marks) Based on the MSE, compute ablations of the model including every feature (other than the offset). Find the most important feature (i.e., such that the ablated model has the highest MSE) and report the value of MSE_ablated - MSE_full.

In [9]:
y_pred = model.predict(X_test_normalized)
mse = mean_squared_error(y_test, y_pred)

mse_ablated = {}
for i, feature in enumerate(feature_names):
    X_train_ablated = np.delete(X_train_normalized, i, axis=1)
    X_test_ablated = np.delete(X_test_normalized, i, axis=1)

    model_ablated = LinearRegression()
    model_ablated.fit(X_train_ablated, y_train)

    y_pred_ablated = model_ablated.predict(X_test_ablated)
    mse_ablated[feature] = mean_squared_error(y_test, y_pred_ablated)

most_important_feature = max(mse_ablated, key=lambda k: mse_ablated[k] - mse)
mse_diff = mse_ablated[most_important_feature] - mse
print(most_important_feature, mse_diff)

volatile acidity 0.023537285288143472


In [10]:
answers['Q3'] = [most_important_feature, mse_diff]

4. (2 marks) Implement a full backward selection pipeline and report the sequence of MSE values for each model as a list (of increasing MSEs).

In [11]:
remaining_features = list(feature_names)
print(len(remaining_features))
X_train_current = X_train_normalized.copy()
X_test_current = X_test_normalized.copy()

mse_values = []

model = LinearRegression()
model.fit(X_train_current, y_train)
y_pred_full = model.predict(X_test_current)
mse_values.append(mean_squared_error(y_test, y_pred_full))

while len(remaining_features) > 1:
    mse_ablated = {}

    for i, feature in enumerate(remaining_features):
        X_train_ablated = np.delete(X_train_current, i, axis=1)
        X_test_ablated = np.delete(X_test_current, i, axis=1)

        model_ablated = LinearRegression()
        model_ablated.fit(X_train_ablated, y_train)
        y_pred_ablated = model_ablated.predict(X_test_ablated)
        mse_ablated[feature] = mean_squared_error(y_test, y_pred_ablated)

    least_important_feature = min(mse_ablated, key=lambda k: mse_ablated[k])
    mse_values.append(mse_ablated[least_important_feature])

    remove_idx = remaining_features.index(least_important_feature)
    X_train_current = np.delete(X_train_current, remove_idx, axis=1)
    X_test_current = np.delete(X_test_current, remove_idx, axis=1)
    remaining_features.remove(least_important_feature)

mse_list = sorted(mse_values)
mse_list

11


[0.5432570592077817,
 0.5433470389201226,
 0.5438342382954463,
 0.5443356554425473,
 0.5448096758792239,
 0.5455297712185817,
 0.5466964419580582,
 0.5473700777213758,
 0.5508614446930262,
 0.5538899057173574,
 0.604438075518826]

In [12]:
answers['Q4'] = mse_list 

5. (2 marks) Change your model to use an l1 regularizer. Increasing the regularization strength will cause variables to gradually be removed (coefficient reduced to zero) from the model. Which is the first and the last variable to be eliminated via this process?

In [13]:
from sklearn.linear_model import Lasso

alpha_values = np.logspace(-4, 1, 50)
feature_names = np.array(feature_names)

coef_history = np.zeros((len(alpha_values), len(feature_names)))

for i, alpha in enumerate(alpha_values):
    lasso = Lasso(alpha=alpha, max_iter=5000)
    lasso.fit(X_train_normalized, y_train)
    coef_history[i] = lasso.coef_

first_eliminated_index = np.where(coef_history == 0)[1][0]
last_eliminated_index = np.where(coef_history == 0)[1][-1]

first_feature = feature_names[first_eliminated_index]
last_feature = feature_names[last_eliminated_index]
print(first_feature, last_feature)

density alcohol


In [14]:
answers['Q5'] = [first_feature, last_feature]

### Implement a classifier to predict the wine color (red / white), again using an 80/20 train/test split, and including only continuous variables.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load datasets
winequality_red = pd.read_csv("winequality-red.csv", sep=';')
winequality_white = pd.read_csv("winequality-white.csv", sep=';')

# Add a column to distinguish red and white wines
winequality_red['type'] = 0  # Red wine (encoded as 0)
winequality_white['type'] = 1  # White wine (encoded as 1)

# Concatenate the datasets
wine_data = pd.concat([winequality_red, winequality_white], axis=0)

# Separate features (and drop "quality" to get continuous variables) and target
X = wine_data.drop(columns=['quality', 'type'])  # Drop the target column
y = wine_data['type']  # Target column (wine type)

# Perform train/test split
random_seed = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)

# Display shapes of the resulting splits
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (5197, 11)
X_test shape: (1300, 11)
y_train shape: (5197,)
y_test shape: (1300,)


6. Report the odds ratio associated with the first sample in the test set.

In [16]:
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore")

scaler = StandardScaler()

X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)
X_train = pd.DataFrame(X_train_normalized, columns=X_train.columns)
X_test = pd.DataFrame(X_test_normalized, columns=X_test.columns)

model = LogisticRegression()
model.fit(X=X_train, y=y_train)

first_sample = X_test[0:1]
beta_coefficients = model.coef_[0]
beta_0 = model.intercept_[0]
odds_ratio = float(beta_0 + np.dot(first_sample, beta_coefficients))
print(odds_ratio)

12.939022435400068


In [17]:
answers['Q6'] = odds_ratio

7. Find the 20 nearest neighbors (in the training set) to the first datapoint in the test set, based on the l2 distance. Train a classifier using only those 20 points, and report the largest value of e^theta_j (see “odds ratio” slides).

In [18]:
from sklearn.neighbors import NearestNeighbors

first_test_sample = X_test[0:1]
nbrs = NearestNeighbors(n_neighbors=50, metric="l2")
nbrs.fit(X_train)
distances, indices = nbrs.kneighbors(first_test_sample)

X_nearest = X_train.iloc[indices[0]].reset_index(drop=True)
y_nearest = y_train.iloc[indices[0]].reset_index(drop=True)

model = LogisticRegression()
model.fit(X_nearest, y_nearest)

coefficients = model.coef_[0]
exp_coefficients = np.exp(coefficients)
value = np.max(exp_coefficients)
value

1.6389763330200806

In [19]:
answers['Q7'] = value

In [20]:
answers

{'Q1': ['alcohol', 0.32243737948877366],
 'Q2': ['alcohol', 0.4645765261736787],
 'Q3': ['volatile acidity', 0.023537285288143472],
 'Q4': [0.5432570592077817,
  0.5433470389201226,
  0.5438342382954463,
  0.5443356554425473,
  0.5448096758792239,
  0.5455297712185817,
  0.5466964419580582,
  0.5473700777213758,
  0.5508614446930262,
  0.5538899057173574,
  0.604438075518826],
 'Q5': ['density', 'alcohol'],
 'Q6': 12.939022435400068,
 'Q7': 1.6389763330200806}