
# Assignment 3 for Course 1MS041
Make sure you pass the `# ... Test` cells and
 submit your solution notebook in the corresponding assignment on the course website. You can submit multiple times before the deadline and your highest score will be used.

---
## Assignment 3, PROBLEM 1
Maximum Points = 8


Download the updated data folder from the course github website or just download directly the file [https://github.com/datascience-intro/1MS041-2025/blob/main/notebooks/data/smhi.csv](https://github.com/datascience-intro/1MS041-2025/blob/main/notebooks/data/smhi.csv) from the github website and put it inside your data folder, i.e. you want the path `data/smhi.csv`. The data was aquired from SMHI (Swedish Meteorological and Hydrological Institute) and constitutes per hour measurements of wind in the Uppsala Aut station. The data consists of windspeed and direction. Your goal is to load the data and work with it a bit. The code you produce should load the file as it is, please do not alter the file as the autograder will only have access to the original file.

The file information is in Swedish so you need to use some translation service, for instance `Google translate` or ChatGPT.

1. [2p] Load the file, for instance using the `csv` package. Put the wind-direction as a numpy array and the wind-speed as another numpy array.
2. [2p] Use the wind-direction (see [Wikipedia](https://en.wikipedia.org/wiki/Wind_direction)) which is an angle in degrees and convert it into a point on the unit circle **which is the direction the wind is blowing to** (compare to definition of radians [Wikipedia](https://en.wikipedia.org/wiki/Radian)). Store the `x_coordinate` as one array and the `y_coordinate` as another. From these coordinates, construct the wind-velocity vector.
3. [2p] Calculate the average wind velocity and convert it back to direction and compare it to just taking average of the wind direction as given in the data-file.
4. [2p] The wind velocity is a $2$-dimensional random variable, calculate the empirical covariance matrix which should be a numpy array of shape (2,2).

For you to wonder about, is it more likely for you to have headwind or not when going to the university in the morning.

In [144]:
import numpy as np
import csv
import math

# --- Part 1: Data Loading and Array Creation ---
winds_dir = []
winds_speed = []

filepath = "data/smhi.csv"

# Use ISO-8859-1 encoding and semicolon delimiter
with open(filepath, encoding="ISO-8859-1") as f: 
    reader = csv.reader(f, delimiter=';')

    # Skip 12 metadata/header rows
    for _ in range(12):
        next(reader) 

    for row in reader:
        # Data columns: Date(0), Time(1), Direction(2), Quality(3), Speed(4)
        if len(row) > 4:
            try:
                # Column 3 (index 2) = Wind Direction
                wind_dir_str = row[2].strip()
                # Column 5 (index 4) = Wind Speed
                wind_speed_str = row[4].strip()
                
                if wind_dir_str and wind_speed_str and wind_dir_str != '999': # Check for non-empty and non-missing codes
                    wind_dir_value = float(wind_dir_str)
                    wind_speed_value = float(wind_speed_str)
                    
                    if 0 <= wind_dir_value <= 360 and wind_speed_value >= 0:
                        winds_dir.append(wind_dir_value)
                        winds_speed.append(wind_speed_value)
                        
            except (ValueError, IndexError):
                continue

problem1_wind_direction = np.array(winds_dir, dtype=np.float64)
problem1_wind_speed = np.array(winds_speed, dtype=np.float64)


# --- Part 2: Velocity Vector Construction ---

# 1. Convert Meteo FROM direction to Meteo TO direction (180 degrees difference)
wind_blows_to_degrees = (problem1_wind_direction + 180) % 360

# 2. Convert Meteo TO angle to Standard Math angle (0=East, counter-clockwise)
degrees_to_math = 90 - wind_blows_to_degrees
radians_to_math = np.radians(degrees_to_math)

# 3. Calculate unit vector coordinates (x=cos, y=sin)
problem1_wind_direction_x_coordinate = np.cos(radians_to_math)
problem1_wind_direction_y_coordinate = np.sin(radians_to_math)

# 4. Construct velocity vector (Magnitude * Unit Vector)
problem1_wind_velocity_x_coordinate = problem1_wind_direction_x_coordinate * problem1_wind_speed
problem1_wind_velocity_y_coordinate = problem1_wind_direction_y_coordinate * problem1_wind_speed


# --- Part 3: Average Velocity and Direction Comparison ---

# Put the average wind velocity x and y coordinates here in these variables
problem1_average_wind_velocity_x_coordinate = np.mean(problem1_wind_velocity_x_coordinate)
problem1_average_wind_velocity_y_coordinate = np.mean(problem1_wind_velocity_y_coordinate)

# 1. Average Wind Velocity Vector Angle (The correct average direction, converted back to Meteo TO)
avg_radians_math = np.arctan2(problem1_average_wind_velocity_y_coordinate, problem1_average_wind_velocity_x_coordinate)
avg_degrees_math = np.degrees(avg_radians_math) % 360 
problem1_average_wind_velocity_angle_degrees = (90 - avg_degrees_math) % 360

# 2. Simple Average of Wind Direction (The mathematically flawed angle average)
problem1_average_wind_direction_angle_degrees = np.mean(problem1_wind_direction)

# Finally, are they the same?
problem1_same_angle = np.isclose(problem1_average_wind_velocity_angle_degrees, problem1_average_wind_direction_angle_degrees)


# --- Part 4: Covariance Matrix ---

# Stack x and y velocity components as rows
velocity_data = np.stack(
    (problem1_wind_velocity_x_coordinate, problem1_wind_velocity_y_coordinate),
    axis=0
)

# Calculate the empirical covariance matrix. 
problem1_wind_velocity_covariance_matrix = np.cov(velocity_data)

In [145]:
import math

# 1. Calculate the direction the wind blows TO (180 degrees opposite FROM)
wind_blows_to_degrees = (problem1_wind_direction + 180) % 360

# 2. Convert to standard mathematical angle convention (0=East, counter-clockwise)
# Math_Angle = 90 - Meteo_Angle_TO
degrees_to_math = 90 - wind_blows_to_degrees
radians_to_math = np.radians(degrees_to_math)

# 3. Calculate unit vector coordinates (x=cos, y=sin)
problem1_wind_direction_x_coordinate = np.cos(radians_to_math)
problem1_wind_direction_y_coordinate = np.sin(radians_to_math)

# 4. Construct velocity vector (Magnitude * Unit Vector)
problem1_wind_velocity_x_coordinate = problem1_wind_direction_x_coordinate * problem1_wind_speed
problem1_wind_velocity_y_coordinate = problem1_wind_direction_y_coordinate * problem1_wind_speed

# print("Wind Velocity X Coordinate:", problem1_wind_velocity_x_coordinate) # Too large to print
# print("Wind Velocity Y Coordinate:", problem1_wind_velocity_y_coordinate) # Too large to print

In [146]:
# Put the average wind velocity x and y coordinates here in these variables
problem1_average_wind_velocity_x_coordinate = np.mean(problem1_wind_velocity_x_coordinate)
problem1_average_wind_velocity_y_coordinate = np.mean(problem1_wind_velocity_y_coordinate)

# --- 1. Average Wind Velocity Vector Angle (The correct average direction) ---
# a. Get angle in standard math radians (-pi to pi) from the mean vector
avg_radians_math = np.arctan2(problem1_average_wind_velocity_y_coordinate, problem1_average_wind_velocity_x_coordinate)
# b. Convert to degrees (0 to 360)
avg_degrees_math = np.degrees(avg_radians_math) % 360 
# c. Convert back to meteorological 'wind blowing TO' (0=N, clockwise)
problem1_average_wind_velocity_angle_degrees = (90 - avg_degrees_math) % 360

# --- 2. Simple Average of Wind Direction (The wrong average direction) ---
# The simple mean of angles is mathematically flawed for circular data (e.g., 1 degree and 359 degrees average to 180, not 0)
problem1_average_wind_direction_angle_degrees = np.mean(problem1_wind_direction)

print("Average Wind Velocity Angle (degrees):", problem1_average_wind_velocity_angle_degrees)
print("Average Wind Direction Angle (degrees):", problem1_average_wind_direction_angle_degrees)

# Finally, are they the same? (They should be different unless the wind is highly concentrated)
problem1_same_angle = np.isclose(problem1_average_wind_velocity_angle_degrees, problem1_average_wind_direction_angle_degrees)
# If np.isclose is not allowed, use: abs(A - B) < tolerance
# problem1_same_angle = abs(problem1_average_wind_velocity_angle_degrees - problem1_average_wind_direction_angle_degrees) < 1e-6

Average Wind Velocity Angle (degrees): 20.12646997879915
Average Wind Direction Angle (degrees): 192.281280627246


In [147]:
# Stack the x and y coordinates of the wind velocity vector. 
# Rows will be the variables (vx, vy), columns will be the observations.
velocity_data = np.stack(
    (problem1_wind_velocity_x_coordinate, problem1_wind_velocity_y_coordinate),
    axis=0
)

# Calculate the empirical covariance matrix. 
# np.cov assumes variables are rows by default if no axis is specified.
problem1_wind_velocity_covariance_matrix = np.cov(velocity_data)

---
## Assignment 3, PROBLEM 2
Maximum Points = 8


For this problem you will need the [pandas](https://pandas.pydata.org/) package and the [sklearn](https://scikit-learn.org/stable/) package. Inside the `data` folder from the course website you will find a file called `indoor_train.csv`, this file includes a bunch of positions in (X,Y,Z) and also a location number. The idea is to assign a room number (Location) to the coordinates (X,Y,Z).

1. [2p] Take the data in the file `indoor_train.csv` and load it using pandas into a dataframe `df_train`
2. [3p] From this dataframe `df_train`, create two numpy arrays, one `Xtrain` and `Ytrain`, they should have sizes `(1154,3)` and `(1154,)` respectively. Their `dtype` should be `float64` and `int64` respectively.
3. [3p] Train a Support Vector Classifier, `sklearn.svc.SVC`, on `Xtrain, Ytrain` with `kernel='linear'` and name the trained model `svc_train`.

To mimic how [kaggle](https://www.kaggle.com/) works, the Autograder has access to a hidden test-set and will test your fitted model.

In [148]:

import pandas as pd
from sklearn.svm import SVC

df_train = pd.read_csv("data/indoor_train.csv")

In [149]:

Xtrain = np.array(df_train.drop(columns=["Location"]).astype(float))
Ytrain = np.array(df_train["Location"].astype(int))

print("Xtrain:", Xtrain.shape)
print("Ytrain:", Ytrain.shape)

Xtrain: (1154, 3)
Ytrain: (1154,)


In [150]:

svc_train = SVC(kernel='linear').fit(Xtrain, Ytrain)

---
## Assignment 3, PROBLEM 3
Maximum Points = 8


Let us build a proportional model ($\mathbb{P}(Y=1 \mid X) = G(\beta_0+\beta \cdot X)$ where $G$ is the logistic function) for the spam vs not spam data. Here we assume that the features are presence vs not presence of a word, let $X_1,X_2,X_3$ denote the presence (1) or absence (0) of the words $("free", "prize", "win")$.

1. [2p] Load the file `data/spam.csv` and create two numpy arrays, `problem3_X` which has shape **(n_texts,3)** where each feature in `problem3_X` corresponds to $X_1,X_2,X_3$ from above, `problem3_Y` which has shape **(n_texts,)** and consists of a $1$ if the email is spam and $0$ if it is not. Split this data into a train-calibration-test sets where we have the split $40\%$, $20\%$, $40\%$, put this data in the designated variables in the code cell.

2. [2p] Follow the calculation from the lecture notes where we derive the logistic regression and implement the final loss function inside the class `ProportionalSpam`. You can use the `Test` cell to check that it gives the correct value for a test-point.

3. [2p] Train the model `problem3_ps` on the training data. The goal is to calibrate the probabilities output from the model. Start by creating a new variable `problem3_X_pred` (shape `(n_samples,1)`) which consists of the predictions of `problem3_ps` on the calibration dataset. Then train a calibration model using `sklearn.tree.DecisionTreeRegressor`, store this trained model in `problem3_calibrator`. Recall that calibration error is the following for a fixed function $f$
$$
    \sqrt{\mathbb{E}[|\mathbb{E}[Y \mid f(X)] - f(X)|^2]}.
$$

4. [2p] Use the trained model `problem3_ps` and the calibrator `problem3_calibrator` to make final predictions on the testing data, store the prediction in `problem3_final_predictions`. 

In [151]:
from sklearn.model_selection import train_test_split

data = pd.read_csv('data/spam.csv', encoding='latin-1')
X1 = "free"
X2 = "prize"
X3 = "win"

feature_list = []
target_list = []

for text, spam_label in zip(data['v2'], data['v1']):

    feature_row_bools = [X1 in text, X2 in text, X3 in text]
    feature_list.append(feature_row_bools)

    target_value = (spam_label == 'spam') # True/False
    target_list.append(target_value)

problem3_X = np.array(feature_list).astype(int)
problem3_Y = np.array(target_list).astype(int)

problem3_X_calib, X_temp, problem3_Y_calib, Y_temp = train_test_split(
    problem3_X, problem3_Y, test_size=0.8, random_state=42
)
problem3_X_train, problem3_X_test, problem3_Y_train, problem3_Y_test = train_test_split(
    X_temp, Y_temp, test_size=0.5, random_state=42
)

print(problem3_X_train.shape,problem3_X_calib.shape,problem3_X_test.shape,problem3_Y_train.shape,problem3_Y_calib.shape,problem3_Y_test.shape)

(2229, 3) (1114, 3) (2229, 3) (2229,) (1114,) (2229,)


In [152]:
class ProportionalSpam(object):
    def __init__(self):
        self.coeffs = None
        self.result = None
    
    # Helper function to add the intercept column
    def _augment_X(self, X):
        import numpy as np
        ones = np.ones((X.shape[0], 1))
        # X is (N, D), returns X_aug (N, D+1)
        return np.hstack([ones, X])

    # 1. FIX: loss method augments X internally
    def loss(self, X, Y, coeffs):
        import numpy as np
        
        # Augment X to match the size of coeffs (3 features + 1 intercept = 4)
        X_aug = self._augment_X(X)
        
        # The dot product now works: (N, 4) @ (4,) -> (N,)
        Z = X_aug @ coeffs 
        Y_hat = 1 / (1 + np.exp(-Z))
        epsilon = 1e-15
        
        # This is the negative log-likelihood loss for logistic regression
        loss_value = -np.mean(Y * np.log(Y_hat + epsilon) + (1 - Y) * np.log(1 - Y_hat + epsilon))
        return loss_value

    def fit(self, X, Y):
        import numpy as np
        from scipy import optimize

        # We pass the non-augmented X to the optimizer, which in turn calls loss(X, Y, coeffs)
        # where the augmentation happens.
        opt_loss = lambda coeffs: self.loss(X, Y, coeffs)
        
        # The optimizer needs an initial guess for the D+1 coefficients.
        initial_arguments = np.zeros(shape=X.shape[1] + 1) 
        
        self.result = optimize.minimize(opt_loss, initial_arguments, method='cg')
        self.coeffs = self.result.x
    
    # 2. FIX: predict method augments X internally
    def predict(self, X):
        import numpy as np
        if (self.coeffs is not None):
            # Augment X for prediction
            X_aug = self._augment_X(X)
            
            # G = sigmoid function
            G = lambda z: np.exp(z) / (1 + np.exp(z))
            
            # Z = X_aug @ self.coeffs
            Z = X_aug @ self.coeffs 

            # Return the rounded probability
            return np.round(10 * G(Z)) / 10

In [153]:

from sklearn.tree import DecisionTreeRegressor

ones_train = np.ones((problem3_X_train.shape[0], 1))
problem3_X_train_aug = np.hstack([ones_train, problem3_X_train])

problem3_ps = ProportionalSpam()
problem3_ps.fit(problem3_X_train, problem3_Y_train)

problem3_X_pred = problem3_ps.predict(problem3_X_calib)
problem3_X_pred = problem3_X_pred.reshape(-1, 1)

problem3_calibrator = DecisionTreeRegressor()
problem3_calibrator.fit(problem3_X_pred, problem3_Y_calib)

raw_calib_predictions = problem3_ps.predict(problem3_X_calib)

problem3_X_pred = raw_calib_predictions.reshape(-1, 1)


In [154]:

problem3_final_predictions = problem3_calibrator.predict(
    problem3_ps.predict(problem3_X_test).reshape(-1, 1)
)

---
#### Local Test for Assignment 3, PROBLEM 3
Evaluate cell below to make sure your answer is valid.                             You **should not** modify anything in the cell below when evaluating it to do a local test of                             your solution.
You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.

In [155]:
try:
    import numpy as np
    test_instance = ProportionalSpam()
    test_loss = test_instance.loss(np.array([[1,0,1],[0,1,1]]),np.array([1,0]),np.array([1.2,0.4,0.3,0.9]))
    assert (np.abs(test_loss-1.2828629432232497) < 1e-6)
    print("Your loss was correct for a test point")
except:
    print("Your loss was not correct on a test point")

Your loss was correct for a test point
