<a href="https://colab.research.google.com/github/LegionXF23/DevSoc-Assign/blob/main/Copy_of_Lin_Reg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression

In [None]:
# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## Data Preprocessing

### **Exploring the dataset**

Let's start with loading the training data from the csv into a pandas dataframe



Load the datasets from GitHub. Train dataset has already been loaded for you in df below. To get test dataset use the commented code.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/train_processed_splitted.csv')
print("--- Exploring the dataset ---")

NameError: name 'pd' is not defined

Let's see what the first 5 rows of this dataset looks like

In [None]:
print("First 5 rows:")
print(df.head())

What are all the features present? What is the range for each of the features along with their mean?

In [None]:
print("\nFeatures (Columns) present:")
print(df.columns.tolist())

print("\nRange (Min/Max) and Mean for each feature:")
feature_stats = df.describe().T[['min', 'max', 'mean']]
feature_stats['range'] = feature_stats['max'] - feature_stats['min']
print(feature_stats[['range', 'min', 'max', 'mean']])
print("-" * 35)


### **Feature Scaling and One-Hot Encoding**

You must have noticed that some features `(such as Utilities)` are not continuous values.
  
These features contain values indicating different categories and must somehow be converted to numbers so that the computer can understand it. `(Computers only understand numbers and not strings)`
  
These features are called categorical features. We can represent these features as a `One-Hot Representation`
  
  
You must have also noticed that all the other features, each are in a different scale. This can be detremental to the performance of our linear regression model and so we normalize them so that all of them are in the range $[0,1]$

> NOTE: When you are doing feature scaling, store the min/max which you will use to normalize somewhere. This is then to be used at testing time. Try to think why are doing this?

In [None]:
# Do the one-hot encoding here
print("--- Feature Scaling and One-Hot Encoding ---")

y_train_original = df['SalePrice'].copy()
X_train = df.drop('SalePrice', axis=1)

categorical_cols = X_train.select_dtypes(include='object').columns.tolist()
X_train = pd.get_dummies(X_train, columns=categorical_cols, dtype='int32')

if 'Utilities' in X_train.columns:
    X_train = X_train.drop('Utilities', axis=1)

# Fill any remaining NaN values with 0 (Common practice after one-hot encoding and before scaling)
X_train.fillna(0, inplace=True)

In [None]:
# Do the feature scaling here
min_vals = X_train.min(axis=0)
max_vals = X_train.max(axis=0)
# Avoid division by zero for columns with constant values (range=0)
range_vals = max_vals - min_vals
range_vals[range_vals == 0] = 1 # Set range to 1 for constant columns
X_train_scaled = (X_train - min_vals) / range_vals

# Scale Target (SalePrice)
target_min = y_train_original.min()
target_max = y_train_original.max()
y_train_scaled = (y_train_original - target_min) / (target_max - target_min)


print(f"Shape of scaled Features (X): {X_train_scaled.shape}")
print(f"Shape of scaled Target (Y): {y_train_scaled.shape}")
print("-" * 35)

### **Conversion to NumPy**

Ok so now that we have all preprocessed all the data, we need to convert it to numpy for our linear regression model
  
Assume that our dataset has a total of $N$ datapoints. Each datapoint having a total of $D$ features (after one-hot encoding), we want our numpy array to be of shape $(N, D)$

In our task, we have to predict the `SalePrice`. We will need 2 numpy arrays $

*   List item
*   List item

(X, Y)$. These represent the features and targets respectively

In [None]:
# Convert to numpy array
print("--- Conversion to NumPy ---")
X = X_train_scaled.to_numpy()
Y = y_train_scaled.to_numpy().reshape(-1, 1)
N, D = X.shape
print(f"N (Datapoints): {N}")
print(f"D (Features after encoding): {D}")
print("-" * 35)


## Linear Regression formulation
  
We now have our data in the form we need. Let's try to create a linear model to get our initial (Really bad) prediction


Let's say a single datapoint in our dataset consists of 3 features $(x_1, x_2, x_3)$, we can pose it as a linear equation as follows:
$$ y = w_1x_1 + w_2x_2 + w_3x_3 + b $$
Here we have to learn 4 parameters $(w_1, w_2, w_3, b)$
  
  
Now how do we extend this to multiple datapoints?  
  
  
Try to answer the following:
- How many parameters will we have to learn in the cae of our dataset? (Don't forget the bias term)
- Form a linear equation for our dataset. We need just a single matrix equation which correctly represents all the datapoints in our dataset
- Implement the linear equation as an equation using NumPy arrays (Start by randomly initializing the weights from a standard normal distribution)

In [None]:
print("--- Linear Regression formulation ---")
num_parameters = D + 1
print(f"Number of parameters to learn: D (Features) + 1 (Bias) = {D} + 1 = {num_parameters}")
W = np.random.randn(D, 1) * 0.01
b = np.random.randn(1) * 0.01
Y_pred_initial = X @ W + b

print(f"Initial Weights (W) shape: {W.shape}")
print(f"Initial Bias (b) value: {b}")
print(f"Initial Prediction (Y_pred) shape: {Y_pred_initial.shape}")

How well does our model perform? Try comparing our predictions with the actual values

In [None]:
initial_mse_loss = np.mean((Y_pred_initial - Y)**2)
print(f"Initial MSE Loss (Scaled): {initial_mse_loss:.6f}")
print("-" * 35)

### **Learning weights using gradient descent**

So these results are really horrible. We need to somehow update our weights so that it correclty represents our data. How do we do that?

We must do the following:
- We need some numerical indication for our performance, for this we define a Loss Function ( $\mathscr{L}$ )
- Find the gradients of the `Loss` with respect to the `Weights`
- Update the weights in accordance to the gradients: $W = W - \alpha\nabla_W \mathscr{L}$

Lets define the loss function:
- We will use the MSE loss since it is a regression task. (Specify the assumptions we make while doing so as taught in the class).
- Implement this loss as a function. (Use numpy as much as possible)

In [None]:
def mse_loss_fn(y_true, y_pred):
    """
    Calculates the Mean Squared Error (MSE) loss.
    Assumptions: The errors (residuals) are normally distributed,
                 homoscedastic (constant variance), and independent.
    """
    # y_true and y_pred are (N, 1) matrices
    N = y_true.shape[0]
    loss = np.sum((y_pred - y_true)**2) / N
    return loss


SyntaxError: incomplete input (ipython-input-957558406.py, line 1)

Calculate the gradients of the loss with respect to the weights (and biases). First write the equations down on a piece of paper, then proceed to implement it

In [None]:
def get_gradients(y_true, y_pred, W, b, X):
    """
    Calculates the gradients for the MSE loss function with respect to the weights (W) and bias (b).

    Gradient of Loss (L) w.r.t W: $\nabla_W \mathscr{L} = \frac{2}{N} X^T (Y_{pred} - Y_{true})$
    Gradient of Loss (L) w.r.t b: $\nabla_b \mathscr{L} = \frac{2}{N} \sum (Y_{pred} - Y_{true})$
    """
    N = y_true.shape[0]
    error = y_pred - y_true # (N, 1)

    # dW: (D, N) @ (N, 1) -> (D, 1)
    dW = (2/N) * (X.T @ error)

    # db: scalar
    db = (2/N) * np.sum(error)

    return dW, db

Update the weights using the gradients

In [None]:
def update(weights, bias, gradients_weights, gradients_bias, lr):
    """
    Updates the weights (and bias) using the gradients and the learning rate.
    $W_{new} = W - \alpha \nabla_W \mathscr{L}$
    $b_{new} = b - \alpha \nabla_b \mathscr{L}$
    """
    weights_new = weights - lr * gradients_weights
    bias_new = bias - lr * gradients_bias
    return weights_new, bias_new

Put all these together to find the loss value, its gradient and finally updating the weights in a loop. Feel free to play around with different learning rates and epochs
  
> NOTE: The code in comments are just meant to be used as a guide. You will have to do changes based on your code

In [None]:
NUM_EPOCHS = 1000
LEARNING_RATE = 2e-2

# Re-initialize weights and bias for training
W = np.random.randn(D, 1) * 0.01
b = np.random.randn(1) * 0.01

losses = []

for epoch in range(NUM_EPOCHS):
    # Model prediction: Y_pred = X @ W + b
    Y_pred = X @ W + b # (N, D) @ (D, 1) -> (N, 1)

    # Calculate loss
    loss = mse_loss_fn(Y, Y_pred)
    losses.append(loss)

    # Calculate gradients
    dW, db = get_gradients(Y, Y_pred, W, b, X)

    # Update weights and bias
    W, b = update(W, b, dW, db, LEARNING_RATE)

    # Print loss periodically
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch + 1}/{NUM_EPOCHS}, Loss: {loss:.6f}")

final_train_loss = losses[-1]
print(f"\nFinal Training Loss (Scaled): {final_train_loss:.6f}")

Now use matplotlib to plot the loss graph

In [None]:
plt.figure(figsize=(8, 4))
plt.plot(losses)
plt.xlabel('Epochs')
plt.ylabel('Loss (MSE)')
plt.title('Training Loss over Epochs')
plt.grid(True)
plt.show()

print("-" * 35)

### **Testing with test data**

Load and apply all the preprocessing steps used in the training data for the testing data as well. Remember to use the **SAME** min/max values which you used for the training set and not recalculate them from the test set. Also mention why we are doing this.

To load test data from GitHub, use the code below.


In [None]:
df_test = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/test_processed_splitted.csv')
print(df_test)

# Let's find all the columns that are missing in the test set
missing_cols = set(df.columns) - set(df_test.columns)

# Add these columns to the test set with all zeros
for col in missing_cols:
    df_test[col] = 0

if 'Utilities_AllPub' not in df_test.columns:
    df_test = df_test.join(pd.get_dummies(df_test['Utilities'], dtype = 'int32', prefix = 'Utilities'))
    df_test = df_test.drop('Utilities', axis = 1)



Using the weights learnt above, predict the values in the test dataset. Also answer the following questions:
- Are the predictions good?
- What is the MSE loss for the testset
- Is the MSE loss for testing greater or lower than training
- Why is this the case

In [None]:
# Scale the features
print("--- Testing Data Preprocessing and Prediction ---")

# Load test data
df_test = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/test_processed_splitted.csv')
print("Initial Test Data Head:")
print(df_test.head(2))

# Separate features and target
y_test_original = df_test['SalePrice'].copy()
X_test = df_test.drop('SalePrice', axis=1)

# Apply One-Hot Encoding to categorical columns in the test set
categorical_cols_test = X_test.select_dtypes(include='object').columns.tolist()
X_test = pd.get_dummies(X_test, columns=categorical_cols_test, dtype='int32')

# Drop the original 'Utilities' column if still present after one-hot (as per prompt logic)
if 'Utilities' in X_test.columns:
    X_test = X_test.drop('Utilities', axis=1)

# Alignment with Training Data Columns
# Rationale: We must ensure the test data matrix (x_test) has the same number of features (columns)
# and that they are in the **exact same order** as the training data (X), so that each weight $W_i$
# corresponds to the correct feature $X_i$ during matrix multiplication.

X_train_cols = list(min_vals.index) # Use the columns from the training min/max

# Find missing columns in X_test and add them with 0
missing_cols = set(X_train_cols) - set(X_test.columns)
for col in missing_cols:
    X_test[col] = 0

# Find extra columns in X_test and drop them
extra_cols = set(X_test.columns) - set(X_train_cols)
if extra_cols:
    X_test = X_test.drop(columns=list(extra_cols), axis=1)

# Ensure the columns are in the exact same order as training
X_test = X_test[X_train_cols]
# Fill NaN values
X_test.fillna(0, inplace=True)

# Scale features
range_vals = max_vals - min_vals
range_vals[range_vals == 0] = 1 # Handle constant columns
X_test_scaled = (X_test - min_vals) / range_vals
y_test_scaled = (y_test_original - target_min) / (target_max - target_min)

# Convert to numpy array
x_test = X_test_scaled.to_numpy() # (N_test, D)
y_test = y_test_scaled.to_numpy().reshape(-1, 1) # (N_test, 1)

print(f"\nFinal Test Features shape: {x_test.shape}")
print("-" * 35)

In [None]:
'''extra_cols = list(set(df_test.columns) - set(df.columns))
print("Extra columns in df_test:", extra_cols)

missing_cols = list(set(df.columns) - set(df_test.columns))
print("Missing columns in df_test:", missing_cols)'''

In [None]:
# Make predictions
Y_pred_test = x_test @ W + b # (N_test, 1)

# Calculate test loss (Scaled)
loss_test = mse_loss_fn(y_test, Y_pred_test)

# Scale the predictions and true values back to the original scale
Y_pred_test_original = Y_pred_test * (target_max - target_min) + target_min
Y_test_original = y_test * (target_max - target_min) + target_min

# Display sample results
idx = np.random.randint(0, x_test.shape[0], 5)
Y_pred_test_sample = Y_pred_test_original[idx].round().astype(int)
Y_true_test_sample = Y_test_original[idx].round().astype(int)

print('Predicted SalePrice: \t', Y_pred_test_sample.squeeze().tolist())
print('Actual SalePrice: \t', Y_true_test_sample.squeeze().tolist())
print('\nTest Loss (Scaled): \t', loss_test)
# Scale the predictions back to the original scale


In [None]:
print("\n--- Model Analysis ---")
print(f"Final Training Loss (Scaled): \t {final_train_loss:.6f}")
print(f"Test Loss (Scaled): \t\t {loss_test:.6f}")

# 1. Are the predictions good?
print("\n- **Are the predictions good?**")
print(f"Based on the low scaled Test Loss of **{loss_test:.4f}**, the predictions are generally good. The low MSE indicates that the model's predictions are close to the actual scaled values, and the sample predictions are in the correct order of magnitude.")

# 2. What is the MSE loss for the testset
print("\n- **What is the MSE loss for the testset?**")
print(f"The **Mean Squared Error (MSE) loss** for the test set (on scaled data) is **{loss_test:.6f}**.")

# 3. Is the MSE loss for testing greater or lower than training
comparison = "greater" if loss_test > final_train_loss else "lower"
print("\n- **Is the MSE loss for testing greater or lower than training?**")
print(f"The MSE loss for testing ({loss_test:.6f}) is **{comparison}** than the final training loss ({final_train_loss:.6f}).")

# 4. Why is this the case
print("\n- **Why is this the case?**")
if comparison == "greater":
    print("It is **expected** for the test loss to be slightly greater. The model's weights were optimized specifically for the training data. The test data is unseen, and a slightly higher test loss indicates that the model is **generalizing** well to new data rather than having perfectly memorized the training set (overfitting).")
else:
    print("If the test loss is lower, it suggests the test set is either 'easier' (less noisy or less variable) than the training set, or that the model exhibits excellent generalization, though this outcome is less common than the test loss being slightly higher.")