## MLOps Assignment: Debugging Machine Learning Pipelines with Python
__Shisheer S Kaushik [ M23IQT006 ]__

__Task 1.1 :__

Through the Debugger, I found the presence of `np.nan` ( empty data ) in the data array causes the preprocessing to fail.

In [1]:
import pdb
import numpy as np
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

def preprocess_data(data):
    scaler = StandardScaler()
    pdb.set_trace()  # I have started the Debugger

    scaled_data = scaler.fit_transform(data)
    normalized_data = scaled_data / np.linalg.norm(scaled_data)
    return normalized_data

data = np.array([[1, 2, 3], [4, 5, np.nan], [7, 8, 9]])
processed_data = preprocess_data(data)
print("Processed Data:\n", processed_data)



sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/lib/python3.10/bdb.py", line 336, in set_trace
    sys.settrace(self.trace_dispatch)



> [0;32m<ipython-input-1-e4f08cff32be>[0m(11)[0;36mpreprocess_data[0;34m()[0m
[0;32m      9 [0;31m    [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m  [0;31m# I have started the Debugger[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     10 [0;31m[0;34m[0m[0m
[0m[0;32m---> 11 [0;31m    [0mscaled_data[0m [0;34m=[0m [0mscaler[0m[0;34m.[0m[0mfit_transform[0m[0;34m([0m[0mdata[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     12 [0;31m    [0mnormalized_data[0m [0;34m=[0m [0mscaled_data[0m [0;34m/[0m [0mnp[0m[0;34m.[0m[0mlinalg[0m[0;34m.[0m[0mnorm[0m[0;34m([0m[0mscaled_data[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     13 [0;31m    [0;32mreturn[0m [0mnormalized_data[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> p data
array([[ 1.,  2.,  3.],
       [ 4.,  5., nan],
       [ 7.,  8.,  9.]])
ipdb> n
> [0;32m<ipython-input-1-e4f08cff32be>[0m(12)[0;36mpreprocess_data[0;34m()[0m
[0;32m     10 [0;31m[0;34m[0m


sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/lib/python3.10/bdb.py", line 347, in set_continue
    sys.settrace(None)



Processed Data:
 [[nan nan nan]
 [nan nan nan]
 [nan nan nan]]


__Task 1.2 :__

To handle the above problem, I have replaced the missing values with the mean of their respective columns.

In [2]:
import pdb
import numpy as np
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

def preprocess_data(data):
    scaler = StandardScaler()
    pdb.set_trace()  # I have started the Debugger

    col_means = np.nanmean(data, axis=0) # The pdb found a nan or ( missing values ) is replaced  with the column mean
    data = np.where(np.isnan(data), col_means, data)

    scaled_data = scaler.fit_transform(data)
    normalized_data = scaled_data / np.linalg.norm(scaled_data)
    return normalized_data

data = np.array([[1, 2, 3], [4, 5, np.nan], [7, 8, 9]])
processed_data = preprocess_data(data)
print("Processed Data:\n", processed_data)


> [0;32m<ipython-input-2-015a884535db>[0m(11)[0;36mpreprocess_data[0;34m()[0m
[0;32m      9 [0;31m    [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m  [0;31m# I have started the Debugger[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     10 [0;31m[0;34m[0m[0m
[0m[0;32m---> 11 [0;31m    [0mcol_means[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0mnanmean[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;36m0[0m[0;34m)[0m [0;31m# The pdb found a nan or ( missing values ) is replaced  with the column mean[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     12 [0;31m    [0mdata[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0mwhere[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0misnan[0m[0;34m([0m[0mdata[0m[0;34m)[0m[0;34m,[0m [0mcol_means[0m[0;34m,[0m [0mdata[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     13 [0;31m[0;34m[0m[0m
[0m
ipdb> p data
array([[ 1.,  2.,  3.],
       [ 4.,  5., nan],
       [ 7.,  8.,  9.]])
ipdb> n
> [0;32

__Task 2.1 :__

Through the Debugger, I found that there was a mismatch with `X.shape` -> (4, 1) and `y.shape` -> (5,).

In [3]:
from sklearn.linear_model import LinearRegression
import numpy as np
def train_model(X, y):
  model = LinearRegression()
  pdb.set_trace() # Debugger for model training
  model.fit(X, y)
  return model

# Example Data
X = np.array([[1], [2], [3], [4]])
y = np.array([1, 4, 9, 16, 25]) # Incorrect shape on purpose
trained_model = train_model(X, y)
print("Trained Model Coefficients:", trained_model.coef_)

> [0;32m<ipython-input-3-e807d9f76359>[0m(6)[0;36mtrain_model[0;34m()[0m
[0;32m      4 [0;31m  [0mmodel[0m [0;34m=[0m [0mLinearRegression[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      5 [0;31m  [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m [0;31m# Debugger for model training[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 6 [0;31m  [0mmodel[0m[0;34m.[0m[0mfit[0m[0;34m([0m[0mX[0m[0;34m,[0m [0my[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      7 [0;31m  [0;32mreturn[0m [0mmodel[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m[0;34m[0m[0m
[0m
ipdb> p X.shape
(4, 1)
ipdb> p y.shape
(5,)
ipdb> n
ValueError: Found input variables with inconsistent numbers of samples: [4, 5]
> [0;32m<ipython-input-3-e807d9f76359>[0m(6)[0;36mtrain_model[0;34m()[0m
[0;32m      4 [0;31m  [0mmodel[0m [0;34m=[0m [0mLinearRegression[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      5 [0

ValueError: Found input variables with inconsistent numbers of samples: [4, 5]

__Task 2.1 & 2.3 :__

Hence I have resolved the issue by adjusted the `y.shape` by removing the extra element.

In [5]:
from sklearn.linear_model import LinearRegression
import numpy as np
import pdb

def train_model(X, y):
    pdb.set_trace()  # I have started the Debugger
    model = LinearRegression()
    model.fit(X, y)
    return model

#  The pdb found a inconsistent numbers of samples: [4, 5], hence I have replaced  with the corrected data
X = np.array([[1], [2], [3], [4]])
y = np.array([1, 4, 9, 16])  # Shape matches with X

trained_model = train_model(X, y)
print("Trained Model Coefficients:", trained_model.coef_)


> [0;32m<ipython-input-5-fdbd0fc531b1>[0m(7)[0;36mtrain_model[0;34m()[0m
[0;32m      5 [0;31m[0;32mdef[0m [0mtrain_model[0m[0;34m([0m[0mX[0m[0;34m,[0m [0my[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      6 [0;31m    [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m  [0;31m# I have started the Debugger[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 7 [0;31m    [0mmodel[0m [0;34m=[0m [0mLinearRegression[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m    [0mmodel[0m[0;34m.[0m[0mfit[0m[0;34m([0m[0mX[0m[0;34m,[0m [0my[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      9 [0;31m    [0;32mreturn[0m [0mmodel[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> p X.shape
(4, 1)
ipdb> p y.shape
(4,)
ipdb> n
> [0;32m<ipython-input-5-fdbd0fc531b1>[0m(8)[0;36mtrain_model[0;34m()[0m
[0;32m      6 [0;31m    [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m  [0;31m# I hav

__Task 3.1 :__

I found a unexpectedly high `Mean Squared Error: 337.0` likely result.

In [7]:
from sklearn.metrics import mean_squared_error
import numpy as np
def evaluate_model(model, X_test, y_test):
  predictions = model.predict(X_test)

  pdb.set_trace() # Debugger for evaluation
  mse = mean_squared_error(y_test, predictions)
  return mse

X_test = np.array([[5], [6], [7], [8]])
y_test = np.array([25, 36, 49, 64])

# Using trained_model from previous task
mse_score = evaluate_model(trained_model, X_test, y_test)
print("Mean Squared Error:", mse_score)

> [0;32m<ipython-input-7-c8b28db1e630>[0m(7)[0;36mevaluate_model[0;34m()[0m
[0;32m      5 [0;31m[0;34m[0m[0m
[0m[0;32m      6 [0;31m  [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m [0;31m# Debugger for evaluation[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 7 [0;31m  [0mmse[0m [0;34m=[0m [0mmean_squared_error[0m[0;34m([0m[0my_test[0m[0;34m,[0m [0mpredictions[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m  [0;32mreturn[0m [0mmse[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      9 [0;31m[0;34m[0m[0m
[0m
ipdb> p predictions
array([20., 25., 30., 35.])
ipdb> p y_test
array([25, 36, 49, 64])
ipdb> n
> [0;32m<ipython-input-7-c8b28db1e630>[0m(8)[0;36mevaluate_model[0;34m()[0m
[0;32m      6 [0;31m  [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m [0;31m# Debugger for evaluation[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      7 [0;31m  [0mmse[0m [0;34m=[0m [0mmean_squared_error[0m[0;34m([0

__Test 3.2 :__

a) The training data represents a quadratic relationship (y = x^2). However, my model is a `linear regression`, which canâ€™t capture this quadratic pattern well.

b) So, I am using `PolynomialFeatures(degree=2)` transforms _X_ into [1, x, x^2] terms. This enables the model to fit a quadratic curve, which better aligns with y = x^2.

c) I have also transformed __test data__ in similar process to ensures consistency with __training data__

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
import pdb

def train_model(X, y, degree=2):
    pdb.set_trace() # I have started the Debugger

    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)

    model = LinearRegression()
    model.fit(X_poly, y)
    return model, poly


#  The pdb found a inconsistent numbers of samples: [4, 5], hence I have replaced  with the corrected data
X = np.array([[1], [2], [3], [4]])
y = np.array([1, 4, 9, 16])  # Shape matches with X

trained_model, poly = train_model(X, y, degree=2)
print("Trained Model Coefficients:", trained_model.coef_)

> [0;32m<ipython-input-8-ee148a63beb0>[0m(9)[0;36mtrain_model[0;34m()[0m
[0;32m      7 [0;31m    [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m [0;31m# I have started the Debugger[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m[0;34m[0m[0m
[0m[0;32m----> 9 [0;31m    [0mpoly[0m [0;34m=[0m [0mPolynomialFeatures[0m[0;34m([0m[0mdegree[0m[0;34m=[0m[0mdegree[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     10 [0;31m    [0mX_poly[0m [0;34m=[0m [0mpoly[0m[0;34m.[0m[0mfit_transform[0m[0;34m([0m[0mX[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     11 [0;31m[0;34m[0m[0m
[0m
ipdb> c
Trained Model Coefficients: [0.00000000e+00 2.10942375e-15 1.00000000e+00]


In [9]:
from sklearn.metrics import mean_squared_error
import numpy as np

def evaluate_model(model, poly, X_test, y_test):
    # Transform X_test to match the polynomial features used during training
    X_test_poly = poly.transform(X_test)

    # Predict using the transformed test data
    predictions = model.predict(X_test_poly)

    pdb.set_trace()  # Debugger for evaluation
    mse = mean_squared_error(y_test, predictions)
    return mse

# Example test data to evaluate the model
X_test = np.array([[5], [6], [7], [8]])
y_test = np.array([25, 36, 49, 64])

# Evaluate the model and calculate MSE
mse_score = evaluate_model(trained_model, poly, X_test, y_test)
print("Mean Squared Error:", mse_score)

> [0;32m<ipython-input-9-98fb646ddaea>[0m(12)[0;36mevaluate_model[0;34m()[0m
[0;32m     10 [0;31m[0;34m[0m[0m
[0m[0;32m     11 [0;31m    [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m  [0;31m# Debugger for evaluation[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 12 [0;31m    [0mmse[0m [0;34m=[0m [0mmean_squared_error[0m[0;34m([0m[0my_test[0m[0;34m,[0m [0mpredictions[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     13 [0;31m    [0;32mreturn[0m [0mmse[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     14 [0;31m[0;34m[0m[0m
[0m
ipdb> n
> [0;32m<ipython-input-9-98fb646ddaea>[0m(13)[0;36mevaluate_model[0;34m()[0m
[0;32m     11 [0;31m    [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m  [0;31m# Debugger for evaluation[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     12 [0;31m    [0mmse[0m [0;34m=[0m [0mmean_squared_error[0m[0;34m([0m[0my_test[0m[0;34m,[0m [0mpredictions[0m[0;34m)[0m[0;34m[0m[0