# Polynomial Regression on nonlinear data
by Santiago Luna

In [None]:
#Metrics
from sklearn.metrics import mean_squared_error
#Load data
import pandas as pd
import numpy as np

DATA_PATH_TRUE = 'DS-5-1-GAP-0-1-N-0_v2.csv'
DATA_PATH_NOISE1 = 'DS-5-1-GAP-1-1-N-1_v2.csv'
DATA_PATH_NOISE2 = 'DS-5-1-GAP-5-1-N-3_v2.csv'

d_true = pd.read_csv(DATA_PATH_TRUE, header=None)
d_noise1 = pd.read_csv(DATA_PATH_NOISE1, header=None)
d_noise2 = pd.read_csv(DATA_PATH_NOISE2, header=None)

# Polynomial interpolation
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

It is necessary to do the imports to use the graphics, regression models, MSE calculation, and data reading from files. Performing a correct data read contributed to the improved performance of the models.

Note: I couldn't read the data from the .dat but I read the .csv ones.


In [None]:
degree = 6  # polynomial degree, 6
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

X = d_noise1[0]  # time
x = np.array(X)[:, np.newaxis]
Y = d_noise1[1]  # mag_A
y = np.array(Y)[:, np.newaxis]

X_test = d_true[0]
Y_test = d_true[1]

# Training
model.fit(x, y)  # get polynomial model for training data

x_test = np.array(X_test)[:, np.newaxis]
y_test = np.array(Y_test)[:, np.newaxis]

# Testing
y_pred_train = model.predict(x)
y_pred_test = model.predict(x_test)

MSE_train = mean_squared_error(y, y_pred_train)
MSE_test = mean_squared_error(y_test, y_pred_test)

print("MSE train ", MSE_train)
print("MSE test ", MSE_test)

# Plot
import matplotlib.pyplot as plt

plt.plot(x_test, y_test, color='k', label="True")
plt.scatter(X, y, edgecolor='b', s=20, label="Training samples")
plt.plot(x_test, y_pred_test, color='g', label="Polynomial model")
plt.xlabel("time")
plt.ylabel("mag")
plt.legend(loc="best")
plt.title("Degree 1st {}\nMSE_train = {:.8} \nMSE_test = {:.8}".format(
    degree, MSE_train, MSE_test))
plt.show()

### First part
In this part, a pipeline with a degree of six (6) is created, which was the best in terms of MSE. Extract the data from the columns of the file and start fitting the model with the training data. Make predictions on the test set and calculate the MSE, which can be seen in the graphic


![image.png](attachment:image.png)

In [None]:
####################### Segunda parte 
# DATA_PATH_NOISE2 = 'DS-5-1-GAP-5-1-N-1_v2.csv'
degree = 9  # polynomial degree
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

X = d_noise2[0]  # time
x = np.array(X)[:, np.newaxis]
Y = d_noise2[2]  # mag_A
y = np.array(Y)[:, np.newaxis]

#Training
model.fit(x, y)  #get polynomial model for training data

#Testing
y_pred_train = model.predict(x) 
y_pred_test = model.predict(x_test)


MSE_train = mean_squared_error(y,y_pred_train)
MSE_test = mean_squared_error(y_test,y_pred_test)

print("MSE train ",MSE_train)
print("MSE test ",MSE_test)


import matplotlib.pyplot as plt
plt.plot(x_test,y_test, color='k', label="True")
plt.scatter(X, y, edgecolor='b', s=20, label="Training samples")
plt.plot(x_test, y_pred_test, color='b', label="Polynomial model")
plt.xlabel("time")
plt.ylabel("mag")
plt.legend(loc="best")
plt.title("Degree 2nd {}\nMSE_train = {:.8} \nMSE_test = {:.8}".format(
        degree, MSE_train, MSE_test))
plt.show()

### Second part

Again, a pipeline is created, but in this case, the degree chosen is nine (9), which could be one of the best options for achieving the least MSE. Extract the data from the file and change the 2nd number of the column. Repeat the steps of training, prediction, calculation of MSE, and show the graphic obtained


![image.png](attachment:image.png)

In [None]:
####################### Tercera parte 
# DATA_PATH_NOISE2 = 'DS-5-1-GAP-5-1-N-3_v2.csv'
degree = 6  # polynomial degree
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

X = d_noise2[0]  # time
x = np.array(X)[:, np.newaxis]
Y = d_noise2[3]  # mag_A
y = np.array(Y)[:, np.newaxis]

#Training
model.fit(x, y)  #get polynomial model for training data

#Testing
y_pred_train = model.predict(x) 
y_pred_test = model.predict(x_test)


MSE_train = mean_squared_error(y,y_pred_train)
MSE_test = mean_squared_error(y_test,y_pred_test)

print("MSE train ",MSE_train)
print("MSE test ",MSE_test)


import matplotlib.pyplot as plt
plt.plot(x_test,y_test, color='k', label="True")
plt.scatter(X, y, edgecolor='b', s=20, label="Training samples")
plt.plot(x_test, y_pred_test, color='r', label="Polynomial model")
plt.xlabel("time")
plt.ylabel("mag")
plt.legend(loc="best")
plt.title("Degree 3rd {}\nMSE_train = {:.8} \nMSE_test = {:.8}".format(
        degree, MSE_train, MSE_test))
plt.show()

### Third Part

Created a pipeline with polynomial regression of degree six (6), extracting variables from the first [0] and fourth [3] columns to fit the model. Repeated the steps of training, prediction, calculation of MSE, and visualization. However, I selected the fourth column [4] because the file number (N-3) did not mention anything about an intermediate dataset or file that could be the third.


![image.png](attachment:image.png)


If you've reached this point, thank you for taking the time. If you have any comments or suggestions, please feel free to let me know.
