Initially, we randomly shuffled and then split the given dataset into a training, validation and test set of approximately 70%, 10% and 20% respectively. Each of these were imported as a dataframe. We also performed normalisation on the data which helped us to achieve uniform surface plots and coefficients, and made visualisation of our results easier.  

**TRAINING DATA**




In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/gibsonjackson/FODSIMG/main/TrainFinal.csv')
fd = df
m=len(df)
df=(df-df.min())/(df.max()-df.min()) #normalising the data

In [None]:
x = df.drop(columns = 'quantitative response of LC50')

y = df['quantitative response of LC50']
Y = y.tolist() #target values of LC50 from dataset
Pred_Y=[0.0]*len(Y) #predicted values of LC50 by regression model

x1 = df['MLOGP']
X1 = x1.tolist()

x2 = df['RDCHI']
X2 = x2.tolist()

*Plotting the dataset on a 3D Plot and visualisating the points as a scatter plot*

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X1,X2,Y)

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

Then, we created a gradient descent regression model of all degrees from 0-9 to predict LC50 value and checked our training error on the training dataset. 

The theta values (weights) are stored in a matrix of size (degree) * (degree+1), initialised to 0. 

**Hypothesis function**: Initialises a list to store predicted values of Y to all 0's. The prediction is calculated by running a nested loop of i and j as long as their sum is less than or equal to the degree, and then calculating the prediction at that Y as the product of theta[i][j] and the powers of X1 and X2 at that Y with respect to i and j.

**Cost function**: This function calculates the root mean square error of all the entries in the prediction of LC50 (calculated by hypothesis) with respect to the target value.

**Gradient Linear Regression function**: This algorithm is run for 10,000 epochs, or until the difference between current cost and last cost is less than a very small fixed value. The weights are updated using the old weights and the learning rate and the cost is calculated at every step and checked with the last cost. The list J is created to store the costs and is appended at every step. 

*Creating a gradient descent regression model for polynomials of degrees 0-9*

In [None]:
def hypothesis(X1,X2, theta,degree):
  Pred_Y=[0.0]*len(X1)
  for y in range(len(Pred_Y)):
    for i in range(degree+1):
      for j in range(degree+1):
        if (i+j>degree):
          break
        Pred_Y[y] = Pred_Y[y] + (theta[i][j]*pow(X1[y],i)*pow(X2[y],j))
  return Pred_Y

In [None]:
def cost(X1, X2,Y, theta, degree):
    y1 = hypothesis(X1,X2,theta,degree)
    return sum(np.sqrt((np.array(y1)-np.array(Y))**2))/(2*m)

In [None]:
def gradientLinearRegression(X1,X2,Y,degree,alpha,epoch):
  theta = [[0.0]*(degree+1) for i in range(degree+1)]
  J = []
  k = 0
  size = len(Y)
  while k<epoch:
    Pred_Y = hypothesis(X1,X2,theta,degree) #Y_pred matrix from equation using theta1x1+theta2x2+theta3
    for i in range(degree+1):
      for j in range(degree+1):
        if (i+j>degree):
          break
        x1pow=[eachx**i for eachx in X1]
        x2pow=[eachx**j for eachx in X2]
        theta[i][j]=theta[i][j]-alpha*sum(np.multiply(np.subtract(Pred_Y,Y),np.transpose(np.multiply(x1pow,np.transpose(x2pow)))))/size

    j = cost(X1,X2,Y, theta,degree)
    J.append(j)
    k+=1
    #if (k>1 and J[len(J)-2]-j<1e-6):
      #break
  return J,theta

*Plotting surface plots of predicted polynomials and calculating training error*

In [None]:
def diagram(X1,X2,Theta, degree):

  X1=np.arange(-1,1,0.01)
  X2=np.arange(-1,1,0.01)
  fig = plt.figure()
  ax = fig.gca(projection='3d')
  X,Y=np.meshgrid(X1,X2)
  
  F=np.zeros(len(X1))
  index=0
  f = lambda i,j: Theta[i][j]*(X**i)*(Y**j)
  for i in range(degree+1):
    for j in range(degree+1):
      if (i+j>degree):
        break
      F = F + f(i,j)

  ax.set_xlabel('MLOGP')
  ax.set_ylabel('RDCHI')
  ax.set_zlabel('Quantitative Eroor of LC50')
  
  Z=np.array(F)
  ax.plot_surface(X,Y,Z)
  plt.show()

Next, we calculate the training error for all 10 degrees of our models. Simultaneously, we plot the calculated polynomials for each degree. 

allTheta is a list containing the appended values of weights of all the 10 models, which is displayed below the plots. 

allJ is the total cost for all the entries in the training data for all the 10 models after performing gradient descent polynomial regression, appended in a list. This is tabulated below the theta values.

In [None]:
import matplotlib.pyplot as plt
from tabulate import tabulate
from matplotlib import cm
from mpl_toolkits import mplot3d
%matplotlib inline

allTheta=[]
for i in range(10):
  J,Theta = gradientLinearRegression(X1,X2,Y,i,0.001,10000)
  diagram(X1,X2,Theta,i)
  allTheta.append(Theta)


print(tabulate(allTheta))

allJ=[]
for i in range(10):
  allJ.append(cost(X1,X2,Y,allTheta[i],i))


trainerror = pd.DataFrame(allJ, columns=['Training Error']) 
print(trainerror) 

As we can clearly see above, the the training error consistently decreases as the degree of the polynomial increases. This is depicting **overfitting** of the model with higher degree polynomials, where the training data is fit almost perfectly. We will view the impact of the same on testing data as well. 



---



Then, we perform gradient descent on the validation data set and calculate the costs for each degree, similar to the process explained above. This yields us the polynomial degree with the lowest error, in this case degree 1 and 2. 

We will check these on the test data without running gradient descent and then finalise the degree with lowest cost. 

**VALIDATION DATA**

In [None]:
dval = pd.read_csv('https://raw.githubusercontent.com/gibsonjackson/FODSIMG/main/Validate.csv')
# dval=(dval-dval.mean())/(dval.std())
dval = (dval-dval.min())/(dval.max()-dval.min())
m=len(dval)

yval = dval['quantitative response of LC50']
Yval = yval.tolist()
matrix_y = [0.0]*len(Yval)

xval = dval.drop(columns = 'quantitative response of LC50')

xval1 = dval['MLOGP']
Xval1 = xval1.tolist()

xval2 = dval['RDCHI']
Xval2 = xval2.tolist()

*Running gradient descent on the validation set*

In [None]:
from tabulate import tabulate
allvalTheta=[]
for i in range(10):
  valJ,valTheta = gradientLinearRegression(Xval1,Xval2,Yval,i,0.001,10000)
  allvalTheta.append(valTheta)

print(tabulate(allvalTheta))

*Finding errors for all degrees of regression models for validation set and choosing the one with the least error*



In [None]:
allvalJ=[]
for i in range(10):
  allvalJ.append(cost(Xval1,Xval2,Yval,allTheta[i],i))


validateerror = pd.DataFrame(allvalJ, columns=['Validation Error']) 
print(validateerror)



---



**TEST DATA**

In [None]:
dtest = pd.read_csv('https://raw.githubusercontent.com/gibsonjackson/FODSIMG/main/Test.csv')
# dtest=(dtest-dtest.mean())/(dtest.std())
dtest = (dtest-dtest.min())/(dtest.max()-dtest.min())

m=len(dtest)

ytest = dtest['quantitative response of LC50']
Ytest = ytest.tolist()
matrix_y = [0.0]*len(ytest)

xtest = dtest.drop(columns = 'quantitative response of LC50')

xtest1 = dtest['MLOGP']
Xtest1 = xtest1.tolist()

xtest2 = dtest['RDCHI']
Xtest2 = xtest2.tolist()

*Finding test dataset errors*

In [None]:
from tabulate import tabulate
import math

alltestTheta=[]
alltestJ=[]
for i in range(10):
  alltestJ.append(cost(Xtest1,Xtest2,Ytest,allTheta[i],i))


testerror = pd.DataFrame(alltestJ, columns=['Test Error']) 
print(testerror) 

As clearly visible, the costs decrease, reach a minimum value and then proceed to increase. 

**The lowest cost is gotten at degree 1, and the lowest error is j = 0.069304.**

Overfitting occurs at higher degrees, since in the training set, the values of cost keep on decreasing even as we reach degree 9, but in testing (unseen) data, the cost increases. This implies that the model is not able to generalise well. 