Initially, we randomly shuffled and then split the given dataset into a training, validation and test set of approximately 70%, 10% and 20% respectively. Each of these were imported as a dataframe. We also performed normalisation on the data which helped us to achieve uniform surface plots and coefficients, and made visualisation of our results easier.  

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/gibsonjackson/FODSIMG/main/TrainFinal.csv')
fd = df
df = (df-df.min())/(df.max()-df.min())
Res = df['quantitative response of LC50']
X1 = df['MLOGP']
X2 = df['RDCHI']
Y1 = Res.tolist()
X1 = X1.tolist()
X2 = X2.tolist()


dfv = pd.read_csv('https://raw.githubusercontent.com/gibsonjackson/FODSIMG/main/Validate.csv')
dfv = (dfv-dfv.min())/(dfv.max()-dfv.min())
fdv = df
Resv = dfv['quantitative response of LC50']
X1v = dfv['MLOGP']
X2v = dfv['RDCHI']
Y1v = Resv.tolist()
X1v = X1v.tolist()
X2v = X2v.tolist()


dft = pd.read_csv('https://raw.githubusercontent.com/gibsonjackson/FODSIMG/main/Test.csv')
dft = (dft-dft.min())/(dft.max()-dft.min())
fdt = df
Rest = dft['quantitative response of LC50']
X1t = dft['MLOGP']
X2t = dft['RDCHI']
Y1t = Rest.tolist()
X1t = X1t.tolist()
X2t = X2t.tolist()



*Creating a stochastic gradient descent regression model for polynomials of degrees 0-9*

Then, we created a stochastic gradient descent regression model of all degrees from 0-9 to predict LC50 value and checked our training error on the training dataset. 

The theta values (weights) are stored in a matrix of size (degree) * (degree+1), initialised to 0. 

**Hypothesis function**: Initialises a list to store predicted values of Y to all 0's. The prediction is calculated by running a nested loop of i and j as long as their sum is less than or equal to the degree, and then calculating the prediction at that Y as the product of theta[i][j] and the powers of X1 and X2 at that Y with respect to i and j.

**Cost function**: This function calculates the root mean square error of all the entries in the prediction of LC50 (calculated by hypothesis) with respect to the target value.

**Gradient Linear Regression function**: This algorithm is run for 10,000 epochs, or until the difference between current cost and last cost is less than a very small fixed value. The weights are updated using the old weights and the learning rate and the cost is calculated at every step and checked with the last cost. The list J is created to store the costs and is appended at every step. 

The difference between stochastic and normal gradient descent is that stochastic GD takes one training instance at a time and calculates the gradient based on that. Batch gradient descent calculates the gradient everytime based on the entire training set. 

In [None]:
def hypothesis(X1,X2, theta,degree):
  Pred_Y = [0.0]*len(X1)
  for y in range(len(Pred_Y)):
    for i in range(degree+1):
      for j in range(degree+1):
        if (i+j>degree):
          break
        Pred_Y[y] = Pred_Y[y] + (theta[i][j]*pow(X1[y],i)*pow(X2[y],j))
  return Pred_Y

In [None]:
def cost(X1,X2,Y,theta,degree):
    y1 = hypothesis(X1,X2,theta,degree)
    return sum(np.sqrt((y1-np.array(Y))**2))/(2*len(X1))

In [None]:
def stochasticLinearRegression(X1,X2,Y,alpha,epoch,degree):
  theta = [ [0.0]*(degree+1) for i in range(degree+1)]
  J = []
  k = 0
  size = len(X1)
  x1=X1
  x2=X2
  po = 0
  yu = Y
  while k<epoch:
    y1 = hypothesis(X1,X2,theta,degree) #Y_pred matrix from equation using theta1x1+theta2x2+theta3
    yy = y1
    for r in range(size):
      for i in range(degree+1):
        for j in range(degree+1):
          if (i+j>degree):
            break
          
          theta[i][j]=theta[i][j]-alpha*((yy[r]-yu[r])*(x1[r]**i)*(x2[r]**j))/size
          

    k+=1
    J.append(cost(X1,X2,Y,theta,degree))
  return J,theta



---



**TRAINING DATA**


*Plotting surface plots of predicted polynomials and calculating training error*

Next, we calculate the training error for all 10 degrees of our models. Simultaneously, we plot the calculated polynomials for each degree. 

allTheta is a list containing the appended values of weights of all the 10 models, which is displayed below the plots. 

allJ is the total cost for all the entries in the training data for all the 10 models after performing gradient descent polynomial regression, appended in a list. This is tabulated below the theta values.

In [None]:
def diagram(X1,X2,Theta, degree):
  X1=np.arange(-1,1,0.01)
  X2=np.arange(-1,1,0.01)
  fig = plt.figure()
  ax = fig.gca(projection='3d')
  X,Y=np.meshgrid(X1,X2)
  F=np.zeros(len(X1))
  index=0
  f = lambda i,j: Theta[i][j]*(X**i)*(Y**j)
  for i in range(degree+1):
    for j in range(degree+1):
      if (i+j>degree):
        break
      F = F + f(i,j)
 
  ax.set_xlabel('MLOGP')
  ax.set_ylabel('RDCHI')
  ax.set_zlabel('Quantitative Eroor of LC50')
  Z=np.array(F)
  ax.plot_surface(X,Y,Z)
  plt.show()

In [None]:
import matplotlib.pyplot as plt
from tabulate import tabulate
from matplotlib import cm
from mpl_toolkits import mplot3d
%matplotlib inline

allTheta=[]
for i in range(10):
  print(i)
  J,Theta = stochasticLinearRegression(X1,X2,Y1,0.001,10000,i)
  diagram(X1,X2,Theta,i)
  allTheta.append(Theta)

print(tabulate(allTheta))

allJ=[]
for i in range(10):
  allJ.append(cost(X1,X2,Y1,allTheta[i],i))


trainerror = pd.DataFrame(allJ, columns=['Training Error']) 
print(trainerror) 




---




As we can clearly see above, the the training error consistently decreases as the degree of the polynomial increases. This is depicting **overfitting** of the model with higher degree polynomials, where the training data is fit almost perfectly. We will view the impact of the same on testing data as well. 

**VALIDATION DATA**

*Running gradient descent on the validation set*

Then, we perform gradient descent on the validation data set and calculate the costs for each degree, similar to the process explained above. This yields us the polynomial degree with the lowest error, in this case degree 2. 

We will check these on the test data without running gradient descent and then finalise the degree with lowest cost. 

In [None]:
from tabulate import tabulate
allThetav=[]
Jval = []
for i in range(10):
  Jv,Thetav = stochasticLinearRegression(X1v,X2v,Y1v,0.001,10000,i)
  Jval.append(Jv[len(Jv)-1])
  allThetav.append(Thetav)

print(tabulate(allThetav))

*Finding errors for all degrees of regression models for validation set and choosing the one with the least error*

In [None]:
allvalJ=[]
for i in range(10):
  allvalJ.append(cost(X1v,X2v,Y1v,allThetav[i],i))


validateerror = pd.DataFrame(allvalJ, columns=['Validation Error']) 
print(validateerror)



---



**TEST DATA**

*Finding test dataset errors*

In [None]:
#This is used to test the model against training dataset
from tabulate import tabulate
allThetat=[]
Jtest = []
for i in range(10):
  J = cost(X1t,X2t,Y1t,allTheta[i],i)
  Jtest.append(J)


testerror = pd.DataFrame(Jtest, columns=['Test Error']) 
print(testerror) 

As clearly visible, the costs decrease, reach a minimum value and then proceed to increase. 

**The lowest cost is gotten at degree 1, and the lowest error is j = 0.06930381901223667.**

Overfitting occurs at higher degrees, since in the training set, the values of cost keep on decreasing even as we reach degree 9, but in testing (unseen) data, the cost increases. This implies that the model is not able to generalise well. 