### Q.1 Generate points with the model y = ax+b+ε where epsilon is standard gaussian. x is distributed as uniform rv between [0,10]. Train a linear regression model with following polynomials-  2,5,10. Study the out of sample performance for each of the above. Compare this when training dataset size is changed.

In [97]:
# Importing libraries and important modules
import math
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import classification_report

In [98]:
# Setting the number of data points
N=100000

In [99]:
#Lists to store generated input x, epsilon values for training
x=[]
e=[]
for i in range(N):
    x.append(np.random.uniform(0,10)) #x-Uniform Random Variable
    e.append(np.random.normal(0,1)) #Epsilon-Standard Gaussian

In [100]:
# Defining the MSE Loss Function
def loss(y,y_pred):
    loss=np.mean((y_pred-y)**2)
    return loss

In [101]:
# Calculating gradient of loss w.r.t parameters (weights and bias).
def gradients(X, y, y_pred):
    # X: Input
    # y: True value
    # y_pred: Predicted value from the hypothesis
    # w: weights (parameter)
    # b: bias (parameter)
    # m: number of training examples.
    
    m = X.shape[0]
    # Gradient of loss w.r.t weights.
    dw = (1/m)*np.dot(X.T, (y_pred - y))
    # Gradient of loss w.r.t bias.
    db = (1/m)*np.sum((y_pred - y)) 
    return dw, db

In [102]:
# Adding Features (Polynomials of different degrees)
def x_transform(X, d):
    # X: Input
    # d: List of exponents that act on X
    t = X.copy()
    # Appending columns of higher degrees to X
    for i in d:
        X = np.append(X, t**i, axis=1)        
    return X

In [103]:
# Generation of the Training Dataset
def x_input(x,d):
    # Let a=0.5 and b=0.25
    a=0.5
    b=0.25
    # Transforming x to higher degrees
    x_matrix = np.array(x)
    x_matrix = x_matrix.reshape(N,1)
    # d=[2] for Polynomial of Degree 2, d=[5,4,3,2] for Polynomial of Degree 5, d=[10,9,8,7,6,5,4,3,2] for Polynomial of Degree 10
    x_t = x_transform(x_matrix,d) 
    y=[]
    # Generating points with model y=ax+b+ε for Training
    for i in range(N):
        c=0
        for j in range(len(d)+1):
            c=c+x_t[i][j]
        y.append((a*c+b+e[i]))
    y_matrix = np.array(y)
    return x_t,y_matrix


In [104]:
# Function for training the Linear Regression model with Mini-Batch Gradient Descent
def train(X, y, Batch , d, epochs, lr):
    # X: Input
    # y: True value
    # Batch: Batch Size
    # epochs: Number of iterations
    # d: List of exponents that act on X
    # lr: Learning rate
    # m: number of training examples
    # n: number of features 
    
    # Adding features to input X
    x_train = X
    m = x_train.shape[0]
    n = x_train.shape[1]
    # Initializing weights and bias to zeros
    w = np.zeros((n,1))
    b = 0
    # Reshape y
    y = y.reshape(m,1)
    # List to store losses
    losses = []
    # Training loop using Mini-Batch Gradient Descent
    for epoch in range(epochs):
        for i in range((m-1)//Batch + 1):
            # Defining batches
            start_i = i*Batch
            end_i = start_i + Batch
            xb = x_train[start_i:end_i]
            yb = y[start_i:end_i]
            # Calculating hypothesis
            y_pred = np.dot(xb, w) + b
            # Getting the gradients of loss w.r.t parameters.
            dw, db = gradients(xb, yb, y_pred)
            # Updating the parameters.
            w -= lr*dw
            b -= lr*db
        
        # Calculating loss and adding it in the list 
        l = loss(y, np.dot(x_train, w) + b)
        losses.append(l)
        
    # returning weights, bias and losses as a List
    return w, b, losses

In [105]:
# Prediction on x with the estimated w,b values
def predict(X, w, b, d):
    # X: Input
    # w: weights (parameter)
    # b: bias (parameter)
    # d: List of exponents that act on X
    
    # Returning predictions of Linear Regression Model
    return np.dot(X, w) + b

In [106]:
#Lists to store input x, epsilon values for testing (Out of Sample Performance)
x_test=[]
e_test=[]
for i in range(N):
    x_test.append(np.random.uniform(0,10)) #x-Uniform Random Variable
    e_test.append(np.random.normal(0,1)) #Epsilon-Standard Gaussian

In [107]:
# Generation of the Testing Dataset
def x_input_test(x,d):
    # Let a=0.5 and b=0.25
    a=0.5
    b=0.25
    # Transforming x to higher degrees
    x_matrix_test = np.array(x_test)
    x_matrix_test = x_matrix_test.reshape(N,1)
    # d=[2] for Polynomial of Degree 2, d=[5,4,3,2] for Polynomial of Degree 5, d=[10,9,8,7,6,5,4,3,2] for Polynomial of Degree 10
    x_t_test = x_transform(x_matrix_test,d) 
    y_test=[]
    # Generating points with model y=ax+b+ε for Testing
    for i in range(N):
        c=0
        for j in range(len(d)+1):
            c=c+x_t_test[i][j]
        y_test.append((a*c+b+e[i]))
    y_matrix_test = np.array(y_test)
    return x_t_test,y_matrix_test

In [108]:
# Generating Dataset for Training for Polynomial of Degree=2
x_t_2, y_matrix_2=x_input(x,d=[2])

In [109]:
# Training with the Generated Dataset when Degree of the Polynomial=2
w_2, b_2, l_2 = train(x_t_2, y_matrix_2, Batch=100, d=[2] , epochs=100, lr=0.0001)

In [110]:
# Predicting with the Generated Dataset when Degree of the Polynomial=2
y_pred_2 = predict(x_t_2,w_2,b_2,d=[2])

In [111]:
# Generating Dataset for Testing for Polynomial of Degree=2
x_t_2_test, y_matrix_2_test=x_input_test(x_test,d=[2])

In [112]:
# Performance on the Training Dataset for Polynomial of Degree=2
r2score_2 = r2_score(y_matrix_2,y_pred_2)
r2score_2

0.9963127353529011

In [113]:
# Study of Out of Sample Performance when Degree of the Polynomial=2
y_pred_2_test=predict(x_t_2_test,w_2,b_2,d=[2])

In [114]:
# Performance on the Testing Dataset for Polynomial of Degree=2
r2score_2_test = r2_score(y_matrix_2_test,y_pred_2_test)
r2score_2_test

0.9962972081402133

In [115]:
# Generating Dataset for Training for Polynomial of Degree=5
x_t_5, y_matrix_5=x_input(x,d=[5,4,3,2])

In [116]:
# Training with the Generated Dataset when Degree of the Polynomial=5
w_5, b_5, l_5 = train(x_t_5,y_matrix_5, Batch=100, d=[5, 4, 3, 2] , epochs=100, lr=0.0000000001)

In [117]:
# Predicting with the Generated Dataset when Degree of the Polynomial=5
y_pred_5 = predict(x_t_5,w_5,b_5,d=[5, 4, 3, 2])

In [118]:
# Generating Dataset for Testing for Polynomial of Degree=5
x_t_5_test, y_matrix_5_test=x_input_test(x_test,d=[5,4,3,2])

In [119]:
# Performance on the Training Dataset for Polynomial of Degree=5
r2score_5 = r2_score(y_matrix_5,y_pred_5)
r2score_5

0.9999809588736166

In [120]:
# Study of Out of Sample Performance when Degree of the Polynomial=5
y_pred_5_test=predict(x_t_5_test,w_5,b_5,d=[5, 4, 3, 2])

In [121]:
# Performance on the Testing Dataset for Polynomial of Degree=5
r2score_5_test = r2_score(y_matrix_5_test,y_pred_5_test)
r2score_5_test

0.9999809685253129

In [122]:
# Generating Dataset for Training for Polynomial of Degree=10
x_t_10, y_matrix_10=x_input(x,d=[10,9,8,7,6,5,4,3,2])

In [123]:
# Training with the Generated Dataset when Degree of the Polynomial=10
w_10, b_10, l_10 = train(x_t_10, y_matrix_10, Batch=100, d=[10, 9, 8, 7, 6, 5, 4, 3, 2] , epochs=100, lr=0.0000000000000000001)

In [124]:
# Predicting with the Generated Dataset when Degree of the Polynomial=10
y_pred_10 = predict(x_t_10,w_10,b_10,d=[10, 9, 8, 7, 6, 5, 4, 3, 2])

In [125]:
# Generating Dataset for Testing for Polynomial of Degree=10
x_t_10_test, y_matrix_10_test=x_input_test(x_test,d=[10,9,8,7,6,5,4,3,2])

In [126]:
# Performance on the Training Dataset for Polynomial of Degree=10
r2score_10 = r2_score(y_matrix_10,y_pred_10)
r2score_10

0.9999977153083135

In [127]:
# Study of Out of Sample Performance when Degree of the Polynomial=10
y_pred_10_test=predict(x_t_10_test,w_10,b_10,d=[10, 9, 8, 7, 6, 5, 4, 3, 2])

In [128]:
# Performance on the Testing Dataset for Polynomial of Degree=10
r2score_10_test = r2_score(y_matrix_10_test,y_pred_10_test)
r2score_10_test

0.9999977154300072

### Observations:

1. The Distribution of x is transformed to a higher degree by passing x and a list containing exponents that act on x to a function x_transform. This transformed x is used to generate y as ax+b+ε.
2. The Linear Regression Model is trained on the generated-transformed dataset using the Mini-Batch gradient descent method for each of the three given polynomials. The weights and bias values returned after training is used for the study of the out-of-sample performance.
3. The Out-of-Sample Performance for each of the three given polynomials is studied by using the above weights, bias values and predicting on the test set generated.
4. The metric used for comparison of the accuracy and out-of-sample performance of the regression model is the r2 score.
5. For a Dataset size of 100000, the r2score for a Polynomial of Degree=2 is 0.996 (lr=0.0001), for a Polynomial of Degree=5 is 0.999 (lr=0.0000000001), for a Polynomial of Degree=10 is 0.999 (lr=0.0000000000000000001)
6. For a Dataset size of 10000, the r2score for a Polynomial of Degree=2 is 0.996 (lr=0.0001), for a Polynomial of Degree=5 is 0.999 (lr=0.0000000001), for a Polynomial of Degree=10 is 0.999 (lr=0.0000000000000000001)
7. For a Dataset size of 1000, the r2score for a Polynomial of Degree=2 is 0.994 (lr=0.0001), for a Polynomial of Degree=5 is 0.999 (lr=0.0000000001), for a Polynomial of Degree=10 is 0.999 (lr=0.0000000000000000001)
8. For a Dataset size of 100, the r2score for a Polynomial of Degree=2 is 0.994 (lr=0.0001), for a Polynomial of Degree=5 is 0.999 (lr=0.0000000001), for a Polynomial of Degree=10 is 0.999 (lr=0.0000000000000000001)

### References:

1. https://towardsdatascience.com/polynomial-regression-in-python-b69ab7df6105