# Argone Leadership Computing Facility Artificial Intelligence Training Program
## Cesar Francisco Quinones-Martinez (cquinones24)
Session 1: Intro to Artificial Intelligence on Supercomputers, 2024-02-09

In this Notebook, I apply the information given by Dr. Huihuo Zheng about linear regression to work on the Homework assignment. I first pass the iterative version of linear regression to verify that it functions in this copy. Afterwards, I will include the changes that were asked for in the Homework.

In [2]:
# Importing the relevant libraries and necessary real estate data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipydis
import time

! [ -e ./slimmed_realestate_data.csv ] || wget https://raw.githubusercontent.com/argonne-lcf/ai-science-training-series/main/01_intro_AI_on_Supercomputer/slimmed_realestate_data.csv
data = pd.read_csv('slimmed_realestate_data.csv')
print(data.columns)

Index(['Unnamed: 0', 'SalePrice', 'GrLivArea'], dtype='object')


We define the data and linear regression.

In [None]:
data = pd.read_csv('slimmed_realestate_data.csv')
data_x = data['GrLivArea'].to_numpy()
data_y = data['SalePrice'].to_numpy()

# Linear Regression by theory:

n = len(data)
sum_xy = np.sum(data_x*data_y)
sum_x = np.sum(data_x)
sum_y = np.sum(data_y)
sum_x2 = np.sum(data_x*data_x)
denominator = n * sum_x2 - sum_x * sum_x

m_calc = (n * sum_xy - sum_x * sum_y) / denominator
b_calc = (sum_y * sum_x2 - sum_x * sum_xy) / denominator

Defining the necessary functions:

In [None]:
def model(x,m,b):
    return m * x + b

def loss(x,y,m,b):
    y_predicted = model(x,m,b)
    return np.power(y - y_predicted,2)

def updated_m(x,y,m,b,learning_rate):
    dL_dm = - 2 * x * (y - model(x,m,b))
    dL_dm = np.mean(dL_dm)
    return m - learning_rate * dL_dm

def updated_b(x,y,m,b,learning_rate):
    dL_db = - 2 * (y - model(x,m,b))
    dL_db = np.mean(dL_db)
    return b - learning_rate * dL_db

def plot_data(x,y,m,b,plt = plt):
    # plot our data points with 'bo' = blue circles
    plt.plot(x,y,'bo')
    # create the line based on our linear fit
    # first we need to make x points
    # the 'arange' function generates points between two limits (min,max)
    linear_x = np.arange(x.min(),x.max())
    # now we use our fit parameters to calculate the y points based on our x points
    linear_y = linear_x * m + b
    # plot the linear points using 'r-' = red line
    plt.plot(linear_x,linear_y,'r-',label='fit')

In [None]:
Using the Stochastic Gradient Descent (SGD) Method:

In [None]:
# Initial guess:
m = 5.
b = 1000.
batch_scale = 1
learning_m_scale = 1
learning_b_scale = 1
# Learning rates:
learning_rate_m = 1e-7*learning_m_scale
learning_rate_b = 1e-1*learning_b_scale

loss_history = []
batch_size = 64*batch_scale

data_batch = data.sample(batch_size)
data_x = data_batch['GrLivArea'].to_numpy()
data_y = data_batch['SalePrice'].to_numpy()

if batch size > n:
    batch_size = len(data)

loop_N = 30*len(data)//batch_size

for i in range(loop_N):
    # update our slope and intercept based on the current values
    m = updated_m(data_x,data_y,m,b,learning_rate_m)
    b = updated_b(data_x,data_y,m,b,learning_rate_b)
    
    loss_value = np.mean(loss(data_x,data_y,m,b))
    loss_history.append(loss_value)
    print('[%03d]  dy_i = %.2f * x + %.2f     previously calculated: y_i = %.2f * x + %.2f    loss: %f' % (i,m,b,m_calc,b_calc,loss_value))
    
    plt.close('all')
    
    fig,ax = plt.subplots(1,2,figsize=(18,6),dpi=80)
    # lot our usual output
    plot_data(data_x,data_y,m,b,ax[0])
    # here we also plot the calculated linear fit for comparison
    line_x = np.arange(data_x.min(),data_x.max())
    line_y = line_x * m_calc + b_calc
    ax[0].plot(line_x,line_y,'b-',label='calculated')
    # add a legend to the plot and x/y labels
    ax[0].legend()
    ax[0].set_xlabel('square footage')
    ax[0].set_ylabel('sale price')
    
    # plot the loss 
    loss_x = np.arange(0,len(loss_history))
    loss_y = np.asarray(loss_history)
    ax[1].plot(loss_x,loss_y, 'o-')
    ax[1].set_yscale('log')
    ax[1].set_xlabel('loop step')
    ax[1].set_ylabel('loss')
    plt.show()
    # gives us time to see the plot
    time.sleep(2.5)
    # clears the plot when the next plot is ready to show.
    ipydis.clear_output(wait=True)

Applying the data_batch = data.sample(batch_size) line takes only the selected number of data points from the data to work on. With the batch_scale variable we increase the batch size by multiplying it by a number we define. If the scaling variable goes above 8 it outputs a number bigger than the total number of data points, which can affect the total number of loops, where the for loop eliminates that possible issue.

Depending on the batch size, the fit we are obtaining with SGD can be more innacurate when compared to the Linear Regression fit as it may be weighted to less data that can be more concentrated to on one side. However this shows that with correct bath sizes you can obtain similar trends in your data without having to model for all datapoints, saving compute time.

In [None]:
'''
batch_scale = 1
learning_m_scale = 1
learning_b_scale = 1
'''
# Allow to modify the behavior of the learning rates.
# Data about 551 points, such that values above 8 for batch_scale gives all points.

For equal scale values, SGD method diverges from the fit at a value of 8. At learning_b_scale >= 14 the SGD fit diverges when using all data points, boucing around the fit and the loss increasing at it kept bouncing. Similar behavior occurs with learning_m_scale >= 5 where the fit completely diverges.