### MAI ASSIGNMENT 2 - LINEAR REGRESSION

Name : Dario Prawara Teh Wei Rong &emsp; Class : DAAA / FT / 2A / 04 &emsp; Admin Number : 2201858

Objective of Assignment : Given a set of data points with at least one predictor and one continuous response variable, construct a linear model to predict the response.  This is the aim of Linear Regression, which is a supervised learning technique. 

Context Given : In the context of this assignment, the resale transacted prices of the 4-Room HDB flats in Choa Chu Kang from Jan to June 2023 are extracted.

Variables Provided :
- Remaining Lease (years) : HDB flats are sold on 99-year leases
- Resale Price : Transacted price of the resale HDB flat

**The response variable is Resale Price, and the predictor is Remaining Lease.**

#### IMPORTING THE RELEVANT MODULES AND THE DATASET

In [1]:
# Importing relevant modules and libraries
import pandas as pd
import numpy as np
import sympy as sp
import matplotlib.pyplot as plt
from itertools import permutations
from sklearn.preprocessing import StandardScaler

# Importing the dataset
df = pd.read_excel('5. ChoaChuKang Resale transactions 4_room Jan June_2023.xlsx')

#### OBTAIN AN OVERVIEW OF THE DATASET

In [2]:
# Retrieve the first 5 rows
df.head()

Unnamed: 0,S/No,Block,Street Name,Storey,Floor Area (sqm) /,Remaining Lease (years),Resale Price,Resale Registration Date,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,1,818A,Choa Chu Kang Ave 1,01 to 03,92,93,525000,2023-06-01,,Source:,https://services2.hdb.gov.sg/webapp/BB33RTIS/
1,2,249,Choa Chu Kang Ave 2,04 to 06,104,69,485000,2023-06-01,,,
2,3,289,Choa Chu Kang Ave 3,04 to 06,104,69,445000,2023-06-01,,,
3,4,429,Choa Chu Kang Ave 4,10 to 12,104,69,450000,2023-06-01,,,
4,5,442,Choa Chu Kang Ave 4,13 to 15,91,72,448888,2023-06-01,,,


#### WRANGLING OF THE DATASET

For this dataset, wrangling and pre-processing was needed as there were unnecessary columns and datatype mismatch found in the dataset. 


Hence, these columns 'Unnamed: 8', 'Unnamed: 9' and 'Unnamed: 10' columns were dropped from the dataset to avoid data redundancy and ensure a proper dataset for analysis and evaluation purposes. 


For 'Floor Area (sqm)', the column name was modified to remove the slash (/) symbol and modify the datatype to a numeric value.

In [3]:
# Printing information on the dataset
df.info()

# Convert Floor Area (sqm) to numeric datatype and modify column name to remove the slash
df.rename(columns={'Floor Area (sqm) /': 'Floor Area (sqm)'}, inplace=True)
df['Floor Area (sqm)'] = pd.to_numeric(df['Floor Area (sqm)'])

# Drop unnecessary columns from the dataset
df = df.drop(columns=["Unnamed: 8","Unnamed: 9", "Unnamed: 10"])

# Viewing the updated dataset
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   S/No                      294 non-null    int64         
 1   Block                     294 non-null    object        
 2   Street Name               294 non-null    object        
 3   Storey                    294 non-null    object        
 4   Floor Area (sqm) /        294 non-null    int64         
 5   Remaining Lease (years)   294 non-null    int64         
 6   Resale Price              294 non-null    int64         
 7   Resale Registration Date  294 non-null    datetime64[ns]
 8   Unnamed: 8                0 non-null      float64       
 9   Unnamed: 9                1 non-null      object        
 10  Unnamed: 10               1 non-null      object        
dtypes: datetime64[ns](1), float64(1), int64(4), object(5)
memory usage: 25.4+ KB


Unnamed: 0,S/No,Block,Street Name,Storey,Floor Area (sqm),Remaining Lease (years),Resale Price,Resale Registration Date
0,1,818A,Choa Chu Kang Ave 1,01 to 03,92,93,525000,2023-06-01
1,2,249,Choa Chu Kang Ave 2,04 to 06,104,69,485000,2023-06-01
2,3,289,Choa Chu Kang Ave 3,04 to 06,104,69,445000,2023-06-01
3,4,429,Choa Chu Kang Ave 4,10 to 12,104,69,450000,2023-06-01
4,5,442,Choa Chu Kang Ave 4,13 to 15,91,72,448888,2023-06-01


#### REVIEWING OF THE UPDATED DATASET

In [4]:
# Printing updated information on the dataset using .info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   S/No                      294 non-null    int64         
 1   Block                     294 non-null    object        
 2   Street Name               294 non-null    object        
 3   Storey                    294 non-null    object        
 4   Floor Area (sqm)          294 non-null    int64         
 5   Remaining Lease (years)   294 non-null    int64         
 6   Resale Price              294 non-null    int64         
 7   Resale Registration Date  294 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(3)
memory usage: 18.5+ KB


#### QUESTION 1C - UNIVARIATE GRADIENT DESCENT ALGORITHM

In [5]:
# Use univariate gradient descent algorithm to find the value of b

# Define the predicted variable
y = df['Resale Price']

# Define the predictor variable
x = df['Remaining Lease (years)']

# Define the parameters for gradient descent
b = 1 # Starting value of b
rate = 0.0001 # Set learning rate
epsilon = 0.0001 # Stop algorithm when absolute difference between 2 consecutive b-values is less than epsilon
diff = 1 # Difference between 2 consecutive iterates
max_iter = 1000 # Set maximum number of iterations
iter_count = 1  # Iterations counter

# Functions
n = len(x)
f = lambda b: (1/n) * np.sum((y - (x * b)) ** 2) # Loss Function
deriv = lambda b: (-2/n) * np.sum(x * (y - (x * b))) # Derivative of f

# Perform the Gradient Descent Function using a while loop
while diff > epsilon and iter_count < max_iter:
    iter_count += 1
    # Update value of b
    b_new = b - rate * deriv(b)
    # Stopping Criterion
    diff = abs(f(b_new) - f(b))
    b = b_new
    
# Calculate the minimum error
minimum_error = diff

# Printing the desired output
print("Number of iterations is", iter_count - 1)
print("The local minimum occurs when b is", round(b,4))
print("Minimum error is", minimum_error)

Number of iterations is 13
The local minimum occurs when b is 6369.4159
Minimum error is 7.152557373046875e-06


#### QUESTION 2C - GRADIENT DESCENT ALGORITHM

#### INITIALIZING THE GRADIENT DESCENT ALGORITHM

In [6]:
# Use gradient descent algorithm to find the value of a and b

# Define the predicted variable
y = df['Resale Price']

# Define the predictor variable
x = df['Remaining Lease (years)']

# Scale the predicted variable (y)
scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1))

# Scale the predictor variable (x)
scaler_x = StandardScaler()
x_scaled = scaler_x.fit_transform(x.values.reshape(-1, 1))

# Define the parameters for gradient descent
a = 2 # Starting value of a
b = 1 # Starting value of b
rate = 0.1 # Set learning rate
epsilon = 0.0000000001  # Stop algorithm when absolute difference between 2 consecutive a or b-values is less than epsilon
diff = 1  # Difference between 2 consecutive iterates
max_iter = 1000  # Set maximum number of iterations
iter_count = 1  # Iterations counter

n = len(x)
f = lambda a, b: (1/n) * np.sum((y_scaled - a - (x_scaled * b)) ** 2) # Loss Function
deriv_a = lambda a, b: (-2/n) * np.sum(y_scaled - a - (x_scaled * b)) # Partial Loss Function for a
deriv_b = lambda a, b: (-2/n) * np.sum(x_scaled * (y_scaled - a - (x_scaled * b))) # Partial Loss Function for b

while diff > epsilon and iter_count < max_iter:
    iter_count += 1
    # Update value of a and b
    a_new = a - rate * deriv_a(a, b)
    b_new = b - rate * deriv_b(a, b)
    # Stopping Criterion
    diff = abs(f(a_new, b_new) - f(a, b))
    a, b = a_new, b_new
    
# Calculate the minimum error
minimum_error = diff
    
# Unscale the coefficients a and b
unscaled_a = (a * scaler_y.scale_[0]) - ((b * (scaler_y.scale_[0] * scaler_x.mean_[0])) / scaler_x.scale_[0]) + scaler_y.mean_[0] 
unscaled_b = b * (scaler_y.scale_[0] / scaler_x.scale_[0])

# Printing the desired output
print("Number of iterations is", iter_count - 1)
print("The local minimum occurs when a is", round(unscaled_a, 4), "and b is", round(unscaled_b, 4))
print("Minimum error is", minimum_error)

Number of iterations is 54
The local minimum occurs when a is 316900.779 and b is 2334.033
Minimum error is 7.9747319858825e-11


#### QUESTION 3C - GRADIENT DESCENT ALGORITHM FOR MLR (X, Y, W)

#### NUMBER OF ROWS EXTRACTED FOR MLR : 15 ROWS


Data collection was conducted with the focus on preventing any instances of identical duplicates in terms of floor area, remaining lease, and resale price. This approach was adopted to guarantee a diverse and randomized dataset, thereby ensuring a balanced representation of various properties.


As for the number of rows, 15 rows was used instead of the minimum requirement of 10 to allow for faster convergence by providing a better estimate of true gradients and parameter values.


In [7]:
# Select 15 rows of data with unique combinations for floor area, resale price and remaining lease 

# Specify if-else condition
if len(df) >= 15:
    # Drop duplicates to ensure unique combinations
    df_unique = df.drop_duplicates(subset=['Remaining Lease (years)', 'Floor Area (sqm)', 'Resale Price'])
    # Sample 15 rows from the unique combinations
    model3_df = df_unique.sample(n=15, random_state=42)
else:
    print("This dataset does not have enough unique combinations for selection.")

# Reset index column
model3_df.reset_index(drop=True, inplace=True)

# Display data retrieved
model3_df

Unnamed: 0,S/No,Block,Street Name,Storey,Floor Area (sqm),Remaining Lease (years),Resale Price,Resale Registration Date
0,256,488A,Choa Chu Kang Ave 5,07 to 09,93,92,502000,2023-01-01
1,7,476C,Choa Chu Kang Ave 5,13 to 15,92,89,510000,2023-06-01
2,83,152,Jln Teck Whye,10 to 12,100,73,522000,2023-05-01
3,228,640,Choa Chu Kang St 64,01 to 03,100,73,538000,2023-02-01
4,126,635,Choa Chu Kang Nth 6,04 to 06,113,73,580000,2023-04-01
5,202,807B,Choa Chu Kang Ave 1,16 to 18,92,93,560000,2023-02-01
6,278,708,Choa Chu Kang St 53,13 to 15,108,71,497000,2023-01-01
7,183,687B,Choa Chu Kang Dr,04 to 06,90,78,460000,2023-03-01
8,10,489C,Choa Chu Kang Ave 5,07 to 09,93,92,520000,2023-06-01
9,32,506,Choa Chu Kang St 51,07 to 09,108,70,494000,2023-06-01


#### INITIALIZING THE GRADIENT DESCENT ALGORITHM

In [8]:
# Use gradient descent algorithm to find the value of a, b and c

# Define the predicted variable
y = model3_df['Resale Price']

# Define the predictor variables
x = model3_df['Remaining Lease (years)']
w = model3_df['Floor Area (sqm)']

# Scale the predicted variable (y)
scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1))

# Scale the predictor variables (x and w)
scaler_x = StandardScaler()
x_scaled = scaler_x.fit_transform(x.values.reshape(-1, 1))

scaler_w = StandardScaler()
w_scaled = scaler_w.fit_transform(w.values.reshape(-1, 1))

# Define the parameters for gradient descent
a = 1 # Starting value of a
b = 1  # Starting value of b
c = 1 # Starting value of c
rate = 0.1 # Set learning rate
epsilon = 0.000000000000001  # Stop algorithm when absolute difference between 2 consecutive values is less than epsilon
diff, a_diff, b_diff, c_diff = 1, 1, 1, 1  # Difference between 2 consecutive iterates
max_iter = 1000  # Set maximum number of iterations
iter_count = 1  # Iterations counter

# Functions
n = len(x)
f = lambda a, b, c: np.mean((y_scaled - (a + x_scaled * b + w_scaled * c)) ** 2)  # Loss Function
deriv_a = lambda a, b, c: (-2 / n) * np.sum(y_scaled - (a + x_scaled * b + w_scaled * c))  # Partial Loss Function for a
deriv_b = lambda a, b, c: (-2 / n) * np.sum(x_scaled * (y_scaled - (a + x_scaled * b + w_scaled * c)))  # Partial Loss Function for b
deriv_c = lambda a, b, c: (-2 / n) * np.sum(w_scaled * (y_scaled - (a + x_scaled * b + w_scaled * c)))  # Partial Loss Function for c

# Perform the Gradient Descent Function using a while loop
while diff > epsilon and iter_count < max_iter:
    iter_count += 1
    # Update value of a, b & c
    a_new = a - rate * deriv_a(a, b, c)
    b_new = b - rate * deriv_b(a, b, c)
    c_new = c - rate * deriv_c(a, b, c)
    # Stopping Criterion
    diff = abs(f(a_new, b_new, c_new) - f(a, b, c))
    # Update the old loss
    a, b, c = a_new, b_new, c_new
    
# Calculate the minimum error
minimum_error = diff

# Unscale the coefficients a, b and c
unscaled_a = (a * scaler_y.scale_[0]) - ((b * (scaler_y.scale_[0] * scaler_x.mean_[0])) / scaler_x.scale_[0]) - ((c * (scaler_y.scale_[0] * scaler_w.mean_[0])) / scaler_w.scale_[0]) + scaler_y.mean_[0] 
unscaled_b = b * (scaler_y.scale_[0] / scaler_x.scale_[0])
unscaled_c = c * (scaler_y.scale_[0] / scaler_w.scale_[0])

# Printing the desired output
print("Number of iterations is", iter_count - 1)
print("The local minimum occurs when a is", round(unscaled_a, 4), "and b is", round(unscaled_b, 4), "and c is", round(unscaled_c, 4))
print("Minimum error is", minimum_error)

Number of iterations is 275
The local minimum occurs when a is -246956.5951 and b is 3715.5665 and c is 4652.8425
Minimum error is 9.43689570931383e-16
