# Principles of Algorithm Design

## Final Exam, Question 6

## Student Name

Hamed Araab

## Student Number

9925003


### Used Packages

#### `numpy`

- `array`: For ease of use
- `concatenate`: To add a column of ones to `X`
- `ones`: To create a column of ones in order to get the intercept values as a row in the coefficients matrix
- `linalg`: To calculate the inverse of `X.T @ X`

#### `pandas`

- `Dataframe`: To use and create data as data frames
- `read_csv`: To read the input file
- `get_dummies`: To transform categorical features (`zipcode` in this question) to dummy features


In [1]:
from numpy import array, concatenate, ones, linalg
from pandas import DataFrame, read_csv, get_dummies

### The Main Class

This is the main class that implements the closed form of Linear Regression.

#### Implementation Steps

1. Transform `X`.
   1. Turn `X` to an array.
   2. Add a column of ones to `X`.
   3. Fit the scaler of `X` by storing its mean and standard deviation.
   4. Normalize `X` using its fitted scaler.
2. Transform `Y`.
   1. Turn `Y` to an array.
   2. Fit the scaler of `Y` by storing its mean and standard deviation.
   3. Normalize `Y` using its fitted scaler.
3. Calculate the coefficients matrix using the closed form.
4. Calculate `Y_prediction` and undo normalization on it.
5. Undo normalization on `Y`.
6. Calculate RMSE and R2 as performance metrics.

#### Capabilities

You can set any number of input and target features (`X` and `Y`). Thus, Creating Single and Multiple, Univariate And Multivariate Linear Regression models are possible with this class.

#### `predict` Function

This function predicts the target variables (`Y_prediction`) of a set of records based on their input features and the model's coefficients.


In [2]:
class LinearRegressor:
    def __init__(self, X, Y):
        X = array(X)
        X = concatenate((ones((X.shape[0], 1)), X), axis=1)

        self._fit_scaler_X(X)

        X = self._normalize_X(X)

        Y = array(Y)

        self._fit_scaler_Y(Y)

        Y = self._normalize_Y(Y)

        self._coefficients = linalg.inv(X.T @ X) @ X.T @ Y

        Y_prediction = X @ self._coefficients
        Y_prediction = self._undo_normalization_Y(Y_prediction)

        Y = self._undo_normalization_Y(Y)

        self._rmse = ((Y - Y_prediction) ** 2).mean() ** 0.5
        self._r2 = 1 - (((Y - Y_prediction) ** 2).sum() / ((Y - Y.mean()) ** 2).sum())

    @property
    def coefficients(self):
        return self._coefficients

    @property
    def rmse(self):
        return self._rmse

    @property
    def r2(self):
        return self._r2

    def _fit_scaler_X(self, X):
        self._mean_X = X.mean()
        self._std_X = X.std()

    def _fit_scaler_Y(self, Y):
        self._mean_Y = Y.mean()
        self._std_Y = Y.std()

    def _normalize_X(self, X):
        return (X - self._mean_X) / self._std_X

    def _normalize_Y(self, Y):
        return (Y - self._mean_Y) / self._std_Y

    def _undo_normalization_Y(self, Y):
        return Y * self._std_Y + self._mean_Y

    def predict(self, X):
        """
        This function predicts the target variables (`Y_prediction`)
        of a set of records based on their input features and the model's
        coefficients.
        """

        X = array(X)
        X = concatenate((ones((X.shape[0], 1)), X), axis=1)
        X = self._normalize_X(X)

        Y_prediction = X @ self._coefficients
        Y_prediction = self._undo_normalization_Y(Y_prediction)

        return Y_prediction

### `home_data.csv`


Here, we load and preview the data that we are going to work with.


In [3]:
home_data = read_csv("home_data.csv")

print(home_data.shape)

home_data.head()

(21613, 21)


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


As you can see, we have 21613 records and 21 features.


### Feature Selection

Here, from `home_data`, we select the columns specified by the question as the input and the target features and create `X` and `Y` accordingly. In the process, using `pandas.get_dummies`, we create dummy columns instead of the `zipcode` column since it's a categorical (qualitative) feature.

In the end, we review `X` and `Y` to make sure everything is right.


In [4]:
X = get_dummies(
    home_data[
        [
            "bedrooms",
            "bathrooms",
            "sqft_living",
            "sqft_lot",
            "floors",
            "zipcode",
        ]
    ],
    columns=["zipcode"],
    dtype=int,  # Return 0 or 1 as the binary value
)

Y = home_data[["price"]]

X.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,zipcode_98001,zipcode_98002,zipcode_98003,zipcode_98004,zipcode_98005,...,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
0,3,1.0,1180,5650,1.0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,3,2.25,2570,7242,2.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,1.0,770,10000,1.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,3.0,1960,5000,1.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3,2.0,1680,8080,1.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
X.describe()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,zipcode_98001,zipcode_98002,zipcode_98003,zipcode_98004,zipcode_98005,...,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,...,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,3.370842,2.114757,2079.899736,15106.97,1.494309,0.016749,0.009207,0.012955,0.014667,0.007773,...,0.013325,0.002637,0.020636,0.011752,0.012446,0.011798,0.012122,0.006293,0.012955,0.014667
std,0.930062,0.770163,918.440897,41420.51,0.539989,0.128333,0.095515,0.113084,0.120219,0.087824,...,0.114666,0.051288,0.142165,0.107771,0.110869,0.107981,0.109435,0.079077,0.113084,0.120219
min,0.0,0.0,290.0,520.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,33.0,8.0,13540.0,1651359.0,3.5,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
Y.describe()

Unnamed: 0,price
count,21613.0
mean,540088.1
std,367127.2
min,75000.0
25%,321950.0
50%,450000.0
75%,645000.0
max,7700000.0


### Part 1

In this part, we are going to fit a model to predict the price of a house with `sqft_living` as the input.


In [7]:
model1 = LinearRegressor(X[["sqft_living"]], Y)

print(f"R2: {model1.r2}")
print(f"RMSE: {model1.rmse}")

R2: 0.492853214845565
RMSE: 261440.79072267728


To analyze the performance of the model using R2 and RMSE, we should consider the fact that a "good" R2 or RMSE value highly depends on the selected features and the quality and the quantity of the data that is fed to the regressor. Thus, it is easier to compare R2 and RMSE of different models with each other.


### Part 2

In this part, we are going to fit a model to predict the price of a house with:

- `bedrooms`,
- `bathrooms`,
- `sqft_living`,
- `sqft_lot`,
- `floors`, and
- `zipcode`

as the inputs. Note that `zipcode` is replaced by dummy features in the previous chapters.


In [8]:
model2 = LinearRegressor(X, Y)

print(f"R2: {model2.r2}")
print(f"RMSE: {model2.rmse}")

R2: 0.7395925389936023
RMSE: 187341.1671332787


As you can see, this model performs better than the previous model since it has a higher R2 value and a lower RMSE value. Higher R2 means higher response variance and lower RMSE means lower response bias. Hence, this model neither overfits nor underfits the data.


### Part 3

In this part, we predict the price of 3 houses, two of which are in the train data, using the models of Part 1 and Part 2.

#### `print_predictions` Function

This function prints the observed price of the house and the response value and error of each model for a given house.


In [9]:
def print_predictions(house_name, X, Y=None):
    """
    This function prints the observed price of the house and the
    response value and the response error of each model for a
    given house.
    """

    if Y is not None:
        Y = array(Y)

    print(f"\nHouse {house_name}:\n")

    if Y is not None:
        print(f"Observed Price: {Y}\n")

    Y_prediction_model1 = model1.predict(X[["sqft_living"]])

    print(f"Model 1 Prediction: {Y_prediction_model1}")

    if Y is not None:
        print(f"Model 1 Error: {Y_prediction_model1 - Y}\n")

    Y_prediction_model2 = model2.predict(X)

    print(f"Model 2 Prediction: {Y_prediction_model2}")

    if Y is not None:
        print(f"Model 2 Error: {Y_prediction_model2 - Y}\n")


print_predictions(
    house_name=1,
    X=X.loc[home_data["id"] == 5309101050],
    Y=Y.loc[home_data["id"] == 5309101050],
)

print_predictions(
    house_name=2,
    X=X.loc[home_data["id"] == 1925069082],
    Y=Y.loc[home_data["id"] == 1925069082],
)

print_predictions(
    house_name=3,
    X=DataFrame(
        {
            **{column: [0] for column in X.columns.values},
            **{
                "bedrooms": [8],
                "bathrooms": [25],
                "sqft_living": [50000],
                "sqft_lot": [225000],
                "floors": [4],
                "zipcode_98039": [1],
            },
        }
    ),
)


House 1:

Observed Price: [[489950]]

Model 1 Prediction: [[399804.49495407]]
Model 1 Error: [[-90145.50504593]]

Model 2 Prediction: [[558200.39008693]]
Model 2 Error: [[68250.39008693]]


House 2:

Observed Price: [[2200000]]

Model 1 Prediction: [[1258512.60885303]]
Model 1 Error: [[-941487.39114697]]

Model 2 Prediction: [[1225791.2893959]]
Model 2 Error: [[-974208.7106041]]


House 3:

Model 1 Prediction: [[13987597.59135527]]
Model 2 Prediction: [[14864246.72493902]]


#### Question 6.1:

The answer is printed above.

#### Question 6.2:

For House 1, Model 2 is the winner.

For House 2, Model 1 is the winner.

For House 3, both predictions are quite close. Since Model 2 has better performance according to the R2 and RMSE metrics, we can rely on its prediction with more confidence.
