<img src="https://news.illinois.edu/files/6367/543635/116641.jpg" alt="University of Illinois" width="250"/>

## HW: Deep Learning ##

HW submission by group (up to 4 people)
* John Doe <johndoe@illinois.edu>
* Jane Roes <janeroe@illinois.edu>

### imports and graphics configurations ###

In [1]:
import numpy
import pandas
import time
import random
import matplotlib
#%matplotlib notebook
import matplotlib.pyplot as plt
import scipy.stats
import matplotlib.offsetbox as offsetbox
from matplotlib.ticker import StrMethodFormatter

def saver(fname,dpi=600):
    plt.savefig(fname+".png",bbox_inches="tight",dpi=dpi)

def textbox(txt,fname=None):
    plt.figure(figsize=(1,1))
    plt.gca().add_artist(offsetbox.AnchoredText("\n".join(txt), loc="center",prop=dict(size=30)))
    plt.axis('off')
    if fname is not None:
        saver(fname)
    plt.show()
    plt.close()

In [2]:
#for some reason, this needs to be in a separate cell
params={
    "font.size":15,
    "lines.linewidth":5,
}
plt.rcParams.update(params)

# **Technology** #

**Technology:** Compute $\cos(k\pi/10)$ for $k\in \{0,1,2,\dots 20\}$

# **Linear Regression** #

**Feature Importance:** Consider linear regression of price upon the feature set
* Square Feet
* number of Beds
* number of Baths
* Year built
* HOA/Month

One by one, remove (using sklearn if you like) each of these features and repeat linear regression.
* Rank features by how the loss (mean square error) changes as each of the features is removed
* Rank features by how the metric (mean absolute error) changes as each of the features is removed

### Construct_Linear_Regression(5)

In [3]:
# Import the data

def getfile(location_pair,**kwargs): #tries to get local version and then defaults to google drive version
    (loc,gdrive)=location_pair
    try:
        out=pandas.read_csv(loc,**kwargs)
    except FileNotFoundError:
        print("local file not found; accessing Google Drive")
        loc = 'https://drive.google.com/uc?export=download&id='+gdrive.split('/')[-2]
        out=pandas.read_csv(loc,**kwargs)
    return out

In [4]:
url="https://www.redfin.com"
fname=("redfin_data.csv","https://drive.google.com/file/d/1ei7JaZ4M1lrw3TyYcWozESi8HrHccnUx/view?usp=sharing")
plot_title="Home Asking Price (Redfin)"
data_color="red"
markersize=2
thinlinesize=2

In [5]:
data_raw=getfile(fname)
data_raw.head()

local file not found; accessing Google Drive


HTTPError: HTTP Error 404: Not Found

In [7]:
data_raw.shape

(101, 27)

In [8]:
data_raw.columns

Index(['SALE TYPE', 'SOLD DATE', 'PROPERTY TYPE', 'ADDRESS', 'CITY',
       'STATE OR PROVINCE', 'ZIP OR POSTAL CODE', 'PRICE', 'BEDS', 'BATHS',
       'LOCATION', 'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT', 'DAYS ON MARKET',
       '$/SQUARE FEET', 'HOA/MONTH', 'STATUS', 'NEXT OPEN HOUSE START TIME',
       'NEXT OPEN HOUSE END TIME',
       'URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)',
       'SOURCE', 'MLS#', 'FAVORITE', 'INTERESTED', 'LATITUDE', 'LONGITUDE'],
      dtype='object')

In [9]:
cols_to_be_used = ["SQUARE FEET", "BEDS", "BATHS", "HOA/MONTH", "YEAR BUILT", "PRICE"]

In [10]:
df = data_raw[cols_to_be_used]

In [17]:
features = ["SQUARE FEET", "BEDS", "BATHS", "HOA/MONTH", "YEAR BUILT"]
y_col = ["PRICE"]

In [11]:
import torch
import scipy

In [12]:
class linearRegression(torch.nn.Module):
    def __init__(self, inputSize, outputSize=1,SEED=0): #default to one-dimensional feature and response
        super().__init__() #run init of torch.nn.Module
        if SEED is not None:
          torch.manual_seed(SEED)
        self.linear = torch.nn.Linear(inputSize,outputSize)
        if torch.cuda.is_available():
          self=self.cuda()

    def forward(self, x):
        out=self.linear(x)
        return out

In [13]:
learningRate = 0.01

In [14]:
model=linearRegression(inputSize=5)

Loss = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learningRate)

In [15]:
df

Unnamed: 0,SQUARE FEET,BEDS,BATHS,HOA/MONTH,YEAR BUILT,PRICE
0,2000.0,3.0,2.5,660.0,1928.0,549000
1,2138.0,2.0,2.5,1250.0,2002.0,925000
2,2198.0,5.0,2.0,,1954.0,699900
3,1426.0,2.0,2.0,446.0,1980.0,315000
4,,2.0,2.5,435.0,1984.0,385000
...,...,...,...,...,...,...
96,725.0,1.0,1.0,264.0,1916.0,159900
97,,,,,,1200000
98,6172.0,5.0,5.0,,1915.0,3250000
99,1807.0,2.0,2.5,465.0,1910.0,530000


In [18]:
X=df[features].squeeze()
Y=df[y_col].squeeze()

In [19]:
X.shape

(101, 5)

In [20]:
X

Unnamed: 0,SQUARE FEET,BEDS,BATHS,HOA/MONTH,YEAR BUILT
0,2000.0,3.0,2.5,660.0,1928.0
1,2138.0,2.0,2.5,1250.0,2002.0
2,2198.0,5.0,2.0,,1954.0
3,1426.0,2.0,2.0,446.0,1980.0
4,,2.0,2.5,435.0,1984.0
...,...,...,...,...,...
96,725.0,1.0,1.0,264.0,1916.0
97,,,,,
98,6172.0,5.0,5.0,,1915.0
99,1807.0,2.0,2.5,465.0,1910.0


In [21]:
features=torch.from_numpy(X.values.astype(numpy.float32))
labels=torch.from_numpy(Y.values.astype(numpy.float32).reshape(-1,1))

In [22]:
X.shape, Y.shape

((101, 5), (101,))

In [23]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
# Imputation : Filling the null values with mean of the columns
X = X.fillna(X.mean())

In [26]:
(feature,featurescale,featurename)=("SQUARE FEET",1000,"SQUARE FEET/1000")
(label,labelscale,labelname)=("PRICE",1.0E6,"PRICE/$1M")

In [27]:
# Scaling down the features
X["SQUARE FEET"] = X["SQUARE FEET"] / 1000
Y = Y / 1.0E6

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,random_state=42, test_size=0.2)

In [29]:
y_train

89    0.7990
26    1.4950
42    0.4490
70    0.1680
15    0.1050
       ...  
60    0.1483
71    0.1030
14    0.6990
92    0.4150
51    0.3190
Name: PRICE, Length: 80, dtype: float64

In [30]:
model = LinearRegression()
model.fit(X_train,y_train)

LinearRegression()

In [31]:
y_pred = model.predict(X_test)

In [32]:
mean_squared_error(y_test, y_pred) , mean_absolute_error(y_test, y_pred)

(0.05344045998064856, 0.15909096115422808)

In [33]:
model.feature_names_in_,  model.coef_

(array(['SQUARE FEET', 'BEDS', 'BATHS', 'HOA/MONTH', 'YEAR BUILT'],
       dtype=object),
 array([ 2.33911924e-01, -2.56721718e-02,  1.43598694e-01,  2.07252597e-04,
        -2.73700924e-03]))

In [34]:
X.describe()

Unnamed: 0,SQUARE FEET,BEDS,BATHS,HOA/MONTH,YEAR BUILT
count,101.0,101.0,101.0,101.0,101.0
mean,1.787987,2.925532,2.164894,485.741379,1956.787234
std,1.006252,1.557173,1.171299,175.193925,35.701785
min,0.637,1.0,1.0,23.0,1889.0
25%,1.178,2.0,1.0,421.0,1927.0
50%,1.7,2.925532,2.0,485.741379,1955.0
75%,1.787987,4.0,2.5,485.741379,1993.0
max,6.172,7.0,6.5,1250.0,2022.0


In [35]:
corr_matrix = df.corr()
print(corr_matrix)

             SQUARE FEET      BEDS     BATHS  HOA/MONTH  YEAR BUILT     PRICE
SQUARE FEET     1.000000  0.779819  0.915848   0.205728   -0.054496  0.875522
BEDS            0.779819  1.000000  0.708116   0.014438   -0.386802  0.606917
BATHS           0.915848  0.708116  1.000000   0.173242   -0.041718  0.783133
HOA/MONTH       0.205728  0.014438  0.173242   1.000000    0.082718  0.310515
YEAR BUILT     -0.054496 -0.386802 -0.041718   0.082718    1.000000 -0.212761
PRICE           0.875522  0.606917  0.783133   0.310515   -0.212761  1.000000


In [36]:
model.fit_intercept

True

In [None]:
features

In [None]:
fin_dict = {}
vars_to_be_used = features
for i in range(-1,5):
  run_id = "run_" + str(i)
  missing_var = ""
  if i >=0:
    vars_to_be_used = features[:i] + features[i+1:]
    missing_var = features[i]
  # print(vars_to_be_used)
  # X_train, X_test, y_train, y_test = train_test_split(X[vars_to_be_used],Y,random_state=42, test_size=0.2)
  model = LinearRegression()
  model.fit(X[vars_to_be_used],Y)
  # y_pred = model.predict(X_test)

  # mse_test = mean_squared_error(y_test, y_pred)
  # mae_test = mean_absolute_error(y_test, y_pred)
  y_pred = model.predict(X[vars_to_be_used])

  mse_all = mean_squared_error(Y, y_pred)
  mae_all = mean_absolute_error(Y, y_pred)

  feature_list = model.feature_names_in_
  model_coef_list = model.coef_

  fin_dict[run_id] = {"vars_used" : feature_list, "missing_var" : missing_var, "mse_all" : mse_all, "mae_all" : mae_all,"model_coef" : model_coef_list}

In [None]:
pandas.DataFrame(fin_dict).round(3)

# Background/Decisions we made:

We first decided to train our model on 80% of the data and test on 20%, but with only 101 rows of data to work with, we considered the
possibility of the effect of outliers being in just the training or just the testing set, and of our metrics being calculated on just
20 houses of raw data. We decided to model with all 101 rows of housing data.

With all five variables in the model, the mse and mae were low in comparison to their values after removing a variable,
but we noticed that the mae actually improved and decreased slightly when we removed the "BEDS" feature. Although this is unexpected,
it makes sense after looking at the feature correlation matrix. "BEDS" is highly correlated with other features such as "SQUARE FEET"
and "BATHS", so including it in the model likely caused overfitting. We also noticed this for the "YEAR BUILT" feature.

# Answer 1:

Order wrt MSE:

"SQUARE FEET" is the most important feature, because removing that leads to the highest MSE when comparing with five features linear regression.
When we remove each of the features, the mse all apears to increase comparing to the original one, so it would be unnecessary to remove either of them.

Order of importance(MSE) : SQUARE FEET > BATHS > YEAR BUILT > HOA/MONTH > BEDS

# Answer 2:

Order wrt MAE:
"SQUARE FEET" still remains the most important feature, but the mae decreases when we remove feature BEDS and feature YEAR BUILT. Under the metric perspective, we can say that feature BEDS and feature YEAR BUILT are negatively contributing to the model and are recommended to be removed from the model.

Order of importance(MAE): SQUARE FEET > BATHS > HOA/MONTH > YEAR BUILT > BEDS


