## Regression with multiple features

In the simple case, we had 1 input variable $x$, and used that to predict the value $y$.

The estimator was a line with the equation $y=ax+b$. We had to find the values for $a,b$.

*Predicting multiple output variables can easily be done by doing the regression separately for each.*

What if we have different kinds of data (features) about the subjects? For example, we know not just the area of the houses but their number of rooms, floors, the ages, and distances to the city center as well.

Predicting based on multiple features means that the input data is not just a number $x$ but a vector of numbers $\mathbf{x}$.

Linear regression tries to estimate the output variable with a linear function, which is a linear combination of the input variables:

$$y = a_1x_1+a_2x_2+\dots+a_nx_n + b = b + \sum_{i=1}^n a_ix_i = \mathbf{a^T x} + b$$

Instead of 2 numbers, $a$ and $b$, we need to determine the values of $n+1$ numbers: $n$ coefficients and the $y$-intercept.

Other than that, the method is the same, we just have more partial derivatives.

Implementing the error function and the gradient descent algorithm for the multivariable case is left as an exercise for the reader. Here, a solution with the scikit-learn library is presented.

### Reading the data

Data source: [Kaggle](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction)

In [3]:
# python -m pip install pandas
import pandas as pd

df = pd.read_csv("kc_house_data.csv")
df

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,...,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


In [4]:
features = ["sqft_living", "sqft_lot", "bedrooms", "bathrooms", "floors", "yr_built"]
X = df[features].copy()
y = df["price"].copy()
X

Unnamed: 0,sqft_living,sqft_lot,bedrooms,bathrooms,floors,yr_built
0,1180,5650,3,1.00,1.0,1955
1,2570,7242,3,2.25,2.0,1951
2,770,10000,2,1.00,1.0,1933
3,1960,5000,4,3.00,1.0,1965
4,1680,8080,3,2.00,1.0,1987
...,...,...,...,...,...,...
21608,1530,1131,3,2.50,3.0,2009
21609,2310,5813,4,2.50,2.0,2014
21610,1020,1350,2,0.75,2.0,2009
21611,1600,2388,3,2.50,2.0,2004


In [5]:
# create training set
train_X = X.iloc[: len(X) // 2]
train_y = y.iloc[: len(y) // 2]
pd.concat([train_X, train_y], axis=1)

Unnamed: 0,sqft_living,sqft_lot,bedrooms,bathrooms,floors,yr_built,price
0,1180,5650,3,1.00,1.0,1955,221900.0
1,2570,7242,3,2.25,2.0,1951,538000.0
2,770,10000,2,1.00,1.0,1933,180000.0
3,1960,5000,4,3.00,1.0,1965,604000.0
4,1680,8080,3,2.00,1.0,1987,510000.0
...,...,...,...,...,...,...,...
10801,1480,1384,3,2.25,3.0,2008,436000.0
10802,2620,8331,4,2.50,2.0,1991,460000.0
10803,1300,5782,3,1.00,1.0,1959,281000.0
10804,3120,7680,4,1.50,1.0,1956,976000.0


### Multiple regression with scikit-learn

In [6]:
# pip install scikit-learn
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(train_X.values, train_y.values)

print(regr.coef_)

[ 3.19253946e+02 -1.99554980e-01 -7.58382693e+04  6.45434185e+04
  4.02050778e+04 -3.61146278e+03]


In [7]:
predictions = []
for i in range(len(X) // 2, len(X)):
    predicted = regr.predict([X.iloc[i]])[0]
    predictions.append(predicted)
    print(predicted, y.iloc[i])
    print(f"\tPredicted price is {predicted / y.iloc[i]:.2%}")

353559.5655401703 429000.0
	Predicted price is 82.41%
667469.0256868489 729000.0
	Predicted price is 91.56%
557202.231788394 625000.0
	Predicted price is 89.15%
241324.81689872686 218500.0
	Predicted price is 110.45%
175681.5122212423 175000.0
	Predicted price is 100.39%
283905.46162226424 442500.0
	Predicted price is 64.16%
273807.14084775466 480000.0
	Predicted price is 57.04%
678619.8833081378 715500.0
	Predicted price is 94.85%
353934.95892111957 432500.0
	Predicted price is 81.83%
1287442.8010682259 875000.0
	Predicted price is 147.14%
848602.504526752 605000.0
	Predicted price is 140.26%
1488733.205588336 1688000.0
	Predicted price is 88.20%
386764.70748868026 318000.0
	Predicted price is 121.62%
289124.05991409346 311300.0
	Predicted price is 92.88%
458865.03450161126 325000.0
	Predicted price is 141.19%
787438.3727737013 795000.0
	Predicted price is 99.05%
501739.4283077 558000.0
	Predicted price is 89.92%
766079.9412505869 370000.0
	Predicted price is 207.05%
539329.8736273637

In [8]:
import plotly.express as px

show_count = 100
diffs = pd.DataFrame(data={"actual": y[len(X) // 2 :].values, "predicted": predictions})
diffs["error"] = diffs["actual"] - diffs["predicted"]
diffs
# prices = y[len(X)//2:].values
fig = px.scatter(diffs[:show_count], y=["actual", "predicted"])
fig.add_bar(y=diffs["error"][:show_count], name="error")
fig.show()

### Feature scaling / normalization / engineering

In [10]:
# features = ["sqft_living", "sqft_lot", "bedrooms", "bathrooms", "floors", "yr_built"]
features = ["bedrooms", "bathrooms", "floors"]
X = df[features].copy()
# twice as old house should have twice as big a number as the age
X["age"] = 2015 - df["yr_built"]
# scaling
X["age"] /= X["age"].max()
# mean normalization
area = df["sqft_living"]
X["living_area"] = (area - area.mean()) / (area.max() - area.min())
# Z-score normalization
lot = df["sqft_lot"]
X["lot_area"] = (lot - lot.mean()) / lot.std()
# the same can be done with sklearn.preprocessing.StandardScaler
X

Unnamed: 0,bedrooms,bathrooms,floors,age,living_area,lot_area
0,3,1.00,1.0,0.521739,-0.067917,-0.228316
1,3,2.25,2.0,0.556522,0.036989,-0.189881
2,2,1.00,1.0,0.713043,-0.098860,-0.123296
3,4,3.00,1.0,0.434783,-0.009049,-0.244009
4,3,2.00,1.0,0.243478,-0.030181,-0.169649
...,...,...,...,...,...,...
21608,3,2.50,3.0,0.052174,-0.041502,-0.337417
21609,4,2.50,2.0,0.008696,0.017366,-0.224381
21610,2,0.75,2.0,0.052174,-0.079992,-0.332129
21611,3,2.50,2.0,0.095652,-0.036219,-0.307069


Visualization of Z-score scaling
![](https://scikit-learn.org/stable/_images/sphx_glr_plot_all_scaling_002.png)

In [11]:
fig = px.line(pd.concat([X, y / 1_000_000], axis=1)[:40], markers=True)
fig.show()

### Feature engineering

Creating new features based on input data by combining or transforming existing features.

Can be used for polynomial regression too by raising a feature to a certain power.

But watch out, as too many features, or features with low correlation can lead to overfitting, as well as insufficient training examples.

In [12]:
# adjust age for renovation
def adj_age(values):
    built, renovated = values
    if renovated:
        return 2015 - (built + renovated) / 2
    return 2015 - built


X["adj_age"] = df[["yr_built", "yr_renovated"]].apply(adj_age, axis=1)
# scaling
X["adj_age"] /= X["adj_age"].max()
fig = px.line(X.loc[:160, ["age", "adj_age"]])
fig.show()