# California housing dataset with linear and polynomial regression 

In this notebook, we'll use [linear regression](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) and [polynomial regression](https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions) to classify MNIST digits using scikit-learn.

First, the needed imports. 

In [None]:
%matplotlib inline

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn import datasets, __version__
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Data

Then we load the California housing data. First time we need to download the data, which can take a while.

In [None]:
chd = datasets.fetch_california_housing()

asdf

In [None]:
df = pd.DataFrame(data=chd.data, columns=chd.feature_names)
df['Target'] = pd.Series(chd.target, index=df.index)
df.describe()

In [None]:
plt.figure(figsize=(15,10))
for i in range(8):
    plt.subplot(4,2,i+1)
    plt.scatter(chd.data[:,i], chd.target, s=2, label=chd.feature_names[i])
    plt.legend(loc='best')

In [None]:
train_len = len(chd.data)-5000
X = chd.data
y = chd.target

X_train_all, y_train = X[:train_len], y[:train_len]
X_test_all, y_test = X[train_len:], y[train_len:]

X_train_single = X_train[:,0].reshape(-1, 1)
X_test_single = X_test[:,0].reshape(-1, 1)
     
print()
print('California housing data loaded: train:',len(X_train),'test:',len(X_test))
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)

The training data (`X_train`) is a matrix of size (`train_len`, 8), i.e. it consists of `train_len` housing districts, each characterized with 8 attributes *(MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude)*. `y_train` is a vector containing the target value *(median house value)* for each housing district in the training set.

In [None]:
X_train = X_train_single
X_test = X_test_single

#X_train = X_train_all
#X_test = X_test_all

## Linear regression

### Learning

In [None]:
%%time

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
print('coefficients:', lin_reg.coef_)
print('intercept:', lin_reg.intercept_)

### Inference

And try to classify some test samples with it.

In [None]:
%%time

predictions = lin_reg.predict(X_test)

print("Mean squared error: %.2f"
      % mean_squared_error(y_test, predictions))
print('Variance score: %.2f' % r2_score(y_test, predictions))

In [None]:
if X_test.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_test, y_test, s=5)
    reg_x = np.arange(np.min(X_test), np.max(X_test), 0.01).reshape(-1, 1)
    plt.scatter(reg_x, lin_reg.predict(reg_x), s=8, label='linear')
    plt.legend(loc='best');

## Ridge regression

### Learning

In [None]:
%%time

rdg_reg = Ridge(alpha=10000)
rdg_reg.fit(X_train, y_train)
print('coefficients:', rdg_reg.coef_)
print('intercept:', rdg_reg.intercept_)

### Inference

In [None]:
%%time

predictions = rdg_reg.predict(X_test)

print("Mean squared error: %.2f"
      % mean_squared_error(y_test, predictions))
print('Variance score: %.2f' % r2_score(y_test, predictions))

In [None]:
if X_test.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_test, y_test, s=5)
    reg_x = np.arange(np.min(X_test), np.max(X_test), 0.01).reshape(-1, 1)
    plt.scatter(reg_x, lin_reg.predict(reg_x), s=8, label='linear');
    plt.scatter(reg_x, rdg_reg.predict(reg_x), s=8, label='ridge')
    plt.legend(loc='best');

## Polynomial regression

### Learning

In [None]:
%%time

poly_model = Pipeline([('poly', PolynomialFeatures(degree=5)),
                      ('linear', LinearRegression(fit_intercept=False))])
poly_model.fit(X_train, y_train)
print('coefficients:', poly_model.steps[1][1].coef_)
print('intercept:', poly_model.steps[1][1].intercept_)

### Inference

In [None]:
%%time

predictions = poly_model.predict(X_test)

print("Mean squared error: %.2f"
      % mean_squared_error(y_test, predictions))
print('Variance score: %.2f' % r2_score(y_test, predictions))

In [None]:
if X_test.shape[1] == 1:
    plt.figure(figsize=(10, 10))
    plt.scatter(X_test, y_test, s=5)
    reg_x = np.arange(np.min(X_test), np.max(X_test), 0.01).reshape(-1, 1)
    plt.scatter(reg_x, lin_reg.predict(reg_x), s=8, label='linear');
    plt.scatter(reg_x, poly_model.predict(reg_x), s=8, label='polynomial')
    plt.legend(loc='best');

## Model tuning

Try to improve the accuracy of the nearest-neighbor classifier while preserving a reasonable runtime to classify the whole test set. Things to try include using more than one neighbor (with or without weights) or increasing the amount of training data.  See the documentation for [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn-neighbors-kneighborsclassifier).

See also http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification for more information.