First, we need to import the necessary packages.

**Note:** You will most likely have a more expansive import cell, I'm just importing what I need for the sample.

In [50]:
from sklearn import datasets, linear_model, metrics, model_selection
import pandas as pd
import numpy as np

Next, we need to load the data set and store it into a pandas DataFrame. Feel free to copy and paste the code below directly into your own Jupyter notebook (.ipynb) source file(s).

In [51]:
data = datasets.fetch_california_housing(as_frame=True).frame

Now, we went to see the data in a tabular representation. Executing the following cell will do just that.

In [52]:
data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


Next, we want to process the data. I will only process it in the following ways:
1. Deal with missing (null or NA) data, if any such records exist.
1. Dealing with duplicate data, if any such records exist.

**Note:** You should do more than what I'm doing (variable transformation if its distribution is skewed, outlier removal if you deem it necessary, et cetera, in addition to the above steps)... I'm only doing this much to make the baseline model you'll be using as simple as possible.

In [53]:
# check if any data is missing...
data.isna().any()

MedInc         False
HouseAge       False
AveRooms       False
AveBedrms      False
Population     False
AveOccup       False
Latitude       False
Longitude      False
MedHouseVal    False
dtype: bool

In [54]:
# check if any data is duplicated...
data[data.duplicated()].any()

MedInc         False
HouseAge       False
AveRooms       False
AveBedrms      False
Population     False
AveOccup       False
Latitude       False
Longitude      False
MedHouseVal    False
dtype: bool

It appears that there are no missing (null or NA) data entries, nor are there any duplicate data records. 

Next, we should split our data into $X$ and $y$ variables, and then split them into training and testing data sets.

**Note:** the "train_test_split" function uses seeds for randomization. If you want to stabilize the splitting so your model is the same each time you train it or test it, hard code the seed by setting the "random_state" parameter manually.

In [55]:
# split into X and y...
# make sure to only include in X variables that you want to use in the model as explanatory/independent variables...
# here, I'm only using MedInc and HouseAge...
# Note: every time you want to add or remove a variable, you'll need to re-run this cell and all the ones after it...
X = data[["MedInc", "HouseAge"]]
y = data.MedHouseVal

In [56]:
# view X variable...
X

Unnamed: 0,MedInc,HouseAge
0,8.3252,41.0
1,8.3014,21.0
2,7.2574,52.0
3,5.6431,52.0
4,3.8462,52.0
...,...,...
20635,1.5603,25.0
20636,2.5568,18.0
20637,1.7000,17.0
20638,1.8672,18.0


In [57]:
# view y variable...
y

0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64

In [58]:
# split X, y into training and testing versions...
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.8, random_state=42)

At this point, we can create out baseline linear regression model. Please try copy and paste the code in every cell from below labelled

```
# COPY THIS CELL!!!
```

directly into each of your own Jupyter notebook (.ipynb) source file(s) in which you create a model for task one. 

Also, make sure to label the cell where you define the model you have chosen as your accepted model as follows:

```
# THIS IS MY ACCEPTED MODEL!!!
```

Remember, you want to choose your accepted model based on how much better your models perform compared to this baseline model: the best model you create, compared to this baseline, should be your accepted model, using the adjusted-$R^2$ score as your measure of comparison.

In [59]:
# COPY THIS CELL!!!

# define a linear regression model...
baseline_lin_reg = linear_model.LinearRegression()

In [60]:
# COPY THIS CELL!!!

# train the linear regression model using the "fit" function...
baseline_lin_reg.fit(X_train, y_train)

LinearRegression()

In [61]:
# COPY THIS CELL!!!

# use the trained linear regression model to make predictions for y_test based on the X_test data...
y_pred = baseline_lin_reg.predict(X_test)

In [62]:
# COPY THIS CELL!!!

# determine the non-adjusted r2 score of the baseline model...
r2 = metrics.r2_score(y_test, y_pred)

In [63]:
# COPY THIS CELL!!!

# determine the number of observations in the test set (n) and the number of explanatory/independent variables used (k)...
n, k = X_test.shape

In [64]:
# COPY THIS CELL!!!

# calculate the adjusted-r2 score for the baseline linear regression model...
adj_r2 = 1 - (1 - r2) * ((n - 1) / (n - k - 1))

# print the calculated adjusted-r2 score (this is the value you want to beat!!!)
print(f'Baseline Adjusted-R^2 : {adj_r2}')

Baseline Adjusted-R^2 : 0.4938153753759118
