# Welcome to week 3 of your PAMD computer labs.

In this week's session, we will implement different linear regression models and evaluate their output. We will use the California Housing dataset in scikit learn for that purpose.

Have a look at the dataset documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) to familiarise yourself with the variables.

In [11]:
import sklearn.datasets as ds

df = ds.fetch_california_housing(as_frame=True)

In [12]:
# Let's start by printing us a summary of the dataset description. 

print(df.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

# Task 1 - Descriptive statistics and visualisation

The dataset provides us with a pre-defined target variable ('target') which is the average house value in units of $100,000. We will build a linear regression model trying to explain and also predict this value. The dataset also pre-defines the 8 variables we can use as predictors as ('data').

Start your analysis by familiarising yourself with the target variable and the predictors. You can create descriptive statistics or plots to visualise them.

A quick way of creating some descriptive statistics for your data is using the pandas [describe() function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html). It takes a pandas dataframe object and gives you a summary statistic for that dataframe. You can also use any of the functions we've learned over the past few weeks to calculate descriptive statistics instead.

In [1]:
# Your code here

# Task 2 - Simple linear regression

Now it's time to build a simple linear regression model. The objective is to explain the variable 'target' (our y) through just one of the available predictors. You can choose any one of them, such as 'MedInc' (our x).

Use the sklearn function LinearRegression() to build your model. You can find the documentation for that [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

Tip: Double check which format the model would like the data input to be in. Then use the fit() method in the package. Don't worry too much about optional parameters like weights for now, try to create the most simple model first.

In [14]:
# Your code here

Try plotting the results of your model. You can do so, for example, with the scatter() function in the matplotlib library. Plot your x values and known y values first. Then you can overlay that with a plot of your predicted y values. Try out some different visualisations if you want to.

Tip: The LinearRegression fit() function we used above estimates your parameters (the betas). Then you can use the predict() function to derive estimations for y-hat. Save these estimations in their own vector to plot them.

In [2]:
# Your code here

You can have a look at your linear regression object in more detail. For example, try printing your coefficients using the  LinearRegression().coef_ function. You can also print your intercept using LinearRegression().intercept_

In [3]:
# Your code here

# Task 3 - Multiple linear regression

Similarly as above, try recreating the same model but this time include multiple columns of X. You can choose just a few of them or try including all.

You can use the sklearn function LinearRegression() which we used above again. It works well with multiple predictors.

However, one disadvantage of the sklearn model is that there is limited in-built functionalities to create a quick overview summary of the model. This is because sklearn is mostly focused on predictive modelling, so, it's more interested in giving us accurate y-hats instead of telling us a lot of small details about the model fit.

An alternative popular package which does have a good summary function is the one in statsmodels, with documentation [here](https://www.statsmodels.org/stable/regression.html). I will use that instead this time to be able to print a summary of the model performance, but the sklearn one would work just as fine. Use whichever you feel more comfortable.

If you want to try out the statsmodel version, use the [OLS estimation](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html). OLS stands for Ordinary Least Squares, and it's a type of least squares estimation as we've covered during the lecture.

In [4]:
# Your code here

In [5]:
# Your code here