<a href="https://colab.research.google.com/github/Tejes-Aulakh/Python/blob/main/Intro_to_AI_Linear_Regression_California_Housing_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# California Housing Data - Using Linear Regression

The California Housing dataset is another well-known dataset in the field of machine learning and statistics.

## Description:

The California Housing dataset consists of 20,640 samples of housing data from California, each described by eight features. This dataset is commonly used to predict housing prices based on various factors.

## Features:

The dataset includes the following eight features (all continuous):

1. MedInc: Median income in block group
2. HouseAge: Median house age in block group
3. AveRooms: Average number of rooms per household
4. AveBedrms: Average number of bedrooms per household
5. Population: Block group population
6. AveOccup: Average number of household members
7. Latitude: Block group latitude
8. Longitude: Block group longitude

These features represent various demographic and geographic properties of the housing samples.

## Target:

The target variable is the median house value for California districts, measured in hundreds of thousands of dollars.

## Data Structure:

* Number of samples: 20,640
* Number of features: 8
* Number of classes: Not applicable (continuous target variable)

## Example Data:

Here is a sample from the dataset:

| MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | Median House Value |
|--------|----------|----------|-----------|------------|----------|----------|-----------|--------------------|
| 8.3252 | 41.0     | 6.9841   | 1.0238    | 322        | 2.5556   | 37.88    | -122.23   | 4.526              |
| 8.3014 | 21.0     | 6.2381   | 0.9719    | 2401       | 2.1098   | 37.86    | -122.22   | 3.585              |
| 7.2574 | 52.0     | 8.2881   | 1.0811    | 496        | 2.8023   | 37.85    | -122.24   | 3.521              |


In this analysis, we applied the Linear Regression algorithm to predict the median house value based on the features provided. Linear Regression is a fundamental regression technique that models the relationship between the input features and a continuous target variable. The results demonstrate the effectiveness of Linear Regression in predicting house prices, showcasing its utility in handling real-world datasets.

# Import libraries

Python includes a large number of libraries pre-installed or available through package managers such as pip to support machine learning. The libraries we will be using are NumPy, pandas, and scikit-learn. Like the previous example, we will be plotting the data using matplotlib, with seaborn used on top of it to better suit our needs.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load the dataset

scikit-learn contains the dataset with a one-shot load function `fetch_california_housing()`. It's often a good idea to display the first few lines of a dataset to make sure everything looks as it should.

In [None]:
# Load the dataset (example: California Housing dataset)
housing = fetch_california_housing(as_frame=True)
housing = housing.frame

In [None]:
# Display first few rows
housing.head()

# Visualise the Data

WWith so many features, it is often worthwhile to visualise each one as a graph. Note that this is not required for the algorithm to run; it helps us, as humans, to interpret and understand the data.

Based on the histograms of the various features, we can observe the following:

The features are distributed across very different scales.

The `HouseAge` and `HouseValue` columns have capped values at 50 and 5, respectively.

To improve accuracy, we should preprocess these features. This can be done by either performing feature engineering or cleaning the problematic instances.

In [None]:
# Visualise the data
housing.hist(bins=50, figsize=(12,8))
plt.show()

# Plot the data

We can plot the value of the properties against location using latitude and longitude. You can see this corresponds to a [map of California](https://www.google.com/maps/place/California,+USA/@37.2691675,-119.306607,6z/data=!3m1!4b1!4m6!3m5!1s0x808fb9fe5f285e3d:0x8b5109a227086f55!8m2!3d36.778261!4d-119.4179324!16zL20vMDFuN3E?entry=ttu).


In [None]:
housing.plot(kind="scatter", x="Longitude",y="Latitude", c="MedHouseVal", cmap="jet", colorbar=True, legend=True, sharex=False, figsize=(10,7), s=housing['Population']/100, label="population", alpha=0.7)
plt.show()


In [None]:
The color scale represents house values, and the circle radii indicate population sizes in different areas. From this visualization, we can conclude that:

Houses near the ocean have higher values.

Houses in densely populated areas also tend to have higher values, although this effect diminishes as we move farther from the ocean.

There are some outliers present.

## Plot Correlations Between Features
Next, we will plot the correlations between the features.

In [None]:
attributes = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup','MedHouseVal']
scatter_matrix(housing[attributes], figsize=(12,8))
plt.show()

By running along the bottom row (`MedHouseVal`) we can compare the correlation of the other features. The strongest one is `MedInc` (median income). This is worth exploring somewhat further.

## Explore in more detail.

In [None]:
housing.plot(kind="scatter", x="MedInc",y="MedHouseVal")
plt.show()

Looking at the plot, there seems to be a relatively strong linear correlation between median income and house value.

There are some issues, however, such as the lines formed at the top, which result from data capping. This should ideally be addressed during preprocessing.

We can view the correlations directly.

In [None]:
corr = housing.corr()
corr['MedHouseVal'].sort_values(ascending=True)

This shows the strongest positive correlation is between median house value and median income, as expected.

Everything we've done so far is part of understanding the data—doing our due diligence. Now, we need to fit and train the model.

## Split for Testing and Training
Here, we have randomly split the data, reserving 80% for training and 20% for testing.

In [None]:
X = housing.iloc[:,:-1]
y = housing.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Fitting the Model

For better accuracy, standard scaling is applied. The pipeline first applies the `StandardScaler()` function to the features and then calls the Linear Regression model. Using a pipeline makes the code cleaner, more reusable, and reduces boilerplate code significantly.


In [None]:
regression_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])
regression_pipeline.fit(X_train,y_train)

## Prediction and Evaluation
Now we can try put our test data through the model to get the R<sup>2</sup> value.

In [None]:
y_pred = regression_pipeline.predict(X_test)
r2_score( y_test, y_pred)

This gives us a value of 0.58. This provides a usable model, but it isn't highly accurate. As evident from the graphs above, there was a positive correlation but numerous issues with the data. The correlation wasn't as strong as it could have been, and many other features also influenced the results. This illustrates how more advanced techniques and methods could be more beneficial.