# üß™ Task 2: Multiple Linear Regression

In this task, Multiple Linear Regression is used to predict housing prices
using all available input features from the California Housing dataset.


## 1Ô∏è‚É£ Data Retrieval and Collection

The California Housing dataset is loaded using the Scikit-learn library.
This dataset contains housing information from California districts.


In [1]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

housing = fetch_california_housing(as_frame=True)
df = housing.frame

print("Dataset Shape:", df.shape)
print("\nColumn Names:\n", df.columns)

df.head()


Dataset Shape: (20640, 9)

Column Names:
 Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'MedHouseVal'],
      dtype='object')


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## 2Ô∏è‚É£ Data Cleaning

The dataset is checked for missing values and data types to ensure
it is suitable for training a regression model.


In [2]:
df.isnull().sum()


MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

### Explanation

- No missing values are present in the dataset.
- All columns are numerical.
- Therefore, no data cleaning was required.


## 3Ô∏è‚É£ Feature Design

In Multiple Linear Regression, all available features except the target
variable are used as input features.

**Target (Label):** MedHouseVal  
**Input Features:** All remaining columns


In [12]:
X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']


## Feature Scaling

Feature scaling is applied because input features have different ranges.
Standardization helps the model learn more effectively.


In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## 4Ô∏è‚É£ Algorithm Selection

Linear Regression is selected because the target variable is continuous
and the relationship between features and output is assumed to be linear.


## 5Ô∏è‚É£ Loss Function Selection

Mean Squared Error (MSE) is used as the loss function to evaluate model
performance by measuring prediction error.


In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)


In [15]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


### Learning Process

The model learns coefficients for each input feature and an intercept
by minimizing the Mean Squared Error on the training data.


## 7Ô∏è‚É£ Model Evaluation

The trained model is evaluated on unseen test data using MSE and R¬≤ score.


In [16]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, r2


(0.5558915986952441, 0.575787706032451)

### Interpretation

- Lower MSE indicates better prediction accuracy.
- Higher R¬≤ score shows the model explains more variance in house prices.
- Multiple Linear Regression performs better than single-feature regression.


## üìà Model Interpretation


In [17]:
coefficients = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_
})

coefficients


Unnamed: 0,Feature,Coefficient
0,MedInc,0.852382
1,HouseAge,0.122382
2,AveRooms,-0.305116
3,AveBedrms,0.371132
4,Population,-0.002298
5,AveOccup,-0.036624
6,Latitude,-0.896635
7,Longitude,-0.868927


In [18]:
model.intercept_


np.float64(2.067862309508392)

### Explanation

- Each coefficient represents the effect of that feature on house price
  while keeping other features constant.
- Positive coefficients increase house value.
- Negative coefficients decrease house value.
- The intercept represents the predicted house value when all features are zero.


## ‚úÖ Final Conclusion

Multiple Linear Regression was successfully implemented using the complete
machine learning pipeline. The model achieved better performance by using
multiple input features to predict housing prices.
