<a href="https://colab.research.google.com/github/TamilselviMunusamy007/MachineLearning_M606/blob/main/exercises/machine-learning/supervised-learning/linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression
You should build a machine learning pipeline using a linear regression model. In particular, you should do the following:
- Load the `housing` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Train and test a linear regression model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


In [11]:

url = "https://raw.githubusercontent.com/m-mahdavi/teaching/refs/heads/main/datasets/housing.csv"
df = pd.read_csv(url)
df.head()
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                545 non-null    int64  
 1   price             545 non-null    int64  
 2   area              545 non-null    int64  
 3   bedrooms          545 non-null    int64  
 4   bathrooms         545 non-null    int64  
 5   stories           545 non-null    int64  
 6   stories.1         545 non-null    int64  
 7   guestroom         545 non-null    int64  
 8   basement          545 non-null    int64  
 9   hotwaterheating   545 non-null    int64  
 10  airconditioning   545 non-null    int64  
 11  parking           545 non-null    int64  
 12  prefarea          545 non-null    int64  
 13  furnishingstatus  545 non-null    float64
dtypes: float64(1), int64(13)
memory usage: 59.7 KB


Unnamed: 0,id,price,area,bedrooms,bathrooms,stories,stories.1,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
count,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0,545.0
mean,272.0,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.858716,0.177982,0.350459,0.045872,0.315596,0.693578,0.234862,0.465138
std,157.47222,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.348635,0.382849,0.477552,0.209399,0.46518,0.861586,0.424302,0.380686
min,0.0,1750000.0,1650.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,136.0,3430000.0,3600.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,272.0,4340000.0,4600.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5
75%,408.0,5740000.0,6360.0,3.0,2.0,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0
max,544.0,13300000.0,16200.0,6.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0


Preprocessing & Feature Engineering

In [12]:
import pandas as pd

url = "https://raw.githubusercontent.com/m-mahdavi/teaching/refs/heads/main/datasets/housing.csv"
df = pd.read_csv(url)

print(df.columns)
df.head()

Index(['id', 'price', 'area', 'bedrooms', 'bathrooms', 'stories', 'stories.1',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')


Unnamed: 0,id,price,area,bedrooms,bathrooms,stories,stories.1,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,0,13300000,7420,4,2,3,1,0,0,0,1,2,1,1.0
1,1,12250000,8960,4,4,4,1,0,0,0,1,3,0,1.0
2,2,12250000,9960,3,2,2,1,0,1,0,0,2,1,0.5
3,3,12215000,7500,4,2,2,1,0,1,0,1,3,1,1.0
4,4,11410000,7420,4,1,2,1,1,1,0,1,2,0,1.0


In [13]:
y = df["price"]
X = df.drop("price", axis=1)



Split into training and test **sets**

In [14]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,       # e.g., 20% test
    random_state=42       # for reproducibility
)


Train linear regression model

In [16]:
lr = LinearRegression(
    fit_intercept=True,
    copy_X=True,
    positive=False,
    n_jobs=None   # or -1 for parallel jobs
)

lr.fit(X_train, y_train)
print("Coefficients:", lr.coef_)
print("Intercept:", lr.intercept_)

Coefficients: [-9.54311688e+03  3.06470980e+01 -2.20348762e+04  3.93482077e+05
  6.93123510e+04 -1.09426289e+05 -7.38625266e+04 -5.37451245e+04
  1.46330240e+05  1.36811768e+05  1.09271486e+05  1.22602247e+05
 -3.08870242e+04]
Intercept: 6600095.193022061


Evaluate on test set & check performance

In [17]:
y_pred = lr.predict(X_test)

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R²: {r2:.3f}")

MSE: 753837202332.18
MAE: 492478.42
R²: 0.851
