# Linear Regression Analysis for Real Estate Price

## Objective
Apply linear regression to predict housing prices using a real estate dataset.

## Steps
1. Load dataset.
2. Understand Information.
3. Data Preprocessing.
4. Model fitting.
5. Prediction and Evaluation.
6. Discussion

## Setup Environment

In [None]:
import pandas as pd

## 1. Load Dataset

We begin by loading a real estate dataset that includes housing prices and relevant features.

In [44]:
csv_file = 'Real_Estate.csv'

data = pd.read_csv(csv_file)
data.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


## 2. Understand Information

In [46]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   No                                      414 non-null    int64  
 1   X1 transaction date                     414 non-null    float64
 2   X2 house age                            414 non-null    float64
 3   X3 distance to the nearest MRT station  414 non-null    float64
 4   X4 number of convenience stores         414 non-null    int64  
 5   X5 latitude                             414 non-null    float64
 6   X6 longitude                            414 non-null    float64
 7   Y house price of unit area              414 non-null    float64
dtypes: float64(6), int64(2)
memory usage: 26.0 KB


## 3. Data Preprocessing

We will handle any missing values, encode categorical variables, and normalize numerical features.

In [49]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data.fillna(data.mean(), inplace = True)

data = pd.get_dummies(data)

X = data.drop('Y house price of unit area', axis = 1)
y = data['Y house price of unit area']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 4. Model Fitting

We will use linear regression to fit the model to our training data.

In [72]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

reg.fit(X_train, y_train)
reg

## 5. Prediction and Evaluation

We will predict the housing prices for the test set and evaluate the model's performance.

In [73]:
pred = reg.predict(X_train)
pred

array([38.68928931, 33.98663039, 32.18637377, 40.46470034, 45.42448177,
       33.9548707 , 40.15323131, 12.53301733, 39.91916791, 41.42478505,
       49.54408363, 47.6937185 , 39.40831706, 35.05171431, 43.76591051,
       30.98146663, 48.52421431, 29.48982739, 33.39842351, 43.87098879,
       30.80915741, 48.20923355, 48.46743841, 30.33852893, 25.46352144,
       41.02925977, 38.4083065 , 37.15837455, 40.42563123, 48.5126692 ,
       46.57150295, 19.40115745, 40.42537053, 44.91464581, 43.7788366 ,
       31.34851131, 33.09693918, 45.60293602, 15.62861649, 51.77255574,
       47.8103501 , 37.5961623 , 48.05813635, 15.90529108, 48.49093796,
       12.78354048, 29.79921903, 34.34277268, 40.03041937, 46.3614422 ,
       39.8541305 , 37.49625336, 33.54816526, 37.70853697, 26.71423492,
       34.6338155 , 25.22665633, 37.24813104, 43.50413125, 53.29632729,
       35.97599037, 48.10716824, 40.9995586 , 42.15449284, 43.82817656,
       44.86204615, 30.44301133, 44.13973642, 31.52491836, 39.87

In [76]:
from sklearn import metrics

r2 = metrics.r2_score(y_train, pred)

print(f'R-Squared: {r2}')

R-Squared: 0.5604074935510763


In [77]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_train, pred)

print(f'Mean Squared Error: {mse}')

Mean Squared Error: 82.68317541554838


## 5. Discussion

We analyze the model's accuracy and discuss potential improvements and biases.

### Model Accuracy
- The Mean Squared Error indicates how close the predictions are to the actual values. Lower values are better.
- The R-squared value shows the proportion of the variance in the dependent variable that is predictable from the independent variables. Values closer to 1 indicate a better fit.it.

### Potential Improvements

- Use more advanced regression tec.Lasso).
- Incomoreditional relevant  features.

### Potential Biases

- The dataset may not be representative of all housing markets.
- The model may overfit or underfit if not properly validated.