# Missing Value Imputation Using Linear Regression  
### Real Estate Price Dataset

**Goal:** Demonstrate how to handle missing target values using a predictive modeling approach instead of simple removal or average imputation.

**Dataset:** Real_Estate_with_Missing.csv  
**Guided by:** Aman Kharwal

This case study shows how we can use a **Linear Regression model to estimate missing pricing values** based on other available property features.

In [19]:
# Importing essential libraries for data manipulation and modeling
import pandas as pd
import numpy as rror


#### Step 1 :Load the Dataset
We check the **first few rows** to understand column names and confirm the dataset loaded correctly.

In [3]:
df = pd.read_csv(r"C:\Users\Pratik\Downloads\Real_Estate_with_Missing.csv")
df.head()

Unnamed: 0,Transaction date,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area
0,2012-09-02 16:42:30.519336,13.3,4082.015,8,25.007059,121.561694,6.488673
1,2012-09-04 22:52:29.919544,35.5,274.0144,2,25.012148,121.54699,24.970725
2,2012-09-05 01:10:52.349449,1.1,1978.671,10,25.00385,121.528336,26.694267
3,2012-09-05 13:26:01.189083,22.2,1055.067,5,24.962887,121.482178,38.091638
4,2012-09-06 08:29:47.910523,8.5,967.4,6,25.011037,121.479946,21.65471


Now, we will **identify missing values** and check if there are any 0s that should also be considered missing:

In [4]:
df.isnull().sum()

Transaction date                        0
House age                               0
Distance to the nearest MRT station     0
Number of convenience stores            0
Latitude                                0
Longitude                               0
House price of unit area               10
dtype: int64

In [5]:
(df==0).sum()

Transaction date                        0
House age                              22
Distance to the nearest MRT station     0
Number of convenience stores           58
Latitude                                0
Longitude                               0
House price of unit area               31
dtype: int64

As we **found 0s in a numeric column** that should not have zero values, we will **replace them with NaN**:

In [9]:
import numpy as np
df["House price of unit area"]= df["House price of unit area"].replace(0,np.nan)

#### Step 2 :Define Features and target Variable
Since we are using **Linear Regression**, we need to define the **independent variables (features)** and the **dependent variable (target)**:

In [10]:
features = ["House age","Distance to the nearest MRT station","Number of convenience stores","Latitude"]
target = "House price of unit area"

In [11]:
df_complete = df[df[target].notna()]
df_missing = df[df[target].isna()]

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(df_complete[features],df_complete[target],test_size = 0.2, random_state = 42)
model = LinearRegression()
model.fit(X_train,y_train)

#### Step 3: Predicting the Missing Values
Now, we will use our trained model to predict the missing values:

In [13]:
df.loc[df[target].isna(),target] = model.predict(df_missing[features])

In [14]:
df.isnull().sum()

Transaction date                       0
House age                              0
Distance to the nearest MRT station    0
Number of convenience stores           0
Latitude                               0
Longitude                              0
House price of unit area               0
dtype: int64

In [15]:
df.to_csv("Real_estate_cleaned.csv",index = False)

In [21]:
df.isnull().sum()

Transaction date                       0
House age                              0
Distance to the nearest MRT station    0
Number of convenience stores           0
Latitude                               0
Longitude                              0
House price of unit area               0
dtype: int64