**Importing modules**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

**Loading Data**

In [2]:
data = pd.read_csv('/content/melb_data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

**Creating Model**

In [3]:
y = data.Price

predictor = data.drop(['Price'], axis=1)
X = predictor.select_dtypes(exclude=['object'])

X_train,X_val,y_train,y_val = train_test_split(X,y,train_size=0.8,test_size=0.2, random_state=0)

def score(X_train,X_val,y_train,y_val):
  model = RandomForestRegressor(random_state=0)
  model.fit(X_train,y_train)
  prediction = model.predict(X_val)
  return mean_absolute_error(y_val,prediction)

# **Handling missing values**

**1.Droping columns with missing values**

Ensures no missing values remain in the dataset.

Removes potentially important features, which may lead to a loss of predictive power.

In [4]:
#finding the columns having missing values
missing_value_col = [col for col in X_train.columns if X_train[col].isnull().any()]


#dropping the columns having missing values from Training and validation data
new_X_train = X_train.drop(missing_value_col,axis=1)
new_X_val = X_val.drop(missing_value_col,axis=1)

#evaluating model
print("\nMean absolute error for (Dropping columns with missing values) :")
approach1 = score(new_X_train,new_X_val,y_train,y_val)
print(approach1)


Mean absolute error for (Dropping columns with missing values) :
175703.48185157913


**2.Imputation**

We use SimpleImputer to replace missing values with the statistical (mean,median,mode) value along each column.

It retains all features, preserving valuable information.

If imputation introduces biased or inaccurate values, it may degrade model performance.

In [5]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()

#finding imputer values for training and validation data

#Calculating the required statistics (e.g., mean) for each column in X_train with missing values and replacing it.
imputed_X_train = pd.DataFrame(imputer.fit_transform(X_train))

#Applying the statistics computed from X_train (via fit) to fill missing values in X_valid.
imputed_X_val = pd.DataFrame(imputer.transform(X_val))

#restoring the column names
imputed_X_train.columns = X_train.columns
imputed_X_val.columns = X_val.columns

#evaluating model
print("\nMean absolute error for (Imputation) :")
approach2 = score(imputed_X_train,imputed_X_val,y_train,y_val)
print(approach2)


Mean absolute error for (Imputation) :
169237.0268668034


In [8]:
print("Handling Missing Values by : ")

print(f"\n1. Dropping columns with missing values")

#no. of columns
col_count1 = new_X_train.shape[1]
print(f"Number of columns after dropping: {col_count1}")
print(f"MAE : {approach1}")


print(f"\n2. Imputation")

#no. of columns
col_count2 = imputed_X_train.shape[1]
print(f"Number of columns after imputation: {col_count2}")
print(f"MAE : {approach2}")


print("\n\n*As number of columns in Imputation method are more, important features are present and thus its mae is less than that of dropping column method.")

Handling Missing Values by : 

1. Dropping columns with missing values
Number of columns after dropping: 9
MAE : 175703.48185157913

2. Imputation
Number of columns after imputation: 12
MAE : 169237.0268668034


*As number of columns in Imputation method are more, important features are present and thus its mae is less than that of dropping column method.
