<a href="https://colab.research.google.com/github/Movya777/EDA_and_Feature_Engineering/blob/main/%F0%9F%8F%A0Houseprice2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective of this notebook

Dataset link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

Handle missing values in categorical features

Implemented the following methods
- Method 1: Mode Imputation - generally works well with a feature having less NANs
- Method 2: New feature to capture NANs + Mode imputation
- Method 3 : Replace with a new label - If a feature has many frequent labels, then we replace NAN with a new label





# Data import

In [1]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv('/content/drive/MyDrive/Practice Datasets/house-prices-advanced-regression-techniques/train.csv',usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

In [4]:
df.head()

Unnamed: 0,BsmtQual,FireplaceQu,GarageType,SalePrice
0,Gd,,Attchd,208500
1,Gd,TA,Attchd,181500
2,Gd,TA,Attchd,223500
3,TA,Gd,Detchd,140000
4,Gd,TA,Attchd,250000


In [7]:
df.shape

(1460, 4)

In [8]:
df.isnull().sum()

Unnamed: 0,0
BsmtQual,37
FireplaceQu,690
GarageType,81
SalePrice,0


Out of 1460 records, 37 of BsmtQual and 81 of GarageType records are missing

# Method 1

When a few NANs are present, replacing them with the most occuring value doesnt distort the distribution

In [14]:
# compute frequency if labels
df['BsmtQual'].value_counts(ascending=False)

Unnamed: 0_level_0,count
BsmtQual,Unnamed: 1_level_1
TA,649
Gd,618
Ex,121
Fa,35


In [13]:
df['GarageType'].value_counts(ascending=False)

Unnamed: 0_level_0,count
GarageType,Unnamed: 1_level_1
Attchd,870
Detchd,387
BuiltIn,88
Basment,19
CarPort,9
2Types,6


In [21]:
# replacing function
def replace(df,feature,count):
  df[feature].fillna(count,inplace=True)

In [22]:
replace(df,'BsmtQual',649)
replace(df,'GarageType',870)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[feature].fillna(count,inplace=True)


In [23]:
df.isnull().sum()

Unnamed: 0,0
BsmtQual,0
FireplaceQu,690
GarageType,0
SalePrice,0


In [41]:
df['FireplaceQu'].mode()

Unnamed: 0,FireplaceQu
0,Gd


In [42]:
replace(df,'FireplaceQu','Gd')

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[feature].fillna(count,inplace=True)


# Method 2 - Adding a new column to capture NAN

In [24]:
df2=pd.read_csv('/content/drive/MyDrive/Practice Datasets/house-prices-advanced-regression-techniques/train.csv',usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

In [25]:
df2.head(1)

Unnamed: 0,BsmtQual,FireplaceQu,GarageType,SalePrice
0,Gd,,Attchd,208500


In [28]:
# capturing NAN values
df2['BsmtQual_NAN']=np.where(df2['BsmtQual'].isnull(),1,0)

In [39]:
# mode imputation
df2['BsmtQual'].fillna(df2['BsmtQual'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['BsmtQual'].fillna(df2['BsmtQual'].mode()[0],inplace=True)


In [43]:
df2['FireplaceQu_NAN']=np.where(df2['FireplaceQu'].isnull(),1,0)
df2['GarageType_NAN']=np.where(df2['GarageType'].isnull(),1,0)

In [45]:
df2['GarageType'].fillna(df2['GarageType'].mode()[0],inplace=True)
df2['FireplaceQu'].fillna(df2['FireplaceQu'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['GarageType'].fillna(df2['GarageType'].mode()[0],inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['FireplaceQu'].fillna(df2['FireplaceQu'].mode()[0],inplace=True)


In [46]:
df2.head()

Unnamed: 0,BsmtQual,FireplaceQu,GarageType,SalePrice,BsmtQual_NAN,FireplaceQu_NAN,GarageType_NAN
0,Gd,Gd,Attchd,208500,0,1,0
1,Gd,TA,Attchd,181500,0,0,0
2,Gd,TA,Attchd,223500,0,0,0
3,TA,Gd,Detchd,140000,0,0,0
4,Gd,TA,Attchd,250000,0,0,0


# Method 3
When a feature has many frequent labels then replace NAN with a new variable

In [47]:
df3=pd.read_csv('/content/drive/MyDrive/Practice Datasets/house-prices-advanced-regression-techniques/train.csv',usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])

In [48]:
def impute_nan(df,feature):
  df[feature+"newvar"]=np.where(df[feature].isnull(),'missing',df[feature])

In [49]:
for i in ['BsmtQual','FireplaceQu','GarageType']:
  impute_nan(df3,i)

In [50]:
df3.head()

Unnamed: 0,BsmtQual,FireplaceQu,GarageType,SalePrice,BsmtQualnewvar,FireplaceQunewvar,GarageTypenewvar
0,Gd,,Attchd,208500,Gd,missing,Attchd
1,Gd,TA,Attchd,181500,Gd,TA,Attchd
2,Gd,TA,Attchd,223500,Gd,TA,Attchd
3,TA,Gd,Detchd,140000,TA,Gd,Detchd
4,Gd,TA,Attchd,250000,Gd,TA,Attchd


In [51]:
df3.drop(['BsmtQual','FireplaceQu','GarageType'],axis=1)

Unnamed: 0,SalePrice,BsmtQualnewvar,FireplaceQunewvar,GarageTypenewvar
0,208500,Gd,missing,Attchd
1,181500,Gd,TA,Attchd
2,223500,Gd,TA,Attchd
3,140000,TA,Gd,Detchd
4,250000,Gd,TA,Attchd
...,...,...,...,...
1455,175000,Gd,TA,Attchd
1456,210000,Gd,TA,Attchd
1457,266500,TA,Gd,Attchd
1458,142125,TA,missing,Attchd
