# Data Preprocessing

During this stage, our objective is to transform and organize data. We aim to prepare our dataset for subsequent analysis and the development of Machine Learning models. Techniques include: <br>
1. One Hot Encoding
2. Dropping unncessary columns
       
**Excel File Utilized : Cleaned-Delhi-Prices.xlsx**

**Imports**

In [1]:
import pandas as pd
import os

**Creating dataframe for imported file**

In [2]:
cwd = os.getcwd()
df = pd.read_excel(cwd + "/Cleaned-Delhi-Prices.xlsx")
df

Unnamed: 0,Locality,Area,BHK,Bathroom,Price (in Lakhs),Price per Sqft
0,Sector 5 Dwarka,1900,3,2,178.00,9368.421053
1,Sector 22 Dwarka,1500,2,2,175.00,11666.666667
2,Sector 4 Dwarka,1900,2,2,175.00,9210.526316
3,Sector 5 Dwarka,1900,3,3,175.00,9210.526316
4,Sector 9 Dwarka,1600,2,2,174.00,10875.000000
...,...,...,...,...,...,...
21078,Unknown,700,3,3,32.01,4572.857143
21079,Unknown,800,3,3,29.65,3706.250000
21080,Unknown,800,3,3,29.00,3625.000000
21081,Unknown,600,3,3,25.00,4166.666667


**One Hot Encoding called upon `Locality`**

In [3]:
dummies = pd.get_dummies(df.Locality)
dummies.head()

Unnamed: 0,Aali Village,Ali Gaon,Ali Vihar Sarita Vihar,Amrita Shergill Marg,Anand Niketan,Anand Vihar,Ashok Nagar,Ashok Vihar,Aurungzeb Road,Babar Road,...,Tuglak Road,Uday Park,Unknown,Uttam Nagar,Vasant Kunj,Vasant Vihar,Vasundhara Enclave,Vikas Puri,Vikaspuri,West End
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


To avoid the "dummy variable trap," you should drop one of the one-hot encoded columns for each categorical variable. This is done to prevent multicollinearity, which occurs when two or more independent variables are highly correlated. In the context of one-hot encoding, having all the one-hot encoded columns for a categorical variable can lead to multicollinearity because their values are perfectly correlated.

In [4]:
df2 = pd.concat([df, dummies.drop('Ali Gaon', axis = 'columns')], axis = 'columns')
df2

Unnamed: 0,Locality,Area,BHK,Bathroom,Price (in Lakhs),Price per Sqft,Aali Village,Ali Vihar Sarita Vihar,Amrita Shergill Marg,Anand Niketan,...,Tuglak Road,Uday Park,Unknown,Uttam Nagar,Vasant Kunj,Vasant Vihar,Vasundhara Enclave,Vikas Puri,Vikaspuri,West End
0,Sector 5 Dwarka,1900,3,2,178.00,9368.421053,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,Sector 22 Dwarka,1500,2,2,175.00,11666.666667,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,Sector 4 Dwarka,1900,2,2,175.00,9210.526316,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Sector 5 Dwarka,1900,3,3,175.00,9210.526316,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,Sector 9 Dwarka,1600,2,2,174.00,10875.000000,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21078,Unknown,700,3,3,32.01,4572.857143,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21079,Unknown,800,3,3,29.65,3706.250000,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21080,Unknown,800,3,3,29.00,3625.000000,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21081,Unknown,600,3,3,25.00,4166.666667,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False


**Creating a final dataframe for Model Training**

In [5]:
model_df = df2.drop(['Locality', 'Price per Sqft'], axis = 'columns')
model_df

Unnamed: 0,Area,BHK,Bathroom,Price (in Lakhs),Aali Village,Ali Vihar Sarita Vihar,Amrita Shergill Marg,Anand Niketan,Anand Vihar,Ashok Nagar,...,Tuglak Road,Uday Park,Unknown,Uttam Nagar,Vasant Kunj,Vasant Vihar,Vasundhara Enclave,Vikas Puri,Vikaspuri,West End
0,1900,3,2,178.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,1500,2,2,175.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1900,2,2,175.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,1900,3,3,175.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1600,2,2,174.00,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21078,700,3,3,32.01,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21079,800,3,3,29.65,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21080,800,3,3,29.00,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
21081,600,3,3,25.00,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False


**Export the final dataframe for Model Training**

In [6]:
cwd = os.getcwd()
model_df.to_excel(cwd + "/Processed-Delhi-Prices.xlsx", index = False)