***Real Estate Price Prediction Model***

In [None]:
#Importing the necessary libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
df1 = pd.read_csv("/content/bengaluru_house_prices.csv")
df1.head()

In [None]:
df1.shape #For knowing number of rows and columns

In [None]:
#To know the different types of elements in the column 'area_types' we use the below code:-
df1.groupby('area_type')['area_type'].agg('count')

In [None]:
#Assuming "area_type", "society", "balcony", "availability" to be not very important, so we can drop those columns

df2 = df1.drop(["area_type", "society", "balcony", "availability"],axis='columns')

df2.head()


In [None]:
#Now to know at which places in our dataset, there are empty/NA values,
df2.isnull().sum()

Since there are about 13,000 rows in the dataset, it is safe to delete the empty values instead of filling with their mean/median values

#Replacing null values with mean or median of values :
### Mean  
```
df['column_name'] = df['column_name'].fillna(df['column_name'].mean())
```
### Median
```
df['column_name'] = df['column_name'].fillna(df['column_name'].mean()) ```
```

In [None]:
df3 = df2.dropna()

df3.isnull().sum() #This shows that now there are no empty elements in our dataset

In [None]:
df3.shape

In [None]:
df3['size'].unique() #This code will return all the unique values in the column 'size'

Now, in order to clean this dataset, we will use the below code, where we will first split each element from 'size' column, and select the element at 0th position, which are nothing but the integers...the number of bedrooms present

In [None]:
df3['bhk'] = df3['size'].str.split(' ').str[0].astype(int)

In [None]:
df3.head()

Now, we shall clean 'total_sqft' column

In [None]:
df3.total_sqft.unique()

Here, we notice that there are values such as '1133 - 1384' which is basically a range value

We also need to figure out, whether there are any float values in 'total_sqft'

So what we are gonna do is, we will try to convert all the elements to float value, and whichever value throws an exception, those values can be categorized as '1133 - 1384' kind of values

In [None]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [None]:
df3[~df3['total_sqft'].apply(is_float)]

#The '~' is used to print wherever the function returned False

In 'total_sqft' we also get to see values such as '34.46Sq. Meter' and '4125Perch' etc, so we are going to just ignore those values

We are going to now deal with only range values for now

In [None]:
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

In [None]:
convert_sqft_to_num('1133 - 1384')

In [None]:
df4 = df3.copy()
df4['total_sqft'] = df4['total_sqft'].apply(convert_sqft_to_num)

df4

<h1>Feature Engineering</h1>

In [None]:
df5 = df4.copy()

#Now we will be adding a new column which will be price per sq ft, for later purposes of outlier detection

df5['price_per_sqft'] = df5['price']*100000 / df5['total_sqft']

df5.head()

Now, we shall explain the location column

In [None]:
len(df5.location.unique())

In [None]:
df5['location'] = df5['location'].str.strip()
#This code removes leading and trailing whitespace from each value in the 'location' column
location_stats = df5['location'].value_counts()

location_stats

We shall witness that there are some locations, that has repetition of only 1 or 2 times....so we shall set a particular margin, below which, the locations shall be names as 'Others'

In [None]:
len(location_stats[location_stats <= 10]) #We can observe that there are 1052 locations whose repetition is less than 10 out of 1293 locations...So we can classify them as 'Others'

In [None]:
# Assuming location_stats is a Series like: location -> count
location_stats_less_than_10 = location_stats[location_stats <= 10].index  # use .index

# Now replace rare locations
df5['location'] = df5['location'].where(~df5['location'].isin(location_stats_less_than_10), 'other')


In [None]:
len(df5.location.unique())

<h1>Outliers Detection and Removal</h1>
<br>
Outlier Detection is nothing but detecting extremely deviating values in the dataset

An example of outlier is...at row 9, total_sqft is 1020 and bhk is 6 which is unusual...<br>
So what we can do is, we can remove all those rows where total_sqft / bhk < 300

In [None]:
df6 = df5[~(df5.total_sqft/df5.bhk<300)]

df6.shape

<h1>Outlier Removal Using Box plot and IQR</h1>

In [None]:
plt.figure(figsize=(18,5))
plt.subplot(1,3,1)
sns.boxplot(df6['bath'])
plt.title('Bathrooms')

plt.subplot(1,3,2)
sns.boxplot(df6['bhk'])
plt.title('BHK')

plt.subplot(1,3,3)
sns.boxplot(df6['price_per_sqft'])
plt.title('price_per_sqft')

plt.tight_layout()
plt.show()



In [None]:
# Remove outliers using IQR for bath, bhk, price_per_sqft
def remove_iqr_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    return df[(df[column] >= Q1 - 1.5 * IQR) & (df[column] <= Q3 + 1.5 * IQR)]

df6_no_bath_outliers = remove_iqr_outliers(df6, 'bath')
df6_no_bhk_outliers = remove_iqr_outliers(df6_no_bath_outliers, 'bhk')
df7 = remove_iqr_outliers(df6_no_bhk_outliers, 'price_per_sqft')

print("Shape after outlier removal:", df7.shape)



In [None]:
# Print boxplots after removal
plt.figure(figsize=(18,5))
plt.subplot(1,3,1)
sns.boxplot(df7['bath'])
plt.title('Bathrooms (After Removal)')

plt.subplot(1,3,2)
sns.boxplot(df7['bhk'])
plt.title('BHK (After Removal)')

plt.subplot(1,3,3)
sns.boxplot(df7['price_per_sqft'])
plt.title('Price per Sqft (After Removal)')

plt.tight_layout()
plt.show()

In [None]:
#Performing one hot encoding on location
dummies = pd.get_dummies(df7.location) #One-hot encoding
dummies.head(3)

In [None]:
df8 = pd.concat([df7, dummies.drop('other',axis="columns")],axis='columns')

df8.head()

In [None]:
df9 =df8.drop('location', axis="columns")

df9.head()

In [None]:
df9.to_csv('preprocessed_data.csv', index=False)
from google.colab import files
files.download('preprocessed_data.csv')
