### Data Wrangling
- Problem Statement: Data Wrangling on Real Estate Market
- Dataset: "RealEstate_Prices.csv"
- Description: The dataset contains information about housing prices in a specific real estate
market. It includes various attributes such as property characteristics, location, sale prices,
and other relevant features. 
- The goal is to perform data wrangling to gain insights into the
factors influencing housing prices and prepare the dataset for further analysis or modeling.
#### Tasks to Perform:
1. Import the "RealEstate_Prices.csv" dataset. Clean column names by removing spaces,
special characters, or renaming them for clarity.
2. Handle missing values in the dataset, deciding on an appropriate strategy (e.g.,
imputation or removal).
3. Perform data merging if additional datasets with relevant information are available
(e.g., neighborhood demographics or nearby amenities).
4. Filter and subset the data based on specific criteria, such as a particular time period,
property type, or location.
5. Handle categorical variables by encoding them appropriately (e.g., one-hot encoding
or label encoding) for further analysis.
6. Aggregate the data to calculate summary statistics or derived metrics such as average
sale prices by neighborhood or property type.
7. Identify and handle outliers or extreme values in the data that may affect the analysis
or modeling process

In [1]:
import pandas as pd

df = pd.read_csv('RealEstate_Prices.csv')
df.head()

Unnamed: 0,Property_Name,Location,Region,Property_Age,Availability,Area_Tpye,Area_SqFt,Rate_SqFt,Floor_No,Bedroom,Bathroom,Price_Lakh
0,Omkar Alta Monte,W E Highway Malad East Mumbai,Malad Mumbai,0 to 1 Year,Ready To Move,Super Built Up Area,2900.0,17241,14,3,4,500.0
1,T Bhimjyani Neelkanth Woods,Manpada Thane Mumbai,Manpada Thane,1 to 5 Year,Ready To Move,Super Built Up Area,1900.0,12631,8,3,3,240.0
2,Legend 1 Pramila Nagar,Dahisar West Mumbai,Dahisar Mumbai,10+ Year,Ready To Move,Super Built Up Area,595.0,15966,3,1,2,95.0
3,Unnamed Property,Vidyavihar West Vidyavihar West Central Mumbai...,Central Mumbai,5 to 10 Year,Ready To Move,Built Up Area,1450.0,25862,1,3,3,375.0
4,Unnamed Property,176 Cst Road Kalina Mumbai 400098 Santacruz Ea...,Santacruz Mumbai,5 to 10 Year,Ready To Move,Carpet Area,876.0,39954,5,2,2,350.0


In [2]:
# Clean column names by removing spaces, special characters and renaming them for clarity
# We will use snake_case for the column names as it is a common convention in Python

# Define a function to clean column names
def clean_column_name(column_name):
    return column_name.strip().replace(' ', '_').replace('+', 'plus').lower()

# Apply the function to clean all column names
df.columns = [clean_column_name(col) for col in df.columns]

# Display the cleaned column names
df.columns.tolist()


['property_name',
 'location',
 'region',
 'property_age',
 'availability',
 'area_tpye',
 'area_sqft',
 'rate_sqft',
 'floor_no',
 'bedroom',
 'bathroom',
 'price_lakh']

In [3]:
# There's a typo in one of the column names: 'area_tpye' should be 'area_type'
# Renaming the column to fix the typo
df.rename(columns={'area_tpye': 'area_type'}, inplace=True)

In [11]:
# Check for missing values in the dataset
df.isnull().sum()

property_name    0
location         0
region           0
property_age     0
availability     0
area_type        0
area_sqft        0
rate_sqft        0
floor_no         0
bedroom          0
bathroom         0
price_lakh       0
dtype: int64

In [5]:
# For encoding categorical variables, we first need to identify which columns contain categorical data.
# We'll check the data types of each column to identify categorical columns.

categorical_columns = df.select_dtypes(include=['object']).columns.tolist()

# Displaying the categorical columns
categorical_columns


['property_name',
 'location',
 'region',
 'property_age',
 'availability',
 'area_type']

In [6]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Initialize encoders
onehot_encoder = OneHotEncoder(sparse=False)
label_encoder = LabelEncoder()

# One-hot encoding for low cardinality categorical columns
onehot_encoded_columns = ['region', 'property_age', 'availability', 'area_type']
onehot_encoded_data = pd.DataFrame(onehot_encoder.fit_transform(df[onehot_encoded_columns]))

# The one-hot encoder removes the original column names, we need to add them back
onehot_encoded_data.columns = onehot_encoder.get_feature_names_out(onehot_encoded_columns)

# Label encoding for high cardinality categorical columns
label_encoded_columns = ['property_name', 'location']
for col in label_encoded_columns:
    df[col + '_label'] = label_encoder.fit_transform(df[col])

# Drop original categorical columns for which we have encoded
df.drop(columns=onehot_encoded_columns + label_encoded_columns, inplace=True)

# Concatenate the one-hot encoded columns to the original dataframe
real_estate_data = pd.concat([df, onehot_encoded_data], axis=1)

# Displaying the first few rows of the updated dataframe
real_estate_data.head()


Unnamed: 0,area_sqft,rate_sqft,floor_no,bedroom,bathroom,price_lakh,property_name_label,location_label,region_Adaigaon Navi-Mumbai,region_Adharwadi Mumbai,...,property_age_1 to 5 Year,property_age_10+ Year,property_age_5 to 10 Year,property_age_Under Construction,availability_Ready To Move,availability_Under Construction,area_type_Built Up Area,area_type_Carpet Area,area_type_Plot Area,area_type_Super Built Up Area
0,2900.0,17241,14,3,4,500.0,487,1276,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,1900.0,12631,8,3,3,240.0,803,886,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,595.0,15966,3,1,2,95.0,359,683,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,1450.0,25862,1,3,3,375.0,844,1263,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4,876.0,39954,5,2,2,350.0,844,246,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [7]:
# For aggregation, let's calculate some summary statistics for the numerical columns
real_estate_data.describe()


Unnamed: 0,area_sqft,rate_sqft,floor_no,bedroom,bathroom,price_lakh,property_name_label,location_label,region_Adaigaon Navi-Mumbai,region_Adharwadi Mumbai,...,property_age_1 to 5 Year,property_age_10+ Year,property_age_5 to 10 Year,property_age_Under Construction,availability_Ready To Move,availability_Under Construction,area_type_Built Up Area,area_type_Carpet Area,area_type_Plot Area,area_type_Super Built Up Area
count,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,...,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0
mean,1026.105058,19111.85,8.839535,1.962016,2.066667,174.389806,575.581783,772.362791,0.000388,0.003876,...,0.343798,0.176357,0.191473,0.005814,0.994186,0.005814,0.155039,0.394961,0.006202,0.443798
std,2287.126278,40760.88,8.100081,0.844726,0.74996,369.484393,280.072257,351.095606,0.019687,0.062149,...,0.475067,0.381197,0.393537,0.076042,0.076042,0.076042,0.362012,0.488937,0.078521,0.496928
min,33.57,84.0,-1.0,1.0,1.0,13.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,630.75,8791.75,3.0,1.0,2.0,67.0,344.75,570.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,850.0,13785.0,6.0,2.0,2.0,111.5,637.0,803.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,1156.0,22650.0,12.0,2.0,2.0,200.0,844.0,1062.25,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
max,100000.0,1650000.0,59.0,6.0,7.0,16500.0,906.0,1307.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [8]:
# For derived metrics, let's calculate the average sale prices (price_lakh) by the number of bedrooms
average_price_by_bedroom = real_estate_data.groupby('bedroom')['price_lakh'].mean()
average_price_by_bedroom

bedroom
1     67.549878
2    154.751389
3    293.437695
4    685.168831
5    857.500000
6    650.428571
Name: price_lakh, dtype: float64

In [9]:
import numpy as np

# Function to detect outliers based on the IQR method
def detect_outliers_iqr(data, threshold=1.5):
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1

    outlier_step = IQR * threshold
    outliers = data[(data < Q1 - outlier_step) | (data > Q3 + outlier_step)]
    return outliers

# Let's apply the function to the 'price_lakh' column to identify outliers
outliers_price = detect_outliers_iqr(real_estate_data['price_lakh'])

# Showing some of the outliers
outliers_price.describe(), outliers_price.head()


(count      176.000000
 mean       772.102273
 std       1234.054405
 min        400.000000
 25%        475.000000
 50%        570.000000
 75%        750.250000
 max      16500.000000
 Name: price_lakh, dtype: float64,
 0      500.0
 23     480.0
 24     440.0
 33     813.0
 39    1800.0
 Name: price_lakh, dtype: float64)

In [10]:
# Capping outliers at the 99th percentile for the 'price_lakh' column
percentile_99 = np.percentile(real_estate_data['price_lakh'], 99)
real_estate_data['price_lakh_capped'] = real_estate_data['price_lakh'].apply(lambda x: min(x, percentile_99))

# Let's compare the original and the capped price_lakh values
comparison = real_estate_data[['price_lakh', 'price_lakh_capped']].describe()

# Displaying the comparison
comparison


Unnamed: 0,price_lakh,price_lakh_capped
count,2580.0,2580.0
mean,174.389806,165.111473
std,369.484393,160.640136
min,13.0,13.0
25%,67.0,67.0
50%,111.5,111.5
75%,200.0,200.0
max,16500.0,978.15
