### Data Wrangling
- Problem Statement: Data Wrangling on Real Estate Market
- Dataset: "RealEstate_Prices.csv"
- Description: The dataset contains information about housing prices in a specific real estate
market. It includes various attributes such as property characteristics, location, sale prices,
and other relevant features. 
- The goal is to perform data wrangling to gain insights into the
factors influencing housing prices and prepare the dataset for further analysis or modeling.
#### Tasks to Perform:
1. Import the "RealEstate_Prices.csv" dataset. Clean column names by removing spaces,
special characters, or renaming them for clarity.
2. Handle missing values in the dataset, deciding on an appropriate strategy (e.g.,
imputation or removal).
3. Perform data merging if additional datasets with relevant information are available
(e.g., neighborhood demographics or nearby amenities).
4. Filter and subset the data based on specific criteria, such as a particular time period,
property type, or location.
5. Handle categorical variables by encoding them appropriately (e.g., one-hot encoding
or label encoding) for further analysis.
6. Aggregate the data to calculate summary statistics or derived metrics such as average
sale prices by neighborhood or property type.
7. Identify and handle outliers or extreme values in the data that may affect the analysis
or modeling process

In [4]:
import pandas as pd

df = pd.read_csv('RealEstate_Prices.csv')
df.head()

Unnamed: 0,Property_Name,Location,Region,Property_Age,Availability,Area_Tpye,Area_SqFt,Rate_SqFt,Floor_No,Bedroom,Bathroom,Price_Lakh
0,Omkar Alta Monte,W E Highway Malad East Mumbai,Malad Mumbai,0 to 1 Year,Ready To Move,Super Built Up Area,2900.0,17241,14,3,4,500.0
1,T Bhimjyani Neelkanth Woods,Manpada Thane Mumbai,Manpada Thane,1 to 5 Year,Ready To Move,Super Built Up Area,1900.0,12631,8,3,3,240.0
2,Legend 1 Pramila Nagar,Dahisar West Mumbai,Dahisar Mumbai,10+ Year,Ready To Move,Super Built Up Area,595.0,15966,3,1,2,95.0
3,Unnamed Property,Vidyavihar West Vidyavihar West Central Mumbai...,Central Mumbai,5 to 10 Year,Ready To Move,Built Up Area,1450.0,25862,1,3,3,375.0
4,Unnamed Property,176 Cst Road Kalina Mumbai 400098 Santacruz Ea...,Santacruz Mumbai,5 to 10 Year,Ready To Move,Carpet Area,876.0,39954,5,2,2,350.0


In [6]:
# Clean column names by removing spaces, special characters and renaming them for clarity
# We will use snake_case for the column names as it is a common convention in Python

# Define a function to clean column names
def clean_column_name(column_name):
    return column_name.strip().replace(' ', '_').replace('+', 'plus').lower()

# Apply the function to clean all column names
df.columns = [clean_column_name(col) for col in df.columns]

# Display the cleaned column names
df.columns.tolist()


['property_name',
 'location',
 'region',
 'property_age',
 'availability',
 'area_tpye',
 'area_sqft',
 'rate_sqft',
 'floor_no',
 'bedroom',
 'bathroom',
 'price_lakh']

In [8]:
# There's a typo in one of the column names: 'area_tpye' should be 'area_type'
# Renaming the column to fix the typo
df.rename(columns={'area_tpye': 'area_type'}, inplace=True)

In [11]:
# Check for missing values in the dataset
df.isnull().sum()

property_name    0
location         0
region           0
property_age     0
availability     0
area_type        0
area_sqft        0
rate_sqft        0
floor_no         0
bedroom          0
bathroom         0
price_lakh       0
dtype: int64