### Load Dataset & Basic Overview

In [1]:
import pandas as pd

df = pd.read_csv("pune_commercial_listings_cleaned.csv")

# Basic dataset information
print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nData Types:")
print(df.dtypes)

# Preview first rows
df.head()

Dataset Shape: (555, 8)

Column Names: ['locality', 'location_full', 'property_type', 'sqft', 'rent', 'deposit', 'price_per_sqft', 'furnishing']

Data Types:
locality           object
location_full      object
property_type      object
sqft                int64
rent                int64
deposit             int64
price_per_sqft    float64
furnishing         object
dtype: object


Unnamed: 0,locality,location_full,property_type,sqft,rent,deposit,price_per_sqft,furnishing
0,Kothrud,hapoy colony near van devi mandir karve nagar ...,Office,650,12000,24000,18.46,Furnished
1,Kothrud,"Gandhi Bhavan Rd, Gandhi Bhavan (Maharashtra G...",Shop,160,20000,60000,125.0,Unfurnished
2,Kothrud,"Chandani chowk , Shinde Farm Golden Group,",Shop,500,100000,300000,200.0,Furnished
3,Kothrud,"Paschimanagri,, City Pride- Kothrud",Office,230,100000,300000,434.78,Furnished
4,Kothrud,"Late. G A kulkarni road, Opposite Karishma Soc...",Office,100,10000,10000,100.0,Furnished


---------------
--------------

### Missing Values Check

In [2]:
df.isnull().sum()

locality          0
location_full     0
property_type     0
sqft              0
rent              0
deposit           0
price_per_sqft    0
furnishing        0
dtype: int64

------------------
-------------------

### Descriptive Statistics (Numerical Columns)

In [4]:
df.describe()

Unnamed: 0,sqft,rent,deposit,price_per_sqft
count,555.0,555.0,555.0,555.0
mean,410.045045,58116.93,166676.1,168.293081
std,222.080667,120621.8,344873.2,420.142338
min,100.0,4000.0,5625.0,8.79
25%,220.0,18000.0,50000.0,55.695
50%,370.0,30000.0,90000.0,78.74
75%,550.0,60000.0,180000.0,133.33
max,992.0,1821000.0,5463000.0,7883.12


---------------------
------------------------

# Categorical Column Analysis

### Property Type Distribution

In [5]:
df['property_type'].value_counts()

property_type
Office        331
Shop          173
Warehouse      24
Commercial     19
Showroom        5
Restaurant      3
Name: count, dtype: int64

----------------
----------------

### Top 10 Localities

In [6]:
df['locality'].value_counts().head(10)

locality
Baner               25
Kalewadi            25
Kondhwa             25
Pashan              24
Hadapsar            24
Camp                24
Wakad               23
Kothrud             23
Pimpri Chinchwad    23
Yerwada             23
Name: count, dtype: int64

---------------------
------------------------

### Count of Locality Rows

In [8]:
df['locality'].nunique()

25

---------------------
---------------------

# Numerical Column Distributions

### Rent Distribution

In [9]:
df['rent'].describe()

count    5.550000e+02
mean     5.811693e+04
std      1.206218e+05
min      4.000000e+03
25%      1.800000e+04
50%      3.000000e+04
75%      6.000000e+04
max      1.821000e+06
Name: rent, dtype: float64

-------------
---------------

### Sqft Distribution

In [10]:
df['sqft'].describe()

count    555.000000
mean     410.045045
std      222.080667
min      100.000000
25%      220.000000
50%      370.000000
75%      550.000000
max      992.000000
Name: sqft, dtype: float64

------------------
------------------

### Price Per Sqft Distribution

In [11]:
df['price_per_sqft'].describe()

count     555.000000
mean      168.293081
std       420.142338
min         8.790000
25%        55.695000
50%        78.740000
75%       133.330000
max      7883.120000
Name: price_per_sqft, dtype: float64

-----------
----------

# Relationship Analysis

### Rent vs Sqft Correlation

In [12]:
df[['rent','sqft']].corr()

Unnamed: 0,rent,sqft
rent,1.0,0.14444
sqft,0.14444,1.0


---------------
---------------

### Full Correlation Matrix

In [None]:
df.corr(numeric_only=True)

--------------------
--------------------

### Average Rent by Property Type

In [13]:
df.groupby('property_type')['rent'].mean().sort_values(ascending=False)

property_type
Restaurant    288333.333333
Commercial    129315.736842
Showroom      123000.000000
Office         69023.861027
Warehouse      30854.166667
Shop           27343.919075
Name: rent, dtype: float64

-------------
-------------

### Average PPS by Locality (Top 10)

In [14]:
df.groupby('locality')['price_per_sqft'].mean().sort_values(ascending=False).head(10)

locality
Viman Nagar         590.637391
Deccan              365.290909
Aundh               309.817500
Kothrud             297.225217
Koregaon Park       284.484286
Balewadi            222.997826
Baner               194.435200
Shivajiaagar        191.440000
Kharadi             150.975000
Pimpri Chinchwad    139.960000
Name: price_per_sqft, dtype: float64

--------------
------------

### EDA — Key Insights (Summary)

**Dataset status**
- Rows: ~555 (cleaned) — no nulls, duplicates removed.  
- Columns kept: `locality`, `location_full`, `property_type`, `sqft`, `rent`, `deposit`, `price_per_sqft`, `furnishing`, `...` (parking & available_from removed).  
- Numeric columns cleaned and typed: `rent`, `sqft`, `deposit`, `price_per_sqft`.

---

**1. Property type (market composition)**
- **Office** is the dominant category (~60% of listings).  
- **Shop** is the second most common (~30%).  
- Other types (Warehouse, Commercial, Showroom, Restaurant) are rare.

**Business implication:** Pune's scraped commercial market is office-heavy — analyses and models should account for class imbalance.

---

**2. Rent (monthly)**
- Typical range (IQR): **₹18,000 – ₹60,000**.  
- Upper bound (IQR rule) ≈ **₹123,000** → ~41 outlier listings above this.  
- Very high-rent listings correspond to premium locations / large units (valid business cases).

**Actionable:** Treat high-rent listings as valid premium-segment data (do not drop).

---

**3. Area (sqft)**
- After cleaning, sqft distribution is realistic: **Q1 ≈ 220, median ≈ 359, Q3 ≈ 550**.  
- All `sqft` ≥ 100 (invalid tiny areas were removed).

**Actionable:** `sqft` is reliable for modeling and visualization.

---

**4. Price per sqft (PPS)**
- Recomputed as `price_per_sqft = rent / sqft` for all rows.  
- Typical PPS IQR: **~₹56 – ₹133 / sqft**.  
- PPS outliers (> ~₹250) present (~73 rows) — reflect premium micro-markets (Koregaon Park, FC Road, Baner, Viman Nagar).

**Actionable:** Use PPS to identify premium localities; avoid capping/trimming since outliers are meaningful.

---

**5. Deposit**
- Normalized with business rule (deposit corrected → `deposit = 3 × rent` where invalid).  
- Correlates strongly with `rent` (expected). No additional treatment required.

---

**6. Locality effects**
- Locality strongly drives PPS and rent.  
- High-value localities: Koregaon Park, FC Road / Deccan, Viman Nagar, Baner, Kalyani Nagar.  
- Lower-cost localities: Pimple Saudagar, Wakad, Moshi, Dhanori, Sus Road.

**Actionable:** For any pricing model, include locality (or locality cluster) as a key feature. Consider grouping rare localities into `Other` for robust models.

---

**7. Outliers (IQR method)**
- Detected using standard IQR thresholds (lower = Q1 − 1.5×IQR, upper = Q3 + 1.5×IQR).  
- Counts (approx): `rent` ≈ 41, `sqft` ≈ 0, `price_per_sqft` ≈ 73, `deposit` ≈ 40.  
- **Decision:** Do **not** trim/cap — these outliers reflect real premium listings and should be preserved.

