# House Price Analysis

## Import Libraries

In [3]:
# Import necessary libraries

import pandas as pd


## Load the Data

In [None]:
# Load the dataset (CSV file)
houses = pd.read_csv("house_prices.csv")


## Explore the Data

In [26]:
# Show first 5 rows
houses.head()


Unnamed: 0,Index,Title,Description,Amount(in rupees),Price (in rupees),location,Carpet Area,Status,Floor,Transaction,...,facing,overlooking,Society,Bathroom,Balcony,Car Parking,Ownership,Super Area,Dimensions,Plot Area
0,0,1 BHK Ready to Occupy Flat for sale in Srushti...,"Bhiwandi, Thane has an attractive 1 BHK Flat f...",42 Lac,6000.0,thane,500 sqft,Ready to Move,10 out of 11,Resale,...,,,Srushti Siddhi Mangal Murti Complex,1,2.0,,,,,
1,1,2 BHK Ready to Occupy Flat for sale in Dosti V...,One can find this stunning 2 BHK flat for sale...,98 Lac,13799.0,thane,473 sqft,Ready to Move,3 out of 22,Resale,...,East,Garden/Park,Dosti Vihar,2,,1 Open,Freehold,,,
2,2,2 BHK Ready to Occupy Flat for sale in Sunrise...,Up for immediate sale is a 2 BHK apartment in ...,1.40 Cr,17500.0,thane,779 sqft,Ready to Move,10 out of 29,Resale,...,East,Garden/Park,Sunrise by Kalpataru,2,,1 Covered,Freehold,,,
3,3,1 BHK Ready to Occupy Flat for sale Kasheli,This beautiful 1 BHK Flat is available for sal...,25 Lac,,thane,530 sqft,Ready to Move,1 out of 3,Resale,...,,,,1,1.0,,,,,
4,4,2 BHK Ready to Occupy Flat for sale in TenX Ha...,"This lovely 2 BHK Flat in Pokhran Road, Thane ...",1.60 Cr,18824.0,thane,635 sqft,Ready to Move,20 out of 42,Resale,...,West,"Garden/Park, Main Road",TenX Habitat Raymond Realty,2,,1 Covered,Co-operative Society,,,


- **Task:** Observe the data structure and column types.
- **Reflection:** Write a short note about what each column represents and any observations.

Title         : short description for the flat  

Decription    : more detailed for the flat
Amount        : prices in Indian numbering units (lac / cr) ,lac = 100k , cr = 10 million 
              observations : need to be converted
price         : the price in indian rupees, 
              observations : contain some NAN values
location      : Area or city where the property is located
Carpet Area   : actual usable area inside the house , in squared foot
status        : determine if flat under-construction or ready to move
floor         : represent the floor in which flat relative to the building total floors
Transaction   : new or resale 
               observations : could help in diff btw new & resale prices
Furnishing    : Describes how the property is equipped with furniture and fittings.
facing        : show direcion (East , West)
              observations : contain some NAN values
overlooking   : the surrondings of flat(garden , park)
Society       : housing complex where the flat is located
               observations : categorical feature
Bathroom      : num of Bathroom
Balcony       : num of Balcony
               observations : contain some NAN values
Car Parking   : Parking num & type (open / coverd)
ownership     : Type of ownership (Freehold, Co-operative Society)
super Area    : Total built-up area including walls, balconies, etc.
              observations : contain some NAN values
Dimensions    : contain a lot of NAN values
Plot Area     : Land area
              observations : contain some NAN values

In [27]:
# Number of rows and columns
houses.shape

(187531, 21)

In [None]:
# Column names
houses.columns

In [None]:

# General info about all columns
houses.info()

carpet Area , Super Area are object , mean we needto handle it when use


In [30]:
# Count of missing values in each column
houses.isna().sum().sort_values(ascending=False)

Plot Area            187531
Dimensions           187531
Society              109678
Super Area           107685
Car Parking          103357
overlooking           81436
Carpet Area           80673
facing                70233
Ownership             65517
Balcony               48935
Price (in rupees)     17665
Floor                  7077
Description            3023
Furnishing             2897
Bathroom                828
Status                  615
Transaction              83
Title                     0
location                  0
Amount(in rupees)         0
Index                     0
dtype: int64

In [25]:
# percentage of missing values
print("==== Percentage ====")
percentage = ( houses.isna().sum().sort_values(ascending=False) ) / len(houses) *100
print (percentage)

==== Percentage ====
Plot Area            100.000000
Dimensions           100.000000
Society               58.485264
Super Area            57.422506
Car Parking           55.114621
overlooking           43.425354
Carpet Area           43.018488
facing                37.451408
Ownership             34.936624
Balcony               26.094352
Price (in rupees)      9.419776
Floor                  3.773776
Description            1.612000
Furnishing             1.544811
Bathroom               0.441527
Status                 0.327946
Transaction            0.044259
Title                  0.000000
location               0.000000
Amount(in rupees)      0.000000
Index                  0.000000
dtype: float64


- **Task:** Review missing values.
- **Reflection:** Which columns have the most missing data? Which columns are mostly complete? How might missing data affect your analysis?

Most missing value columns :
Plot Area            100 %
Dimensions           100 %
Society               58 %
Super Area            57 %
Car Parking           55 %
overlooking           43 %
=============================
Most completed columns :
Status                 0 %    (Missing percentage)   
Transaction            0 %
Title                  0 %
location               0 %
Amount(in rupees)      0 %
Price (in rupees)      9.4 %
=============================
How missing data will affect the analysis :
1) data with 100 % missing values 
Plot Area            100 %
Dimensions           100 %
empty columns , having no useful information will be dropped
2) data above 50 % missing values
Society               58.4 %
    having 58 % missing will make us lose categorical features , and only analyze "price by society" for 42 %
Super Area            57.4 %
    will affect on analysis of the relation between "super area , Carpet Area and the price"
Car Parking           55.1 %
    will affect the relations between Car Parking and prices
3) data above 40 % missing values
overlooking           43.4 %
    can influence catogrical features analysis
Carpet Area           43.0 %
    Price per sqft calculations cannot be done for these rows.
    Super Area estimation  is affected.


In [17]:
# Summary statistics for numerical columns
houses.describe()

Unnamed: 0,Index,Price (in rupees),Dimensions,Plot Area
count,187531.0,169866.0,0.0,0.0
mean,93765.0,7583.772,,
std,54135.681003,27241.71,,
min,0.0,0.0,,
25%,46882.5,4297.0,,
50%,93765.0,6034.0,,
75%,140647.5,9450.0,,
max,187530.0,6700000.0,,


- **Task:** Review the summary statistics for numeric columns.
- **Reflection:** Write a note on unusual values (like 0 or extremely high numbers) and columns that may need cleaning.

In [None]:
# check for duplicate rows


np.int64(0)

## Data Cleaning

In [22]:
# Fill missing 'Price (in rupees)' with the mean



In [None]:
# Fill missing 'Price (in rupees)' with the mean

In [None]:
# Fill missing 'Carpet Area' with median (safer than mean for skewed data)

In [None]:

# Fill missing 'Status' with 'Unknown'

In [None]:
# Drop columns with mostly missing values (>80%)

In [None]:
# Show cleaned data

**Task:** Perform basic data cleaning as shown above.

**Reflection** Questions (write your answers ):

- Which columns did you clean and why?

- Why did you fill Price (in rupees) with the mean instead of dropping rows?

- Why did you fill Carpet Area with the median?

- Why did you fill Status with 'Unknown'?

- Why did we drop columns with mostly missing values (>80%)?

- How do you think these cleaning steps will affect your analysis later?

## Filtering & Sorting Data

In [None]:
# Flats with Price less than 1 Cr

In [None]:
# Flats with Carpet Area greater than 600 sqft

In [None]:
# Sort by Price descending

In [None]:
# show each data after sorting

**Task:** Apply filtering and sorting.

**Reflection** Questions:

- Which filters did you apply?

- What did you learn from the sorted data?

- Are there any patterns or outliers?

## Selecting Key Columns

In [29]:
# select the following columns from the cleaned dataset  Title , Price (in rupees), Carpet Area, Status, Floor, location


**Task:** Focus on key columns.

**Reflection** Questions:

- Why did you select these columns?

- How might they relate to each other in analysis?

## Aggregations & Relationships

In [None]:
# Average price

In [None]:
# Average price by Location

In [None]:
# Min and Max carpet area

In [None]:

# Average Price by Status


In [None]:
# Average Carpet Area by Location

**Task:** Calculate aggregations and explore relationships.

**Reflection** Questions:

- What patterns do you see between Price, Carpet Area, Status, and Location?

- Any surprising insights?

## General Questions on Data

In [None]:
# Display the first 10 rows of the cleaned dataset.



In [None]:

# Count the number of flats per Location.


In [None]:
# Find the top 5 most expensive flats

In [None]:
# Filter all flats where Status = Ready to Move


In [20]:
# Calculate median Price and compare with mean Price

In [None]:
# Explore relationship between Floor and Price

## Now Write a Short Summery of your findings about house price Data 

### **Note**
We will continue working with this dataset step by step. All the tasks you have done so far are preparing you for a Mini Data Analysis Project.

In the next Tasks, we will:

- Apply more advanced visualizations.

- Explore relationships between columns more deeply.

- Answer questions and draw insights from the data.

Keep practicing with this cleaned dataset because it will be the base for all your upcoming exercises.ðŸ¥°