# DATA ANALYSIS

You and your classmates have collected data from various locations about houses, including their features (e.g., area, number of rooms, condition) and prices (house prices and rents). The objective is to help a fictional real estate firm analyze trends and patterns in the housing market to make decisions about property investment and pricing strategies.

As the Data Scientist, your task is to analyze this data using measures of central tendency, dispersion, and correlation analysis. Based on your findings, you will provide insights that the firm can use to make informed decisions.

#### Part 1: Central Tendency (Patterns in Data)
1 - What is the typical house price and rent price in the dataset? Decide whether the mean, median, or mode is the most appropriate measure of   central tendency for these variables, considering the presence of any outliers.
    
2 - Compare the average number of rooms in houses across different locations. Which location tends to have the most spacious homes?
    
3- What is the most common condition of houses (e.g., Good, Fair, Excellent) in the data? Does the majority align with higher or lower prices?
    
#### Part 2: Dispersion (Variability in Data)
1- Which location shows the greatest variability in house prices? Use measures such as range, interquartile range (IQR), or standard deviation to          support your conclusion.
    
2- Analyze the variability in rent prices across locations. Are rents more stable in some locations compared to others?
    
3- Compare the variability in house prices for houses with and without a garden. Do gardens significantly influence the consistency of pricing?
    
#### Part 3: Correlation (Relationships in Data)
1- Is there a relationship between the size of a house (area) and its house price? Use correlation analysis to identify whether larger houses are  priced higher. 
    
2-Investigate whether the number of rooms or the number of washrooms has a stronger correlation with rent prices.
    
3- Does the year of construction correlate with house prices? For example, are newer houses priced higher, or is the relationship weak?

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_excel("Housing_Updated.xlsx")

In [4]:
df.columns

Index(['Location', 'Area (Marla)', 'No of Rooms', 'No of Washrooms',
       'No of Kitchens', 'Garden', 'Garage', 'Balcony', 'Floors', 'Condition',
       'House Prices (Millions)', 'Rent Prices', 'Year of Construction'],
      dtype='object')

In [5]:
df.shape

(100, 13)

In [6]:
df.head()

Unnamed: 0,Location,Area (Marla),No of Rooms,No of Washrooms,No of Kitchens,Garden,Garage,Balcony,Floors,Condition,House Prices (Millions),Rent Prices,Year of Construction
0,Chatter,6.0,4,3,1,Yes,Yes,No,1,Good,12,20000,2000.0
1,Chatter,8.0,6,3,1,Yes,Yes,No,1,Fair,20,40000,2019.0
2,Chatter,7.0,9,7,2,No,No,No,1,Fair,30,200000,1999.0
3,Chehlla,3.0,2,1,1,Yes,Yes,No,3,Fair,7,30000,2013.5
4,Chehlla,10.0,5,5,1,No,Yes,Yes,2,Good,15,60000,2015.0


In [7]:
df.tail()

Unnamed: 0,Location,Area (Marla),No of Rooms,No of Washrooms,No of Kitchens,Garden,Garage,Balcony,Floors,Condition,House Prices (Millions),Rent Prices,Year of Construction
95,Gojra,10.0,6,2,1,No,No,No,1,Good,8,9500,1980.0
96,Ambore,4.0,2,2,1,No,No,No,1,Good,8,9000,2022.0
97,Bella Noor Shah,5.0,5,2,1,Yes,No,No,2,fair,10,18000,1998.0
98,Ambore,9.0,11,8,3,No,No,Yes,3,Good,14,20000,2009.0
99,Naloochi,8.0,8,7,2,No,No,Yes,2,Good,18,20000,2022.0


In [8]:
df.describe()

Unnamed: 0,Area (Marla),No of Rooms,No of Washrooms,No of Kitchens,Floors,House Prices (Millions),Rent Prices,Year of Construction
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,9.845,6.59,3.46,1.56,1.62,13.01,30876.0,2010.93
std,11.705929,6.292685,2.226652,0.715203,0.72167,8.967388,55181.584763,10.16456
min,2.5,2.0,1.0,1.0,1.0,3.0,8000.0,1970.0
25%,5.0,4.0,2.0,1.0,1.0,8.0,15000.0,2007.0
50%,7.0,5.5,3.0,1.0,2.0,10.0,20000.0,2013.5
75%,10.0,8.0,4.0,2.0,2.0,15.0,25000.0,2018.25
max,108.0,56.0,14.0,4.0,4.0,60.0,500000.0,2024.0


In [9]:
df.isnull().sum()

Location                   0
Area (Marla)               0
No of Rooms                0
No of Washrooms            0
No of Kitchens             0
Garden                     0
Garage                     0
Balcony                    1
Floors                     0
Condition                  0
House Prices (Millions)    0
Rent Prices                0
Year of Construction       0
dtype: int64

# Part 1: Central Tendency (Patterns in Data)

## Step 1: Typical House and Rent Prices

In [10]:
# Calculate mean, median, and mode for house and rent prices
house_price_mean = df['House Prices (Millions)'].mean()
house_price_median = df['House Prices (Millions)'].median()
house_price_mode = df['House Prices (Millions)'].mode()[0]

rent_price_mean = df['Rent Prices'].mean()
rent_price_median = df['Rent Prices'].median()
rent_price_mode = df['Rent Prices'].mode()[0]

# Display the results
house_price_mean, house_price_median, house_price_mode, rent_price_mean, rent_price_median, rent_price_mode

(13.01, 10.0, 10, 30876.0, 20000.0, 20000)

### Insights:

#### Mean:
The average price of houses and rent in the dataset. It gives us a general idea of how much a house or rent costs. However, if there are a few extremely expensive or cheap houses, the mean might not represent most properties accurately.
#### Median: 
The middle value when all prices are lined up from lowest to highest. This is often a better indicator when there are outliers, as it isn't affected by extreme values.
#### Mode: 
The most frequently occurring price. This tells us the price that appears most often, which could indicate common pricing in the market.
#### Which measure to use?

If the dataset has extreme values (outliers), the median is more reliable because it shows the central point without being skewed by outliers. If the prices are more evenly spread, the mean can be useful. The mode is helpful to see the most common price.

## Step 2: Average Number of Rooms Across Locations

In [12]:
# Group by location and calculate the mean number of rooms
rooms_by_location = df.groupby('Location')['No of Rooms'].mean().sort_values(ascending=False)

# Display the results
rooms_by_location

Location
Bela Noor Shah     12.400000
Gojra               9.187500
Airport             9.000000
Ambore              7.285714
Jalalabad           7.000000
Naloochi            6.875000
Bela noor shah      6.500000
Plate               6.461538
Tanga Stand         6.000000
Langerpura          6.000000
Chehlla Bandi       5.250000
Chehlla             5.200000
Chatter             5.142857
Chella Bandi        5.000000
Tarqabad            4.600000
Madina Market       4.400000
Bella Noor Shah     4.250000
Balapeer            3.000000
Shawai              2.000000
Domail Sayedian     2.000000
Name: No of Rooms, dtype: float64

### Insights:

This shows which locations tend to have bigger homes based on the number of rooms. For example, a location with an average of 5 rooms per house might attract larger families or people needing more space. Conversely, areas with fewer rooms might be more appealing to singles or smaller families.

## Step 3: Common Condition of Houses

In [13]:
# Calculate mode for house condition
house_condition_mode = df['Condition'].mode()[0]

# Group by condition and calculate the average house price
price_by_condition = df.groupby('Condition')['House Prices (Millions)'].mean()

# Display the results
house_condition_mode, price_by_condition

('Good',
 Condition
 Excellent    16.111111
 Fair         14.423077
 Good         11.654545
 Poor         15.000000
 fair         10.000000
 Name: House Prices (Millions), dtype: float64)

### Insights:

The condition of a house (e.g., Good, Fair, Excellent) tells us about its quality and upkeep. If most houses are in "Good" condition, this could reflect the typical state of properties in the area. By looking at the average price for each condition, we can see if better conditions (like "Excellent") lead to higher prices. If the best conditions align with higher prices, improving the condition of a house could be a good investment.

# Part 2: Dispersion (Variability in Data)

## Step 1: Variability in House Prices by Location

In [14]:
# Calculate standard deviation and range for house prices by location
price_variability = df.groupby('Location')['House Prices (Millions)'].agg(['std', 'min', 'max'])
price_variability['range'] = price_variability['max'] - price_variability['min']

# Display the results
price_variability.sort_values(by='std', ascending=False)

Unnamed: 0_level_0,std,min,max,range
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jalalabad,36.062446,9,60,51
Bela Noor Shah,18.005555,9,50,41
Bela noor shah,14.023789,6,35,29
Chatter,8.664694,5,30,25
Ambore,8.124038,7,30,23
Madina Market,8.105554,6,25,19
Plate,7.181708,7,30,23
Naloochi,6.18177,5,20,15
Chehlla Bandi,5.560276,7,20,13
Gojra,5.394055,3,20,17


### Insights:

### Standard Deviation: 
This tells us how spread out the house prices are from the average price. A high standard deviation means prices vary a lot, which could indicate a mix of high-end and low-end properties.
### Range: 
The difference between the highest and lowest prices. A large range suggests significant variation in the types of houses available, from affordable to luxury.
### Which location to focus on?
Locations with high variability might be riskier but could also offer opportunities to buy low and sell high. More stable locations are safer but might offer less potential for big gains.

## Step 2: Variability in Rent Prices Across Locations

In [15]:
# Similar approach for rent prices
rent_variability = df.groupby('Location')['Rent Prices'].agg(['std', 'min', 'max'])
rent_variability['range'] = rent_variability['max'] - rent_variability['min']

# Display the results
rent_variability.sort_values(by='std', ascending=False)

Unnamed: 0_level_0,std,min,max,range
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Gojra,119046.975147,9500,500000,490500
Bela noor shah,89814.623902,10000,200000,190000
Chatter,48354.240584,11100,200000,188900
Jalalabad,42426.406871,40000,100000,60000
Chehlla,21288.494545,10000,60000,50000
Bela Noor Shah,13874.436926,15000,50000,35000
Ambore,10438.025721,9000,40000,31000
Chehlla Bandi,6454.972244,15000,30000,15000
Madina Market,6426.507605,8000,25000,17000
Naloochi,5926.634796,8000,30000,22000


### Insights:

Similar to house prices, we want to see how much rent prices change in different areas. Locations with stable rents (low standard deviation) are good for consistent income. Locations with high rent variability could indicate changing demand or quality of housing.

## Step 3: Variability in Prices for Houses with and without Gardens

In [16]:
# Compare variability in prices for houses with and without gardens
garden_variability = df.groupby('Garden')['House Prices (Millions)'].agg(['std', 'min', 'max', 'mean'])

# Display the results
garden_variability

Unnamed: 0_level_0,std,min,max,mean
Garden,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,8.515385,5,60,11.454545
Yes,9.231424,3,50,14.911111


### Insights:

Houses with gardens might have more consistent pricing if buyers generally prefer having a garden. If the variability (standard deviation) is lower for houses with gardens, it suggests that these homes are valued more consistently. If gardens have a big impact on prices, it’s worth considering this feature when buying or improving properties.

# Part 3: Correlation (Relationships in Data)

## Step 1: Correlation Between House Size and Price

In [17]:
# Calculate correlation between area and house price
area_price_corr = df['Area (Marla)'].corr(df['House Prices (Millions)'])

# Display the result
area_price_corr

0.5184812744651904

### Insights:

Correlation Coefficient: This number (between -1 and 1) shows the relationship between two variables. A positive correlation close to 1 means that as one variable increases (house size), the other one (price) also increases. A strong positive correlation suggests that bigger houses tend to cost more, which makes sense as larger properties offer more space.

## Step 2: Correlation Between Rooms/Washrooms and Rent Price

In [18]:
# Calculate correlations
rooms_rent_corr = df['No of Rooms'].corr(df['Rent Prices'])
washrooms_rent_corr = df['No of Washrooms'].corr(df['Rent Prices'])

# Display the results
rooms_rent_corr, washrooms_rent_corr

(0.08378082940385487, 0.18004580151231933)

### Insights:

We are comparing which feature (rooms or washrooms) has a stronger relationship with rent prices. If the number of rooms has a higher correlation with rent, it suggests that tenants value more rooms over more washrooms. This helps prioritize what features to highlight or develop in rental properties.

## Step 3: Correlation Between Year of Construction and House Prices

In [19]:
# Calculate correlation between year of construction and house price
year_price_corr = df['Year of Construction'].corr(df['House Prices (Millions)'])

# Display the result
year_price_corr

-0.018665074271904277

### Insights:

If newer houses have a higher correlation with prices, it means newer properties are valued more, possibly because of modern designs or better materials. A weak correlation suggests that the age of the house isn’t as important as other features.

# Overall Summary
Understanding these patterns helps the real estate firm decide where to invest, what features to focus on, and how to price properties competitively. Knowing which areas have stable prices, which features add value, and how different characteristics affect pricing can lead to smarter, data-driven decisions.