# Melbourne Housing Snapshot data
https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot 

# About Dataset
## Context

Melbourne real estate is BOOMING. Can you find the insight or predict the next big trend to become a real estate mogul… or even harder, to snap up a reasonably priced 2-bedroom unit?

## Content

This is a snapshot of a dataset created by Tony Pino.

It was scraped from publicly available results posted every week from Domain.com.au. He cleaned it well, and now it's up to you to make data analysis magic. The dataset includes Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.



## Notes on specific variables
Rooms: Number of rooms

Price: Price in dollars

Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

SellerG: Real Estate Agent

Date: Date sold

Distance: Distance from CBD

Regionname: General Region (West, North West, North, North east …etc)

Propertycount: Number of properties that exist in the suburb.

Bedroom2 : Scraped # of Bedrooms (from different source)

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

CouncilArea: Governing council for the area

## Acknowledgements

This is intended as a static (unchanging) snapshot of https://www.kaggle.com/anthonypino/melbourne-housing-market. It was created in September 2017. Additionally, homes with no Price have been removed.

## Importing the dataset

In [None]:
# First, let's import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

# Next, load the dataset
df = pd.read_csv('data/melb_data.csv') # replace with your CSV's path

## Profiling the dataset

In [None]:
# To get a quick overview of the dataset we can use .info()
df.info()

In [None]:
# Let's also take a look at the first few rows of the data
df.head()

In [None]:
# Let's investigate the "Price" column
price = df['Price']

# We can get the minimum, maximum and average price as follows
min_price = price.min()
max_price = price.max()
average_price = price.mean()

print(f"Minimum Price: {min_price}")
print(f"Maximum Price: {max_price}")
print(f"Average Price: {average_price}")

### pandas Describe 
The describe method in pandas is a very powerful tool for quickly summarizing data, and it can be used with both numeric and non-numeric (also known as categorical or object) data types. 

For the numeric columns, describe() will provide the count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum.

For non-numeric columns, describe() will provide the count, unique (number of distinct objects in the column), top (most frequently occurring object), and freq (frequency of the most common object).

In [None]:
# Describe is a good method to quickly profile all numeric columns
df.describe()

In [None]:
df.describe(include=['object'])

## Profiling and visualizing
**An image says more than a 1000 words**. It is really powerfull to visualize your data during profiling. It doesn't need to be perfectly clean to do this. 

### Visualize the data with a histogram
A histogram gives us a visual representation of the data distribution

In [None]:
# Let's create a histogram of the price
plt.figure(figsize=(10,6))
plt.hist(df['Price'], bins=50, edgecolor='black')
plt.title('Distribution of Prices')
plt.xlabel('Price ($)')
plt.xticks(ticks = plt.xticks()[0], labels = [f'{int(x/1e6)}M' for x in plt.xticks()[0]]) # This line formats the x-axis
plt.xlim(left=0)  # This line sets the left limit of the x-axis to 0
plt.show()

### Identifying outliers
Outliers can be identified using a boxplot. Points that are determined to be outliers are marked as dots while the other observations are shown as boxes (interquartile range) and whiskers (1.5*interquartile range).

In [None]:
# Let's create a boxplot for price
plt.figure(figsize=(10,6))
plt.boxplot(df['Price'], vert=False)
plt.title('Boxplot of Prices')
plt.xlabel('Price ($)')
plt.xlim(left=0)  # This line sets the left limit of the x-axis to 0

plt.xticks(ticks = plt.xticks()[0], labels = [f'{int(x/1e6)}M' for x in plt.xticks()[0]]) # This line formats the x-axis
plt.show()

In [None]:
df.head()

# Exercise 1
**Now it is your turn!**

Preferably work in groups of 4. Two people create the plots for the number of rooms and two people for the landsize (Pair programming). Help each other if you get stuck.

- Rooms
    - Histogram
    - Boxplot
- Landsize
    - Histogram
    - Boxplot

In [None]:
# TODO 
# your Histogram
plt.figure(figsize=(10, 6))
plt.hist(df['Rooms'], bins=9, edgecolor='black')
plt.title('Distribution of Number of Rooms')
plt.xlabel('Number of Rooms')
plt.ylabel('Frequency')
plt.show()

# Histogram of landsize
plt.figure(figsize=(10, 6))
plt.hist(df['Landsize'], bins=50, edgecolor='black')
plt.title('Distribution of Landsize')
plt.xlabel('Landsize')
plt.ylabel('Frequency')
plt.show()


In [None]:
# TODO 
# your Boxplot
# Boxplot of the number of rooms
plt.figure(figsize=(10, 6))
plt.boxplot(df['Rooms'], vert=False)
plt.title('Boxplot of Number of Rooms')
plt.xlabel('Number of Rooms')
plt.show()

# Boxplot of landsize
plt.figure(figsize=(10, 6))
plt.boxplot(df['Landsize'], vert=False)
plt.title('Boxplot of Landsize')
plt.xlabel('Landsize')
plt.show()


# Completeness
Missing values in your data can sometimes cause a lot of issues. 

In data science, identifying missing values and checking data completeness is vital for accurate analysis, reliable results, and data integrity. Missing values can bias statistical analysis, affect predictive models, and signal data quality issues. By addressing missing values, data scientists can ensure data quality, make informed decisions about imputation or removal, and derive meaningful insights.



**QUESTION** 

- What would you do with incomplete data? 
- Filling the missing data?
    - What value do you choose?
- Drop the column with the missing values?
- Drop the rows with the missing values?
- Leave them empty


In [None]:
columns_to_check = ['Date', 'Price', 'Rooms', 'Landsize', 'BuildingArea', 'YearBuilt']
missing_values = df[columns_to_check].isnull().sum()

print("Missing values:")
print(missing_values)

### Dangers of missing values
In dashboarding and data science, missing values can lead to problems such as biased analysis, incomplete insights, distorted visualizations, impaired predictive models, data integrity concerns, and user misinterpretation. Handling missing values properly is crucial to ensure accurate and reliable analysis, maintain data integrity, and prevent misleading or incomplete results.

- Machine Learning models often can't handle missing values. Either drop them or fill them
- Users can misinterpret your data if the missing values are not clearly communicated
- Missing values can indicate underlying data quality issues

To mitigate these issues, it is crucial to handle missing values appropriately by employing techniques such as data imputation, removal of incomplete cases, or utilizing specialized models capable of handling missing values. Proper documentation and transparency regarding the handling of missing values in the dashboard are also essential to ensure data integrity and user understanding.

# Correlations
**Importance of Finding Correlations in Data**

Identifying correlations in data is crucial for data science. Correlations reveal relationships between variables, providing valuable insights and aiding decision-making. They help discover patterns, select meaningful features, guide data preprocessing, and inspire new hypotheses. By leveraging correlations effectively, data scientists gain a deeper understanding of the data and enhance their analysis.

In [None]:
columns_to_correlate = ['Rooms', 'Price', 'Landsize']
correlation_matrix = df[columns_to_correlate].corr()

print("Correlation matrix:")
print(correlation_matrix)

## Description of the Correlation Matrix

The correlation matrix reveals the relationships between variables. In this case:
- "Rooms" and "Price" have a moderate positive correlation (approximately 0.497). More rooms tend to be associated with higher prices. - "Rooms" and "Landsize," as well as "Price" and "Landsize," have very weak positive correlations (approximately 0.026 and 0.038, respectively). There is a slight tendency for larger land sizes as the number of rooms or price increases.

These correlations provide insights into how the variables are related.

# Exercise 2
Are the date (of sale) and price correlated?

- Try to find if there is a correlation
- *hint: make a plot*

In [None]:
# Exercise 2 TODO

# Scatter plot of Date vs. Price
plt.figure(figsize=(10, 6))
plt.scatter(df['Date'], df['Price'], alpha=0.2)
plt.title('Date vs. Price')
plt.xlabel('Date')
plt.ylabel('Price')

# Select a subset of x-tick labels for better readability
unique_dates = df['Date'].unique()
x_ticks = unique_dates[::2]  # With this list slicing we only take every second element

# + Rotate and allign the ticks
# only plot a subset of the dates to increase readabality
plt.xticks(x_ticks, rotation=45, ha='right')

plt.tight_layout()  # Adjust the layout to prevent overlapping
plt.show()

In [None]:
# Group by Date and calculate the average price per date
date_price_avg = df.groupby('Date')['Price'].mean()

# Line plot of Date vs. Average Price
plt.figure(figsize=(10, 6))
plt.plot(date_price_avg.index, date_price_avg.values, '-o', markersize=3)
plt.title('Date vs. Average Price')
plt.xlabel('Date')
plt.ylabel('Average Price')

# Select a subset of x-tick labels for better readability
x_ticks = date_price_avg.index[::2]  # Modify the slicing value as per your preference

plt.xticks(x_ticks, rotation=45, ha='right')

plt.tight_layout()  # Adjust the layout to prevent overlapping
plt.show()


In [None]:
import numpy as np # is automatically installed with Pandas
# It's also possible to add a trendline to the datapoints

# Group by Date and calculate the average price per date
date_price_avg = df.groupby('Date')['Price'].mean()

# Line plot of Date vs. Average Price
plt.figure(figsize=(10, 6))
plt.plot(date_price_avg.index, date_price_avg.values, '-o', markersize=3, label='Average Price')

# Add trendline using linear regression
poly_fit = pd.Series(np.polyfit(range(len(date_price_avg)), date_price_avg.values, 1))
trendline = pd.Series(np.polyval(poly_fit, range(len(date_price_avg))))
plt.plot(date_price_avg.index, trendline, color='red', label='Trendline')

plt.title('Date vs. Average Price')
plt.xlabel('Date')
plt.ylabel('Average Price')
plt.legend()

# Select a subset of x-tick labels for better readability
x_ticks = date_price_avg.index[::2]  # Modify the slicing value as per your preference

plt.xticks(x_ticks, rotation=45, ha='right')

plt.tight_layout()  # Adjust the layout to prevent overlapping
plt.show()

## Groupby + agg
The groupby() function in pandas is used to group data based on one or more columns. It allows you to split the dataset into groups based on unique values in the specified column(s).

The agg() function, short for "aggregation," is then applied to calculate summary statistics or perform computations within each group. It enables you to apply one or more aggregation functions, such as mean(), sum(), max(), min(), or custom functions, to derive aggregated results for each group.

By using groupby() in combination with agg(), you can efficiently analyze and summarize data across different categories or segments. It enables you to calculate statistics, extract key insights, and understand the relationships between variables within each group.

This combination is particularly useful when you want to analyze and compare data across groups, perform calculations based on specific criteria, or generate aggregated results for further analysis or visualization.

In [None]:
region_price_avg = df.groupby('Regionname')['Price'].agg('mean') 
region_price_avg

In [None]:
# it is also possible to use a list of aggregation methods
region_price_stats = df.groupby('Regionname')['Price'].agg(['count', 'mean', 'min', 'max'])
region_price_stats

# Exercise 3
Explore the relationship between number of rooms and price.

Use:

- .groupby()
- Make a plot

In [None]:
# TODO EXERCISE 3 

# Group by the number of rooms and calculate the average price
rooms_price_avg = df.groupby('Rooms')['Price'].mean()

# Create a plot of the average price based on the number of rooms
plt.figure(figsize=(10, 6))
plt.plot(rooms_price_avg.index, rooms_price_avg.values, '-o')
plt.title('Number of Rooms vs. Average Price')
plt.xlabel('Number of Rooms')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Group by Regionname and Rooms, and calculate the average price
region_rooms_price_avg = df.groupby(['Regionname', 'Rooms'])['Price'].mean()

# Get unique Regionname categories
region_names = df['Regionname'].unique()

# Create subplots for each Regionname category
fig, axs = plt.subplots(len(region_names), figsize=(10, 6*len(region_names)), sharey=True)
fig.suptitle('Average Price vs. Number of Rooms by Regionname')

# Iterate over each Regionname and create a plot
for i, region_name in enumerate(region_names):
    # Filter data for the current Regionname
    region_data = region_rooms_price_avg[region_name]
    
    # Extract number of rooms and average price
    num_rooms = region_data.index
    avg_price = region_data.values
    
    # Create a plot for the current Regionname
    axs[i].plot(num_rooms, avg_price, '-o')
    axs[i].set_title(region_name)
    axs[i].set_xlabel('Number of Rooms')
    axs[i].set_ylabel('Average Price')

plt.tight_layout()
plt.show()