# Assumptions & Approach - Group 15

## Approach

The main goal of our project was to forecast the rental prices across Victoria for the next 3 years. Our approach was tailored to meet the needs of our primary stakeholders, which included potential renters, real estate investors, property managers, and urban planners. We recognized that these groups have significant interest in the predictions of rental prices in order to support informed decision-making related to livelihood, investment, development, and urban infrastructure planning. With these in mind, we took deliberate steps to ensure that our modelling addressed their specific concerns.

1. **Data Collection and Processing:**

    We sourced data from multiple platforms that were essential in deciding features that could possibly affect the rental price of a property.
    
    * **Rental Property Data from Domain:**
    We scraped data from the [domain.com.au](https://www.domain.com.au) website. This provided us with information on properties that are currently available in the market, including their availablity, pricing, and other features, which are of particular interest to renters, property managers, and investors.
    
    * **Historical Median Rent Pricing:**
    We accessed this data from from the [Department of Families, Fairness, and Housing](https://www.dffh.vic.gov.au/), which provided us with the historical data on median rent prices for different suburbs, which we were unable to access from Domain. This was essential for forecasting rental prices as we can use this data to understand its long-term trends.

    * **Crime Data:**
    Crime severity and crime rates were factored in, as safety is a critical concern for both renters and property managers alike. This data was obtained from the [Crime Statistics Agency](https://www.crimestatistics.vic.gov.au/).

    * **Population and Income Data for Demographics:**
    Population and income data was taken from the 2016 and 2021 Census as we wanted to account for the population growth, affordability, and rental demand across different suburbs, which are essential in determining future rental conditions.

    * **Urban Landmarks and Public Transport:**
    Using openrouteservice, we calculated the distances between rental properties and public transport hubs, as well as key urban landmarks such as shopping centers and employment hubs. These factors are crucial for assessing the desirability of rental properties for future tenants. Data for the landmarks and public transport options were sourced from OpenStreetMap.

    * **Schools and Education:**
    The proximity and rankings for educational institutions were considered, as rental demand from students and families wanting to live near a well-regarded school is a significant factor for stakeholders.


2. **Data Imputation and Forecasting:**

    Some datasets such as population, income and crime rates lacked future projection. As a result, we thought it would be ideal to extrapolate these values and utilise them for our model. We implemented statistical forecasting techniques to project these values, enabling stakeholders to anticipate changes that may influence rental trends.


3. **Feature Engineering:**

    Feature engineering was a crucial step in preparing the data for modelling and ensuring that key stakeholder priorities were reflected. For example, real estate investors and property managers are highly interested in a property's proximity to public transport as well as crime rates, as these factors significantly influence rental demand. 
    
    _Population growth and income data were engineered to capture the economic conditions most relevant to urban planners and local government_ (fix?)


4. **Model Fitting:**



## Assumptions

The following are the assumptions we made for our project:

### Education

1. School density representing education quality: we assume that a higher density of education institutions (primary, secondary and tertiary institutions) indicate with better educational opportunities, regardless of the actual quality or performance of those institutions.

2. Relatively static analysis over time: our analysis is based on data from 2023, making the assumption that the educational landscape remains mostly stable. We presume that key factors such as funding, policies, and infrastructure development will not experience significant changes during the analysis period. 

3. Influence of non-academic factors: it is assumed that residents of the suburb have full access to schools within that suburb. That is, external factors such as catchment zones and private versus public school policies are not taken into consideration.

### Domain Data
1. We considered all properties in terms of weekly rent
* Monthly to weekly rent: multiply by 84 and divide by 365 
    * Monthly rent is calculated as follows : weekly rent divided by 7 (days) x 365 (days) divided by 12 (months)
* Yearly to weekly rent: multiply by 7 and divide by 365. 

2. We assumed that the minimum monthly rent should be $500, everything below that is most likely to be car spaces or storage options. 

3. We assumed the maximum weekly rent would be $5000 and removed those above that.
* If the rental price does not specify weekly or monthly, and if it is less than $5000, we assume that it is weekly. 
* As a result we removed properties that rented 'by the season'. 
    
4. We determine whether or not a property is furnished by the description or the initial feature list. From the feature list we only say it's furnished if it's verified (i.e. no asterisk). This is due to the nature of the data from the domain website itself, wherein features with * are said to be unverified. 

5. We assumed that if the feature list did not specify 'Pets Allowed', then a property would not allow pets.

6. When taking the rental price and they specify the price for both unfurnished and furnished, we always take furnished and take the higher number (basically assuming that furnished properties are always more expensive than unfurnished properties).

7. For weekly rent that specified a range (e.g. 650 - 700 a week), we took the average of the two numbers. 

8. We also got rid of properties that rented per night as we are only considering properties that are rented for long term stays. 

9. We decided to map different property types into just 2 categories, House and Apartment. 
* Houses: House, Townhouse, Villa, New House & Land, Semi-Detached, Duplex, Terrace
* Apartments: Apartment / Unit / Flat, Studio, New Apartments / Off the Plan

10. We removed properties that do not specify rental price, and do not have any information on the number of rooms (bed and bath). We also removed properties with 0 beds or 0 baths. 

11. We removed properties that had more than 12 parking spaces (we manually inspected the properties with more than 5 parking spaces and found that most of them were still within reasonable ranges, with properties having more than 12 parking spaces to be outliers)

In [None]:
# Month only
# properties[(properties['is_monthly'] == True) & (properties['is_weekly'] == False) & (properties['rent_nums'].apply(lambda x: len(x) > 1))]

# Weekly only
# properties[(properties['is_monthly'] == False) & (properties['is_weekly'] == True) & (properties['rent_nums'].apply(lambda x: len(x) > 1))][['rental_price', 'bond', 'price_furnished', 'features_furnished', 'desc_furnished', 'furnished', 'rent_nums', 'weekly_rent']]

# Both (most pcm == bond, when we removed bond we removed most pcm but it's fine cus we took the minimum for is_both)
# properties[(properties['is_both'] == True) & (properties['rent_nums'].apply(lambda x: len(x) == 1))].head(20)

# Bond (shows why we needed to remove bond) --> lost 2 records tho (properties[properties['weekly_rent'].isna()])
# properties[properties['rental_price'].str.contains('bond',case=False)]
 
# Not indicated
# properties[(properties['not_indicated'] == True) & (properties['rent_nums'].apply(lambda x: len(x) > 1))]

# 1 wack record:
# rental price = '$800 1,2 or 3 Month Lease' --> weekly rent = 184.1
# properties[properties['rental_price'].str.contains('month lease', case=False)]

####################################################################################################################################################################################

# there's an entry that's 800 but considered monthly due to "1,2,3 month lease"
# i think we can just get rid of this one property LOL idk how else 

# if contains_both_furnished_options = True, take the maximum of rent_nums & output to final_rent (new column)

# if not_indicated == True 
    # if len(rent_nums) > 1, we take the average
        # issue with one property: $600/$2607 
    # if len == 1
        # if rent_nums[0] > 5000 (we assume per week), we remove
        # else output to final_rent
        # if > 5000, we remove

# if monthly = True & weekly = False
    # if length rent_nums > 1, 
    #   we take the max of rent_nums , x 52 / 12, and output to final_rent
    # else if len == 1
        # if rent_nums[0] < 500, remove (most likly to be car park / storage)
        # else x 12 / 52 and output to final_rent

# if weekly = True & monthly = False
    # if len rent_nums > 1, 
    #   we take the minimum of rent_nums and output to final_rent (bc some rental prices include bond UGHH)
    # else we output to final_rent

    # if less than 125, we remove

# if is_both = True,
    # we take the minimum of rent_nums and output to final_rent


# if len rent_nums == 2
    # if is_both == False:
        # take average of rent_nums and output to final_rent
    # if is_both == True:
        # take minimum

####################################################################################################################################################################################

# if none & if less than 5000, assume weekly

    # remove greater than 5000

    # if pcm, x 12 / 52

    # if weekly, just take the number

    # if both, take the minimum

# find numerical values (if 1 and less than 5000, assume pw)

# if month, keep just the number

# if contains both furnished options, take the max

# If not_indicated = True & less than 5000, then we assume weekly

####################################################################################################################################################################################

## Assumptions

# Rental price only have numbers --> assume weekly for these

# get suburbs and postcode for address (DONE)
# rental_price per week
# split rooms into bed and bath (DONE)
# parking: list format into just number (DONE)
# features: furnished, pets (DONE)

# * is unconfirmed --> assume 0
# just 0's and 1's for all hot-encoded columns
# number of features would be len of features list
# check if furnished from description (only look for 'furnished' / 'unfurnished' in description)

# keep house & apartment
# townhouse in house
# new house & land to house
# semi detached --> house
# villa in house
# duplex to house

# studio in apt
# new apartments to apts

# if contain furnished --> flag, unless the str also contains 'extra'

# pw, per week, /w, weekly, a week, p.w, p.w., week, / week, /week, wk, / wk, 
# some just have the number --> assume weekly for these
# p.c.m., per month, /month, pcm, calendar month, pm

#################################################################################################################################################################################

# 5000 max per week

# $630 bills and wifi included
# could contain furnished / 'fully furnished' / 'partly-furnished', 'furnish', 'furnished!', 'fully furn'
# 'furnished option extra pw'

# problems: rent2own, for winter season, for the season, per night (get rid of these) (DONE)
# (starting) from xx per month
# $786 - $1572 (got rid of, parking > 12)

# for calendar month: x 12 / 52? 

# Assume if rent doesn't have like p/w or pcm, then if its less than 5000 --> pw, if greater than 5000 --> per month or per year?
# 95000 per year

# Get rid of all rental prices above 5000

## Income

We interpolated the income data between the 2016 and 2021 census data to get the data for each year, assuming a linear relationship

## Limitations

### Education:  
1. Ranking criteria differences across school stages: ranking primary and secondary schools together was challenging due to the differences in their ducational objectives and outcomes. A secondary school's ranking might influence real estate more significantly due to its relevance to ATAR/VCE scores, while primary schools often lack performance-based rankings. Tertiary instituions are often highly selective and profession specific in the offered courses, making it difficult to rank them due to the diversity of the content provided. Hence, education density was used to identify top suburbs instead.

2. Quality vs. Quantity dilemma: the use of "education density" measures the quantity of schools but fails to account for the quality of education provided by each institution. A suburb with many low-performing schools could appear more favorable than one with fewer but higher-performing schools.