# Prediction Model for High Value Townships in Eastern United States


<img src = 'https://nypost.com/wp-content/uploads/sites/2/2022/09/phoenix-skyline-sunset.jpg' />

_(Skline of Phoenix - from nypost/Getty images) - Picture for representation purposes only_

## Problem Introduction and Motivation

_Assuming a made-up scenario in which we have been employed by a large real estate company who wants to invest in good value homes for sale to High Networth Individuals. For the purpose of this scenario, all the parameters are assumed to have been researched and requested for by this ficticious company "GP Real Estate"_


### **Objective and Background**<br>
GP Real Estate has been in the business of constructing industrial units and warehouses for large corporates for over 25 years. As the company has grown over the years, as have their clients and client base. Some of their clients are in the computers and computer peripheral products industry which has seen a large demand even during the Covid-19 pandemic. With increasing demans and profits, these companies have been purchasing more industrial properties to increase their production. However, now these promoters (High Networth Individuals) are looking for personal investments in residential properties. During some of their informal meetings, they voiced their interest in investing in good value homes in good neighbourhoods, and that they have been on the look-out for a good real estate company which might help them with their search.
The sales team at GP Real Estate realized that they can add a new Residential vertical and pitched this idea to the stakeholders with the backing of a few interested clients. Been given a tentative go-ahead, the company contacted us for developing a model which will help them predict the right kind of townships to establish a new housing development. This which will give them the right entry into the residential game.
While GP Real Estate has clients across the nation, the company is headquartered in New York, and has a big chunk of clients who are based in Eastern United States, and looking for investments in the same region.

### **Understanding the Data**<br>
This is a model for predicting which townships are more likely to have residential houses of higher value.
Over the past 2 years, due to the Covid-19 pandemic the job market saw an unprecedented disruption. This also lead to change in the way we work, with many organizations offering the flexibility of 'Work from Home' to counter the spread of and also in response to the Great Resignation. Due to this, more and more individuals are relocating to better localities to achieve their dreams of investing in good homes.
This has led to an increase in the value of real estate, especially over the past few months. However, as the demand has increased, the values of all the houses have gone up irrespective of the location and quality of the house.

Due to this real estate companies and individual investors are on the look out for quality houses, and hence need relevant data in order to take well informed decisions.

There is a great article which dives in deep into the reasons behind the increase in housing prices:
https://www.nyrentownsell.com/blog/why-are-houses-so-expensive/

An excerpt from the article: _Following the onset of the COVID-19 pandemic, interest rates were reduced to boost economic health. The drastic drop in interest rates, combined with numerous Americans’ desire to abandon apartments and cities in favor of residential areas and lower prices, created an increased demand. In contrast, many sellers withdrew from the market due to political and economic instability. More buyers than sellers have since entered the real estate market, and total house prices have dramatically increased as a result._

>Dataset Details:
We have the records for roughly 500 “townships” or municipalities in the Eastern United States. We need to use the area characteristics provided in the data to predict whether or not the township is likely to possess a lot of high valued properties.<br>
(Typically, townships in the USA refers to a small geographical area)

Click here to know more details about townships in the USA: https://www.britannica.com/topic/township 

This prediction could be helpful for building companies who want to know where to establish a new housing development. Other real-estate ventures would be interested as well.


A few things to keep in mind before proceeding with the Prediction Model:
>**Target audience**<br>
>Apart from GP Real Estate, there are quite a few Real-estate investors and HNIs who would benefit vastly with such a prediction model. If this model works out, our firm can use similar models to pitch to new clients who want to venture into residential housing estate in different parts of the country<br><br>
>**Actionables**<br>
>Users of this model shall be able to select the right townships for their investments. This will directly impact both their topline as well as bottom line, as the model will be able to help them take quick and informed decisions, suitable for their target audience.<br><br>
>**Who are the key stakeholders?**<br>
>This model shall be of great use to Real estate companies/ property owners, and high networth investors<br><br>
>**Outcome of this Prediction Model**<br>
>Using Machine Learning to accurately predict which townships are to be preferred for the objective of finding townships with high value properties. The additional application displayed with the help of streamlit is to showcase the findings in an aesthetic manner.


# Project Dependencies

In [1]:
import pandas as pd
                                                            # Here we are importing the pandas data analysis library
                                                            # into our program. We are assigning pandas as pd to use the Pandas
                                                            # library wherever required in the code later
            
from sklearn.linear_model import LogisticRegression
                                                            # Scikit learn is a Machine Learning library, and we are
                                                            # particularly importing just the required parts of the library, 
                                                            # It will eat up our RAM if we import the entire library
                                                            
                                                            # The linear.model is a class of Scikit learn library/ module
                                                            # and we can use it for ML with linear models
                                                            
                                                            # Logistic Regression is an ML classification technique/ algorithm
                                                            # It is used to predict the probability of a 
                                                            # dependent categorical variable 

from sklearn.metrics import accuracy_score
                                                            # The metrics module measures the classification performance
                                                            # It implements several loss, score, and utility functions
                                                            # The accuracy_score function calculates the accuracy score
                                                            # for a set of predicted labels against the true labels

import pickle
                                                            # The Python pickle module serializes and deserialize objects
                                                            # in binary. Pickling is used to store python objects to a file 
                                                            # that can be loaded to another program again later


# Data Preparation
- [ ] Initial Evaluation
- [ ] Getting the Predictor and Target Variables
- [ ] Modeling - Training the Model


><h3>Initial Evaluation </h3>
Read in the csv data file and try to understand it

In [2]:
df = pd.read_csv('high_value_townships.csv')
                                                            # While it is not mandatory, df is the standard term used to store
                                                            # dataframes - common jargon
                                                            # Here, we are using the read_csv function to read the csv file
                                                            # which is our data set 


><h3>Getting the Predictor and Target Variables </h3>
Creating X and y to store the predictor variables and the Target variable

In [3]:
X = df.iloc[:,1:len(df.columns)]
                                                            # iloc is integer-location based indexing for selection by position
                                                            # Here, we are considering all the rows, however, we are 
                                                            # excluding the 0th column which contains the Target variable
                                                            # which is not a predictor variable
                                                            # X is storing the entire predictor variables/ data
y = df.iloc[:,0]
                                                            # Y is storing the target variable, as we are considering
                                                            # only the 0th column
model = LogisticRegression(max_iter=800)
                                                            # Here, we are storing the function to run logistic regression 
                                                            # in model, and with max 800 iterations 


# Modeling


In [4]:
model.fit(X,y)
                                                            # The .fit method is basically training the model
predictions = model.predict(X)
                                                            # This function helps us predict labels of the data values
                                                            # on the basis of the trained model
print(accuracy_score(y,predictions))
                                                            # Displays the accuracy score based on the target variable and 
                                                            # predictions
pickle_out = open('classifier', mode='wb')
                                                            # This opens the file for writing in it. Here we are creating a 
                                                            # file called as 'Classifier' and we are writing (w) to it in
                                                            # a binary (b) mode - for which we have used the mode as 'wb'
pickle.dump(model, pickle_out)
                                                            # Dump puts the data of the object in a file and returns
                                                            # the object in bytes
pickle_out.close()
                                                            # We close up the file now

0.9525691699604744


#### These are the Predictor variables/ determinants used to predict whether or not the township has a likelihood of having high value homes or not:
- Large Lots: The percentage of residential lots which are large in size
- Industrial Land: The percentage of the township land which has been zoned for industrial purposes
- River Side: Provides details about whether the township is on a river side or not (1 = Yes, 0 = No)
- Mean Rooms: The average number of rooms in each of the residential properties
- Pupil Teacher Ratio: The number of students per teacher within the township 

These are the variables that we are considering for now. However, further research can help us determine more relevant datapoints/ predictor variables going forward.
A few that come to mind are: Schools in the area (and their fees), presence of high-end shops nearby, Home Owners Association fees

#### The likelihood of the township having high value properties is determined by:
- Higher percentage of large residential lots
- Lower percentage of township land utilized for industrial purposes
- Properties would typically be on the riverside
- The average number of rooms would be on the higher side
- There would be a smaller ratio of students to teachers in the township


><h3>Running the Prediction Model </h3>
Using the predictors listed above to use in the Prediction Model

In [5]:
%%writefile app.py

import pickle
import streamlit as st

pickle_in = open('classifier', 'rb')
classifier = pickle.load(pickle_in)

@st.cache()

# Define the function which will make the prediction using data
# inputs from users
def prediction(industrial_land, river_side,
               mean_rooms, pupil_teacher_ratio, large_lots):
    
    # Make predictions
    prediction = classifier.predict(
        [[large_lots, industrial_land, river_side, mean_rooms, pupil_teacher_ratio]])
    
    if prediction == 0:
        pred = 'This township is likely to possess a lot of high valued properties. Research further!'
    else:
        pred = 'This township will probably not have a lot of high valued properties. Avoid for residential, but can consider for Industrial projects.'
    return pred

# This is the main function in which we define our webpage
def main():
    
    # Create input fields
    
    large_lots = st.number_input("Residentials lots that are large (%)",
                                  min_value=0,
                                  max_value=100,
                                  value=5,
                                  step=1,
                                 )
    industrial_land = st.number_input("Township land zoned for industrial purposes (%)",
                              min_value=0,
                              max_value=100,
                              value=5,
                              step=1,
                             )

    river_side = st.number_input("Riverside - Yes(1) or No(0)",
                              min_value=0,
                              max_value=1,
                              value=0,
                              step=1
                             )
    mean_rooms = st.number_input("Mean Number of Rooms",
                          min_value=1,
                          max_value=10,
                          value=2,
                          step=1
                         )
    pupil_teacher_ratio = st.number_input("pupil_teacher_ratio",
                          min_value=1,
                          max_value=25,
                          value=10,
                          step=1
                         )

    result = ""
    
    # When 'Predict' is clicked, make the prediction and store it
    if st.button("Predict"):
        result = prediction(large_lots, industrial_land, river_side, mean_rooms, pupil_teacher_ratio)
        st.success(result)
        
if __name__=='__main__':
    main()

Writing app.py


# Deployment
We can now run the application using Streamlit app
This application is to showcase the prediction to relevant stakeholders in a visually appealing manner.

In [None]:
!streamlit run app.py