GitHub - TheNJineer/NJRealtor-Scrapper: Full scale portfolio project which scrapes the NJ Realtor website of its monthly median sales data pdfs. The pdf's contents will be extracted to be cleaned and transformed to then be stores in a SQL data base for future use in a machine learning project.

Introduction:

Imagine being new real estate agent; wide-eyed, excited and ready to take on your first client. Miraculously, you get a call on your first day from a buyer with a particular set of criteria: “I’m moving from another state so I know nothing about New Jersey, but they have a max budget of $500k, want to live in the suburbs but close to public transportation, a good school system and plenty of businesses and entertainment. Please show me all cities which meet that criteria, thank!”. You hang the phone up less enthusiastic thank when you picked up. You have no idea where to begin getting this information but you heard from another senior agent that’s there’s four (4) different private multiple listing services (MLS), Zillow, Redfin, Trulia and maybe Wikipedia. Other than that you have to either drive to other cities to do recon or call other agents to get a detailed layout of their city. Only if there was an easier way to get all of this information. You also wish there was an easier way to classify cities with this criteria, because more than likely other people would want to live somewhere just like that. Lastly, how would a municipality having all those great qualities affect the change in population?

Problem Statement:

The current system (or lack thereof) for real estate agents to analyze their local and distance markets is inefficient and time consuming. To properly represent clients and meet their needs, an agent needs to spend many hours throughout the week tracking recent sales, inventory, home prices and conducting comparable sales analysis. With this information, an agent can thoroughly understand market dynamics, when, where and how a buyer should purchase a property or to sell a home. However, markets change on a daily, weekly and monthly basis. This can be overwhelming for new and even established agents as NJ has 4-5 different private MLS services which organize and display their data with different methods. This inefficiency can lead to increased frustration and lack of understanding with ones job, increased inability to keep up with fast paced markets, decreased client satisfaction and loss of revenue (salary).

Goals:

Build and maintain an ETL (extract, transform, load) pipeline of real estate data from various sources through the use of Python, PostgrSQL and Apache Airflow
Perform exploratory data analysis (EDA) to gain insight and recognize trends in acquired data
Visualization the data using tools such as Matplotlib, Seaborn and GeoPandas for static images, and Streamlit to create real estate dashboards
Use regression algorithms to understand the relationship between the real estate data and municipality populations
Use unsupervised machine learning algorithms to cluster “similar municipalities” based on data compiled from the MLS, US Census Bureau, NJ Department of Labor and third party data resources
Use results from real estate data analysis and machine learning to help retail sellers, buyers and investors make data driven decisions
Allow agents to effectively access and analyze all markets across the state of NJ, not only ones within their immediate radius

Data Sources:

NJ Realtor Association
Wikipedia
USPS (Zipcode Lookup) / UnitedStatesZipCodes.org
US Cenus for the 2019-2022 Per Capita Income
US Cenus for the 2019-2022 Median Household Income
US Cenus for the 2019-2022 Median Age
US Cenus for the 2019-2022 Home Ownership Rates
US Cenus for the 2019-2022 Commuting Data
County Business Patterns (CBP) by Zipcode
FBI Uniform Crime Reporting Program (UCRP)
NJ Deptartment of Labor for 2019-2023 Population Estimates
NJ Deptartment of Labor for 2019-2023 General Tax Rates
NJ Deptartment of Labor for 2019-2023 Average Property Tax Bills
NJ Deptartment of Labor for 2019-2023 Unemployment Rates
U.S Environmental Protection Agency Air Quality Index (AQI)
School Digger Best High School Rankings (for NJ)

Designing the System: The pipeline operates as follows:

Requests will be used to send inidividual requests to receive files from the host server for each municipality for that month and year
After all target pdfs are streamed, begin scrapping the data using PyPDF2 and Regex
Store data in a temporary Python dictionary
Convert Python dictionary into Pandas DataFrame
Initiate data cleaning and transformations
Store DataFrame in PostgreSQL
Schedule monthly data aggregation with Apache Airflow

Data Cleaning, Enrichment, Imputations and Transformations

Used BeautifulSoup and Requests to scrape all municipality's latitudinal and longitudunal information to merge with real estate data
Used BeautifulSoup and Requests to scrape of NJs Census Designated Places (CDPs) and convert names to host cities
Transformed all of US Census Data from JSON to Pandas Dataframes
School names cleaned to match their host municipality. Municipality's School Rank was averaged if multiple school systems existed
Transformed FBI Crime Data from unconventional xlsx and pdf formats to Pandas Dataframes
Used the Census Bureau API to request CBP zipcode data. Transformed and filtered data to save NJ Zipcodes early. Use the full zipcode database to match NJ zipcodes with their municipality name
Air Quality Index Data was transformed and filtered NJ Counties only. Data was then converted from a daily to a yearly timeframe
After merging all dataframes, necessary transformations (added constants, log, exp and power) were applied to ensure data was as close to normally distributed as possible

Exploratory Data Analysis & Data Visualization

Figure 1: What is the Median Sales Price of a NJ Home Since 2019?

The median sales price has increase by approximately 65% since 2019, with an average yearly increase of 13%

Figure 2: What is the Average Median Gross Rent by County for 2022?

The median gross rent is highest in northern New Jersey and decreases as one heads south. There's a positive linear correlation between both Latitude and Longitude and Gross Rent

Figure 3: What is the best month to sell your home?

According to data, the best time to sell your home is either in August, July, or June. Each month has respectively accounted for approximately 10% of the states closed sales since NJ Realtor has recorded data. This also correlates to higher sales price during these months

Machine Learning

Figure 4: Supervised Regression R2 Prediction and Best Fit Scores

Distance-based, Decision Tree and Ensemble estimators performed best in predicting the target (Populaiton) and understanding the relationship between it and the dependent variables. They typically do better with higher dimension data. The linear estimators underperformed as expected but, I was able to use linear feature importance techniques to deduce which features were led to the best estimator scores. The best hyperparameters were found using Sci-kit Learn's Grid Search Cross Validation method.

Figure 5: UMAP Dimension Reduction Results (Pre-Clustering)

After applying all imputations and transformations to make the data as normalized as possible, I used the UMAP dimension reduction technique to reduce the dimensions from 66 features down to three (3). There are some clearly formed clusters after the transformation but, more needs to be done to quantify how many individual clusters there are.

Figure 6: UMAP Dimension Reduction Clustering Results

Clustering algorithms like KMeans, DBSCAN and HDBSCAN were used to estimate the numeber of clusters each instance of data belonged to. Additionally, cluster evaluators such as the Silhouete Score and David Bouldin Score were used to estimate how well clusters formed and how different each cluster was to one another. The cluster evaluators determined that KMeans produced the best scored clusters. Result may vary through each use of UMAP with six (6) to eight (8) unique clusters being found (eight (8) clusters shown in this figure)

Additional Actions Considered

Re-run the algorithm using all of the monthly data
Generate normalized data on a monthly scale from the yearly data
Include all discrete data (Municipality Name, County, Quarter) through encoding techniques
Feature engineer a column which would classify a municipality being part of North, Central, or South Jersey
Try to enrich the data with more municipal level data
Perform EDA on the clusters discovered through unsupervised learning. This will help me understand their similaries and apply a classification label
Apply a classification algorithm to predicted the likelihood of an instance being labeled the correct municipality and cluster

Summary

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Final Project		Final Project
Project Images		Project Images
.gitignore		.gitignore
NJRScrapper.py		NJRScrapper.py
README.md		README.md
nj_realtor_dashboard.py		nj_realtor_dashboard.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

TheNJineer/NJRealtor-Scrapper

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages