The repsitory contains the files for Jared Godar's Codeup project on regression and modeling of zillo real estate data.
The main goal of this project is to be able to accurately predict the values of single unit properties.
This will be accomplished by using past using property data from transactions between May and August 2017 to build various regression models, rating the effectiveness of each model, and testing the best model on new data is has never seen.
I will also determine the distribution of tax rates by state and county.
The ability to accurately value a home is essential for both buyers and sellers. Having an accurate model will allow us to determine which houses are over and under-valued and make apporptiate decisions accordingly.
This project provides the opportunity to create and evaluate multiple predictive models as well as implement other essential parts of the data science pipeline.
It will incolve pulling relavant data from a SQL database; cleaning that data; splitting the data into training, validation, and test sets; scaling data; feature engineering; exploratory data analysis; modeling; model evaluation; model testing; and effectively communicating findings in written and oral formats.
A home is often the most expensiver purchase one makes in their lifetime. Having a good handle on pricing is essential for both buyers and sellers. An accurate pricing model factoring in the properties of similar homes will allow for appropriate prices to be set as well as the alility to identify under and overvalued homes.
- What are the main drivers of home price?
- What are the relative importances of the assorted drivers?
- What factors don't matter?
- Are there any other potentially useful features that can be engineered from the current data available?
- Are the relationships suggested by initial visualizations statistically significant?
- Is the data balanced or unbalanced?
- Are there null values or missing data that must be addressed?
- Are there any duplicates in the dataset?
- Which model feature is most important for this data and business case?
- Which model evaluation metrics are most sensitive to this primary feature?
Target | Datatype | Definition |
---|---|---|
home_value | float64 | home value in dollars |
variable | Dtype | Definition |
---|---|---|
bedrooms | float64 | Number of bedrooms |
bathrooms | float64 | Number of bathrooms |
square_feet | int64 | Area in square feet |
year | int64 | Year built |
taxes | float64 | Tax amount dollars |
fips_number | int64 | Area code |
zip_code | category | Zip Code |
county_Orange | unit8 | Encoded county information |
county_Ventura | unit8 | Encoded county information |
county_avg (engineered) | float64 | Average home price in county |
baseline (engineered) | float64 | baseline |
You will need your own env file with database credentials along with all the necessary files listed below to run my final project notebook.
- Read this README.md.
- Download the
zillo_aquire.py
,zillo_prepare.py
, andzillo_project_report.ipynb
files into your working directory. - Add your own
env
file to your directory. (user, password, host). - Run the
zillo_project_report.ipynb
workbook.
- Acquire, clean, prepare, and split the data:
- Pull from zillo database.
- Eliminate any unnecessary or redundant fields.
- Engineer new, potentially informative features.
- Search for null values and respond appropriately (delete, impute, etc.).
- Deal with outliers.
- Scale data appropriately.
- Correlate IDs to states and counties.
- Distribution of tax rates by ocunty.
- Calculate tax rate using home value and taxes paid.
- Divide the data in to training, validation, and testing sets (~50-30-20 splits)
- Exploratory data analysis:
- Visualize pairwaise relationships looking for correlation with home value.
- Note any interesting correlations or other findings.
- Test presumptive relationships for statistical significance.
- Think of what features would be most useful for model.
- Record any other interesng observations or findings. NOTE: This data analysis will be limited to the training dataset
- Model generation, assessment, and optimization:
- Establish baseline performance (median home price).
- Generate a basic regression model using only home area, number of bedrroms, number of bathrooms.
- Calculate evaluation metrics to assess quality of models (RMSE, R^2, and p as primary metrics).
- Generate additional models incorporating other existing fields.
- Engineer additional features to use in other models.
- Evaluate ensemble of better models on validation data to look for overfitting.
- Select the highest performing model.
- Test that model with the previously unused and unseen test data once and only once.
- Streamline presentation
- Take only the most relative information from the working along and create a succinct report that walks through the rationale, steps, code, and observations for the entire data science pipeline of acquiring, cleaning, preparing, modeling, evaluating, and testing our model.
- Outline next steps for this project:
- Potential specific changes designed to retain customers
- Strategy to tests and evaluate implementation of those changes
- Potential revenu and savings for success
- Most important factors
- Factors that don't matter
- Model performance
- Improvement over baseline
- Counties