# Texas Housing Prices

---

This project is part of my "100 Days of Data Projects."

With this data, I hope to build and compare regression models for predicting housing prices in Austin, Texas.

---

This dataset consists of about 15.2K house records of Zillow home listings and 47 features.

Source: https://www.kaggle.com/datasets/ericpierce/austinhousingprices

# Import Data and Packages

In [2]:
import numpy as np
import pandas as pd 

data = pd.read_csv("texas_housing_prices.csv")

# Check the Size and Type of Data

In [3]:
# check the number of rows and features

print(data.shape)

(15171, 47)


In [4]:
# check the data types

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15171 entries, 0 to 15170
Data columns (total 47 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   zpid                        15171 non-null  int64  
 1   city                        15171 non-null  object 
 2   streetAddress               15171 non-null  object 
 3   zipcode                     15171 non-null  int64  
 4   description                 15171 non-null  object 
 5   latitude                    15171 non-null  float64
 6   longitude                   15171 non-null  float64
 7   propertyTaxRate             15171 non-null  float64
 8   garageSpaces                15171 non-null  int64  
 9   hasAssociation              15171 non-null  bool   
 10  hasCooling                  15171 non-null  bool   
 11  hasGarage                   15171 non-null  bool   
 12  hasHeating                  15171 non-null  bool   
 13  hasSpa                      151

# Train-Test Split

In [5]:
# import package for splitting
from sklearn.model_selection import train_test_split

# declare our X inputs and y outcomes
X = data.drop("latestPrice", axis=1)
y = data["latestPrice"]

# split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    #stratify=y, 
                                                    test_size=0.2)

print("X_train.shape = ", X_train.shape)
print("X_test.shape = ", X_test.shape)

print("y_train.shape = ", y_train.shape)
print("y_test.shape = ", y_test.shape)

# print("\n")
# print("y_train class proportions: \n", y_train.value_counts(normalize=True))

# print("\n")
# print("y_test class proportions: \n", y_test.value_counts(normalize=True))

X_train.shape =  (12136, 46)
X_test.shape =  (3035, 46)
y_train.shape =  (12136,)
y_test.shape =  (3035,)


# Explore the Data

1) Get to know each feature:
- Name
- Type
- Missing Values
- Noise (Stocastic, outliers, rounding errors, e.t.c.)
- Usefulness for the task at hand
- Distribution type (Gaussian, uniform, logarithmic, e.t.c)

2) Identify the target attribute.

3) Visualize the data.

4) Study the correlations b/w attributes.

5) Identify what transformations to the features you might want to apply.

6) Document what you have learned.

   

# Prepare the data

(Write functions for all data transformations applied.)

1) Data Cleaning

- Fix or remove outliers (optional)

- Imputate for missing values, drop rows, or drop columns

2) Feature selection (optional)

- Drop attributes that provide no useful information for the task.

3) Feature engineering where appropriate:

- Bin continuous features

- Decompose features (categorical, data/time, e.t.c.)

- Add promissing transformation of features (log, sqrt, ^2, e.t.c.)

- Aggregate features into promising new features

4) Feature scaling: Standardize or normalize features

# Short-list promising models

(If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time. Be aware that this penalizes complex models such as large neural nets or Random Forests).

(Try to automate these steps as much as possible.)

1) Train many quick and dirty models from different categories with default parameters

2) Measure and compare their performance.

3) For each model, use N-fold cross-validation and compute the mean and standard deviation of their performance.

4) Analyze the most significant variables for each algorithm.

5) Analyze the types of errors the models make.

- What data could be used to avoid these errors?

6) Have a quick round of feature selection and engineering.

7) Have one or two more quick iterations of the five previous steps.

8) Short-list the top three to five most promising models, preferring models that make different types of errors.

# Fine-Tune the System

(You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning.)

(As always automate what you can.)

1) Fine-tune the hyperparameters using cross-validation

- Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or the median value? Or just drop the rows?).

- Unless there are very few hyperparamter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using a Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams (https://goo.gl/PEFfGr))

2) Try Ensemble methods. Combining your best models will often perform better than running them invdividually.

3) Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.

(Don't tweak your model after measuring the generalization error: you would just start overfitting the test set.)

# Present your solution

- Document what you have done.

- Create a nice presentation.

    - Make sure you highlight the big picture first.

- Explain why your solution achieves the business objective.

- Don't forget to present interesting points you noticed along the way.

    - Describe what worked and what did not.

    - List your assumptions and your system's limitations.

- Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., "the median income is the number-one predictor of housing prices").