# Austin House Data Summary

Data comes from Kaggle and the publisher's goal was to study Austin's housing market. The publisher claimed that Austin's housing market is "one of the hottest markets in 2021" and that researchers can use the data to predict home price.

## Part 1: Formatting The Data Set

In [40]:
# Libraries:
import pandas as pd
import numpy as np

In [41]:
path = "../data/Austin_Housing_Data.csv"
house_data = pd.read_csv(path, header = 0)

There are 15,171 observations and 47 columns in the data set. There are several columns that are object types. The coreset algorithms do not consider these data types. Therefore, we will remove the columns that are not floats or integer types. For now, we will also keep the boolean data types in the data set. If we find that the algorithm does no support booleans, we may remove them later.

In [44]:
# Identify numeric columns and remove columns with string and character types
remove_columns = (house_data.dtypes == "float64") | (house_data.dtypes == "int64") | (house_data.dtypes == "bool") 
house_data = house_data.iloc[:, remove_columns.values]

There are also a few columns that have numeric data types, but do not have numeric significance. For example, the difference between two zip codes or the max zip code does not have a numerical meaning. Therefore, we will remove these columns from the data set.

In [45]:
# Remove columns:
remove_columns = ~house_data.columns.isin(['zpid', 'zipcode', 'latest_salemonth'])
house_data = house_data.iloc[:, remove_columns]

Lastly, we change the values for yearBuilt and latest_saleyear. Instead of the year, we alter the values to be the difference between the current year(data set publish year 2021) and the observation year. This means we will change yearBuilt to houseAge and latest_saleyear to numYearsLastSale(number of years since last sale). 

In [46]:
# Change the yearBuilt and latest_saleyear columns:
house_data.loc[:,"yearBuilt"] = 2021 - house_data.loc[:,"yearBuilt"]
house_data.loc[:,"latest_saleyear"] = 2021 - house_data.loc[:,"latest_saleyear"]
# Change the column names:
house_data = house_data.rename(columns = {'yearBuilt':'houseAge', 'latest_saleyear':'numYearsLastSale'})

In [47]:
# Check for missing values: 
print(all(house_data.isna().any() == False))
print(all(house_data.isnull().any() == False))

True
True


In [48]:
house_data.shape

(15171, 37)

In [49]:
house_data.dtypes

latitude                      float64
longitude                     float64
propertyTaxRate               float64
garageSpaces                    int64
hasAssociation                   bool
hasCooling                       bool
hasGarage                        bool
hasHeating                       bool
hasSpa                           bool
hasView                          bool
parkingSpaces                   int64
houseAge                        int64
latestPrice                   float64
numPriceChanges                 int64
numYearsLastSale                int64
numOfPhotos                     int64
numOfAccessibilityFeatures      int64
numOfAppliances                 int64
numOfParkingFeatures            int64
numOfPatioAndPorchFeatures      int64
numOfSecurityFeatures           int64
numOfWaterfrontFeatures         int64
numOfWindowFeatures             int64
numOfCommunityFeatures          int64
lotSizeSqFt                   float64
livingAreaSqFt                float64
numOfPrimary

After some manipulation, the final data set for the algorithm contains 37 columns. There are no missing values in the data set. The final predictor vairables include latitude, longitude, property tax rate, garage spaces, parking spaces, house age, number of price changes, number of years since last sale, number of photos, number of accessibility features, number of appliances, number of parking features, number of patio and porch features, number of security features, number of waterfront features, number of window features, number of community features, lot size square feet, living area square feet, number of primary schools nearby, number of elementary schools nearby, number of middle schools nearby, number of high schools nearby, avgerage school distance, average school rating, average school size, median number of students per teacher, number of bathrooms, number of bedrooms, number of stories, has association, has cooling, has a garage, has heating, has a spa, and has a view. The response variable is lastest house price.

# Part 2: Descriptive Statistics