The objective of this project is to predict the Median house value for the given test data using Apache Spark.
In this, we'll make use of the California Housing data set. Note, of course, that this is actually 'small' data and that using Spark in this context is mainly to demonstrate a proof of concept.
The California Housing data set appeared in a 1997 paper titled Sparse Spatial Autoregressions, written by Pace, R. Kelley, and Ronald Barry and published in the Statistics and Probability Letters journal. The researchers built this data set by using the 1990 California census data.
The data contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). In this sample, a block group on average includes 1425.5 individuals living in a geographically compact area.
These spatial data contain 20,640 observations on housing prices with 9 economic variables:
Longitude:refers to the angular distance of a geographic place north or south of the earth’s equator for each block group Latitude:refers to the angular distance of a geographic place east or west of the earth’s equator for each block group Housing Median Age:is the median age of the people that belong to a block group. Note that the median is the value that lies at the midpoint of a frequency distribution of observed values Total Rooms:is the total number of rooms in the houses per block group Total Bedrooms:is the total number of bedrooms in the houses per block group Population:is the number of inhabitants of a block group Households:refers to units of houses and their occupants per block group Median Income:is used to register the median income of people who belong to a block group Median House Value:is the dependent variable and refers to the median house value per block group
What's more, we also learn that all the block groups have zero entries for the independent and dependent variables have been excluded from the data.
The Median house value is the dependent variable and will be assigned the role of the target variable in our ML model.
Look for houses.zip from here and download the data.