Skip to content

Ashish-Ghoshal/House-Price-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Predicting House Prices with Apache Spark

Objective

The objective of this project is to predict the Median house value for the given test data using Apache Spark.

In this, we'll make use of the California Housing data set. Note, of course, that this is actually 'small' data and that using Spark in this context is mainly to demonstrate a proof of concept.

Understanding the Data Set

The California Housing data set appeared in a 1997 paper titled Sparse Spatial Autoregressions, written by Pace, R. Kelley, and Ronald Barry and published in the Statistics and Probability Letters journal. The researchers built this data set by using the 1990 California census data.

The data contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). In this sample, a block group on average includes 1425.5 individuals living in a geographically compact area.

These spatial data contain 20,640 observations on housing prices with 9 economic variables:

Longitude:refers to the angular distance of a geographic place north or south of the earth’s equator for each block group
Latitude:refers to the angular distance of a geographic place east or west of the earth’s equator for each block group
Housing Median Age:is the median age of the people that belong to a block group. Note that the median is the value that lies at the midpoint of a frequency distribution of observed values
Total Rooms:is the total number of rooms in the houses per block group
Total Bedrooms:is the total number of bedrooms in the houses per block group
Population:is the number of inhabitants of a block group
Households:refers to units of houses and their occupants per block group
Median Income:is used to register the median income of people who belong to a block group
Median House Value:is the dependent variable and refers to the median house value per block group

What's more, we also learn that all the block groups have zero entries for the independent and dependent variables have been excluded from the data.

The Median house value is the dependent variable and will be assigned the role of the target variable in our ML model.

Look for houses.zip from here and download the data.

About

Predicting the house price using Apache Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published