The Repository explores predicting California housing prices in the 1990s using a Linear Regression Model
. We use the scikit
for the project and data required is available with scikit.
Data Set Characteristics | |
---|---|
Number of instances | 20640 |
Number of attributes | 8 |
Attributes |
|
Target : Median house value
- for California districts, expressed in hundreds of thousands of dollars ($100,000).
This dataset was obtained from the StatLib repository. https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.
-
Standard Linear Regression Model available with Scikit
-
Input : Above data from the 1990s consensus
-
Output : Median house price of the house in hundreds of thousands of dollars
-
Split the data into training and test dataset. We split it in the ratio of 80:20
-
Create a pipeline containing a StandardScaler and a LinearRegressionModel.
-
Fit the training data using the pipline to create the model.
-
Predict the results using the model ib the training data.
-
Compare the results to the given housing prices and calculate r2_score and Mean Squared error.
Metric | Value |
---|---|
R2 Score | 0.5891435539852219 |
MSE | 0.5472825858911409 |
The R2 score
of 0.58 tells us that there is still lots of variability when comparing the model predicted results and the true values provided in the dataset.
- Evaluating whether to normalize Longitude and Latitude attributes of the dataset. Is there better way to represent the latitude and longitudes
- Evaluating if we can add external data to improve the model and its accuracy
- Evaluation of other feature engineering techniques to get better representative features.
A linear regression model describes the relationship between a dependent variable, y, and one or more independent variables, X. Linear regression model is linear in terms of the coefficients. In this model, we try to fit a n-dimensional plane that represents the given data best.
For more details MathWorks
- scikit
- Linear regression @ MathWorks
- scikit-learn California Housing Dataset
- 123 of AI
- Suburb image by upklyak
If you have any feedback/are interested in collaborating, please reach out to me via 📧