## University of California, Santa Barbara
## PSTAT 135 / 235: Big Data Analytics
## Professor Tashman
## Project 3: Linear Regression Modeling of California Home Prices
Last updated: Jan 31, 2019


## Instructions
In this project, you will work with the California Home Price dataset.  You will train a regression model to predict median home prices.  

Learning Objectives
Students will gain additional expertise in the following:

RDDs, DataFrames, data preprocessing, feature engineering, model training, model evalulation

## Lab Exercises (TOTAL POINT VALUE: 10PTS)

1) Go through all code and fill in the missing cells. This will prep data, train a model, predict, and evaluate model fit.  Compute and report the Mean Squared Error (MSE) in a table at the very bottom, where all MSE values should be summarized.  
Show all work.

2) Repeat (1) with at least one additional feature from the original set.
3) Repeat (1) with at least one engineered feature based on one or more variables from the original set.  
4) Repeat (1) but do Lasso Regression instead of Linear Regression.`


### Data Source
StatLib---Datasets Archive  
http://lib.stat.cmu.edu/datasets/

houses.zip
These spatial data contain 20,640 observations on housing prices with 9 economic covariates. It appeared in Pace and Barry (1997), "Sparse Spatial Autoregressions", Statistics and Probability Letters. Submitted by Kelley Pace (kpace@unix1.sncc.lsu.edu). [9/Nov/99] (536 kbytes)



Data Description
This tutorial makes use of the California Housing data set. It appeared in a 1997 paper titled Sparse Spatial Autoregressions, written by Pace, R. Kelley and Ronald Barry and published in the Statistics and Probability Letters journal. The researchers built this data set by using the 1990 California census data.

The data contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). In this sample a block group on average includes 1425.5 individuals living in a geographically compact area. 

These spatial data contain 20,640 observations on housing prices with 9 economic variables

All the block groups with zero entries for the independent and dependent variables have been excluded from the data.

The Median house value is the dependent variable and will be assigned the role of the target variable in your ML model.



### Preprocessing (completed offline by instructor)

cadata_raw.txt contains a data description at the top, followed by data.

1. Separated data from header  
   cal_housing_data_raw.txt  contains only data  
   cal_housing_header.txt contains only text
2. Some values are in scientific notation.  
   Spacing is inconsistent (first 6 fields separated by 2 spaces. Lat/long separated by 1 space)  
   Ran the following in Python to format values

import pandas as pd

d = pd.read_csv('/home/ubuntu/projects/cal_housing_data_raw.txt', header=None, sep='  ')

d.columns=['v1','v2','v3','v4','v5','v6','v7','v8']  
d['latitude'] = d.v8.map(lambda l: float(l.split()[0]))  
d['longitude'] = d.v8.map(lambda l: float(l.split()[1]))  
d.to_csv('/home/ubuntu/projects/cal_housing_data_preproc.txt', index=False, header=False)


In [1]:
# import pyspark modules
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import *       # for datatype conversion
from pyspark.sql.functions import *   # for col() function
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.feature import StandardScaler
from pyspark.ml.regression import LinearRegression
import pandas as pd

sc = SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)

read text file, which is comma-separated

show first 5 rows of data

count the number of rows


print the schema

select variables *median_house_value*,*median_income* and show first 5 rows (**1PT**)


show descriptive statistics on the following: (**1PT**)  
*households*,*median_house_value*,*median_income*,*total_bedrooms*

### Additional Preprocessing

1) Scale the response variable median_house_value, dividing by 100000 and saving into column *median_house_value_final* (**1PT**)

2) Select the following features into one "feature" column, and store in dataframe with scaled response variables:  
*median_house_value_final*, *total_bedrooms*, *population*, *households*, *median_income*, *rooms_per_household*

3) Scale the features (**1PT**)

Split data into train 80%, test 20% sets, using `seed=314`

Initialize a linear regression object with these parameters:  
`maxIter=10` `regParam=0.3` `elasticNetParam=0.8`

Fit the model using the training data

For each datapoint in the test set, make a prediction (hint: apply `transform()` to the model).
You will want the returned object to be a dataframe

Convert the dataframe to an rdd. Then select only the `prediction` and `label` fields (hint: use map()) (**2PTS**)

Evaluate the model by computing Mean Squared Error (MSE), which is the average sum of squared differences between predicted and label. 

This can be computed in a single line using `reduce()`

Show all MSE values in a table at bottom, indicating run1, run2, run3, run4 (**each MSE worth 1PT for 4PTS total**)