### Your name:

Joan Soo Li Lim

### Collaborators:

None


In [55]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Open the housing data


In [56]:
import os
import tarfile

HOUSING_PATH = "."

def fetch_housing_data(housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### Build full pipeline for the data analysis following the example of the notebook.
 Hint: the main part requested to change is the algorithm used (Lasso regression)

If you want to learn more about the Lasso regression, see resources below:
- http://scikit-learn.org/stable/modules/linear_model.html#lasso
- https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

#### Considerations for building pipeline:

- Split data into training and testing sets below.
- Convert all categorical data to one-hot vectors below
- Normalize all non-categorical data 
-  Perform Lasso-based regression using a variety of values for $\alpha$ between 0 and 1 via a grid search where  *housing_labels* is the output and all other features are the input (similar to as seen in lecture two.)

In [57]:
# Check for null values. 
housing.count()

longitude             20640
latitude              20640
housing_median_age    20640
total_rooms           20640
total_bedrooms        20433
population            20640
households            20640
median_income         20640
median_house_value    20640
ocean_proximity       20640
dtype: int64

In [58]:
# Only "total_bedrooms" column has null values. Deal with this now. Fill the null with the median value.
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

In [59]:
# Step 1: One hot vectors for categorical
# Step 2: Normalize numerical data

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer 
from sklearn import preprocessing

debug = False

# Split data into training and testing sets using sklearn build-in function. 
# As per example from the text, use the same defaults of ratio = 20% and random state = 42. 
data_train, data_test = train_test_split(housing, 
                                         test_size=0.2, 
                                         random_state=42)
  
# Create labels for training set    
housing_train = data_train.drop("median_house_value", axis=1) 
housing_train_labels = data_train["median_house_value"].copy() 

if (debug):
    print(len(data_train), "train +", len(data_test), "test")
    print (data_train.head())
    print (data_test.head())
    print ("------------------------")
    print (housing_train.head())
    print (housing_train_labels.head())
    
# First step: Convert categorical data (ocean proximity) to one-hot vectors
# sklearn 2.0 is still in DEV so I chose LabelBinarizer instead.
lb = LabelBinarizer()

# Fit based on entire set to get ALL labels
lb.fit (housing[['ocean_proximity']]) 

# Transform train and test sets separately
train_cat_1hot = lb.transform(housing_train[['ocean_proximity']])
test_cat_1hot = lb.transform(data_test[['ocean_proximity']])

if (debug):
    print (lb.classes_)
    print (train_cat_1hot[:5])
    print (test_cat_1hot[:5])
    
# Second step: Normalize non-categorical data using Euclidean distance
housing_num = housing_train.select_dtypes(include=[np.number])
housing_num_normalized = preprocessing.normalize(housing_num, norm='l2')

if (debug):
    print (housing_num.head())
    print (housing_num_normalized[:5])
    

In [60]:
# Now, using pipeline. I left the above as-is so that I can see what is happening step by step. Cheerios.
# Majority of the code is from the lecture notebook with changes identified in comments.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
    
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]   

# Numerical pipeline  
# NOTE:
# (1) Use median strategy for NA values. 
# (2) Did not combine any other attributes as it wasn't specified in the assignment.
# (3) Did not use StandardScaler as we will normalize using parameter into Lasso later. 
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),])

# Categorical pipeline 
# NOTE:
# (1) Perform Labeled Encoding first as input to (2) must be numerical value as per version 19.X. 
# (2) Use One Hot Encoder without a sparse matrix. 

le = LabelEncoder ()

# OMIT. Save directly into ocean_proximity --- housing_train[['ocean_encoded']] = housing_train[['ocean_proximity']].copy() #
# Encode ocean proximity with values 0 to 4. Use full set to ensure we get all labels.
housing[['ocean_proximity']] = housing[['ocean_proximity']].apply(le.fit)
housing_train[['ocean_proximity']] = housing_train[['ocean_proximity']].apply(le.transform)

if (debug):
    print (data_train["ocean_proximity"].value_counts())
    print (housing_train["ocean_proximity"].value_counts())
    print (housing_train.head())

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', OneHotEncoder(sparse=False)),])

   
# Full pipeline    
full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),])

housing_prepared = full_pipeline.fit_transform(housing_train)
housing_prepared

array([[-117.03,   32.71,   33.  , ...,    0.  ,    0.  ,    1.  ],
       [-118.16,   33.77,   49.  , ...,    0.  ,    0.  ,    1.  ],
       [-120.48,   34.66,    4.  , ...,    0.  ,    0.  ,    1.  ],
       ...,
       [-118.38,   34.03,   36.  , ...,    0.  ,    0.  ,    0.  ],
       [-121.96,   37.58,   15.  , ...,    0.  ,    0.  ,    0.  ],
       [-122.42,   37.77,   52.  , ...,    0.  ,    1.  ,    0.  ]])

In [61]:
# Perform Lasso regression for values α between 0 and 1 via a grid search 

from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

# A range of alpha values to test from 1 to 0
# I omitted 0 as it was not converging despite increasing iterations to 10000 (noted below)
alphas = np.array([1,0.7,0.5,0.4,0.1,0.01])

# Normalize any numerical values and try max_iter of 10000
lasso = Lasso(normalize=True, max_iter=10000)

# Try 5 folds as increasing to 10 will yield convergence problems
grid_search = GridSearchCV(estimator=lasso, param_grid=dict(alpha=alphas) , cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)

grid_search.fit(housing_prepared, housing_train_labels)

if (debug):
    print(grid_search)
    
# Results of the grid search for best alpha
print(grid_search.best_estimator_.alpha)
rmse = np.sqrt(-grid_search.best_score_)
print (rmse)
print ("------------")

# Results of the grid search in general
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

0.1
68637.22420908543
------------
68643.35988805693 {'alpha': 1.0}
68639.44250156602 {'alpha': 0.7}
68637.81752886708 {'alpha': 0.5}
68637.32710598563 {'alpha': 0.4}
68637.22420908543 {'alpha': 0.1}
68637.50618326574 {'alpha': 0.01}


In [62]:
# Find RSME of test set.
from sklearn.metrics import mean_squared_error

final_model = grid_search.best_estimator_

# Perform same preprocessing
X_test = data_test.drop("median_house_value", axis=1)
y_test = data_test["median_house_value"].copy()
X_test[['ocean_proximity']] = X_test[['ocean_proximity']].apply(le.transform)

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)   

print (final_rmse)

70052.57839145602



Why is it necessary to normalize all continuous variables before performing Lasso? (OPTIONAL)

I do not have time, or else I would do some research. My guess is that since Lasso aims to penalize the coefficients, then it would mean that a larger coefficient on a different scale (i.e. not normalized) will be penalized more than another coefficient with a smaller scale.

### Conclusions
For what values of $\alpha$ does Lasso perform best? Does it perform as well on the housing data as the linear regressor from the lectures? Why do you think this is?

Comment:
Personally, I don't think I tuned this well nor did I execute all steps as per Module 2 notebook. However, I am out of time. The errors arising from the provided examples were time consuming as I searched for 'stable' versions of APIs I could use. I prefer not to install / use 'unstable' code. Also, I did this within a span of a day pluw so as to submit on Thursday! As mentioned via email, I could not work on this prior due to health issues. I look forward to the solution discussion, after which I will return and tune this out a bit more. I suspect I will learn a great deal. :) 

Answer:
For my hacky grid search, the best alpha was 0.1 with a RSME of 68637.22420908543. It's final RSME of the entire test set is 70052.57839145602. I don't think it is fair to compare to the linear regressor (with a smaller final RMSE of 68628.19819848922 for 5 data points as noted in cell 80) as I did not set it up the same way. If I recall, the linear regressor also utilized combined attributes etc. However, they are fairly close which is what I sort of suspected. I may try to play around if time permits after our discussion in lecture and try for Ridge as well. I did notice however that Lasso did not converge for lower values of alpha and that might be a negative for this algorithm. Also, I am not completely sure how well Lasso works with one hot vectors with many zeros.

###  Read appendix B

- Reflect on your last data project, read appendix B. Then, write down a few of the checklist items that your last data project could have used. If you have not yet done a data project, then write down a few of the items that you found most interesting.


The project checklist is most definitely something I would print out and lay beside me as I have a tendency to dive in without adequate prepping. For the most part,  I have done the majority of what was described in the text. However, I could place more emphasis on feature engineering and analyzing correlations between the new features. This requires a bit more expertise in the subject area and I suspect that I will generally evolve in that aspect. Similarly, I will need to learn more about models in various categories in order to perform 'quick and dirty' analysis on standard parameters. This is fairly important in order to iron down the exact models I would like to data experiment with or to build ensemble models of. Finally, I think the section on maintaining the solution is important as we sometimes do not pay enough attention to data rot or diminishing quality. Similar to any software system with scripts to automate tracking of memory leaks or data locks, we need to monitor the ML solution.  