# California Housing Price 
reference : exercise in chapter 2 of 'Hands-On Machine Learning with Scikit-learn and Tensorflow' by Aurélien Géron. 

##### Tip> shortcuts for Jupyter Notebook
* Ctrl + Enter : run cell
* Shift + Enter : run cell and select below

## 1. Data Load

Load the data by using *read_csv()* method in __Pandas__ module. Then, let's take a look at the top 10 rows using the *head()* method. 

In [None]:
# Data load
import pandas as pd

housing = pd.read_csv('housing.csv')
housing.head(10)

Let's see the distribution of the data by using __matplotlib__ module briefly.

In [None]:
# figures plotting with data
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/50, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)

plt.legend()

To better understand the characteristics of each feature, let's apply the *info()* method.

In [None]:
# check a structure of the data
housing.info()

Let’s look at how much each attribute correlates with the *median house value*:

In [None]:
# correlation between the median_house_value and other features
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

## 2. Prepare the Data

this step consists of 'pre-processing', 'train-test seperation', and 'feature-label seperation'.

### 2-1) Pre-processing 

#### 2-1.1) Data cleaning
Most Machine Learning algorithms cannot work with missing features, so let’s replace the empty values of 'total_bedrooms' with the median value.

In [None]:
# replace the empty values with the median
median =housing["total_bedrooms"].median()
housing["total_bedrooms"] = housing["total_bedrooms"].fillna(median) 
housing.info()

#### 2-1.2) Attributes combinations
*rooms_per_household* is more meaningful than *total_rooms*. Also, *bedrooms_per_room* is more meaningful than *total_bedrooms*.

In [None]:
# Attributes combinations
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
del housing["total_rooms"], housing["total_bedrooms"]

housing.info()

In [None]:
######################################################### < Quiz. >#####################################################################

# To do : Write the code to calculate the correlation coeff. Between 'bedrooms_per_room' and 'median_house_value'. 




#########################################################################################################################################

#### 2-1.3) Feature Scaling
Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales.

__Scikit-Learn__ provides a transformer called *StandardScaler* for *standardization*.

In [None]:
# feature standardization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# delete a column of text type
col_list = list(housing) 
col_list.remove('ocean_proximity') # text type
col_list.remove('median_house_value') # target variable needs not to be scaled

# generate a new dataframe that consist of numeric type only
housing_numeric = housing[col_list]
housing_scaled = scaler.fit_transform(housing_numeric)
# Data type conversion from 'Series' to 'DataFrame'
housing_scaled_df = pd.DataFrame(housing_scaled, index=housing_numeric.index, columns=housing_numeric.columns)

# Concatenate 
housing = pd.concat([housing_scaled_df, housing['median_house_value'], housing['ocean_proximity']], axis=1)
housing.head()

#### 2-1.4) Handling Text and Categorical Attributes
Most Machine Learning algorithms prefer to work with numbers anyway, so let’s convert the 'ocean_proximity' to numbers.

__Pandas__ provides a *get_dummies* method to convert integer categorical values into one-hot vectors. 

In [None]:
# One-hot encoding
housing = pd.get_dummies(housing)
housing.head(10)

### 2-2) Training and Test Set Seperation
__Scikit-Learn__ provides *train_test_split* function to split dataset into multiple subsets in various ways. 

In [None]:
# training - test seperation
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

print('# of train_set : %.0f, # of test_set : %.0f' %(train_set.shape[0], test_set.shape[0]))

### 2-3) Features and Target Value Seperation of the Training Set
It’s time to prepare the data for your Machine Learning algorithms. 

Let’s separate the features and target value to generate the model H(X).

In [None]:
# feature and label seperation of training set
train_set_features = train_set.drop('median_house_value',axis=1)
train_set_target = train_set["median_house_value"].copy()

## 3. Linear Regression
generate the linear regression model by using *LinearRegression* function from __Scikit-learn__.

For calculating our RMSE, *mean_square_error* function will be used from __scikit-learn__. Also, __numpy__ module will be used to use sqaure-root operation.

 $$RMSE = \sqrt{\sum{(y - \widehat y)^2}\over N}$$
 <br/>
 
$y$ : actual median_house_value, $\widehat y$ : median_house_value predicted. $N$ : total number of data<br/>

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np # for a sqaure root calcuation

# generate model by using training set
lin_reg = LinearRegression()
lin_reg.fit(train_set_features, train_set_target) 

# Feature and target value Seperation of the test set
test_set_features = test_set.drop('median_house_value',axis=1)
test_set_target = test_set["median_house_value"].copy()

# target value predicted from our model
final_model = lin_reg
final_predictions = final_model.predict(test_set_features)

# RMSE
final_mse = mean_squared_error(test_set_target, final_predictions)
final_rmse = np.sqrt(final_mse)

print('final_linear_RMSE : %.2f' %final_rmse)

## 4. Ridge Regression
__scikit-learn__ provides *Ridge* function and *cross_val_score* function to calculate a cross-validation.

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# a function defined to calculate the RMSE with 5-fold cross-validation.
def mean_cv_rmse(model):
    rmse= np.sqrt(-cross_val_score(model, train_set_features, 
                                   train_set_target, scoring="neg_mean_squared_error", cv = 5))
    return(rmse.mean())

# find best alpha
alpha_range = np.arange(0, 1.5, 0.1)
cv_ridge = [mean_cv_rmse(Ridge(alpha = alpha_value)) for alpha_value in alpha_range]
cv_ridge = pd.Series(cv_ridge, index=alpha_range)
ridge_best_alpha = cv_ridge.idxmin()
print("Best alpha : %f" % (ridge_best_alpha))

# plot the RMSE curve according to alpha value
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
ax1.plot(alpha_range, cv_ridge)
ax1.set_ylim(67812, 67813.2)
y_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)
ax1.yaxis.set_major_formatter(y_formatter)
plt.xlabel("alpha")
plt.ylabel("RMSE")

# ridge regression
model_ridge = Ridge(alpha = ridge_best_alpha)
model_ridge.fit(train_set_features, train_set_target)

# ridge RMSE
ridge_predicted = model_ridge.predict(test_set_features)
final_ridge_mse = mean_squared_error(test_set_target, ridge_predicted)
final_ridge_rmse = np.sqrt(final_ridge_mse)
print('final_ridge_RMSE : %.2f' %final_ridge_rmse)

# command to hide the warning box
import warnings
warnings.filterwarnings(action = 'ignore')

## 5. Lasso Regression

In [None]:
########### To do : fill in the box with your Lasso code. ###############


