# Regression Challenge

Predicting the selling price of a residential property depends on a number of factors, including the property age, availability of local amenities, and location.

In this challenge, you will use a dataset of real estate sales transactions to predict the price-per-unit of a property based on its features. The price-per-unit in this data is based on a unit measurement of 3.3 square meters.

> **Citation**: The data used in this exercise originates from the following study:
>
> *Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.*
>
> It was obtained from the UCI dataset repository (Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science).

## Review the data

Run the following cell to load the data and view the first few rows.

In [None]:
import pandas as pd

df = pd.read_csv('data/real_estate.csv')
df.head()

The data consists of the following variables:

- **transaction_date** - the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
- **house_age** - the house age (in years)
- **transit_distance** - the distance to the nearest light rail station (in meters)
- **local_convenience_stores** - the number of convenience stores within walking distance
- **latitude** - the geographic coordinate, latitude
- **longitude** - the geographic coordinate, longitude
- **price_per_unit** house price of unit area (3.3 square meters)

## Train a Regression Model

Your challenge is to explore and prepare the data, identify predictive features that will help predict the **price_per_unit** label, and train a regression model that achieves the lowest Root Mean Square Error (RMSE) you can achieve (which must be less than **7**) when evaluated against a test subset of data.

Add markdown and code cells as required to create your solution.

> **Note**: There is no single "correct" solution. A sample solution is provided in [02 - Real Estate Regression Solution.ipynb](02%20-%20Real%20Estate%20Regression%20Solution.ipynb).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Get the label column
label = df[df.columns[-1]]

# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))


# plot the histogram

ax[0].hist(label, bins= 100)
ax[0].set_ylabel('Frequency')

# Add lines for the mean, meadian and mode
ax[0].axvline(label.mean(), color= 'magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)

# plot the boxplot

ax[1].boxplot(label, vert= False)
ax[1].set_xlabel('label')

# Add a title to the figure

fig.suptitle('Label Distribution')

# Show the figure
fig.show()

# Remove outliers 

In [None]:
df = df[df['price_per_unit'] < 70]
# Get the label column

label = df[df.columns[-1]]

# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))


# plot the histogram

ax[0].hist(label, bins= 100)
ax[0].set_ylabel('frequency')

# Add lines for the mean, meadian and mode
ax[0].axvline(label.mean(), color= 'magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)

# plot the boxplot

ax[1].boxplot(label, vert= False)
ax[1].set_xlabel('label')

# Add a title to the figure

fig.suptitle('Label Distribution')

# Show the figure
fig.show()


## view numeric correlation

In [None]:
for col in df[df.columns[:-1]]:
    fig = plt.figure(figsize= (9, 16)) # Aspect ratio
    ax = fig.gca()
    feature = df[col]
    correlation = feature.corr(label)
    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('Correlations')
    ax.set_title('Label vs ' + col + '- correlation: ' + str(correlation))
plt.show()

## View categorical features

(transaction_date and local_convenience_stores seem to be discrete values, so might work better if treated as categorical features)


In [None]:
# plot a boxplot for the label by each categorical feature

for col in df[['transaction_date', 'local_convenience_stores']]:
    fig = plt.figure(figsize= (9, 16))
    ax = fig.gca()
    df.boxplot(column= 'price_per_unit', by = col, ax = ax)
    ax.set_title('label by '+ col)
    ax.set_ylabel("Label Distribution by Categorical Variable")
plt.show()

## Separate features and label and split data for training and validation

(transaction_date doesn't seem to be very predictive, so omit it)


In [None]:
from sklearn.model_selection import train_test_split
# Separate features (columns 1 [house_age] to the last but one) and labels (the last column)

X, y = df[df.columns[1:-1]].values, df[df.columns[-1]].values
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print ('Training Set: %d, rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape[0]))

## Preprocess the data and train a model in a pipeline

Normalize the numeric features, then use a RandomForestRegressor to train a model.


In [None]:
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,3,4]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
    ])

# Create preprocessing and training pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', RandomForestRegressor())])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_squared_error
%matplotlib inline

prediction = model.predict(X_test)

mse = mean_squared_error(y_test, prediction)
print("MSE", mse)

rmse = np.sqrt(mse)
print("RMSE", rmse)

r2 = r2_score(y_test, prediction)
print("R2", r2)

plt.scatter(y_test, prediction)
plt.xlabel('Actual Labels')
plt.ylabel('prediction label')
plt.title('prediction vs Actuals')

z = np.polyfit(y_test, prediction, 1) # for red line
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='red')
plt.show()

## Use the trained Model

Save your trained model, and then use it to predict the price-per-unit for the following real estate transactions:

In [None]:
import joblib

filename = "./real_estate_model.pkl"
joblib.dump(model, filename)

loaded_model = joblib.load(filename)

X_new = np.array([[16.2,289.3248,5,24.98203,121.54348],
                  [13.6,4082.015,0,24.94155,121.5038]])
results = loaded_model.predict(X_new)
print('Predictions:')
for prediction in results:
    print(round(prediction,2))