#  Real Estate Valuation

This is a real estate multivariate regression problem. Well be going through the "checklist" defined in Appendix B in the book [Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291). I'll be expliclity answering all the questions and going through each step to improve my learning, even though a lot of the questions don't make sense or apply super well to just doing this process for learning.

I'll preface answers with

> Pretend

if I'm making up an answer for this example.

In [None]:
# allows our matplotlib graphs to be display inline
%matplotlib inline

In [None]:
SEED = 42

In [None]:
import urllib.request # for fetching our raw data from the web
import pandas as pd # for easily manipulating our data
from pandas.plotting import scatter_matrix # for comparing all independent and dependent variables against each other
import seaborn as sns # for pretty graphs
import matplotlib.pyplot as plt # to stop graphs from plotting over one another
from scipy.stats import shapiro # for testing for normality
import statsmodels.api as sm # for making QQ plots
from IPython.display import Image, display # for display a local image, as the markdown way does not work with sibling folders
from sklearn.preprocessing import RobustScaler # for scaling our numerical independent variables
from sklearn.model_selection import train_test_split # for train test split
from sklearn.linear_model import LinearRegression # for linear regression
from sklearn.ensemble import RandomForestRegressor # for random forest regressor
from sklearn.metrics import mean_absolute_error # for MAE
from keras.models import Sequential # for neural network
from keras.wrappers.scikit_learn import KerasRegressor # for keras regressor
from keras.layers import Dense # for layers
from sklearn.model_selection import cross_val_score, KFold # for crossvalidation

## Part A: Frame the Problem and Look at the Big Picture
 
1. Define the objective in business terms.

(Pretend) We want to predict real estate valuations in New Taipei City, Taiwan to increase our ability to effectively bid on lots.

2. How will your solution be used?

(Pretend) Our solution will be used to ensure we our bids are accurate, maximizing profit.

3. What are the current solutions/workarounds (if any)?

(Pretend) None.

4. How should you frame this problem (supervised/unsupervised, online/offline, etc.)?

Supervised. The data sets have house_price_of_unit_area attached. 

Offline. All the data that existed has been already gathered and all the training will be done at once.

5. How should performance be measured?

When deciding which performance metric to use, we have a lot of options to pick from.

1. Mean Squared Error (MSE)
2. Root Mean Squared Error (RMSE)
3. Mean Absolute Error (MAE)
4. R Squared (R²)
5. Adjusted R Squared $(R²)$
6. Mean Square Percentage Error (MSPE)
7. Mean Absolute Percentage Error (MAPE)
8. Root Mean Squared Logarithmic Error (RMSLE)

I have decided that we'll use MAE:

$\text{MAE} = \frac{1}{N}\sum_{i=1}^{N}|{y_{i}-\hat{y}_{i}}|$

I've decided to use MAE, as I believe we'll have some outliers in our data set. 

I used this [article](https://towardsdatascience.com/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0) to help decide what would be a proper performance measurement.

6. Is the performance measure aligned with the business objective?

Yes, since we're predicting the value of houses, we'll be able to look at our error function's output and compare different models against each other, knowing we can treat a 10 MSE as exactly twice as bad as a 5 MSE. This works very well with the financial aspect of what we're modeling here.

7. What would be the minimum performance needed to reach the business objective?

(Pretend) We want to be within 25% of the valuation of a house, so we can see if it is worth gathering more data to try and further refine our model.

8. What are comparable problems? Can you reuse experience or tools?

(Pretend) Our company has had no comparable problems, nor can we reuse experience or tools.

9. Is human expertise available?

(Pretend) No.

10. How would you solve the problem manually?

I'd attempt to solve this problem manually by looking at all the houses that sold for the most amount and the houses that sold for the least amount and look for patterns that may elude to what would effect each house's price.

11. List the assumptions you (or others) have made so far.

While I have no assumptions regarding the dataset as a whole, I'll be listing column specific assumptions in section C.

12. Verify assumptions if possible.

Will do in Part C.

## Part B: Get the Data

_Note: automate as much as possible so you can easily get fresh data._

1. List the data you need and how much you need.

We'll just use the dataset we found before starting this project. If we were to answer this without acknowledging the dataset we've already found, we'd want a many rows as possible where columns would be potential variables that contributed to valuation of a house.

2. Find and document where you can get that data.

We can get this data at this [location](https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set).

3. Check how much space it will take.

32 KB.

4. Check legal obligations, and get authorization if necessary.

UCI's data is free.

5. Get access authorizations.

Not needed.

6. Create a workspace (with enough storage space).

Done using [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/).

7. Get the data.

Done below.

8. Convert the data to a format you can easily manipulate (without changing the data itself).

Not needed.

9. Ensure sensitive information is deleted or protected (e.g., anonymized).

Not needed.

10. Check the size and type of data (time series, sample, geographical, etc.).

I'm not sure what to call the kind of data we're working with. Each row represents the action of selling a house, how much it sold for and a few other potentially related measurements.

11. Sample a test set, put it aside, and never look at it (no data snooping!).

We'll do this later, as we have cleaning to perform in the next section.

In [None]:
# We we should store our raw hosuing data
STORE_RAW_HOUSING_DATA_DESTINATION_PATH = "../data/raw/real_estate_valuation_data_set.xlsx"

In [None]:
def fetch_raw_housing_data(STORE_RAW_HOUSING_DATA_DESTINATION_PATH: str = STORE_RAW_HOUSING_DATA_DESTINATION_PATH):
    """Fetches our raw housing data.
    
    """
    # We we can fetch out raw data from
    RAW_DATA_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00477/Real%20estate%20valuation%20data%20set.xlsx"
    
    urllib.request.urlretrieve(RAW_DATA_URL, STORE_RAW_HOUSING_DATA_DESTINATION_PATH)
    
fetch_raw_housing_data()

In [None]:
def load_raw_hosuing_data(STORE_RAW_HOUSING_DATA_DESTINATION_PATH: str = STORE_RAW_HOUSING_DATA_DESTINATION_PATH) -> pd.DataFrame:
    """Loads the raw housing data we've previously fetched
    
    """
    
    names = ["No", "transaction_date", "house_age", "distance_to_the_nearest_MRT_station", "number_of_convenience_stores", "latitude", "longitude", "house_price_of_unit_area"]
        
    # creating our base data frame
    # I know there needs to be some cleaning on the `transaction_date` independent variable.
    # We could have done that here using the parse_dates keyword flag and a parser keyword flag to a lambda or function, but we'll handle that later
    df = pd.read_excel(STORE_RAW_HOUSING_DATA_DESTINATION_PATH, names=names, index_col="No")
    

    return df

df = load_raw_hosuing_data()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

We have successfully gathered the data we need.

## Part C: Explore the Data

_Note: try to get insights from a field expert for these steps._

1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary).

The data is not big enough for me to consider this necessary. 

2. Create a Jupyter notebook to keep a record of your data exploration.

We'll be doing that in this notebook.

3. Study each attribute and its characteristics:

  - Name
  - Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  - % of missing values
  - Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
  - Possibly useful for the task?
  - Type of distribution (Gaussian, uniform, logarithmic, etc.)
  
Will do on a column by column basis. Skipping noiseness analysis and type of distribution. I rolled "Possibly useful" into a new bullet I called "assumptions".

4. For supervised learning tasks, identify the target attribute(s).

The taret attribute is `house_price_of_unit_area`

5. Visualize the data.

We'll do this column by column and then a big scatter matrix.

6. Study the correlations between attributes.

We'll evaluate the correlations visually and by using Pearson's R.

7. Study how you would solve the problem manually.

I believe we went over this prior, but I would solve the problem manually by looking at the extremes of the data. I'd look at the most expensive houses and the least expensive houses and attempt to explain why those houses were cheaper or more expensive. Just to be clear, I would not want to look at outliers specifically but rather the non-outliers on each end of the valuation.

8. Identify the promising transformations you may want to apply.

I plan to clean `transaction_date`, so its easier to query. I'm not exactly sure what format I'll use, but whatever Pandas suggests.

9. Identify extra data that would be useful (go back to “Get the Data”).

It would be useful to get:

- number of bathrooms
- number of bedrooms
- is there an attic
- is there a basement
- is there a garage
- amount of land
- crime in the area
- quality of education in the area

Map of the location our data is sampled from.

In [None]:
display(Image(filename="../references/city_map.png"))

## All Columns

First we'll do a very quick high level overview of all the columns together, before diving into each column individually. In a large data set, we'd use this section to help decide which columns to look at individually. Since this dataset is rather small, I'll take the oppurtunity to explore all columns

## Scatter Matrix

We'll create single graphic of all columns against all other columns.

In [None]:
scatter_matrix(df, alpha=0.6, figsize=(30, 30), diagonal='hist')

I like to look at the scatter matrix when we don't have a lot of columns, just to look at the big picture in one big graph. This generally isn't the best way to check for correlation between two columns, as its a visual check and we already looked at each of the independent columns against the dependent column already. Instead we'll be using Pearson's R, also know as the standard correlation coefficient.

### Correlation Matrix

In [None]:
corr_matrix = df.corr()
corr_matrix["house_price_of_unit_area"].sort_values(ascending=False)

The correlation coefficients range from $-1$ to $1$. Close to $1$ tells us there's a strong positive correlation, while close to $-1$ means there's a strong negative correlation, and close to $0$ means no correlation.

We see here that `number_of_convenience_stores`, `latitude` and `longitude` look to have a strong positive correlation, while I'd say `transaction-date` has no correlation.

`house_age` has a weak negative correlation and `distance_to_the_nearest_MRT_station` has a strong negative correlation. 

### Correlation Heatmap

In [None]:
top_corr_features = corr_matrix.index
plt.figure(figsize=(15, 15))

g=sns.heatmap(df[top_corr_features].corr(), annot=True, cmap="RdYlGn")

This visualization makes it easy to just look at the row of `house_price_of_unit_area`, and use the color coordination to get a quick idea of the correlations.

### Visualize Data Geographically

I believe it'd be very useful if I could superimpose this graph with a graphic of the location of the data. Additionally, if the dots were colored on a gradient based on their `house_price_of_unit_area`.

In [None]:
df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.2)

## Each Column

We'll address each column, starting with the information the [website](https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set) we got the data from and then moving into any observations we have ourselves regarding the data. Then we'll use the `describe` function to get a general feel for the data, along with a violin plot and followed by performing the Shapiro-Wilk test to test for normality. This test was chosen as we have a small number of samples, which is generally a perferred prerequisite for this test. 

Additionally we're going to look for outliers across every column. To do this, I was going to leverage the standard score, also known as the z-score. This score is a measurement of how many standard deviations away from the mean. Unfortunately this score does not make much sense when the data you're working with is not gaussian, which is our case. 

With that in mind, I'll be leveraging a box-and-whisker plot to visualize outliers and use some metric like

> An outlier is any value that's $1.5$ times above or below the interquartile range

_Note: The assumptions were made prior to running the code._

In [None]:
def violin_plot(series: pd.Series):
    """Displays a violin plot for a single Series
    
    """
    sns.violinplot(series).set_title(f"Violin Plot of column {series.name}")
    plt.figure() # ensures this graph does not plot over another graph

In [None]:
def evaluate_missing(series: pd.Series):
    """Displays how many rows are missing for a single Series
    
    """
    print(f"{series.isna().sum()} missing")

In [None]:
def box_plot(series: pd.Series):
    """Displays a Boxplot for a single Series
    
    """
    sns.boxplot(x = series).set_title(f"Boxplot of column {series.name}")
    plt.figure() # ensures this graph does not plot over another graph

In [None]:
def outliers_via_iqr(series: pd.Series):
    """Displays number of outliers using the IQR, for a single Series
    
    """
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    print(f"IQR: {IQR}")
    
    outliers = list(filter(lambda row: (row < (Q1 - 1.5 * IQR)) | (row > (Q3 + 1.5 * IQR)), series))
    print(f"found {len(outliers)} outliers")

In [None]:
def evaluate_outliers(series: pd.Series):
    """Displays how many outliers for a given Series, visually and mathmatically.
    """
    box_plot(series)
    outliers_via_iqr(series)

In [None]:
def qq_plot(series: pd.Series):
    """Displays a QQ plot for a single Series
    
    """
     # visual normality test
    ax = sm.qqplot(series, line='45')
    ax.suptitle(f"Q-Q Plot of column {series.name}")
    plt.figure() # ensures this graph does not plot over another graph

In [None]:
def shapiro_wilk(series: pd.Series, alpha: int = 0.05):
    """Performs the Shapiro-Wilk test, a mathmatical test for normality.
    This test is specifically interested in the tails of a distribution and 
    should not be used with large datasets.
    
    """
    stat, p = shapiro(series)
    
    print(f"\nShapiro-Wilk stat {stat} p {p}")
    
    if p > alpha: 
        print('Sample looks Gaussian (fail to reject H0)')
    else:
        print('Sample does not look Gaussian (reject H0)')

In [None]:
def test_for_normality(series:pd.Series):
    """Performs a series of tests for normality. Both visual and mathmatical.
    
    """
    qq_plot(series)
    shapiro_wilk(series)

In [None]:
def scatter_plot(x: pd.Series, y: pd.Series):
    """Plots a scatter plot given an X and Y series
    
    """
    sns.scatterplot(x=x, y=y, alpha=0.8).set_title(f"Scatter Plot of {x.name} against {y.name}")
    plt.figure() # ensures this graph does not plot over another graph

In [None]:
def regression_line(x: pd.Series, y: pd.Series):
    """Plots a regression line, showing a linear elationship between 
    X and Y, if any.
    
    """
    
    #sns.regplot(x=x, y=y).set_title(f"Regression Line Plot of {x.name} against {y.name}") # a dufferent regression style
    sns.jointplot(x=x, y=y, kind="regg")
    plt.figure() # ensures this graph does not plot over another graph

In [None]:
def high_level_overview(independent: pd.Series, dependent: pd.Series = None):
    """Gives a high level overview of a single Pandas column.
    
    """
    print(independent.describe())
    violin_plot(independent)
    evaluate_missing(independent)
    evaluate_outliers(independent)
    test_for_normality(independent)
    
    if dependent is not None:
        scatter_plot(independent, dependent)
        regression_line(independent, dependent)

### Independent Variable `transaction_date`

> the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)

In [None]:
high_level_overview(df["transaction_date"], df["house_price_of_unit_area"])

- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  - date
- Assumptions
  - house_price_of_unit_area and transaction date will have a positive correlation. 
    - I believe as time has gone on, the value of all houses has gone up. This might be due to a bigger population, so demand has gone up and supply has not. I'm drawing from what I know in general in the US, which may not apply to Taiwan.

### Independent Variable `house_age`

> the house age (unit: year)

In [None]:
high_level_overview(df["house_age"], df["house_price_of_unit_area"])

- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  - float
- Assumptions
  - `house_age` and `house_price_of_unit_area` will have a negative correlation
    - I believe that newer houses will be worth more. Again, my assumptions are based on what I know of the US. In the US, houses have gotten bigger throughout time. I imagine this trend might exist elsewhere in the world as well.

### Independent Variable `distance_to_the_nearest_MRT_station`

> the distance to the nearest MRT station (unit: meter)

Where MRT = metro rail transit.

In [None]:
high_level_overview(df["distance_to_the_nearest_MRT_station"], df["house_price_of_unit_area"])

- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  - float
- Assumpions
  - `distance_to_the_nearest_MRT_station` and `house_price_of_unit_area` will have an inverse correlation
    - I believe the distance to a MRT station wil be a good indicator for how urban or rural a house is. Where a smaller measurement will indicate more urban, fetching a higher `house_price_of_unit_area`.

### Independent Variable `number_of_convenience_stores `

> the number of convenience stores in the living circle on foot (integer)

In [None]:
high_level_overview(df["number_of_convenience_stores"], df["house_price_of_unit_area"])

- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  - int
- Assumpions
  - `number_of_convenience_stores` and `house_price_of_unit_area` will have a positive correlation
    - Similiar to `number_of_convenience_stores`, I believe this will be another good measurement of how urban a house is. Again, a larger measurement here will indicate more urban and therefore fetch a higher `house_price_of_unit_area`.

### Independent Variable `latitude`

> the geographic coordinate, latitude. (unit: degree)

In [None]:
high_level_overview(df["latitude"], df["house_price_of_unit_area"])

- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  - float
- Assumptions
  - `latitude` and `house_price_of_unit_area` will have a parabolic correlation
    - Since New Taipei City has water on its East and West, I believe the smallest `latitude` and largest `latitude` values will equate to larger `house_price_of_unit_area` due to waterfront properties.

### Independent Variable `longitude`

> the geographic coordinate, longitude. (unit: degree)

In [None]:
high_level_overview(df["longitude"], df["house_price_of_unit_area"])

- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  - float
- Assumptions
  - `longitude` and `house_price_of_unit_area` and will have a positive correlation
    - New Taipei City is a city bordered by water to the north, so I believe a higher `latitude` will relate to a more coastal city, resulting in a larger `house_price_of_unit_area`.

### Dependent Variable `house_price_of_unit_area`

> house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared) 

In [None]:
high_level_overview(df["house_price_of_unit_area"])

Dependent variable.

- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
  - float

## Part D: Prepare the Data

Notes:
  - Work on copies of the data (keep the original dataset intact).
  - Write functions for all data transformations you apply, for five reasons:
    1. So you can easily prepare the data the next time you get a fresh dataset
    2. So you can apply these transformations in future projects
    3. To clean and prepare the test set
    4. To clean and prepare new data instances once your solution is live
    5. To make it easy to treat your preparation choices as hyperparameters
    
1. Data cleaning:
  - Fix or remove outliers (optional).
    - I have read that this is a big no no and I will not proceed with removing or "fixing" outliers.
  - Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns).
2. Feature selection(optional):
  - Drop the attributes that provide no useful information for the task.
3. Feature engineering, where appropriate:
  - Discretize continuous features.
  - Decompose features (e.g., categorical, date/time, etc.).
  - Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
  - Aggregate features into promising new features.
4. Feature scaling: standardize or normalize features.

First we'll start by creating a copy of the data, so we can keep the original intack.

In [None]:
df_clean = df.copy()

### Data Cleaning

#### Outliers

We will not be removing any outliers or "fixing" them. I have no sources or proof that any of these outliers are bad data.

#### Missing Values

No columns have any missing data, so no data imputation techniques will be used.

#### Fixing D-Types

`transaction_date` has a `float64` dtype, where it could have a `datetime` dtype. We'll levage the `pandas.read_csv`'s ability to pass in a custom `parse_dates` function to write our own parser.

- https://stackoverflow.com/questions/21269399/datetime-dtypes-in-pandas-read-csv
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime
- https://stackoverflow.com/questions/23797491/parse-dates-in-pandas

We leverage the [datetime.strptime format codes](https://docs.python.org/3.4/library/datetime.html#strftime-strptime-behavior) to make the parsing incredibly easy.

In [None]:
def month_decimal_to_index(s: str) -> int:
    """Converts the weird encoding deciaml that this data set had to an actual month.
    EX: '2013.250' -> 3, for March, the third month of the year
    
    """
    DECIMAL_VALUE_PER_MONTH = 1.00/12
    return int(float('.' + s.split('.')[1])//DECIMAL_VALUE_PER_MONTH + 1) # we reattach the '.' we split on, as it is a decimal and we want to use it

In [None]:
def weird_date_to_normalized_year_month(weird_date_format: float):
    month_as_int = month_decimal_to_index(str(weird_date_format))
    
    return str(weird_date_format).split('.')[0] + '.' + str(month_as_int)

In [None]:
df_clean["transaction_date"] = df_clean["transaction_date"].apply(weird_date_to_normalized_year_month)

At this point, our column is now formatted from the strange percentage of the year, to an actual month. Where $1$ represents January and $12$ represents December. We'll now use Pandas' ability to parse these into a new dtype.

In [None]:
df_clean["transaction_date"] = df_clean["transaction_date"].apply(lambda date: pd.datetime.strptime(str(date), "%Y.%m"))

In [None]:
df_clean.info()

And now Pandas knows our `transaction_date` is a `datetime64` dtype. 

Its important to note that the original data did not encode day of the month, nor time. So I assumed the 1st of each month. Perhaps their fraction had the intent to encode the day, hour, minute, second but sense they did not show it in their examples, I will not try to decode this. 

The last step we have is converting from `datetime64` to `datetime`. We do this because of [this SO post](https://stackoverflow.com/a/49758140/1983957).

In [None]:
df_clean["transaction_date"] = df_clean["transaction_date"].dt.to_pydatetime()
# df_clean["transaction_date"] = df_clean["transaction_date"].dt.date
# df_clean["transaction_date"].astype("datetime")

In [None]:
df_clean.info()

In [None]:
df_clean.head()

TODO figure out why above isn't converting the dtype to `datetime`

### Feature Engineering

We're going to experiment with creating our own columns based on other columns in the dataset.

### `urbanness`

In [None]:
df_clean["urbanness"] = df_clean["number_of_convenience_stores"] / df["distance_to_the_nearest_MRT_station"]

In [None]:
df_clean.head()

Now we'll look at the correlation matrix again.

In [None]:
corr_matrix = df_clean.corr()
corr_matrix["house_price_of_unit_area"].sort_values(ascending=False)

We can see that our `urbanness` column has a positive correlation with `house_price_of_unit_area`, which was what we predicted in our initial EDA.

### `transaction_year`

Now that we have our `transaction_date` cleaned, we can extract the `transaction_year` easily.

In [None]:
df_clean["transaction_year"] = df_clean["transaction_date"].map(lambda x:  x.year) # trying to use https://stackoverflow.com/a/25146337/1983957

In [None]:
df_clean.info()

In [None]:
df_clean.head()

### `transaction_month`

### Feature Scaling

Generally, models don't work particularly well when numerical independent variables are in different scales from each other. What can happen is larger scaled independent variables can overshadow in importance, the smaller scaled independent variables. So a technique we can do is to scale our numerical models, all on the same scale, to give them each the same oppurtunity to be important. We have many options for scaling our independent variables:

- "standard scaler"
- "min-max scaler"
- "robust scaler"
- "normalizer"

While I originally planned on using the standard scaler, we've already demonstrated that our independent columns are not gaussian, so we will skip this technique. The next technique min-max scaler seems to be the most widely used technique but the problem is that our independent variables have outliers as we previously demonstrated. That brings us to our third technique, the robust scaler, which is more resistant to outliers. For that reason, we will be using the robust scaler to scale our numerical independent variables.

Used [this link](http://benalexkeen.com/feature-scaling-with-scikit-learn/) to learn about scaling. [This link](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) demonstrates what can happen by using different scalers.

Before we scale, lets reobserve the information regarding our independent variables.

In [None]:
df_clean.describe()

We'll look at this exact output above, post sacling.

In [None]:
X = df_clean[["number_of_convenience_stores", "latitude", "longitude", "urbanness", "house_age", "distance_to_the_nearest_MRT_station"]] # we had to exclude our date column
Y = df_clean[["house_price_of_unit_area"]]

In [None]:
scaler = RobustScaler().fit(X)
df_clean[["number_of_convenience_stores", "latitude", "longitude", "urbanness", "house_age", "distance_to_the_nearest_MRT_station"]] = scaler.fit_transform(X)
df_clean.head()

In [None]:
df_clean.describe()

### Feature Selection

Why do we perform feature selection? For three benefits of performing feature selection before modeling your data are:

1. Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
2. Improves Accuracy: Less misleading data means modeling accuracy improves.
3. Reduces Training Time: Less data means that algorithms train faster.

Additionally, there are four different techniques for performing feature selection:


1. Forward Selection: The procedure starts with an empty set of features [reduced set]. The best of the original features is determined and added to the reduced set. At each subsequent iteration, the best of the remaining original attributes is added to the set.
2. Backward Elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.
4. Recursive Feature elimination: Recursive feature elimination performs a greedy search to find the best performing feature subset. It iteratively creates models and determines the best or the worst performing feature at each iteration. It constructs the subsequent models with the left features until all the features are explored. It then ranks the features based on the order of their elimination. In the worst case, if a dataset contains N number of features RFE will do a greedy search for 2N combinations of features.



We'll address each column and determine if any are not good predictors of the dependent variable.


Used [this link](https://machinelearningmastery.com/feature-selection-machine-learning-python/) and [this link](https://www.datacamp.com/community/tutorials/feature-selection-python), as I don't believe the "Hands-On Machine Learning" book does enough on this section.

In [None]:
# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
# Feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

# Summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)

features = fit.transform(X)
# Summarize selected features
features

## Part E: Short-List Promising Models

Notes:
  - If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests).
  - Once again, try to automate these steps as much as possible.
  
1. Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
2. Measure and compare their performance. For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.
3. Analyze the most significant variables for each algorithm.
4. Analyze the types of errors the models make. What data would a human have used to avoid these errors?
5. Have a quick round of feature selection and engineering.
6. Have one or two more quick iterations of the five previous steps.
7. Short-list the top three to five most promising models, preferring models that make different types of errors.

For each model, we'll use the mean absolute error (MAE) to evaluate performance. Remember, this is a non-negative floating point, where the best value is $0.0$.

### Splitting our data set into a training and test set

In [None]:
X_TRAIN, X_TEST, Y_TRAIN, Y_TEST = train_test_split(X, Y, test_size=0.20, random_state=SEED)

### Linear Regression

In [None]:
linear_regression = LinearRegression().fit(X_TRAIN, Y_TRAIN)
Y_PRED = linear_regression.predict(X_TEST)
print("Mean Absolute Error: ", mean_absolute_error(Y_TEST, Y_PRED))  

### Support-vector machine

### Random Forest

In [None]:
random_forest_regressor = RandomForestRegressor(n_estimators=20, random_state=SEED).fit(X_TRAIN, Y_TRAIN)  
Y_PRED = random_forest_regressor.predict(X_TEST)  
print("Mean Absolute Error: ", mean_absolute_error(Y_TEST, Y_PRED))  

### Neural Network

In [None]:
def baseline_model():
    model = Sequential()
    model.add(Dense(6, input_dim=6, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [None]:
estimator = KerasRegressor(build_fn=baseline_model, nb_epoch=100, batch_size=100, verbose=False)
kfold = KFold(n_splits=10, random_state=SEED)
results = cross_val_score(estimator, X, Y, cv=kfold)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))

estimator.fit(X, y)
Y_PRED = estimator.predict(X_TEST)
print("Mean Absolute Error: ", mean_absolute_error(Y_TEST, Y_PRED))  

## Part F: Fine-Tune the System

Notes:
- You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning.
- As always automate what you can.

1. Fine-tune the hyperparameters using cross-validation.
  - Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or with the median value? Or just drop the rows?). Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams).

2. Try Ensemble methods. Combining your best models will often perform better than running themindividually.
3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.

WARNING: Don’t tweak your model after measuring the generalization error: you would just start overfitting the test set

## Part G: Present Your Solution

1. Document what you have done.
2. Create a nice presentation. 
  - Make sure you highlight the big picture first.
3. Explain why your solution achieves the business objective.
4. Don’t forget to present interesting points you noticed along the way. 
  - Describe what worked and what did not. 
  - List your assumptions and your system’s limitations.
5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).

## Part H: Launch!

1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.).
2. Write monitoring code to check your system’s live performance at regular intervals and trigger alerts when it drops.
  - Beware of slow degradation too: models tend to “rot” as data evolves.
  - Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
  - Also monitor your inputs’ quality (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale). This is particularly important for online learning systems.
3. Retrain your models on a regular basis on fresh data (automate as much as possible).

## TODO

- Feature selection
- Use cross validation for chosen model
- Save figures back to filesystem

### From Max:
- extract month and year from the `transaction_date` column
  - perform EDA on these new columns
- look at zillow notebooks
- check how RMSE is NOT insulated from outliers
- check how MAE is insulated from outliers
  - its not
- type 
- if I had categorical independent variables, I shouldn't use the same high level analysis. Instead, i'd do different stuff, which we'll find in the future
- use L1 to perform feature selection
- use broken code to pick the top N, n-1, n-2, etc to 1 columns evaluate on TEST SET the "best", using MSE
- validation set, no cross validation
- neural network needs more hidden layers
  - train on GPU, look into how difficult it is to setup
  - actually pick a learning rate, start with 0.001
  
### From Scott:
- ~~Elaborate that "1. What are the current solutions/workarounds (if any)? None" is just pretend~~
- "This may also sound really out there... but in choosing MAE ... totally fine with it... but wouldn't it be interesting to use Median instead of Mean? People default to MSE"
- ~~"Would be useful to try and examine the ones that had some strange relationships from the massive grid corr plot"~~
  -~~ Basically do the grid first, then columns of interest after~~
- bin lat/long
  - could find "beachfront" for example

## Could Do

- Add more normality tests
- Attempt to figure out each column's distribution
- Graph each data point, by `longitude` and `latitude` superimposed on a real map of Taiwan, with a heat map of the `house_price_of_unit_area`
- use `Sklearn`'s `pipeline`s
- use bootstrapping
- reference [this](https://www.century21global.com/for-sale-residential/Taiwan/Taipei-City)