# Workflow instructions
As 2017/9/30

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
os.chdir(module_path)

Settings of the configuration should be stored in "config/some_config_file.py"

In [10]:
with open('config/config_linear.py', 'r') as f:
    for line in f.readlines():
        print(line[:-1])

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Feature list
from features import feature_list_linear
# model
from models import LinearModel

# Configuration
config_linear = {
    # 'pca_components': 15, # a pca_component greater than 0 will automatically set clean_na to True as pca cannot deal with infinite numbers.
    # 'resale_offset': 0,
    'feature_list': feature_list_linear.feature_list,
    'clean_na': True,
    'training_params': {
        'Model': LinearModel.RidgeRegressor,
        'model_params': {'alpha': 1.0, 'random_state': 42},
        'FOLDS': 5,
        'record': False,
        'outliers_up_pct': 99,
        'outliers_lw_pct': 1,
        # 'resale_offset': 0.012
        'pca_components': -1, # clean_na needs to be True to use PCA
        'scaling': True,
        # 'scaler': RobustScaler(quantile_range=(0, 99)),
        # 'scaling_columns': SCALING_COLUMNS
    },
    'stacking_p

### Depending on the need of different model, choose different 'feature_list'.
Linear feature would need to one-hot-encode all categorized model, and eliminate all Nan's.

Nonlinear models (Trees, Boost) do not have such problems

# Cleaning and Feature Engineering

In [5]:
with open('features/feature_list_linear.py', 'r') as f:
    for line in f.readlines():
        print(line[:-1])

feature_list = {
    'before_fill': [
        # ('missing_value_one_hot', missing_value_one_hot, {}, 'missing_value_one_hot_pickle', False),
        ('missing_value_count', 'missing_value_count', {}, 'missing_value_count_pickle', False),
    ],
    'original': [
        # required columns
        'parcelid',
        # optional columns
        'basementsqft',
        'bathroomcnt',
        # 'bedroomcnt', # high corr
        'calculatedbathnbr',
        # 'finishedfloor1squarefeet', # seems almost same as finishedsquarefeet50
        'calculatedfinishedsquarefeet',
        # 'finishedsquarefeet12',
        'finishedsquarefeet13',
        # 'finishedsquarefeet15',
        'finishedsquarefeet50',
        'finishedsquarefeet6',
        'fireplacecnt',
        # 'fullbathcnt',
        'garagecarcnt',
        'garagetotalsqft',
        'hashottuborspa',
        'latitude',
        'longitude',
        'lotsizesquarefeet',
        'poolcnt',
        'poolsizesum',
        'pooltypeid10',
    

### feature_list contains three parts: before_fill, original and generated.
before_fill generally gives information of the number of missing data, etc, before any processing.

original is generally the original columns of the data, but could be slightly different due to parsing the string, etc.

generated is any features we created.

### The raw data would be load through the __main__ part of train.py. For notebook users, use the following method instead:

In [3]:
import interface
train_df, transactions = interface.load_full_data(config_file='config_linear')



before_fill: 1
original: 31
generated: 46
clean_na: True
training_params: {'Model': <class 'models.LinearModel.RidgeRegressor'>, 'model_params': {'alpha': 1.0, 'random_state': 42}, 'FOLDS': 5, 'record': False, 'outliers_up_pct': 99, 'outliers_lw_pct': 1, 'pca_components': -1, 'scaling': True}
Feature engineering
Using cleaned prop
features/feature_pickles_cleaned/geo_county_pickle
features/feature_pickles_cleaned/category_geo_county_one_hot_pickle
features/feature_pickles_cleaned/geo_neighborhood_pickle
features/feature_pickles_cleaned/geo_zip_pickle
features/feature_pickles_cleaned/multiply_lat_lon_pickle
features/feature_pickles_cleaned/poly_2_structure_tax_value_pickle
features/feature_pickles_cleaned/poly_3_structure_tax_value_pickle
features/feature_pickles_cleaned/ratio_basement_pickle
features/feature_pickles_cleaned/ratio_bedroom_bathroom_pickle
features/feature_pickles_cleaned/ratio_fireplace_pickle
features/feature_pickles_cleaned/ratio_floor_shape_pickle
features/feature_pic

MemoryError: 

The previous procedure could be broken down into the following steps:

1. A featurelist is loaded. `train.prepare_features()` is called.
1. Based on whether NaN needs to be cleaned, either `feature_combine.feature_combine_cleaned(feature_list)` or `feature_combine.feature_combine_with_nan(feature_list)` is called.
1. Inside `feature_combine`, `utils.load_properties_data_preprocessed` or `utils.load_properties_data_cleaned` is called. 
    1. They will both call `utils.load_properties_data` first, which reads the raw peoperties data from csv/pickle file.
    1. `preprocess_geo` is mandatorily called to fill in some certain missing values. If necessary, `preprocess_add_geo_features` is called. It will try to fill in missing data by grouping geographically. Notice this will change the "original" feature list as there is some parsing.
    1. If `utils.load_properties_data_cleaned` is called, all columns of the properties data is fed into the functions of `feature_clean`. Each function in `feature_clean` must be the same name as the column and only return one such column that does not contain nan or inf.
1. The returned `prop` DataFrame is fed into a subset of functions of `feature_eng` (defined in `feature_list`). Each function will return a DataFrame of one or more columns. They will be concatenated together with the original `prop` to form a large DataFrame.

`train.prepare_training_data(prop)` is called afterwards, to combine the `prop` (features) with `transactions` (labels)

# Training

`train.train()` would segmentate the dataframe into training, validation and testing part. A 5-fold validation is applied to average the result.

After each run, a submission file is created in data/submissions folder.

# Tuning

# Ensembling and Stacking