## Kaggle Competition - House Prices: Advanced Regression Techniques

In [5]:
from src.models import RegularizationModelLinear, RegularizationModelFeatures, RegularizationModelPolynomial, RegularizationModelSquared
from src.models import SquaredFeatureEncoding
from src.models import MultiModelSquared

import pydash
import warnings
from operator import itemgetter
warnings.filterwarnings('ignore')

## Quartic Feature Encoding

Whilst sklearn.preprocessing.PolynomialFeatures() generates a very large number of cross-pollination polynomial features, 
linear regression might be better servered through using just the self-square features ^2 + ^3 + ^4

Optimial tuning parameter for this dataset is 4

In [2]:
results = []
for n in range(1,10):
    result = SquaredFeatureEncoding( **{ 'X_feature_squared': n } ).summary()
    results.append( (result[0], result[1], n) )
sorted( results )

[(0.1507, 'SquaredFeatureEncoding', 4),
 (0.1532, 'SquaredFeatureEncoding', 3),
 (0.1629, 'SquaredFeatureEncoding', 2),
 (0.1942, 'SquaredFeatureEncoding', 1),
 (0.2231, 'SquaredFeatureEncoding', 8),
 (0.2516, 'SquaredFeatureEncoding', 5),
 (0.3344, 'SquaredFeatureEncoding', 6),
 (0.3696, 'SquaredFeatureEncoding', 7),
 (0.5343, 'SquaredFeatureEncoding', 9)]

Testing this against the range of scikitlearn.linear_model's (next notebook), we discovered:
- Squared/Quartic feature encoding outperforms most PolynomialFeatures encoding 
- Squared/Quartic feature encoding works best with LassoLars + ElasticNet
- X_feature_squared tuning parameter doesn't improve when using MultiModel selection

In [3]:
MultiModelSquared(X_feature_squared=4).model_scores_list()

[(0.1386, 'MultiModelSquared', 'LassoLars'),
 (0.1389, 'MultiModelSquared', 'ElasticNet'),
 (0.1399, 'MultiModelSquared', 'Lasso'),
 (0.1402, 'MultiModelSquared', 'RidgeCV'),
 (0.1413, 'MultiModelSquared', 'Ridge'),
 (0.1507, 'MultiModelSquared', 'LinearRegression'),
 (0.1581, 'MultiModelSquared', 'LassoLarsCV'),
 (0.1581, 'MultiModelSquared', 'LassoLarsIC'),
 (0.1659, 'MultiModelSquared', 'ARDRegression'),
 (0.1785, 'MultiModelSquared', 'OrthogonalMatchingPursuitCV'),
 (0.1801, 'MultiModelSquared', 'BayesianRidge'),
 (0.2526, 'MultiModelSquared', 'TheilSenRegressor'),
 (0.3013, 'MultiModelSquared', 'RidgeClassifierCV'),
 (0.3063, 'MultiModelSquared', 'RidgeClassifier'),
 (0.3906, 'MultiModelSquared', 'LassoCV'),
 (0.3906, 'MultiModelSquared', 'ElasticNetCV'),
 (0.3929, 'MultiModelSquared', 'LarsCV'),
 (0.4008, 'MultiModelSquared', 'Perceptron'),
 (0.4252, 'MultiModelSquared', 'RANSACRegressor'),
 (0.5673, 'MultiModelSquared', 'PassiveAggressiveRegressor'),
 (0.6503, 'MultiModelSquared

In [6]:
results = sorted(pydash.flatten([
    RegularizationModelLinear().model_scores_list(),
    RegularizationModelFeatures().model_scores_list(),
    RegularizationModelPolynomial().model_scores_list(),
    RegularizationModelSquared().model_scores_list(),
]), key=itemgetter(0))
for result in results: print(result)

(0.1386, 'RegularizationModelSquared', 'LassoLars')
(0.1389, 'RegularizationModelSquared', 'ElasticNet')
(0.1402, 'RegularizationModelSquared', 'RidgeCV')
(0.1507, 'RegularizationModelSquared', 'LinearRegression')
(0.1581, 'RegularizationModelSquared', 'LassoLarsIC')
(0.1619, 'RegularizationModelFeatures', 'ARDRegression')
(0.162, 'RegularizationModelFeatures', 'RidgeCV')
(0.1659, 'RegularizationModelSquared', 'ARDRegression')
(0.1688, 'RegularizationModelFeatures', 'ElasticNet')
(0.1758, 'RegularizationModelFeatures', 'LassoLars')
(0.1763, 'RegularizationModelFeatures', 'LassoLarsIC')
(0.1823, 'RegularizationModelFeatures', 'LinearRegression')
(0.1899, 'RegularizationModelLinear', 'ElasticNet')
(0.1909, 'RegularizationModelLinear', 'ARDRegression')
(0.1909, 'RegularizationModelPolynomial', 'LassoLarsIC')
(0.1911, 'RegularizationModelLinear', 'LassoLarsIC')
(0.1929, 'RegularizationModelLinear', 'RidgeCV')
(0.1939, 'RegularizationModelLinear', 'LassoLars')
(0.1942, 'RegularizationModelL

### Submit to Kaggle
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/submissions

```
(0.1386, 'LassoLarsSquared', 'LassoLarsSquared', 'X_feature_exclude X_feature_year_ages X_feature_label_encode X_feature_onehot X_feature_squared')
(0.1389, 'ElasticNetSquared', 'ElasticNetSquared', 'X_feature_exclude X_feature_year_ages X_feature_label_encode X_feature_onehot X_feature_squared')
(0.1507, 'SquaredFeatureEncoding', 'LinearRegression', 'X_feature_exclude X_feature_year_ages X_feature_label_encode X_feature_onehot X_feature_squared')
```

```
$ kaggle competitions submit -c house-prices-advanced-regression-techniques -f ./data/submissions/SquaredFeatureEncoding.csv -m "Quartic Features"
```    
- Your submission scored 2.29145, which is not an improvement of your best score. Keep trying!

```
$ kaggle competitions submit -c house-prices-advanced-regression-techniques -f ./data/submissions/ElasticNetSquared.csv -m "ElasticNet + Quartic Features"
```    
- Your submission scored 2.42165, which is not an improvement of your best score. Keep trying!

```
$ kaggle competitions submit -c house-prices-advanced-regression-techniques -f ./data/submissions/LassoLarsSquared.csv -m "LassoLars + Quartic Features"
```    
- Your submission scored 2.42903, which is not an improvement of your best score. Keep trying!

So this is weird!

Whilst Squared/Quartic Features produced very good results on the local dataset, the model was completely useless on Kaggle