<center><h1>Feature Reduction & Starter Models</h1></center>

In this notebook, we'll try different techniques to reduce the number of dimensions of our data.<br><br>
I'll try the following techniques:
<br>

**1. Feature Selection:**
- Forward Feature Selection.
- Backward Feature Selection.
- Remove High Collinarity Features.
- Recursive Feature Elimination method.
- Nearest Neighbors for Feature Extraction.

**2. Dimensionality Reduction:**
- PCA.
- ICA.
- t-SNE.
- UMAP.
- Random Projection.
- other.

For the starter models, I'll try the following models:<br>
1. Linear Regression.
1. Lasso.
2. Ridge.
3. Elastic Net.
4. SVR.
5. GaussianNB.
6. KNeighborsRegressor.
7. MLP.
8. RandomForestRegressor.
9. ExtraTreeRegressor.
10. GTBM (Sklearn)
11. XGBoost (Stanford).
12. LightGBM (Microsoft).
13. Catboost (Yandex).
14. Average Stacking.
15. Meta layer Stacking.

# Data Overview:
## Import Libraries:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm_notebook as tqdm

from util import *

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Read Data:

In [23]:
train = pd.read_csv('../3_Feature Engineering/output/train_engineered.csv').dropna()
test  = pd.read_csv('../3_Feature Engineering/output/test_engineered.csv')

train_labels = train.SalePrice

shape(train, test)

~> [train] has [5m[7m[34m 1,408 [0m rows, and [5m[7m[34m 1,050 [0m columns.
~> [test ] has [5m[7m[34m 1,459 [0m rows, and [5m[7m[34m 1,049 [0m columns.


# Feature Selection:
## Remove High Correlated Features:

In [4]:
threshold = .9

# Create correlation matrix
corr = train.corr().abs()

# Select upper triangle of correlation matrix
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))

# Select columns with correlations above threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

print(f'~> There are {bg(len(to_drop))} columns to remove.  (;´༎ຶД༎ຶ`)')

~> There are [5m[7m[34m 350 [0m columns to remove.  (;´༎ຶД༎ຶ`)


In [5]:
train[to_drop].head()

Unnamed: 0,GarageQual,GarageCond,onehot_LandSlope_3,onehot_ExterQual_4,onehot_HeatingQC_5,onehot_Electrical_5,onehot_FullBath_1,onehot_FullBath_2,onehot_HalfBath_1,onehot_PavedDrive_3,...,diff_GarageYrBlt_MoSold,diff_GarageYrBlt_YrSold,diff_MoSold_YearBuilt,diff_MoSold_YearRemodAdd,diff_MoSold_GarageYrBlt,diff_MoSold_YrSold,diff_YrSold_YearBuilt,diff_YrSold_YearRemodAdd,diff_YrSold_GarageYrBlt,diff_YrSold_MoSold
0,3,3,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,2001.0,5.0,2001,2001,2001.0,2006,5,5,5.0,2006
1,3,3,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,1971.0,31.0,1971,1971,1971.0,2002,31,31,31.0,2002
2,3,3,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,1992.0,7.0,1992,1993,1992.0,1999,7,6,7.0,1999
3,3,3,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,...,1996.0,8.0,1913,1968,1996.0,2004,91,36,8.0,2004
4,3,3,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,1988.0,8.0,1988,1988,1988.0,1996,8,8,8.0,1996


IMHO, maybe if we standardize/normalize these features, the result will differ.<br><br>
The thing I know, I won't depend on these result thoroughly.

## Recursive Feature Elimination Method:
For this notebook, I'll use RandomForest, to get some intuition about the features, but in the modeling section, I'll apply the same method for each model, and see how the result can differ.

In [28]:
# First, Define our metric target
from sklearn.metrics import mean_squared_error, make_scorer


def rmse(X_test, y_pred):
    return np.sqrt(mean_squared_error(X_test, y_pred))

scorer = make_scorer(rmse, greater_is_better=False)

In [29]:
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestRegressor as RFR

# train.drop('SalePrice', axis=1, inplace=True)

# Create a model for feature selection
estimator = RFR(random_state=10, n_estimators=100, n_jobs=-1)

# Create the object
selector = RFECV(estimator, step=1, cv=5, scoring=scorer, n_jobs=-1)

In [None]:
selector.fit(train.dropna(), train_labels)

In [24]:
shape(test, train_labels)

~> [test        ] has [5m[7m[34m 1,459 [0m rows, and [5m[7m[34m 1,049 [0m columns.


IndexError: tuple index out of range

In [25]:
test.dropna().shape

(1314, 1049)

In [22]:
train_labels.dropna().shape

(1460,)

In [26]:
test.isnull().sum().sum()

221

In [None]:
temp = pd.read_csv('../')

In [31]:
1

1