## Feature Selection VS Feature Extraction

#### https://www.analyticsvidhya.com/blog/2021/06/dimensionality-reduction-using-autoencoders-in-python/

Feature Selection
- Recursive Feature Elimination 
- Genetic Feature Selection 
- Sequential Forward Selection 

Feature Extraction
- AutoEncoder 
- Principal Component Analysis (PCA)
- Linear Determinant Analysis (LDA)

Types of AutoEncoders available are
- Deep Autoencoder
- Sparse Autoencoder
- Under complete Autoencoder
- Variational Autoencoder
- LSTM Autoencoer

Hyperparameters of an AutoEncoder
- Code size or the number of units in the bottleneck layer
- Input and output size, which is the number of features in the data
- Number of neurons or nodes per layer
- Number of layers in encoder and decoder
- Activation function
- Optimization function

Example

https://docs.anaconda.com/anaconda/user-guide/tasks/tensorflow/

In [1]:
#conda create -n tf-gpu tensorflow-gpu

In [4]:
!pip install pandas

Collecting pandas
  Downloading pandas-1.3.4-cp39-cp39-win_amd64.whl (10.2 MB)
Collecting pytz>=2017.3
  Downloading pytz-2021.3-py2.py3-none-any.whl (503 kB)
Installing collected packages: pytz, pandas
Successfully installed pandas-1.3.4 pytz-2021.3


In [7]:
!pip install kerastuner

ERROR: Could not find a version that satisfies the requirement kerastuner (from versions: none)
ERROR: No matching distribution found for kerastuner


In [14]:
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-3.4.3-cp39-cp39-win_amd64.whl (7.1 MB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.3.2-cp39-cp39-win_amd64.whl (52 kB)
Collecting cycler>=0.10
  Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting pillow>=6.2.0
  Downloading Pillow-8.4.0-cp39-cp39-win_amd64.whl (3.2 MB)
Installing collected packages: pillow, kiwisolver, cycler, matplotlib
Successfully installed cycler-0.11.0 kiwisolver-1.3.2 matplotlib-3.4.3 pillow-8.4.0


In [16]:
!pip install sklearn

Collecting sklearn
  Using cached sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.0.1-cp39-cp39-win_amd64.whl (7.2 MB)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Collecting joblib>=0.11
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1315 sha256=59b1f344b1454f70a9704998f3d2e22158eb80b0cc7bd383694344db308b7537
  Stored in directory: c:\users\konutech\appdata\local\pip\cache\wheels\e4\7b\98\b6466d71b8d738a0c547008b9eb39bf8676d1ff6ca4b22af1c
Successfully built sklearn
Installing collected packages: threadpoolctl, joblib, scikit-learn, sklearn
Successfully installed joblib-1.1.0 scikit-learn-1.0.1 sklearn-0.0 threadpoolctl-3.0.0


In [57]:
!pip install keras

Collecting keras

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.5.0 requires flatbuffers~=1.12, but you have flatbuffers 20210226132247 which is incompatible.
tensorflow 2.5.0 requires grpcio~=1.34.0, but you have grpcio 1.36.1 which is incompatible.



  Using cached keras-2.7.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: keras
Successfully installed keras-2.7.0


In [59]:
import math
import pandas as pd
import tensorflow as tf
#import kerastuner.tuners as kt
import matplotlib.pyplot as plt
from tensorflow.keras import Model
from tensorflow.keras import Sequential
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from tensorflow.keras.losses import MeanSquaredLogarithmicError
from sklearn.preprocessing import MinMaxScaler 

In [12]:
print(tf.__version__)

2.5.0


In [None]:
#pd.options.display.float_format = '{:.1f}'.format

In [18]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")

In [22]:
california_housing_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


In [21]:
california_housing_dataframe.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [20]:
california_housing_dataframe.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [31]:
# data in google colab
TRAIN_DATA_PATH = "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv"
TEST_DATA_PATH = "https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv"
TARGET_NAME = 'median_house_value'

In [32]:
train_data = pd.read_csv(TRAIN_DATA_PATH, sep=",")
test_data = pd.read_csv(TEST_DATA_PATH, sep=",")

x_train, y_train = train_data.drop(TARGET_NAME, axis=1), train_data[TARGET_NAME]
x_test, y_test = test_data.drop(TARGET_NAME, axis=1),  test_data[TARGET_NAME]

In [36]:
def scale_datasets(x_train, x_test):
    """
    Standard Scale test and train data
    """
    standard_scaler = MinMaxScaler()

    x_train_scaled = pd.DataFrame(
        standard_scaler.fit_transform(x_train),
        columns=x_train.columns
    )

    x_test_scaled = pd.DataFrame(
        standard_scaler.transform(x_test),
        columns = x_test.columns
    )
    
    return x_train_scaled, x_test_scaled

In [37]:
x_train_scaled, x_test_scaled = scale_datasets(x_train, x_test)

In [103]:
class AutoEncoders(Model):
    
    def __init__(self, output_units):
        super(AutoEncoders, self).__init__()
        self.encoder = Sequential(
            [
                Dense(32, activation="relu"),
                Dense(16, activation="relu"),
                Dense(3, activation="relu")
            ]
        )
        
        self.decoder = Sequential(
            [
                Dense(16, activation="relu"),
                Dense(32, activation="relu"),
                Dense(output_units, activation="sigmoid")
            ]
        )

    def call(self, inputs):
    
        encoded = self.encoder(inputs)
        decoded = self.decoder(encoded)
    
        return decoded

In [104]:
auto_encoder = AutoEncoders(len(x_train_scaled.columns))

auto_encoder.compile(
    loss='mae',
    metrics=['mae'],
    optimizer='adam'
)

In [105]:
history = auto_encoder.fit(
    x_train_scaled,
    x_train_scaled,
    epochs=15,
    batch_size=32,
    validation_data=(x_test_scaled, x_test_scaled)
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [109]:
layer_names = [layer.name for layer in auto_encoder.layers]

In [110]:
layer_names

['sequential_10', 'sequential_11']

In [111]:
encoder_layer = auto_encoder.get_layer('sequential_10')

In [112]:
reduced_df = pd.DataFrame(encoder_layer.predict(x_train_scaled))

In [113]:
reduced_df

Unnamed: 0,0,1,2
0,0.457429,1.755553,0.0
1,0.609428,1.764719,0.0
2,0.369452,2.083229,0.0
3,0.308045,1.952719,0.0
4,0.430631,2.229617,0.0
...,...,...,...
16995,3.125538,2.452579,0.0
16996,2.828446,1.819286,0.0
16997,2.700769,1.132381,0.0
16998,2.744479,1.214879,0.0
