# 1. References 
This notebook heavily uses concepts and implementation of:
* https://github.com/JingweiToo/Wrapper-Feature-Selection-Toolbox-Python
* https://www.kaggle.com/dwin183287/tps-august-2021-eda-base-model

Thanks for making such great implementations

# 2. Importing essential libraries and csv files

In [None]:
# Import essential libraries
import os
import joblib
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
from sklearn.model_selection import train_test_split
import os
from numpy.random import rand
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import random
from random import randrange
import time
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LinearRegression
import xgboost as xg


# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

# import datasets
train_df = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')
submission = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

# converting column without decimal to integer
for col in train_df.columns:
    if np.sum((train_df[col] - train_df[col].astype('int'))) == 0:
        train_df[col] = train_df[col].astype('int')
        
for col in test_df.columns:
    if np.sum((test_df[col] - test_df[col].astype('int'))) == 0:
        test_df[col] = test_df[col].astype('int')

[back to top](#table-of-contents)
<a id="3"></a>
# 3 Dataset Overview
The intend of the overview is to get a feel of the data and its structure in train, test and submission file. An overview on train and test datasets will include a quick analysis on missing values and basic statistics, while sample submission will be loaded to see the expected submission.

<a id="3.1"></a>
## 3.1 Train dataset
As stated before, train dataset is mainly used to train predictive model as there is an available target variable in this set. This dataset is also used to explore more on the data itself including find a relation between each predictors and the target variable.

**Observations:**
- **target**
    - `loss` column is the target variable which is only available in the `train` dataset.
    - The interesting part with the `loss` column is in `int64` type. **Is this a classification task with a regression evaluation metrics?**
- **features**
    - There are `100` features which start from `f0` to `f99`.
    - `train` dataset contain of `250,000` observations without any missing values with total of `102` columns.
    - Only features `id`, `f1`, `f16`, `f27`, `f55`, `f60`, and `f86` are in `int64` type, other features are in `float64`.
    - `f31`, `f36`, `f46`, `f78` mean are quite close with target variable mean.

### 3.1.1 Quick view
Below is the first 5 rows of train dataset:

In [None]:
train_df.head()

The dimension and number of missing values in the train dataset is as below:

In [None]:
print(f'Number of rows: {train_df.shape[0]};  Number of columns: {train_df.shape[1]}; No of missing values: {sum(train_df.isna().sum())}')

### 3.1.2 Data types
Except for column `id`, `f1`, `f16`, `f27`, `f55`, `f60`, `f86` and `loss` column which are in `int64` type, other columns are in `float64`. *(to see the details, please expand)*

In [None]:
train_df.dtypes

### 3.1.3 Basic statistics
Below is the basic statistics for each variables which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile` and `maximum`.

In [None]:
train_df.describe()

[back to top](#table-of-contents)
<a id="3.2"></a>
## 3.2 Test dataset
Test dataset is used to make a prediction based on the model that has previously trained. Exploration in this dataset is also needed to see how the data is structured and especially on it’s similiarity with the train dataset.

**Observations:**

Features column in `test` dataset are similar with `train` with details as follow:
- There are `100` features which start from `f0` to `f99`.
- `test` dataset contain of `150,000` observations without any missing values with total of `101` columns.
- Only features `id`, `f1`, `f16`, `f27`, `f55`, `f60` and `f86` are in `int64` type, other features are in `float64`.

### 3.2.1 Quick view
Below is the first 5 rows of test dataset:

In [None]:
test_df.head()

In [None]:
print(f'Number of rows: {test_df.shape[0]};  Number of columns: {test_df.shape[1]}; No of missing values: {sum(test_df.isna().sum())}')

### 3.2.2 Data types
Except for column `id`, `f1`, `f16`, `f27`, `f55`, `f60`, `f86` and `loss` column which are in `int64` type, other columns are in `float64` which is consistent with the train dataset. *(to see the details, please expand)*

In [None]:
test_df.dtypes

[back to top](#table-of-contents)
<a id="3.3"></a>
## 3.3 Submission
The submission file is expected to have an `id` and `loss` columns.

Below is the first 5 rows of submission file:

In [None]:
submission.head()

[back to top](#table-of-contents)
<a id="4"></a>
## 4. Feature Selection using Particle Swarm Optimization

### 4.1. Creating an array of X (Features) and y (Targets)

* At first we are going to put features values in to X_train by droping 'id' and 'loss' columnes.
* Then, we put the loss values into 'y_train'

In [None]:
X_train = train_df.drop(['id','loss'], axis=1).values
y_train = train_df['loss'].values

### 4.2. Scaling features to unit variance

Then we are going to use StandardScaler which removes the mean and scales each feature/variable to unit variance. 

This operation is performed feature-wise in an independent way.

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).

Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).

References and LearnMore about it:
1. https://stackoverflow.com/questions/40758562/can-anyone-explain-me-standardscaler
2. https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832#:~:text=StandardScaler%20removes%20the%20mean%20and,standard%20deviation%20of%20each%20feature.

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

Then, we are going to check our data shape and see everything is OK!

In [None]:
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)

### 4.3. The beginning of the PSO Algorithm

* **Introduction**

At First, I'm going to give you a quik and basic introduction about this algorithm.

Particle Swarm optimization is first attributed by Kennedy, Eberhar and Shi in their 1995 paper 'Particle Swarm Optimization'. It locates the minimum of a function by creating a number of 'particles'. These particles store their best position as well as also storing the global best position. It is this combination of local and global information that gives rise to 'swarm intelligence'.

Within an iteration, a particle will update it's position slightly towards both the swarm best and slightly towards it's personal best. With eventually the particles converging on (hopefully) the global minimum.

Mathematically this position update is defined as follows:

$ v_i^{t + 1}=\omega v_i^t + \phi_br_b(x_{i_b}-x_i) + \phi_gr_g(g_b-x_i) $

$ x_i^{t + 1}=x_i^t + v_i^t $

Initially every particle is given a random velocity *vi*, and the function is evaluated for every particle. Each particle is now 'aware' of it's previous best position as well as the global best position. On the first iteration it's previous best position is obviously it's current position so this term doesn't come into play until the second iteration.

It's current velocity is first scaled by a factor of w in order to ensure particle velocities don't grow exponentially over each iteration.

The term $ \phi_br_b(x_{i_b}-x_i) $ then represents the vector from the particles position, towards it's previous best position. It is scaled by a constant $ \phi_b $ (Here I've referred to this as c1) and a random value $ r_b $ between 0 and 1 (uniformly distributed). This random scaling provides the stochastic element of this optimization scheme.

Likewise $ \phi_gr_g(g_b-x_i) $ represents the vector from the particles position towards the swarms best position, again scaled by a constant $ \phi_g $ and a random variable $ r_g $. Varying these phi (or c) parameters effect how locally or globally the particle explores the search space. A higher value will provide a larger vector and therefore the particle will take into account that aspect more than another.

Now in the next part I'm going to explain how we can generalize this algorithm for selecting features.

* **PSO Algorithm for Feature Selection**

I will break down this implementation for better understanding.

So in this implementation we have 7 following functions:

**1. Error Rate Function:**

This is a simple function for measuring the error for a subset of selected features.
we have four inputs: 
1. xtrain 
2. ytrain 
3. x: 

x is an array equal to number of features, this array have value of 0 and 1, 0 indicating that features are not selected based on the indexs and 1 indicating that features are selected. you can see an example of this array below:

X: [1 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 0 1
 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0
 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0]
 
 4. opts:
 is a dictionary contaning required variables and parameters which I will explain further.
 
So, in general in this function we are going to create xtrain and ytrain based on the subset of features (x) and then we going to train a model with this features and measure its error (for Regression and Classification) and accuracy (Classification). This competition is based on error so I measured the error.

**2. Fun Function:**

In general this function get error based on the selected features using the error function and then it calculates the objective function.

**3. init position Function:**

As I mentioned above every index of feature has the value of zero (0) or one (1) (selected or not selected). So, how do we decide which one to set 1 (as selected) and which to set 0 (as not selected)? randomly, Of course!  (Based on the definition of the Algorithm that I've given above)

So, in this function, we are going to build N (number of population) array with the size of the features, and then, we going to initialize it with a random value between 0-1.

Don't worry as we move forward it makes more sense.

**4. init velocity Function:**

Velocity in the Particle Swarm Optimization algorithm (PSO) is one of its major features, as it is the mechanism used to move (evolve) the position of a particle to search for optimal solutions. ... This velocity regulation aims to achieve a balance between exploration and exploitation.

So, in this function we are going to initialize vlocity of each particle randomly between [-Vmax, Vmax]
We choose maximum and minimum velocity based on the our lower and upper bound which it is zero (0) or one (1) (selected or not selected) for selecting features.

**5. binary conversion Function:**

If you remember in the init position function we randomly put value between 0-1 for each dimension of particles (N particles), as you can see the index of each feature should the value of zero or one, according to this we are going to convert these random values to 0 and 1 by seting a threshold (I set thresh to 0.5).

**6. boundary Function:**

In this function we are going to check that every index value not be greater than our upper bound (which is 1) smaller than our lower bound (which is 0). (for index features)
and the same function is also going to use for vlocity values (upper: Vmax, lower: Vmin)

**7. jfs Function:**

So, lets put all of these together and run the algoritm using above functions.
I've put comment in each line that you can see whats happening.

In [None]:
# error rate
def error_rate(xtrain, ytrain, x, opts):
    # parameters
    fold = opts['fold']
    xt = fold['xt']
    yt = fold['yt']
    xv = fold['xv']
    yv = fold['yv']
    # number of instances
    num_train = np.size(xt, 0)
    num_valid = np.size(xv, 0)
    # Define selected features
    xtrain = xt[:, x == 1]
    ytrain = yt.reshape(num_train)
    xvalid = xv[:, x == 1]
    yvalid = yv.reshape(num_valid)
    # Training
    mdl     = LinearRegression()
    mdl.fit(xtrain, ytrain)
    # Prediction
    ypred   = mdl.predict(xvalid)
    error   = mean_squared_error(yvalid, ypred, squared=False)
    
    return error

In [None]:
# Error rate & Feature size
def Fun(xtrain, ytrain, x, opts):
    # parameters
    alpha = 0.99
    beta = 1 - alpha
    # original feature size
    max_feat = len(x)
    # Number of selected features
    num_feat = np.sum(x == 1)
    # Solve if no feature selected
    if num_feat == 0:
        cost = 1
    else:
        # Get error rate
        error = error_rate(xtrain, ytrain, x, opts)
        # Objective function
        cost = alpha * error + beta * (num_feat / max_feat)
        
    return cost

In [None]:
def init_position(lb, ub, N, dim):
    X = np.zeros([N, dim], dtype='float')
    for i in range(N):
        for d in range(dim):
            X[i,d] = lb[0,d] + (ub[0,d] - lb[0,d]) * rand()        
    
    return X

In [None]:
def init_velocity(lb, ub, N, dim):
    V    = np.zeros([N, dim], dtype='float')
    Vmax = np.zeros([1, dim], dtype='float')
    Vmin = np.zeros([1, dim], dtype='float')
    # Maximum & minimum velocity
    for d in range(dim):
        Vmax[0,d] = (ub[0,d] - lb[0,d]) / 2
        Vmin[0,d] = -Vmax[0,d]
        
    for i in range(N):
        for d in range(dim):
            V[i,d] = Vmin[0,d] + (Vmax[0,d] - Vmin[0,d]) * rand()
        
    return V, Vmax, Vmin

In [None]:
def binary_conversion(X, thres, N, dim):
    Xbin = np.zeros([N, dim], dtype='int')
    for i in range(N):
        for d in range(dim):
            if X[i,d] > thres:
                Xbin[i,d] = 1
            else:
                Xbin[i,d] = 0
    
    return Xbin

In [None]:
def boundary(x, lb, ub):
    if x < lb:
        x = lb
    if x > ub:
        x = ub
    
    return x

In [None]:
def jfs(xtrain, ytrain, opts):
    # Parameters
    ub    = 1
    lb    = 0
    thres = 0.5
    w     = 0.9    # inertia weight
    c1    = 2      # acceleration factor
    c2    = 2      # acceleration factor
    
    N        = opts['N']
    max_iter = opts['T']
    if 'w' in opts:
        w    = opts['w']
    if 'c1' in opts:
        c1   = opts['c1']
    if 'c2' in opts:
        c2   = opts['c2'] 
    
    # Dimension
    dim = np.size(xtrain, 1)
    if np.size(lb) == 1:
        ub = ub * np.ones([1, dim], dtype='float')
        lb = lb * np.ones([1, dim], dtype='float')
        
    # Initialize position & velocity
    X             = init_position(lb, ub, N, dim)
    V, Vmax, Vmin = init_velocity(lb, ub, N, dim) 
    
    # Pre
    fit   = np.zeros([N, 1], dtype='float')
    Xgb   = np.zeros([1, dim], dtype='float')
    fitG  = float('inf')
    Xpb   = np.zeros([N, dim], dtype='float')
    fitP  = float('inf') * np.ones([N, 1], dtype='float')
    curve = np.zeros([1, max_iter], dtype='float') 
    t     = 0
    
    while t < max_iter:
        # Binary conversion
        Xbin = binary_conversion(X, thres, N, dim)
        
        # Fitness
        for i in range(N):
            fit[i,0] = Fun(xtrain, ytrain, Xbin[i,:], opts)
            if fit[i,0] < fitP[i,0]:
                Xpb[i,:]  = X[i,:]
                fitP[i,0] = fit[i,0]
            if fitP[i,0] < fitG:
                Xgb[0,:]  = Xpb[i,:]
                fitG      = fitP[i,0]
        
        # Store result
        curve[0,t] = fitG.copy()
        print("Iteration:", t + 1)
        print("Best (PSO):", curve[0,t])
        t += 1
        
        for i in range(N):
            for d in range(dim):
                # Update velocity
                r1     = rand()
                r2     = rand()
                V[i,d] = w * V[i,d] + c1 * r1 * (Xpb[i,d] - X[i,d]) + c2 * r2 * (Xgb[0,d] - X[i,d]) 
                # Boundary
                V[i,d] = boundary(V[i,d], Vmin[0,d], Vmax[0,d])
                # Update position
                X[i,d] = X[i,d] + V[i,d]
                # Boundary
                X[i,d] = boundary(X[i,d], lb[0,d], ub[0,d])
    
                
    # Best feature subset
    Gbin       = binary_conversion(Xgb, thres, 1, dim) 
    Gbin       = Gbin.reshape(dim)
    pos        = np.asarray(range(0, dim))    
    sel_index  = pos[Gbin == 1]
    num_feat   = len(sel_index)
    # Create dictionary
    pso_data = {'sf': sel_index, 'c': curve, 'nf': num_feat}
    
    return pso_data

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(X_train, y_train, test_size=0.3, shuffle=True)
fold = {'xt':xtrain, 'yt':ytrain, 'xv':xtest, 'yv':ytest}

In [None]:
c1  = 2         # cognitive factor
c2  = 2         # social factor 
w   = 0.9       # inertia weight
k     = 5     # k-value in KNN
N     = 20    # number of population
T     = 100   # maximum number of iterations
opts = {'k':k, 'fold':fold, 'N':N, 'T':T, 'w':w, 'c1':c1, 'c2':c2}

In [None]:
# perform feature selection
start_time = time.time()
fmdl  = jfs(X_train, y_train, opts)
print("Run Time --- %s seconds ---" % (time.time() - start_time))

sf    = fmdl['sf']

# model with selected features
num_train = np.size(xtrain, 0)
num_valid = np.size(xtest, 0)
x_train   = xtrain[:, sf]
y_train   = ytrain.reshape(num_train)  # Solve bug
x_valid   = xtest[:, sf]
y_valid   = ytest.reshape(num_valid)  # Solve bug

mdl       = LinearRegression()
mdl.fit(x_train, y_train)

# accuracy
y_pred    = mdl.predict(x_valid)
RMSE       = mean_squared_error(y_valid, y_pred, squared=False)
print("RMSE:", RMSE)

# number of selected features
num_feat = fmdl['nf']
print("Feature Size:", num_feat)

# plot convergence
curve   = fmdl['c']
curve   = curve.reshape(np.size(curve,1))
x       = np.arange(0, opts['T'], 1.0) + 1.0

fig, ax = plt.subplots()
ax.plot(x, curve, 'o-')
ax.set_xlabel('Number of Iterations')
ax.set_ylabel('Fitness')
ax.set_title('PSO')
ax.grid()
plt.show()

This varible shows the index of the selected features, 

which we can use it to select these columns in the test dataframe

In [None]:
fmdl['sf']

In [None]:
X_test = test_df.drop(['id'], axis=1).values
X_test   = X_test[:, sf]
sc = StandardScaler()
X_test = sc.fit_transform(X_test)

## 6.1 Regression
Model that will be evaluated is `Linear Regression`

Now, this is the interesting part of the story!!

## 6.1.1 Linear Regression

In [None]:
y_test_pred = mdl.predict(X_test)
y_test_pred

In [None]:
my_submission = pd.DataFrame({'id': test_df.id, 'loss': y_test_pred})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

In [None]:
my_submission.head()

**Thanks for reading my notebook!**

**I hope that you use it for your task and do great works!**

**I've implemented this alogrithm not just for TPS competition you can use it in any task that you want (Image Processing, Signal Processing, ...).**

**Your support gives me motivation to implement more enhanced algorithms (ISSA, ISCA, Jaya, ...) in various task.**