# Feature Selection

**Input**: Cleaned data from the "Data Cleaning" phase and a list of feature correlations from the "EDA" phase.

**Output**: A list of selected features to use for modeling. We will use this list to filter the pivot dataframe and save it to PostgreSQL for further use.

In this phase, we will select features that will be used to create the models.

The selection consists of two main steps aimed at reducing the number of features used:

1. Remove features using the "feature_correlation_0.8" document created in the "EDA" phase.
2. Remove features using known methods such as univariate selection and recursive feature elimination.

**Question**: Should each target have its own set of "main" features, or can we use a set of common features for every target we want to predict?

To answer this question, we need to identify features that are useful for specific targets and then remove the ones that have no value for many of them.


## Removing Features with High Correlation

Let's load the document containing feature correlations created during the "EDA" phase.

In [1]:
# move to root to simplify imports
%cd ..

C:\Users\marco\PycharmProjects\portfolio-optimization


In [2]:
from pymongo import MongoClient
from configparser import ConfigParser
import pandas as pd

parser = ConfigParser()
_ = parser.read("credentials.cfg")
username = parser.get("mongo_db", "username")
password = parser.get("mongo_db", "password")

connection_string = f"mongodb+srv://{username}:{password}@cluster0.3dxfmjo.mongodb.net/?" \
                    f"retryWrites=true&w=majority"

client = MongoClient(connection_string)

def get_document(collection_name, document_id):
    database = client['portfolio']
    return database[collection_name].find({'_id':document_id}).next()

feature_correlation = get_document('feature_selection','feature_correlation_0.8')
corr_df = pd.DataFrame(feature_correlation['data'], columns=feature_correlation['data'][0].keys())
corr_df

Unnamed: 0,col_1,col_2,corr,c_1,c_2,corr_1,corr_2
0,feature100,feature143,0.920896,750,750,0.112123,0.116308
1,feature102,feature118,0.925882,755,755,0.079067,0.079087
2,feature102,feature159,0.935671,755,755,0.079067,0.076662
3,feature102,feature175,0.921395,755,755,0.079067,0.076680
4,feature102,feature176,0.907246,755,755,0.079067,0.080287
...,...,...,...,...,...,...,...
134,feature99,feature102,0.928987,755,755,0.076532,0.079067
135,feature99,feature118,0.983688,755,755,0.076532,0.079087
136,feature99,feature159,0.998215,755,755,0.076532,0.076662
137,feature99,feature175,0.984846,755,755,0.076532,0.076680


We create a dataframe called *corr_df* which contains the correlation data from the "feature_correlation_0.8" document stored on MongoDB. The columns in the dataframe are as follows:
- col1: ID of the first feature
- col2: ID of the second feature
- corr: absolute correlation between col1 and col2
- c_1: count of data points for col1
- c_2: count of data points for col2
- corr_1: mean absolute correlation of col1 with the targets
- corr_2: mean absolute correlation of col2 with the targets

We now iterate through the rows of this dataframe to populate a list containing the IDs of features to be removed.

First, we load the features dataframe. In the below example, we show how the selection is performed.

In [3]:
# get_eda_df is the method described in "EDA" phase, it retrieves data from postgreSQL and return two dataframes,
# one with values and one with names, an other helper function is get_indicator_name that we use to retrieve feature titles.
from portfolio_optimization.helper import get_eda_df, get_indicator_name
df, name_df = get_eda_df()
df

column_name,feature1,feature10,feature100,feature101,feature102,feature103,feature104,feature105,feature106,feature107,...,target259,target260,target263,target265,target266,target267,target268,target55,target71,target82
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1960-01-01,,100.78720,763.258,,40.6,39.6291,52.208,1460.0,0.0500,,...,,,,,,,,0.0472,,
1960-02-01,,100.03520,763.258,,39.2,39.7872,52.208,1503.0,0.0500,,...,,,,,,,,0.0449,,
1960-03-01,,99.05860,763.258,,35.0,40.0180,52.208,1109.0,0.0500,,...,,,,,,,,0.0425,,
1960-04-01,,98.29333,776.204,,31.4,40.5152,52.295,1289.0,0.0500,,...,,,,,,,,0.0428,,
1960-05-01,,97.88153,776.204,,38.9,40.8926,52.295,1271.0,0.0500,,...,,,,,,,,0.0435,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-08-01,,99.73845,,14.14119,606.4,2721.9167,113.832,1508.0,0.0550,,...,3955.000000,31510.429688,11816.200195,831.50,1710.96,181.165396,89.83,0.0290,0.0276,628.55
2022-09-01,,99.51685,,16.86149,574.4,2747.8114,113.832,1465.0,0.0573,,...,3585.620117,28725.509766,10575.620117,921.50,1659.67,181.165396,80.03,0.0352,0.0321,628.55
2022-10-01,,99.32253,,20.86594,645.8,2779.6053,113.963,1426.0,0.0625,,...,3871.979980,32732.949219,10988.150391,882.25,1633.12,181.772737,86.88,0.0398,0.0385,623.66
2022-11-01,,99.17128,,17.70746,761.0,2809.8687,,1419.0,0.0695,,...,4080.110107,34589.769531,11468.000000,795.50,1768.45,,81.05,0.0389,0.0446,


In [4]:
import matplotlib.pyplot as plt

def feature_correlation(r, plot=False):
    
    col_1 = r['col_1']
    col_2 = r['col_2']
    count_1 = r['c_1']
    count_2 = r['c_2']
    corr_1 = r['corr_1']
    corr_2 = r['corr_2']

    removed = False
    
    # Choose to remove feature2 if feature1 has more datapoints and is more correlated with targets, else the opposite.
    if count_1 >= count_2 and corr_1 >= corr_2:
        to_remove = col_2
        removed = True
    elif count_2 >= count_1 and corr_2 >= corr_1:
        to_remove = col_1
        removed = True
    
    # plot features in the same chart to see how much they are correlated
    if removed and plot:
        title_1 = get_indicator_name(name_df, col_1)
        title_2 = get_indicator_name(name_df, col_2)
        
        fig, ax = plt.subplots()
        
        col_df = df[[col_1, col_2]].dropna()
        
        ax.plot(col_df.index, col_df[col_1], label=f"{col_1} - {title_1}")
        ax.set_ylabel(f"{col_1} - {title_1}")
        
        ax2 = ax.twinx()
        ax2.plot(col_df.index, col_df[col_2], label=f"{col_2} - {title_2}", color='red')
        ax2.set_ylabel(f"{col_2} - {title_2}")
        
        lines = ax.get_lines() + ax2.get_lines()
        ax.legend(lines, [l.get_label() for l in lines], loc='upper center')
        ax.set_ylim(ymin=min(col_df[col_1]))
        ax2.set_ylim(ymin=min(col_df[col_2]))
    
    return to_remove

to_remove = feature_correlation(corr_df.iloc[0], plot=True)
print(f"FEATURE TO REMOVE {to_remove}")
print(corr_df.iloc[0])

FEATURE TO REMOVE feature100
col_1     feature100
col_2     feature143
corr        0.920896
c_1              750
c_2              750
corr_1      0.112123
corr_2      0.116308
Name: 0, dtype: object


In this example, we can see that feature100 and feature143 have a correlation of 0.920896. Both features have 750 data points.

Since corr_1 is less than corr_2, meaning feature100 is less correlated than feature143 with the targets, we decide to remove feature100.

We can now iterate through all feature_correlation rows to find the feature to remove.


In [5]:
remove_list = []
for i, r in corr_df.iterrows():
    
    # if feature already is in remove_list skip
    if r['col_1'] in remove_list or r['col_2'] in remove_list:
        continue
    
    f = feature_correlation(r, plot=False)
    if f is not None:
        remove_list.append(f)

print(f"FEATURES TO REMOVE: {len(remove_list)}")

FEATURES TO REMOVE: 67


With this step we identified 67 features to remove because of high correlation.

Then we store this list on MongoDB.

In [6]:
def upsert_document(collection_name, document):
    database = client['portfolio']
    return database[collection_name].replace_one({"_id":document["_id"]}, document, upsert=True)

# create document
feature_to_remove = {'_id': "feature_to_remove_correlation", "data": remove_list}
# load document to MongoDB
upsert_document('feature_selection', feature_to_remove)

<pymongo.results.UpdateResult at 0x22df516ef98>

We can now filter our dataframe with this feature_to_remove list. We also modify get_eda_df adding the following code before creating the pivot dataframe.

In [7]:
# if remove_correlation:
#         to_remove_corr = get_document('feature_selection','feature_to_remove_correlation')['data']
#         df = df[~df["column_name"].isin(to_remove_corr)]

df, name_df = get_eda_df(remove_correlation=True)
df

column_name,feature1,feature10,feature101,feature103,feature105,feature106,feature107,feature114,feature119,feature121,...,target259,target260,target263,target265,target266,target267,target268,target55,target71,target82
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1960-01-01,,100.78720,,39.6291,1460.0,0.0500,,1092.0,1.817,,...,,,,,,,,0.0472,,
1960-02-01,,100.03520,,39.7872,1503.0,0.0500,,1088.0,1.817,,...,,,,,,,,0.0449,,
1960-03-01,,99.05860,,40.0180,1109.0,0.0500,,955.0,1.817,,...,,,,,,,,0.0425,,
1960-04-01,,98.29333,,40.5152,1289.0,0.0500,,1016.0,1.797,,...,,,,,,,,0.0428,,
1960-05-01,,97.88153,,40.8926,1271.0,0.0500,,1052.0,1.797,,...,,,,,,,,0.0435,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-08-01,,99.73845,14.14119,2721.9167,1508.0,0.0550,,1542.0,1.191,-219596.240950,...,3955.000000,31510.429688,11816.200195,831.50,1710.96,181.165396,89.83,0.0290,0.0276,628.55
2022-09-01,,99.51685,16.86149,2747.8114,1465.0,0.0573,,1564.0,1.191,-429673.478962,...,3585.620117,28725.509766,10575.620117,921.50,1659.67,181.165396,80.03,0.0352,0.0321,628.55
2022-10-01,,99.32253,20.86594,2779.6053,1426.0,0.0625,,1512.0,1.226,-87797.836311,...,3871.979980,32732.949219,10988.150391,882.25,1633.12,181.772737,86.88,0.0398,0.0385,623.66
2022-11-01,,99.17128,17.70746,2809.8687,1419.0,0.0695,,1351.0,,-248534.880653,...,4080.110107,34589.769531,11468.000000,795.50,1768.45,,81.05,0.0389,0.0446,


Resulting feature are less correlated. In total we have 132 remaining features.



## Removing feature with univariate selection

Univariate feature selection is a widely used technique in machine learning and data science to select the most relevant features in a dataset.

The basic idea behind this technique is to evaluate each feature individually and then rank them according to their correlation with the target variable. 

In other words, univariate feature selection methods assess the usefulness of each feature independently of the other features in the dataset.

There are several methods to perform univariate feature selection, including:

1. Pearson correlation: This method computes the linear correlation between each feature and the target variable.

2. ANOVA F-test: This method tests the difference between the means of each feature across different classes of the target variable.

3. Mutual information: This method measures the amount of information that each feature provides about the target variable.

Once the features are ranked according to their relevance, we can choose the top k features to use in our model.

This can help reduce the dimensionality of the dataset and improve the performance of the model, as irrelevant or redundant features can introduce noise and bias in the model.

It is important to note that univariate feature selection only considers the relationship between each feature and the target variable, and not the interactions between the features themselves. 

That is why we removed correlated feature before.

We are going to evaluate univariate selection with all three methods and then observe ranks and distribution to decide which common feture are useful for target predictions.

Scikit Learn has various implementation to perform univariate selection. We are going to use it in the following example.

In [8]:
from sklearn.feature_selection import SelectKBest, f_regression, r_regression, mutual_info_regression

target_columns = [col for col in df.columns if 'target' in col]
feature_columns = [col for col in df.columns if 'feature' in col]

# create a new dataframe for storing scores
score_df = pd.DataFrame([[x] for x in feature_columns], columns=["feature"])
# score function to use for evaluation
score_functions = {"f_regr": f_regression, "r_regr": r_regression, "m_regr": mutual_info_regression}

for score_f in score_functions:
    # We are going to use SelectKBest class to identify best features
    kbest_model = SelectKBest(score_func=score_functions[score_f], k='all')

    # for each target we evaluate best k feature and store scores
    for t_col in target_columns:
        r_df = df[feature_columns + [t_col]]
        r_df = r_df.dropna()

        X = r_df[feature_columns]
        Y = r_df[t_col]

        fit = kbest_model.fit(X, Y)

        score_df[f"{t_col}_{score_f}"] = fit.scores_

score_df = score_df.set_index("feature")
score_df

Unnamed: 0_level_0,target254_f_regr,target256_f_regr,target259_f_regr,target260_f_regr,target263_f_regr,target265_f_regr,target266_f_regr,target267_f_regr,target268_f_regr,target55_f_regr,...,target259_m_regr,target260_m_regr,target263_m_regr,target265_m_regr,target266_m_regr,target267_m_regr,target268_m_regr,target55_m_regr,target71_m_regr,target82_m_regr
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
feature1,3.198237,75.999772,0.121778,0.721349,0.098684,145.580278,75.164156,53.077556,163.865170,75.931360,...,1.495759,1.563253,1.504421,0.749843,1.659620,2.028825,1.104472,1.165745,1.634280,2.017327
feature10,32.176010,1.350604,20.788865,22.886800,19.152094,0.182934,7.962655,13.500408,10.951818,1.509269,...,0.494560,0.424580,0.420120,0.356900,0.415358,0.531987,0.253717,0.442076,0.320744,0.678947
feature101,10.632892,0.182663,1.394609,2.213413,0.237421,0.689181,2.499757,2.221440,15.055011,0.370933,...,0.156580,0.188955,0.125425,0.130579,0.242433,0.214584,0.028348,0.138678,0.008100,0.188262
feature103,1988.257815,643.617205,2129.608473,2450.652307,1666.977666,46.496430,515.187394,974.448251,27.282985,652.682696,...,1.692451,1.731906,1.714701,1.078997,1.719724,2.294536,1.204762,1.389491,1.614104,2.292113
feature105,20.274164,96.356885,4.374649,9.022283,4.720386,290.282360,257.382940,7.993397,170.721827,94.351886,...,0.839382,0.914493,0.814692,0.778745,0.972863,1.216079,0.781900,0.880341,0.803029,1.164485
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
feature94,521.066030,575.986855,324.968440,451.179561,191.503152,51.364409,263.845390,1393.441045,145.861850,583.051455,...,1.706673,1.682119,1.697058,0.931710,1.539011,2.478873,1.319260,1.318350,1.756898,2.438730
feature95,1634.172295,878.099612,1029.029857,1448.691117,638.996810,39.069958,370.456211,4239.423975,62.378560,899.770988,...,1.839863,1.768962,1.806308,1.005690,1.559317,2.394874,1.104782,1.492770,1.677836,2.338168
feature96,1384.923534,1421.632455,646.710704,828.368284,523.142145,116.306711,2057.097764,491.817653,103.200903,1450.093059,...,1.646074,1.712329,1.629010,1.017008,1.867422,2.019047,1.189630,1.384617,1.533116,2.040833
feature97,93.612707,98.377934,134.335639,117.582975,212.324607,11.463007,129.414474,51.905402,4.723726,97.861451,...,0.767111,0.953417,0.841472,0.353532,0.822317,1.149909,0.439039,0.803815,0.620784,1.073712


We used **SelectKBest** to select features based on the k highest scores. In this particular case, we kept all scores and calculated the rank among the score functions.

We evaluated scores using 3 different score functions:
- **f_regression**: Univariate linear regression tests returning F-statistic and p-values.
- **r_regression**: Compute Pearson's r for each feature and the target.
- **mutual_info_regression**: Estimate mutual information for a continuous target variable. Mutual information (MI) between two random variables is a non-negative value that measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

In the resulting score_df, we store scores for each feature and each score function.

We want to keep the features that perform better overall (across different targets and across different scoring functions).
Because different scoring methods have different value ranges we need to transform the scores in rankings.

In [9]:
for col in score_df.columns:
    
    # Transform in absolute values
    if "r_regr" in col:
        score_df[col] = abs(score_df[col])
        
    # Calculate rank (lower is better)
    score_df[col] = score_df[col].rank()

# using describe functionality of a dataframe we can see the rankings distribution.
describe_df = score_df.apply(pd.DataFrame.describe, axis=1)
describe_df

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
feature1,36.0,43.555556,29.922227,2.0,26.00,36.5,68.5,95.0
feature10,36.0,13.083333,8.012045,2.0,6.75,11.0,20.0,26.0
feature101,36.0,6.083333,6.460318,1.0,1.75,3.0,7.0,26.0
feature103,36.0,83.888889,24.432739,35.0,70.75,88.0,100.5,119.0
feature105,36.0,33.555556,28.445883,12.0,18.00,21.0,34.5,115.0
...,...,...,...,...,...,...,...,...
feature94,36.0,75.777778,18.726774,45.0,64.50,69.0,89.5,108.0
feature95,36.0,87.111111,22.248952,39.0,73.00,94.0,101.5,118.0
feature96,36.0,86.638889,15.243083,58.0,75.50,84.5,99.0,118.0
feature97,36.0,27.222222,12.071480,11.0,17.00,24.0,34.0,57.0


In [10]:
# We define as threshold 1/3 of the # of features.
# This is arbitrary and depends on how many features you want to keep.
# The more features you want to keep, the higher the threshold 
threshold = int(len(feature_columns)/3)

original_len = len(describe_df)

# we remove features which first quartile is greater than threshold.
# This means that 75% of the 36 combinations (27) scoring/target have a ranking higher than the threshold
univariate_feature_selection = describe_df[describe_df["25%"] >= threshold]

univariate_feature_selection = univariate_feature_selection.index
univariate_feature_selection = list(univariate_feature_selection)

print(f"# TARGET: {len(target_columns)}")
print(f"# FEATURES: {len(feature_columns)}")
print(f"FEATURE SELECTED FROM UNIVARIATE: {len(univariate_feature_selection)}")

# TARGET: 12
# FEATURES: 120
FEATURE SELECTED FROM UNIVARIATE: 64


Univariate removes 64 features.

## Removing feature with Recursive Feature Elimination (RFE)


Recursive feature elimination (RFE) is a popular feature selection technique in machine learning and data science that aims to identify the most important features in a dataset by recursively eliminating the least relevant features. Unlike univariate feature selection methods that evaluate each feature independently, RFE takes into account the interactions between the features and their impact on the performance of the model.

The basic idea behind RFE is to start with all the features in the dataset and train a model on them. The least important feature(s) are then removed from the dataset, and a new model is trained on the remaining features. This process is repeated recursively until a desired number of features is reached or until the performance of the model stops improving.

There are several methods that can be used to rank the importance of the features in RFE, including:

- Coefficient values: This method ranks the features according to the magnitude of their coefficients in a linear model.

- Feature importances: This method ranks the features according to their importance scores in a tree-based model.

- Recursive feature elimination with cross-validation (RFECV): This method uses cross-validation to evaluate the performance of the model at each iteration and select the optimal number of features.

RFE has several advantages over other feature selection techniques. Firstly, it takes into account the interactions between the features, which can be important in datasets with complex relationships between the variables. Secondly, it can be used with a wide range of models, including linear models, tree-based models, and support vector machines (SVMs). Finally, RFE provides a ranking of the features, which can be useful in understanding the underlying patterns in the data.

However, RFE can be computationally expensive, especially for large datasets and complex models. Moreover, the optimal number of features to select may depend on the specific problem and may require tuning.

In summary, recursive feature elimination is a powerful feature selection technique that can help us identify the most important features in a dataset. By recursively eliminating the least relevant features, we can improve the performance of our models and gain better insights into the underlying patterns in the data. However, it is important to be aware of the computational cost and to carefully tune the parameters of the method.

We selected different models to use with RFE: LinearRegression, DecisionTreeRegressor, SDGRegressor, and BayesianRidge.

Again we use Scikit Learn implementations in the following examples.

In [11]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression, SGDRegressor, BayesianRidge
from sklearn.tree import DecisionTreeRegressor

models = {"lin_regr": LinearRegression, "tree_regr": DecisionTreeRegressor, 
          "sgd_regr": SGDRegressor, "ridge_regr": BayesianRidge}

score_df = pd.DataFrame([[x] for x in feature_columns], columns=["feature"])

for m in models:
    for t_col in target_columns:
        r_df = df[feature_columns + [t_col]]
        r_df = r_df.dropna()
        
        X = r_df[feature_columns]
        Y = r_df[t_col]

        model = models[m]()
        rfe = RFE(model, n_features_to_select=1)
        fit = rfe.fit(X, Y)
        score_df[f"{t_col}_{m}"] = fit.ranking_

score_df = score_df.set_index("feature")
score_df

Unnamed: 0_level_0,target254_lin_regr,target256_lin_regr,target259_lin_regr,target260_lin_regr,target263_lin_regr,target265_lin_regr,target266_lin_regr,target267_lin_regr,target268_lin_regr,target55_lin_regr,...,target259_ridge_regr,target260_ridge_regr,target263_ridge_regr,target265_ridge_regr,target266_ridge_regr,target267_ridge_regr,target268_ridge_regr,target55_ridge_regr,target71_ridge_regr,target82_ridge_regr
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
feature1,9,13,30,13,13,20,11,10,18,14,...,94,86,106,102,98,19,88,25,49,84
feature10,57,51,60,61,47,52,62,57,66,53,...,1,1,58,76,34,25,1,42,46,11
feature101,63,86,65,64,62,82,75,98,65,73,...,11,13,50,36,41,56,6,60,57,73
feature103,69,79,70,70,67,99,85,77,70,91,...,18,23,16,11,9,50,10,80,63,24
feature105,86,96,90,88,90,93,97,95,103,104,...,34,49,30,32,32,71,43,100,83,42
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
feature94,108,109,109,109,118,106,111,115,107,109,...,62,73,27,55,52,106,55,103,111,61
feature95,82,90,87,90,91,67,80,66,108,72,...,15,20,43,10,5,46,20,63,73,27
feature96,81,113,85,86,84,100,82,74,74,80,...,26,31,3,24,22,86,47,90,80,41
feature97,45,48,46,46,25,44,44,56,52,39,...,100,91,115,94,82,27,91,32,29,88


We used **RFE** a Scikit Learn class for feature ranking with recursive feature elimination.

We have evaluated scores with 4 different models:
- **LinearRegression**: Ordinary least squares Linear Regression. It fits a linear model with coefficients w = (w1, ..., wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.
- **DecisionTreeRegressor**: A decision tree regressor. The goal of a decision tree is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
- **SGDRegressor**: Linear model fitted by minimizing a regularized empirical loss with SGD. SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).
- **BayesianRidge**: Fit a Bayesian ridge model. Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand.

Here we have already the rankings, so no need to transform values.
We can directly call describe to see the rankings distribution.

In [12]:
describe_df = score_df.apply(pd.DataFrame.describe, axis=1)
describe_df

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
feature1,48.0,79.083333,42.550012,9.0,23.75,98.0,116.50,120.0
feature10,48.0,46.916667,29.520158,1.0,22.25,52.5,62.00,118.0
feature101,48.0,53.729167,30.289904,6.0,16.75,63.5,73.25,117.0
feature103,48.0,49.583333,26.802535,9.0,32.25,47.0,69.25,116.0
feature105,48.0,57.270833,28.470769,12.0,35.00,44.0,88.50,113.0
...,...,...,...,...,...,...,...,...
feature94,48.0,60.437500,41.569171,2.0,11.25,60.5,106.25,119.0
feature95,48.0,59.062500,26.370846,5.0,42.50,64.5,77.00,108.0
feature96,48.0,62.833333,28.299563,1.0,40.75,74.0,85.00,113.0
feature97,48.0,70.895833,34.213497,10.0,43.50,85.0,102.25,117.0


In [13]:
# remove features which first quartile is greater than threshold.
rfe_selection = describe_df[describe_df["25%"] >= threshold]

rfe_selection = rfe_selection.index
rfe_selection = list(rfe_selection)

print(f"# TARGET: {len(target_columns)}")
print(f"# FEATURES: {len(feature_columns)}")
print(f"FEATURE SELECTED FROM UNIVARIATE: {len(univariate_feature_selection)}")
print(f"FEATURE SELECTED FROM UNIVARIATE: {len(rfe_selection)}")

# To remove duplicates
total_feature_to_remove = list(set(univariate_feature_selection + rfe_selection))
print(f"TOTAL FEATURES TO REMOVE: {len(total_feature_to_remove)}")

# TARGET: 12
# FEATURES: 120
FEATURE SELECTED FROM UNIVARIATE: 64
FEATURE SELECTED FROM UNIVARIATE: 24
TOTAL FEATURES TO REMOVE: 76


We merged the two list of feature we identified with univariate feature selection and with recursive feature elimination to find the set of feature to remove from the dataframe.

## Conclusion - storing data on PostgreSQL

To conclude this "Feature Selection" phase we are going to upload data into our postgreSQL database in a new table called pivot.

This table will store the pivoted dataframe filtered by all the selected features.

First we upload the list to MongoDB and then we remove selected_feature as we did previously.

In [14]:
upsert_document("feature_selection", {"_id":"selected_features","data":total_feature_to_remove})

<pymongo.results.UpdateResult at 0x22d8a0f3278>

Now we can update get_eda_df with the following code, before creating pivot dataframe:

In [15]:
# if remove_correlation:
#         to_remove_corr = get_document('feature_selection','feature_to_remove_correlation')['data']
#         df = df[~df["column_name"].isin(to_remove_corr)]

# if remove_selected:
#         to_remove_selected = get_document('feature_selection','selected_features')['data']
#         df = df[~df["column_name"].isin(to_remove_selected)]

df, name_df = get_eda_df(remove_correlation=True, remove_selected=True)
df

column_name,feature1,feature10,feature101,feature105,feature106,feature114,feature121,feature124,feature13,feature130,...,target259,target260,target263,target265,target266,target267,target268,target55,target71,target82
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1960-01-01,,100.78720,,1460.0,0.0500,1092.0,,,101.64170,,...,,,,,,,,0.0472,,
1960-02-01,,100.03520,,1503.0,0.0500,1088.0,,,101.37660,,...,,,,,,,,0.0449,,
1960-03-01,,99.05860,,1109.0,0.0500,955.0,,,101.14750,,...,,,,,,,,0.0425,,
1960-04-01,,98.29333,,1289.0,0.0500,1016.0,,,101.02110,,...,,,,,,,,0.0428,,
1960-05-01,,97.88153,,1271.0,0.0500,1052.0,,,101.05300,,...,,,,,,,,0.0435,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-08-01,,99.73845,14.14119,1508.0,0.0550,1542.0,-219596.240950,8.5,96.32501,1.24,...,3955.000000,31510.429688,11816.200195,831.50,1710.96,181.165396,89.83,0.0290,0.0276,628.55
2022-09-01,,99.51685,16.86149,1465.0,0.0573,1564.0,-429673.478962,10.1,96.53947,1.24,...,3585.620117,28725.509766,10575.620117,921.50,1659.67,181.165396,80.03,0.0352,0.0321,628.55
2022-10-01,,99.32253,20.86594,1426.0,0.0625,1512.0,-87797.836311,9.5,96.64154,1.22,...,3871.979980,32732.949219,10988.150391,882.25,1633.12,181.772737,86.88,0.0398,0.0385,623.66
2022-11-01,,99.17128,17.70746,1419.0,0.0695,1351.0,-248534.880653,9.4,96.69122,1.24,...,4080.110107,34589.769531,11468.000000,795.50,1768.45,,81.05,0.0389,0.0446,


In this dataframe we have the remaining feature selected, let's insert into postgreSQL.

In [16]:
# First we delete any existing pivot table, to make this code re-executable.
drop_statement = f"DROP TABLE IF EXISTS pivot"
create_statement = "CREATE TABLE pivot (date date"
df["date"] = df.index
df.reset_index(drop=True)
for col in df.columns:
    if col == "date":
        continue
    create_statement += "," + col + " numeric"
create_statement += ")"

# We use our helper functions to execute SQL commands and insert data into postgreSQL.
from portfolio_optimization.db import execute_db_commands, insert_df_into_table
execute_db_commands([drop_statement, create_statement])
insert_df_into_table(df, "pivot")

Let's see our data stored in pivot table.

In [17]:
from portfolio_optimization.db import get_df_from_table
df = get_df_from_table("pivot")
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values(by="date").set_index("date")
df = df.apply(pd.to_numeric)
df = df.asfreq('MS')
df

Unnamed: 0_level_0,feature1,feature10,feature101,feature105,feature106,feature114,feature121,feature124,feature13,feature130,...,target259,target260,target263,target265,target266,target267,target268,target55,target71,target82
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1960-01-01,,100.78720,,1460.0,0.0500,1092.0,,,101.64170,,...,,,,,,,,0.0472,,
1960-02-01,,100.03520,,1503.0,0.0500,1088.0,,,101.37660,,...,,,,,,,,0.0449,,
1960-03-01,,99.05860,,1109.0,0.0500,955.0,,,101.14750,,...,,,,,,,,0.0425,,
1960-04-01,,98.29333,,1289.0,0.0500,1016.0,,,101.02110,,...,,,,,,,,0.0428,,
1960-05-01,,97.88153,,1271.0,0.0500,1052.0,,,101.05300,,...,,,,,,,,0.0435,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-08-01,,99.73845,14.14119,1508.0,0.0550,1542.0,-219596.240950,8.5,96.32501,1.24,...,3955.000000,31510.429688,11816.200195,831.50,1710.96,181.165396,89.83,0.0290,0.0276,628.55
2022-09-01,,99.51685,16.86149,1465.0,0.0573,1564.0,-429673.478962,10.1,96.53947,1.24,...,3585.620117,28725.509766,10575.620117,921.50,1659.67,181.165396,80.03,0.0352,0.0321,628.55
2022-10-01,,99.32253,20.86594,1426.0,0.0625,1512.0,-87797.836311,9.5,96.64154,1.22,...,3871.979980,32732.949219,10988.150391,882.25,1633.12,181.772737,86.88,0.0398,0.0385,623.66
2022-11-01,,99.17128,17.70746,1419.0,0.0695,1351.0,-248534.880653,9.4,96.69122,1.24,...,4080.110107,34589.769531,11468.000000,795.50,1768.45,,81.05,0.0389,0.0446,


In this phase of "Feature Selection" the goal was to reduce the number of feature to use in the modeling phase.

We defined two method to select feature.

One that was consequent of the exploratory step where we identified correlation among features. The second was based on theory of feature selection.

The result is a reduced dataset with fewer and useful features, that we stored in PostgreSQL in a new table named "pivot".

### Principal Component Analysis (PCA)

An other methodology we can use for feature selection is Principal Component Analysis (PCA).

PCA works by identifying the principal components of the data, which are the directions in the data space that explain the most variance. The first principal component is the direction in which the data varies the most, while the second principal component is the direction that explains the most variance orthogonal to the first principal component, and so on.

The principal components are computed using linear algebra techniques, specifically by performing a singular value decomposition (SVD) of the data matrix. Once the principal components are computed, we can project the data onto the new space spanned by the first k principal components, where k is the desired number of dimensions.

PCA has several advantages over other dimensionality reduction techniques. Firstly, it is unsupervised, which means it can be applied to any dataset without requiring labeled data. Secondly, it is computationally efficient, especially for large datasets. Finally, it can be used for data visualization and exploration, as it reduces the data to a smaller number of dimensions that can be easily visualized.

However, PCA also has some limitations. Firstly, it is a linear method, which means it may not be suitable for datasets with complex non-linear relationships between the variables. Secondly, the interpretability of the principal components may be limited, as they are linear combinations of the original features.

In summary, Principal component analysis is a powerful technique for reducing the dimensionality of a dataset while retaining most of the information. By identifying the principal components of the data and projecting it onto the new space spanned by the most important directions, we can reduce the computational complexity and improve the performance of our models. However, it is important to be aware of the limitations and to carefully select the number of principal components to retain.

We decided to not to use PCA during this step becase we think the number of feature selected is appropriate for modeling and we wanted to preserve transparency of data (data transformed via PCA has no resembling to the starting data).

We are going to cross validate in the next "Modeling" phase to see whether PCA helps us improve performance.

[Go to Modeling](modeling.ipynb)