**1. Using Stratified K-fold for regression problem**

To use stratified k-fold for a regression problem, we have first to divide the target  into bins, and then we can use stratified k-fold in the same way as for classification  problems. There are several choices for selecting the appropriate number of bins. If  you have a lot of samples( > 10k, > 100k), then you don’t need to care about the  number of bins. Just divide the data into 10 or 20 bins. If you do not have a lot of  samples, you can use a simple rule like Sturge’s Rule to calculate the appropriate  number of bins. 
```
Sturge’s rule:  
Number of Bins = 1 + log2(N) 
Where N is the number of samples you have in your dataset.
```


**2. Feature selection**

The simplest form of selecting features would be to remove features with very  low variance. If the features have a very low variance (i.e. very close to 0), they  are close to being constant and thus, do not add any value to any model at all. It  would just be nice to get rid of them and hence lower the complexity. Please note  that the variance also depends on scaling of the data. Scikit-learn has an  implementation for VarianceThreshold that does precisely this. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
#fetch a regression dataset  
data = fetch_california_housing()  
X = data["data"]
col_names = data["feature_names"]
df = pd.DataFrame(X, columns=col_names) 
df.loc[:, "MedInc_Sqrt"] = df.MedInc.apply(np.sqrt)
y = data["target"]
df.shape

(20640, 9)

In [4]:
from sklearn.feature_selection import VarianceThreshold  
var_thresh = VarianceThreshold(threshold=0.1)  
transformed_data = var_thresh.fit_transform(df)  
#transformed data will have all columns with variance less  
#than 0.1 removed 

We can also remove features which have a high correlation. For calculating the  correlation between different numerical features, you can use the Pearson  correlation. 

In [5]:
df.corr()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_Sqrt
MedInc,1.0,-0.119034,0.326895,-0.06204,0.004834,0.018766,-0.079809,-0.015176,0.984329
HouseAge,-0.119034,1.0,-0.153277,-0.077747,-0.296244,0.013191,0.011173,-0.108197,-0.132797
AveRooms,0.326895,-0.153277,1.0,0.847621,-0.072213,-0.004852,0.106389,-0.02754,0.326688
AveBedrms,-0.06204,-0.077747,0.847621,1.0,-0.066197,-0.006181,0.069721,0.013344,-0.06691
Population,0.004834,-0.296244,-0.072213,-0.066197,1.0,0.069863,-0.108785,0.099773,0.018415
AveOccup,0.018766,0.013191,-0.004852,-0.006181,0.069863,1.0,0.002366,0.002476,0.015266
Latitude,-0.079809,0.011173,0.106389,0.069721,-0.108785,0.002366,1.0,-0.924664,-0.084303
Longitude,-0.015176,-0.108197,-0.02754,0.013344,0.099773,0.002476,-0.924664,1.0,-0.015569
MedInc_Sqrt,0.984329,-0.132797,0.326688,-0.06691,0.018415,0.015266,-0.084303,-0.015569,1.0


We see that the feature MedInc_Sqrt has a very high correlation with MedInc. We  can thus remove one of them.

**3. Univariate Feature Selection**

Univariate  feature selection is nothing but a scoring of each feature against a given target.  Mutual information, ANOVA F-test and chi2 are some of the most popular  methods for univariate feature selection. There are two ways of using these in scikitlearn.  - SelectKBest: It keeps the top-k scoring features  - SelectPercentile: It keeps the top features which are in a percentage  specified by the user 

In [6]:
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif  
from sklearn.feature_selection import f_regression, mutual_info_regression  
from sklearn.feature_selection import SelectKBest, SelectPercentile 

In [7]:
class UnivariateFeatureSelction:
    def __init__(self, n_features, problem_type, scoring):
        """  
        Custom univariate feature selection wrapper on
        different univariate feature selection models from  scikit-learn.
        :param n_features: SelectPercentile if float else SelectKBest  
        :param problem_type: classification or regression  
        :param scoring: scoring function, string 
        """
    
        if problem_type == "classification":  
            valid_scoring = {
                "f_classif": f_classif,
                "chi2": chi2,  "mutual_info_classif": mutual_info_classif  
            }  
        else:  
            valid_scoring = {
                "f_regression": f_regression,
                "mutual_info_regression": mutual_info_regression  
            } 
            
        #raise exception if we do not have a valid scoring method  
        if scoring not in valid_scoring:  
            raise Exception("Invalid scoring function") 
            
        #if n_features is int, we use selectkbest 
        #if n_features is float, we use selectpercentile 
        #please note that it is int in both cases in sklearn  
        if isinstance(n_features, int):  
            self.selection = SelectKBest(valid_scoring[scoring], k=n_features)
        elif isinstance(n_features, float):  
            self.selection = SelectPercentile(valid_scoring[scoring],  percentile=int(n_features * 100))
        else:
            raise Exception("Invalid type of feature")
            
    #same fit function  
    def fit(self, X, y):
        return self.selection.fit(X, y)
    
    #same transform function  
    def transform(self, X): 
        return self.selection.transform(X)  
    
    #same fit_transform function  
    def fit_transform(self, X, y):  
        return self.selection.fit_transform(X, y)

In [8]:
ufs = UnivariateFeatureSelction(
    n_features=0.1,
    problem_type="regression",  
    scoring="f_regression"  
)  
ufs.fit(X, y) 
X_transformed = ufs.transform(X)

**4. Greedy Feature Selection**

Univariate feature selection may not always perform  well. Most of the time, people prefer doing feature selection using a machine  learning model.

The simplest form of feature selection that uses a model for selection is known as  greedy feature selection. In greedy feature selection, the first step is to choose a  model. The second step is to select a loss/scoring function. And the third and final  step is to iteratively evaluate each feature and add it to the list of **good** features if it improves loss/score. It can’t get simpler than this. But you must keep in mind that  this is known as greedy feature selection for a reason. This feature selection process  will fit a given model each time it evaluates a feature. The computational cost  associated with this kind of method is very high.

In [10]:
from sklearn import linear_model
from sklearn import metrics
from sklearn.datasets import make_classification  

In [12]:
class GreedyFeatureSelection: 
    """  
    A simple and custom class for greedy feature selection.  
    You will need to modify it quite a bit to make it suitable  
    for your dataset. 
    """
    def evaluate_score(self, X, y):  
        """  
        This function evaluates model on data and returns  
        Area Under ROC Curve (AUC)
        NOTE: We fit the data and calculate AUC on same data. 
        WE ARE OVERFITTING HERE.  
        But this is also a way to achieve greedy selection.  
        k-fold will take k times longer.   
        :param X: training data  
        :param y: targets  
        :return: overfitted area under the roc curve  
        """  
        #fit the logistic regression model,  
        #and calculate AUC on same data  
        #you can choose any model that suits your data 
        model = linear_model.LogisticRegression()  
        model.fit(X, y)  
        predictions = model.predict_proba(X)[:, 1]  
        auc = metrics.roc_auc_score(y, predictions)  
        return auc 
    
    def _feature_selection(self, X, y): 
        """  This function does the actual greedy selection  
        :param X: data, numpy array  
        :param y: targets, numpy array  
        :return: (best scores, best features)  
        """ 
        #initialize good features list  
        #and best scores to keep track of both  
        
        good_features = []  
        best_scores = []  
        
        #calculate the number of features  
        num_features = X.shape[1] 

        #infinite loop  
        while True:  

            #initialize best feature and score of this loop  
            this_feature = None  
            best_score = 0 

            #loop over all features  
            for feature in range(num_features):  
                #if feature is already in good features,  
                #skip this for loop  
                if feature in good_features: 
                    continue  
                    
                #selected features are all good features till now  
                #and current feature  
                selected_features = good_features + [feature]  
                
                #remove all other features from data  
                xtrain = X[:, selected_features]  
                
                #calculate the score, in our case, AUC  
                score = self.evaluate_score(xtrain, y)  
                
                #if score is greater than the best score  
                #of this loop, change best score and best feature  
                if score > best_score:  
                    this_feature = feature  
                    best_score = score 
                    
            #if we have selected a feature, add it 
            #to the good feature list and update best scores list  
            if this_feature != None:  
                good_features.append(this_feature)  
                best_scores.append(best_score)  
                
            #if we didnt improve during the last two rounds,  
            #exit the while loop  
            if len(best_scores) > 2:  
                if best_scores[-1] < best_scores[-2]: 
                    break  
                    
        #return best scores and good features  
        #why do we remove the last data point?  
        return best_scores[:-1], good_features[:-1] 

    def __call__(self, X, y): 
        """  
        Call function will call the class on a set of arguments 
        """  
        #select features, return scores and selected indices  
        scores, features = self._feature_selection(X, y)  
        
        #transform data with selected features  
        return X[:, features], scores 

In [14]:
X, y = make_classification(n_samples=1000, n_features=100)  
#transform data by greedy feature selection  
X_transformed, scores = GreedyFeatureSelection()(X, y) 

Another greedy approach is known as recursive feature elimination (RFE). In the  previous method, we started with one feature and kept adding new features, but in  RFE, we start with all features and keep removing one feature in every iteration that  provides the least value to a given model. But how to do we know which feature  offers the least value? Well, if we use models like linear support vector machine  (SVM) or logistic regression, we get a coefficient for each feature which decides  the importance of the features. In case of any tree-based models, we get feature  importance in place of coefficients. In each iteration, we can eliminate the least important feature and keep eliminating it until we reach the number of features  needed. So, yes, we have the ability to decide how many features we want to keep. 

In [15]:
import pandas as pd  
from sklearn.feature_selection import RFE  
from sklearn.linear_model import LinearRegression 
from sklearn.datasets import fetch_california_housing 

#fetch a regression dataset  
data = fetch_california_housing()  
X = data["data"]  
col_names = data["feature_names"] 
y = data["target"] 

#initialize the model  
model = LinearRegression()  
#initialize RFE  
rfe = RFE(  
    estimator=model,  
    n_features_to_select=3  
)  

#fit RFE  
rfe.fit(X, y)  

#get the transformed data with  
#selected columns  
X_transformed = rfe.transform(X) 