# Feature Selection: Variance Threshold

In [29]:
import pandas as pd

Dummy Data

In [30]:
data = pd.DataFrame({"A": [1,2,4,1,2,4],
                     "B": [4,5,6,7,8,9],
                     "C": [0,0,0,0,0,0],
                     "D": [1,1,1,1,1,1]})

In [31]:
from sklearn.feature_selection import VarianceThreshold
var_thres = VarianceThreshold(threshold = 0)
var_thres.fit(data)

In [32]:
var_thres.get_support()

array([ True,  True, False, False])

In [33]:
data.columns[var_thres.get_support()]

Index(['A', 'B'], dtype='object')

> Remove all columns except A and B

Actual Data

In [34]:
df = pd.read_csv(r"Data/sample_submission.csv")

In [42]:
var_thres_1 = VarianceThreshold(threshold = 2)
var_thres_1.fit(df)

In [43]:
df.columns[var_thres_1.get_support()]

Index(['ID'], dtype='object')

> Remove all columns except ID

# Feature Selection: With Correlation

In [44]:
#importing libraries
from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


In [None]:
# with the following function we can select highly correlated features
# it will remove the first feature that is correlated with anything other feature

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
corr_features = correlation(X_train, 0.7)
len(set(corr_features))

In [None]:
corr_features

# VIF

VIF stands for **Variance Inflation Factor**, a measure used to detect multicollinearity in a regression model. Multicollinearity occurs when independent variables (predictors) are highly correlated, which can make it difficult to interpret the model and can lead to unreliable coefficient estimates.

One independent feature will be dependent and rest independent and regression






### Interpretation of VIF:

- **VIF = 1**: No correlation between the variable and other variables (no multicollinearity).
- **1 < VIF < 5**: Moderate correlation, usually acceptable.
- **VIF > 5**: High correlation, indicating multicollinearity.
- **VIF > 10**: Strong multicollinearity, which can affect the regression results and should be addressed.

In [8]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor # for importing VIF method

In [4]:
df = pd.read_csv(r"Data/RestaurentData.csv")
df

Unnamed: 0,ZomatoFoodRating,SwiggyFoodRating,Years_Old,Location,Cost_for_two
0,4,8.0,10,1,1200
1,3,5.0,18,2,1000
2,4,7.5,12,2,1300
3,4,7.0,5,5,600
4,4,6.0,20,3,400
5,4,6.0,40,12,200
6,3,5.0,5,1,300


In [5]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [13]:
variance_inflation_factor(X, 0)

np.float64(203.73719967345562)

In [24]:
def calculate_vif(dataset):
    vif = pd.DataFrame()
    vif['Features'] = dataset.columns
    vif_values = []
    for i in range(dataset.shape[1]):
        value = variance_inflation_factor(dataset, i)
        vif_values.append(value)
    vif_values = pd.DataFrame(vif_values, columns = ['Vif values'])
    vif = pd.concat([vif, vif_values], axis = 1)
    return vif

In [25]:
calculate_vif(X)

Unnamed: 0,Features,Vif values
0,ZomatoFoodRating,203.7372
1,SwiggyFoodRating,167.148061
2,Years_Old,9.596556
3,Location,6.195456


* ZomatoFoodRating and SwiggyFoodRating have the highest vif value.
* Now check the correlation of these columns with the target variable. Column with has more correlation, we keep that column and drop the other.

In [26]:
df.corr()

Unnamed: 0,ZomatoFoodRating,SwiggyFoodRating,Years_Old,Location,Cost_for_two
ZomatoFoodRating,1.0,0.785553,0.236454,0.3875,0.097849
SwiggyFoodRating,0.785553,1.0,-0.223692,-0.08269,0.586607
Years_Old,0.236454,-0.223692,1.0,0.81156,-0.380386
Location,0.3875,-0.08269,0.81156,1.0,-0.558556
Cost_for_two,0.097849,0.586607,-0.380386,-0.558556,1.0


SwiggyFoodRating has more correlation with the target variable, therefore we keep SwiggyFoodRating and drop ZomatoFoodRating.

In [27]:
df.drop(['ZomatoFoodRating'], axis = 1, inplace = True)

Check the new VIF values

In [28]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
calculate_vif(X)

Unnamed: 0,Features,Vif values
0,SwiggyFoodRating,2.548228
1,Years_Old,7.817709
2,Location,5.977085


The VIF values have significantly dropped