# Variable Variance
Variable variance is how far a set of numbers are spread out from their average value.

### High Variance for Independent Variables
Higher variance means the variable's values are more spread-out from the mean.

Having a higher variance for an independent variable allows for the machine learning model to better predict on a larger range, which makes it more generalized.

### Low Variance for Independent Variables
Lower variance means the variable's are less spread-out from the mean.

Having a lower variance for an independent variable allows for the machine learning model to better predict on the smaller range it trained on, which makes it more specialized.

In [108]:
# import libraries
import pandas as pd
import numpy as np

In [100]:
# read the customers csv as a data set
customers_df = pd.read_csv("datasets/customers.csv")

# fill NaN with the mean of each column
customers_df = customers_df.fillna(customers_df.mean())

customers_df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes


In [101]:
# return statistical information of each column
customers_df.describe()

Unnamed: 0,Age,Salary
count,10.0,10.0
mean,38.777778,63777.777778
std,7.253777,11564.099406
min,27.0,48000.0
25%,35.5,55000.0
50%,38.388889,62388.888889
75%,43.0,70750.0
max,50.0,83000.0


In [102]:
# return the variance of each non-categorical column
customers_df.var()

Age       5.261728e+01
Salary    1.337284e+08
dtype: float64

# Feature Scaling Variance
Obviously, the Age and Salary are not within the same scale so the Salary column has an extremely larger variance than the Age column. Therefore, we need to feature scale the columns.

We can apply a Standarized Scaler to the Age and Salary columns.

In [103]:
# import a Standarization Scaler for Feature Scaling
from sklearn.preprocessing import StandardScaler

# create a Standarization Scaler for the age, then fit and transform
sc_Age = StandardScaler()
ages = sc_Age.fit_transform(customers_df["Age"].values.reshape(-1, 1))

# create a Standarization Scaler for the salary, then fit and transform
sc_Salary = StandardScaler()
salaries = sc_Salary.fit_transform(customers_df["Salary"].values.reshape(-1, 1))

In [104]:
# reshape both ages and salaries from 2D to 1D
ages = ages.reshape(-1)
salaries = salaries.reshape(-1)

# combine the ages and salaries Arrays into a data frame
scaled_df = pd.DataFrame({"Age": ages, "Salary": salaries})

scaled_df.head()

Unnamed: 0,Age,Salary
0,0.758874,0.7494733
1,-1.711504,-1.438178
2,-1.275555,-0.8912655
3,-0.113024,-0.2532004
4,0.177609,6.632192e-16


In [105]:
# return the variance of each column
scaled_df.var()

Age       1.111111
Salary    1.111111
dtype: float64

# Handle Low Variance
If there's low variance among a column (variable), we can remove these columns.

We can define a variance threshold using the SKLearn class to remove low-variant variables.

In [106]:
# import the variance threshold class
from sklearn.feature_selection import VarianceThreshold

In [107]:
# create a variance selector with a threshold of 0.5, although default value is 0
selector = VarianceThreshold(threshold=0.5)

# fit the scaled data frame to the data set
selected_df = selector.fit_transform(scaled_df)

"""
Seems as though the selector did not remove the Age or Salary column
because their variants were greater than the threshold.
"""
selected_df

array([[ 7.58874362e-01,  7.49473254e-01],
       [-1.71150388e+00, -1.43817841e+00],
       [-1.27555478e+00, -8.91265492e-01],
       [-1.13023841e-01, -2.53200424e-01],
       [ 1.77608893e-01,  6.63219199e-16],
       [-5.48972942e-01, -5.26656882e-01],
       [ 0.00000000e+00, -1.07356980e+00],
       [ 1.34013983e+00,  1.38753832e+00],
       [ 1.63077256e+00,  1.75214693e+00],
       [-2.58340208e-01,  2.93712492e-01]])