# Non Zero Variance
### Description: calculating the variance which is near to zero and removing it
### References: www.kaggle.com
### Link: https://stackoverflow.com/questions/29298973/removing-features-with-low-variance-using-scikit-learn


## Import Libraries

In [90]:
# calculate variance
import numpy as np
import pandas as pd 
from sklearn.feature_selection import VarianceThreshold           #Feature selector that removes all low-variance features

print(np.var([1,9,5,6,8,7]))            ##sample data showing the variance
print(np.var([4,-11,-5,16,5,7,9]))      #sample variance using single mode command

6.666666666666667
69.10204081632652


## Data Set:

In [91]:
data = pd.DataFrame({'A': [2,7,5,6,0,8,9,0,0,4],
                     'B': [0,4,6,0,0,12,14,0,0,20],
                     'C': [3,6,0,12,0,18,0,24,0,30],
                     'D': [4,0,12,16,0,0,28,32,0,40],
                     'E': [0,12,0,24,0,0,42,48,0,60],
                     'F': [0,0,0,0,0,0,0,0,0,0],
                     'G': [2,2,2,2,2,2,3,2,2,2],
                     'H': [1,7,1,1,1,1,1,1,1,1]})
print(data)

   A   B   C   D   E  F  G  H
0  2   0   3   4   0  0  2  1
1  7   4   6   0  12  0  2  7
2  5   6   0  12   0  0  2  1
3  6   0  12  16  24  0  2  1
4  0   0   0   0   0  0  2  1
5  8  12  18   0   0  0  2  1
6  9  14   0  28  42  0  3  1
7  0   0  24  32  48  0  2  1
8  0   0   0   0   0  0  2  1
9  4  20  30  40  60  0  2  1


### Calculating the variance

In [92]:
# variance of the dataframe
print(data.var())

# column variance of the dataframe
# print(data.var(axis=0))

# Row variance of the dataframe
print(data.var(axis=1))

# variance of the specific column
print(data.loc[:,"B"].var())

A     11.877778
B     53.155556
C    124.900000
D    231.288889
E    547.600000
F      0.000000
G      0.100000
H      3.600000
dtype: float64
0      2.285714
1     16.785714
2     17.928571
3     78.839286
4      0.553571
5     46.696429
6    236.982143
7    353.982143
8      0.553571
9    491.410714
dtype: float64
53.15555555555555


### Removing the near zero variance

In [93]:
print(data != 0)                                 ##data which having zero displayed as 
print((data != 0).any(axis=0))                   ##displays columns are having any zero values or not 
# df = data.loc[:, (data != 0).any(axis=0)]      ##display whole data and removes the column with more zero values 
# print(df)
# Removing features with low variance variables from a dataframe
# dataframe removing constant column
# data = data.loc[:,data.apply(pd.Series.nunique) != 1]        #Ignoring NaNs like usual, a column is constant if nunique=1
# print(data)


       A      B      C      D      E      F     G     H
0   True  False   True   True  False  False  True  True
1   True   True   True  False   True  False  True  True
2   True   True  False   True  False  False  True  True
3   True  False   True   True   True  False  True  True
4  False  False  False  False  False  False  True  True
5   True   True   True  False  False  False  True  True
6   True   True  False   True   True  False  True  True
7  False  False   True   True   True  False  True  True
8  False  False  False  False  False  False  True  True
9   True   True   True   True   True  False  True  True
A     True
B     True
C     True
D     True
E     True
F    False
G     True
H     True
dtype: bool


### Features removing with low variances

### Note: VarianceThreshold cannot calculate the strings data because it converts data into float.

In [94]:
print(data) 
#fit to data and then transform it
# remaining_columns = data.columns
# get dataframe values
# d = data.loc[:, remaining_columns].values

# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=5.0)

# fit vt to data        
vt.fit(data)

# get the indices of the dataframe that are being kept
indices = vt.get_support(indices=True)

# remove low-variance columns from index
variance = [data.columns[idx] for idx, _ in enumerate(data.columns) if idx in indices]

# get the columns to be removed
removed = list(np.setdiff1d(data.columns,variance))
print("Found {0} low-variance columns.".format(len(removed)))

# d_removed = d[:, vt.variances_ > threshold]
data_removed = vt.transform(data)

df = pd.DataFrame(data_removed)
print(df)

   A   B   C   D   E  F  G  H
0  2   0   3   4   0  0  2  1
1  7   4   6   0  12  0  2  7
2  5   6   0  12   0  0  2  1
3  6   0  12  16  24  0  2  1
4  0   0   0   0   0  0  2  1
5  8  12  18   0   0  0  2  1
6  9  14   0  28  42  0  3  1
7  0   0  24  32  48  0  2  1
8  0   0   0   0   0  0  2  1
9  4  20  30  40  60  0  2  1
Found 3 low-variance columns.
   0   1   2   3   4
0  2   0   3   4   0
1  7   4   6   0  12
2  5   6   0  12   0
3  6   0  12  16  24
4  0   0   0   0   0
5  8  12  18   0   0
6  9  14   0  28  42
7  0   0  24  32  48
8  0   0   0   0   0
9  4  20  30  40  60


### Variance of Mean, Standard deviation

In [95]:
# d = df.describe().reindex(['mean', 'std', 'variance'])
# d.loc['variance'] = d.loc['std']**2                

pd.DataFrame([df.mean(), df.std(), df.var()], index=['Mean', 'Std. dev', 'variance'])

Unnamed: 0,0,1,2,3,4
Mean,4.1,5.6,9.3,13.2,18.6
Std. dev,3.446415,7.290786,11.175867,15.208185,23.400855
variance,11.877778,53.155556,124.9,231.288889,547.6


## Summary:

### * In statistics, variance is a measure of how far a value in a data set lies from the mean value. In other words, it indicates how dispersed the values are.
### * fit_transform(self, X, y=None, **fit_params) or VarianceThreshodlObj.fit(DataFrame) is  to fit data, then transform it. If it fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
### * Variance of the dataframe is dataframe.var( if axis=None the shows variance of whole data, if axis=0 the it displays column variances, if axis=1 then displays rows variances) .
### * Variance of the specific column is data.loc[:,"column"].var()).
### * (data_frame != 0)  is used to display data which having zero. 
### * (data != 0).any(axis=0) to display columns which are having any zero values or not. 
### * data.loc[:, (data != 0).any(axis=0)] is for display whole data and removes the column with more zero values. 
### * Get the indices of the dataframe that are being kept using vt.get_support(indices=True).
### * Calculating to remove low-variance columns from index by using (columns idx, _ in enumerate(columns) in the indices).
### * Get the columns to be removed for low variance using the np.setdiff1d(columns,lowvariancecolumns)
### * .format(len(removed_features)) it is used to display how many columns are low variance
### * dataframe[:, vt.variances_ > threshold] or vt.transform(d) commands used to remove the low variance
### * pd.DataFrame([df.mean(), df.std(), df.var()], index=['Mean', 'Std. dev', 'variance']) It calculates mean, std.dev and variance and creating a new data frame.


