# Dimensionality Reduction

### Challanges with high dimensional data:
- Visualization becomes difficult
- All the variables might not be important
- More computation time
- More complex models
- Difficulties in data exploration


### Common Dimensionality Reduction Techniques:
- <b>Feature Selection</b>

	- Missing value ratio
	- Low Variance
	- High Correlation
	- Backward Feature Elimination
	- Forward Feature Selection

- <b>Feature Extractions</b>
	
	- Factor Analysis
	- Principal Component Analysis

Feature selection keeps a subset of the original features while feature extraction creates new features using the existing features.


In [2]:
#importing libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import mean_squared_error as mse

## Missing value ratio

<b>Steps:</b>
	
	- Calculate ratio of missing values
	- Ratio of missing value = (Num of missing values) / (Total num of obs) * 100
	- Calculate above ratio of all the variables
	- Set a threshold, say 70%
	- Use this threshold and drop all the variables which have missing values more than this threshold

<b>How to deal with remaining variables which still have missing values in them?<b>

Try find reason for missing data (error, PII, customer not filling info). Once we have the reason, we will try to impute those missing values by:

	- Statistical measures like mean, median and mode
	- Train model to predict missing values

In [102]:
data_mv = pd.read_csv('missing_value_ratio.csv')

In [103]:
data_mv.shape

(12980, 10)

In [104]:
data_mv.isna().sum()

ID               0
season           9
holiday       6295
workingday       9
weather          4
temp             0
atemp            0
humidity         5
windspeed     5324
count            0
dtype: int64

In [105]:
cols = data_mv.columns

In [106]:
# Ratio of missing value = (Num of missing values) / (Total num of obs) * 100
ratios = []
for i in (cols):
    ratios_temp = (data_mv[i].isna().sum() / data_mv.shape[0])*100
    ratios.append(ratios_temp)

print("This is ratio out of loop ", ratios)
    #ratios[i] = data[i].isna().sum() / data.shape[0]

This is ratio out of loop  [0.0, 0.06933744221879815, 48.497688751926034, 0.06933744221879815, 0.030816640986132512, 0.0, 0.0, 0.038520801232665644, 41.01694915254237, 0.0]


In [49]:
cols[1], ratios[1]

('season', 0.06933744221879815)

In [59]:
# set the threshold

thresh = 40
indexlist = []
for i in range(len(cols)):
    if ratios[i] < thresh:
        indexlist.append(cols[i]) 

#list of columns with greater threshold

indexlist

['ID', 'season', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'count']

In [107]:
new_data = data_mv[indexlist]

In [61]:
new_data.head()

Unnamed: 0,ID,season,workingday,weather,temp,atemp,humidity,count
0,AB101,1.0,0.0,1.0,9.84,14.395,81.0,16
1,AB102,1.0,0.0,,9.02,13.635,80.0,40
2,AB103,1.0,,1.0,9.02,13.635,80.0,32
3,AB104,,,1.0,9.84,14.395,75.0,13
4,AB105,1.0,0.0,,9.84,14.395,,1


In [62]:
#Recalculate missing value percentage

new_data.isnull().sum() / len(new_data) * 100

ID            0.000000
season        0.069337
workingday    0.069337
weather       0.030817
temp          0.000000
atemp         0.000000
humidity      0.038521
count         0.000000
dtype: float64

In [108]:
# shape of new and original data
new_data.shape, data_mv.shape

((12980, 8), (12980, 10))

## Low Variance ratio
Variance is the spread of the data. It tells us how far the points are from the mean.<br>
	Eg: if all the values in a column are the same number, then the variance is 0.

So, we can say that variables with low variance have less impact on the target variable.

We can set a threshold value for variance as well. Any column which is below the threshold value can be safely dropped. 
<br>
Variance can be applied only to numeric columns, not categorical columns. For categorical columns, if one category is repeating for more than 95% of data, then that feature has less variance
### <b>IMP </b> - Variance is range dependent. Therefore, we need to do normalization before applying this technique.


In [64]:
from sklearn.preprocessing import normalize

In [66]:
data_lv = pd.read_csv('low_variance_filter.csv')
data_lv.head()

Unnamed: 0,ID,temp,atemp,humidity,windspeed,count
0,AB101,9.84,14.395,81,0.0,16
1,AB102,9.02,13.635,80,0.0,40
2,AB103,9.02,13.635,80,0.0,32
3,AB104,9.84,14.395,75,0.0,13
4,AB105,9.84,14.395,75,0.0,1


In [72]:
data_lv.shape

(12980, 6)

In [67]:
#first check if there is any missing values
data_lv.isna().sum()

ID           0
temp         0
atemp        0
humidity     0
windspeed    0
count        0
dtype: int64

In [70]:
#This is not applicable on Category data.
data_lv.dtypes

ID            object
temp         float64
atemp        float64
humidity       int64
windspeed    float64
count          int64
dtype: object

In [73]:
# We can drop ID column as its only unique identifier for each row and don't provide any information about dependent variable

data_lv = data_lv.drop('ID', axis=1)

In [74]:
data_lv.shape

(12980, 5)

In [78]:
cols_lv = data_lv.columns
cols_lv

Index(['temp', 'atemp', 'humidity', 'windspeed', 'count'], dtype='object')

In [75]:
#normalize the data
data_lv_normalize = normalize(data_lv)

In [81]:
# Normalize convert DF to array. Conver it back

data_lv_normalize = pd.DataFrame(data_lv_normalize, columns=cols_lv)
data_lv_normalize.head()

Unnamed: 0,temp,atemp,humidity,windspeed,count
0,0.116607,0.170585,0.959872,0.0,0.189604
1,0.099203,0.14996,0.87985,0.0,0.439925
2,0.102851,0.155473,0.912202,0.0,0.364881
3,0.126009,0.184339,0.960431,0.0,0.166475
4,0.127781,0.186932,0.97394,0.0,0.012986


In [82]:
variances = data_lv_normalize.var()

In [83]:
variances

temp         0.005877
atemp        0.007977
humidity     0.093491
windspeed    0.008756
count        0.111977
dtype: float64

In [92]:
variances_features = []
for i in range(len(variances)):
    if (variances[i] >= 0.006):
        variances_features.append(cols_lv[i])

variances_features

['atemp', 'humidity', 'windspeed', 'count']

In [97]:
new_data_lv = data_lv[variances_features]
new_data_lv.head()

Unnamed: 0,atemp,humidity,windspeed,count
0,14.395,81,0.0,16
1,13.635,80,0.0,40
2,13.635,80,0.0,32
3,14.395,75,0.0,13
4,14.395,75,0.0,1


In [98]:
new_data_lv.var()

atemp           73.137484
humidity       398.549141
windspeed       69.322053
count        25843.419864
dtype: float64

In [99]:
# shape of new and original data
new_data_lv.shape, data_lv.shape

((12980, 4), (12980, 5))

## High Correlation Filter

Correlation is: <br>
   - Determines relationship between two variables
   - Higher magnitude of corr, stronger the relationship

So, if we think that two variables are correlated, we can try: <br>
   - Plot a scatteplot and we can see trend in it
   - Verify it by Pearson Corr. It should be a high number

Highly corr variables converys similar info, and its not necessary to keep all of them. They also lead to **multicollinearity** problem we saw in Linear Regression (in github notebook - https://github.com/Neelam-Singhal/Linear-Models/blob/master/LinearModel.ipynb)

Steps:

   - Calculate corr between all independent variables
   - Drop variables if corr value crosses a certain threshold (eg: 0.5 - 0.6)
   - Drop the one which has lesses corr with our target variable


In [101]:
data_hc = pd.read_csv('high_correlation_fllter.csv')

In [110]:
data_hc.head()

Unnamed: 0,ID,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,AB101,1,0,0,1,9.84,14.395,81,0.0,16
1,AB102,1,0,0,1,9.02,13.635,80,0.0,40
2,AB103,1,0,0,1,9.02,13.635,80,0.0,32
3,AB104,1,0,0,1,9.84,14.395,75,0.0,13
4,AB105,1,0,0,1,9.84,14.395,75,0.0,1


In [109]:
data_hc.shape

(12980, 10)

In [111]:
# Check if any null
data_hc.isna().sum()

ID            0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
count         0
dtype: int64

In [112]:
#drop the target variable
data_hc = data_hc.drop('count', axis=1)

In [113]:
data_hc.shape

(12980, 9)

In [169]:
corr_matrix = data_hc.corr().abs()

In [170]:
# selecting upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

In [171]:
upper

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
season,,0.010959,0.014343,0.013005,0.39456,0.397765,0.181712,0.135762
holiday,,,0.248558,0.018406,0.025104,0.032903,0.02952,0.021646
workingday,,,,0.052788,0.060589,0.06484,0.028026,0.001986
weather,,,,,0.093655,0.094877,0.432497,0.01112
temp,,,,,,0.991839,0.048478,0.008669
atemp,,,,,,,0.031606,0.049997
humidity,,,,,,,,0.296975
windspeed,,,,,,,,


In [182]:
to_drop = [columns for columns in upper.columns if any(upper[columns] > 0.6)]

In [183]:
to_drop

['atemp']

In [184]:
# dropping the variable and creating new dataset
new_data_hc = data_hc.drop(data[to_drop], axis=1)

In [185]:
# shape of new and original data
new_data_hc.shape, data_hc.shape

((12980, 8), (12980, 9))

## Backward Feature Elimination

**Assumptions:**

    - No missing values in dataset
    - Variance of the variables is high
    - Low correlation between the independent variables

**Steps:**

    - Train the model using all the variables (n)
    - Calculate the performance of the model
    - Eliminate a variable, train the model on remaining variables (n-1)
    - Calculate the performance of model on new data
    - Identify the eliminated variable which dosen’t impact the performance much
    - Repeat until no more variables can be dropped

In [204]:
data_bf = pd.read_csv('backward_feature_elimination.csv')

In [205]:
data_bf.head()

Unnamed: 0,ID,season,holiday,workingday,weather,temp,humidity,windspeed,count
0,AB101,1,0,0,1,9.84,81,0.0,16
1,AB102,1,0,0,1,9.02,80,0.0,40
2,AB103,1,0,0,1,9.02,80,0.0,32
3,AB104,1,0,0,1,9.84,75,0.0,13
4,AB105,1,0,0,1,9.84,75,0.0,1


In [206]:
# checking missing values in the data
data.isnull().sum()

ID            0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
count         0
dtype: int64

In [208]:
# creating the training data
X = data_bf.drop(['ID', 'count'], axis=1)
y = data_bf['count']

In [211]:
X.shape, y.shape

((12980, 7), (12980,))

In [210]:
!pip install mlxtend

^C


In [193]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression

In [198]:
lr = LinearRegression()
sfs_backward = sfs(lr, forward=False, verbose=1, k_features=5, scoring='neg_mean_squared_error')

In [212]:
sfs_backward = sfs_backward.fit(X, y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.2s finished
Features: 6/5[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.1s finished
Features: 5/5

In [213]:
feature_names = list(sfs_backward.k_feature_names_)
feature_names

['holiday', 'workingday', 'weather', 'temp', 'humidity']

In [215]:
new_data_bf = data_bf[feature_names]
new_data_bf['count'] = y

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data_bf['count'] = y


In [216]:
new_data_bf.head()

Unnamed: 0,holiday,workingday,weather,temp,humidity,count
0,0,0,1,9.84,81,16
1,0,0,1,9.02,80,40
2,0,0,1,9.02,80,32
3,0,0,1,9.84,75,13
4,0,0,1,9.84,75,1


## Forward Feature Selection

**Steps:**

    - Train n models using each feature (n) individually and check the performance
    - Choose variables that give best performance
    - Repeat the process and add one variable at a time
    - Variable producing the highest improvement is retained
    - Repeat the entire process until there is no significant improvement in Model’s performance


In [217]:
data_ff = pd.read_csv('forward_feature_selection.csv')

In [219]:
# creating the training data
X_ff = data_ff.drop(['ID', 'count'], axis=1)
y_ff = data_ff['count']

In [220]:
# calling the linear regression model
lreg = LinearRegression()
sfs1 = sfs(lreg, k_features=4, forward=True, verbose=2, scoring='neg_mean_squared_error')

In [221]:
sfs1 = sfs1.fit(X_ff, y_ff)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.0s finished

[2021-06-26 10:50:42] Features: 1/4 -- score: -23364.95550318101[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.1s finished

[2021-06-26 10:50:42] Features: 2/4 -- score: -21454.89933921974[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished

[2021-06-26 10:50:42] Features: 3/4 -- score: -21458.27878856438[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 

In [222]:
feat_names = list(sfs1.k_feature_names_)
print(feat_names)

['holiday', 'workingday', 'temp', 'humidity']


In [223]:
# creating a new dataframe using the above variables and adding the target variable
new_data_ff = data_ff[feat_names]
new_data_ff['count'] = data_ff['count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data_ff['count'] = data_ff['count']


In [224]:
# shape of new and original data
new_data_ff.shape, data_ff.shape

((12980, 5), (12980, 9))