## What is Dimensionality Reduction?

We are generating a tremendous amount of data daily. In fact, 90% of the data in the world has been generated in the last 3-4 years! The numbers are truly mind boggling. Below are just some of the examples of the kind of data being collected:

* Facebook collects data of what you like, share, post, places you visit, restaurants you like, etc.
* Your smartphone apps collect a lot of personal information about you
* Amazon collects data of what you buy, view, click, etc. on their site
* Casinos keep a track of every move each customer makes

As data generation and collection keeps increasing, visualizing it and drawing inferences becomes more and more challenging. We can easily visualize 2-d data but once dimensions start to increase it become impossible to plot such data on a 2-d plane.


Here are some of the benefits of applying dimensionality reduction to a dataset:

* Space required to store the data is reduced as the number of dimensions comes down
* Less dimensions lead to less computation/training time
* Some algorithms do not perform well when we have a large dimensions. So reducing these dimensions needs to happen for the algorithm to be useful
* It takes care of multicollinearity by removing redundant features. For example, you have two variables – ‘time spent on treadmill in minutes’ and ‘calories burnt’. These variables are highly correlated as the more time you spend running on a treadmill, the more calories you will burn. Hence, there is no point in storing both as just one of them does what you require
* It helps in visualizing data. As discussed earlier, it is very difficult to visualize data in higher dimensions so reducing our space to 2D or 3D may allow us to plot and observe patterns more clearly

Dimensionality reduction can be done in two different ways:

1. By only keeping the most relevant variables from the original dataset (this technique is called feature selection)
2. By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables (this technique is called dimensionality reduction)


![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/08/Screenshot-from-2018-08-10-12-07-43.png)

**1)** **Missing Value Ratio**

Suppose you’re given a dataset. What would be your first step? You would naturally want to explore the data first before building model. While exploring the data, you find that your dataset has some missing values. Now what? You will try to find out the reason for these missing values and then impute them or drop the variables entirely which have missing values (using appropriate methods).

What if we have too many missing values (say more than 50%)? Should we impute the missing values or drop the variable? We can set a threshold value and if the percentage of missing values in any variable is more than that threshold, we will drop the variable.

**2) Low Variance Filter**

Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance.

So, we need to calculate the variance of each variable we are given. Then drop the variables having low variance as compared to other variables in our dataset. The reason is that variables with a low variance will not affect the target variable.

**3) High Correlation filter**

High correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance). We can calculate the correlation between independent numerical variables that are numerical in nature. If the correlation coefficient crosses a certain threshold value, we can drop one of the variables (dropping a variable is highly subjective and should always be done keeping the domain in mind).

As a general guideline, we should keep those variables which show a decent or high correlation with the target variable.

**4) Use of Random Forest**

Random Forest is one of the most widely used algorithms for feature selection. It comes packaged with in-built feature importance so you don’t need to program that separately. This helps us select a smaller subset of features.



In [0]:
# a dummy code

features = df.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:]  # top 10 features
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

NameError: ignored

One will see something like this:

![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/08/Screenshot-from-2018-07-26-23-28-54.png)



Alernatively, we can use the SelectFromModel of sklearn to do so. It selects the features based on the importance of their weights.

In [0]:
# dummy code

from sklearn.feature_selection import SelectFromModel
feature = SelectFromModel(model)
Fit = feature.fit_transform(df, train.Item_Outlet_Sales)

NameError: ignored

**5) Backward Feature Elimination**

Follow the below steps to understand and use the ‘Backward Feature Elimination’ technique:

* We first take all the n variables present in our dataset and train the model using them
* We then calculate the performance of the model
* Now, we compute the performance of the model after eliminating each variable (n times), i.e., we drop one variable every time and train the model on the remaining n-1 variables
* We identify the variable whose removal has produced the smallest (or no) change in the performance of the model, and then drop that variable
* Repeat this process until no variable can be dropped.


This method can be used when building Linear Regression or Logistic Regression models. Let’s look at it’s Python implementation

In [0]:
# dummy code
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn import datasets
lreg = LinearRegression()
rfe = RFE(lreg, 10)
rfe = rfe.fit_transform(df, train.Item_Outlet_Sales)

NameError: ignored

We need to specify the algorithm and number of features to select, and we get back the list of variables obtained from backward feature elimination. We can also check the ranking of the variables using the “rfe.ranking_” command.

**6) Forward Feature Selection**

This is the opposite process of the Backward Feature Elimination we saw above. Instead of eliminating features, we try to find the best features which improve the performance of the model. This technique works as follows:

*  We start with a single feature. Essentially, we train the model n number of times using each feature separately
* The variable giving the best performance is selected as the starting variable
* Then we repeat this process and add one variable at a time. The variable that produces the highest increase in performance is retained
* We repeat this process until no significant improvement is seen in the model’s performance

NOTE : Both Backward Feature Elimination and Forward Feature Selection are time consuming and computationally expensive.They are practically only used on datasets that have a small number of input variables.

The techniques we have seen so far are generally used when we do not have a very large number of variables in our dataset. These are more or less feature selection techniques.