# Dimensionality Reduction

### Challanges with high dimensional data:
- Visualization becomes difficult
- All the variables might not be important
- More computation time
- More complex models
- Difficulties in data exploration


### Common Dimensionality Reduction Techniques:
- <b>Feature Selection</b>

	- Missing value ratio
	- Low Variance
	- High Correlation
	- Backward Feature Elimination
	- Forward Feature Selection

- <b>Feature Extractions</b>
	
	- Factor Analysis
	- Principal Component Analysis

Feature selection keeps a subset of the original features while feature extraction creates new features using the existing features.


## Missing value ratio

<b>Steps:</b>
	
	- Calculate ratio of missing values
	- Ratio of missing value = (Num of missing values) / (Total num of obs) * 100
	- Calculate above ratio of all the variables
	- Set a threshold, say 70%
	- Use this threshold and drop all the variables which have missing values more than this threshold

<b>How to deal with remaining variables which still have missing values in them?<b>

Try find reason for missing data (error, PII, customer not filling info). Once we have the reason, we will try to impute those missing values by:

	- Statistical measures like mean, median and mode
	- Train model to predict missing values

## Low Variance ratio
Variance is the spread of the data. It tells us how far the points are from the mean.<br>
	Eg: if all the values in a column are the same number, then the variance is 0.

So, we can say that variables with low variance have less impact on the target variable.

We can set a threshold value for variance as well. Any column which is below the threshold value can be safely dropped. <br>
### <b>IMP </b> - Variance is range dependent. Therefore, we need to do normalization before applying this technique.


High Correlation Filter
Correlation is:
Determines relationship between two variables
Higher magnitude of corr, stronger the relationship

So, if we think that two variables are correlated, we can try:
Plot a scatteplot and we can see trend in it
Verify it by Pearson Corr. It should be a high number

Highly corr variables converys similar info, and its not necessary to keep all of them. They also lead to multicollinearity problem we saw in Linear Regression (in github notebook)

Steps:
Calculate corr between all independent variables
Drop variables if corr value crosses a certain threshold (eg: 0.5 - 0.6)
Drop the one which has lesses corr with our target variable


Backward Feature Elimination

Assumptions:
No missing values in dataset
Variance of the variables is high
Low correlation between the independent variables

Steps:
Train the model using all the variables (n)
Calculate the performance of the model
Eliminate a variable, train the model on remaining variables (n-1)
Calculate the performance of model on new data
Identify the eliminated variable which dosen’t impact the performance much
Repeat until no more variables can be dropped

Forward Feature Selection

Steps:
Train n models using each feature (n) individually and check the performance
Choose variables that give best performance
Repeat the process and add one variable at a time
Variable producing the highest improvement is retained
Repeat the entire process until there is no significant improvement in Model’s performance


In [7]:
#importing libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import mean_squared_error as mse

In [8]:
df = pd.read_csv("insurance.csv")

In [9]:
df.head(3)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462


In [26]:
df.shape

(1338, 7)

In [25]:
#How to use random numbers to create train and test data if you don't want to use train test split


df['randomnum'] = np.random.randint(1,6,df.shape[0])

train_temp = df[df['randomnum'] >= 3]
test_temp = df[df.randomnum < 3] 

train_temp = train_temp.drop('randomnum', axis=1)
test_temp = test_temp.drop('randomnum', axis=1)

df = df.drop('randomnum', axis=1)

In [31]:
train_temp

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
5,31,female,25.740,0,no,southeast,3756.62160
7,37,female,27.740,3,no,northwest,7281.50560
...,...,...,...,...,...,...,...
1332,52,female,44.700,3,no,southwest,11411.68500
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500
