### INTRODUCTION 
The main goal of the dimensionality reduction techniques is to reduce the dimensions by removing the redundant and dependent
features by transforming the features from a higher dimensional space that may lead to a curse of dimensionality problem, to a space with lower dimensions.
LDA stands for Linear Discriminant Analysis. It is used as both multiclass classification algorithms and dimentionality
reduction technique.
It reduces the number of input features or columns on the given dataset. LDA focuses on maximizing separatability among 
known categories.
LDA is an unsupervised approach which means there is no need for labeling classes of the data.

### WHAT IS DIMENTIONALITY REDUCTION ?
The techniques of dimensionality reduction are important in applications of Machine Learning, Data Mining, Bioinformatics, and Information Retrieval. The main agenda is to remove the redundant and dependent features by changing the dataset onto a lower-dimensional space.

In simple terms, they reduce the dimensions (i.e. variables) in a particular dataset while retaining most of the data.

Multi-dimensional data comprises multiple features having a correlation with one another. We can plot multi-dimensional data in just 2 or 3 dimensions with dimensionality reduction. It allows the data to be presented in an explicit manner which can be easily understood by a layman.

### HOW ARE LDA MODELS REPRESENTED

The representation of LDA is pretty straight-forward. The model consists of the statistical properties of your data that has been calculated for each class. The same properties are calculated over the multivariate Gaussian in the case of multiple variables. The multivariates are means and covariate matrix.

Predictions are made by providing the statistical properties into the LDA equation. The properties are estimated from your data. Finally, the model values are saved to file to create the LDA model.

### WORKING
The two very basic principles on which LDA works can be summerized into two steps:

<ul>
    <li> Maximizing distance between means of given classes.</li>
    <li> Minimizing variation (which LDA calls as scatter) within each category.</li>
</ul>
    


### ADVANTAGES
<ul>
    <li> Helps in reducing computational costs for a given classification task. </li>
    <li> Helpful in avoiding overfitting by minimizing the error in parameter estimation. </li>
</ul>


### LIMITATIONS
<ul>
    <li> LDA fails to find the lower dimensional space if the dimensions are much higher than 
         the number of samples in the data matrix. </li>
    <li> <b>LDA produces at most C-1 feature projections:</b> If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features </li> 
    <li> <b> LDA is a parametric method since it assumes unimodal Gaussian likelihoods: If the distributions are significantly          non-Gaussian, the LDA projections will not be able to preserve any complex structure of the data that may be needed for          classification </b> </li>
    <li> LDA will fail when the discriminatory information is not in the mean, but rather in the variance of the data </li>
</ul>

### DIFFERENCE BETWEEN PCA AND LDA
<ul>
    <li> PCA is unsupervised algorithm while LDA is supervised algorithm. </li>
    <li> The goal of PCA is to maximize variation in the given dataset while LDA focuses on 
         maximizing separatibility among known categories. </li>
    <li> LDA performs better multi-class classification tasks than PCA. However, PCA performs better when the sample size is              comparatively small. An example would be comparisons between classification accuracies that are used in image         classification.</li>
</ul>
  

### Following are the extensions of LDA in case we need to use non-linear discriminant analysis:
<ul>
    <li><b>Quadratic Discriminant Analysis (QDA):</b> Each class uses its own estimate of variance (or covariance when there are   multiple input variables). </li>
    <li><b>Flexible Discriminant Analysis (FDA):</b> Where non-linear combinations of inputs is used such as splines.</li>
    <li><b>Regularized Discriminant Analysis (RDA):</b> Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA. </li>
</ul>

### Applications:
<ul>
    <li><b>Face Recognition:</b> In the field of Computer Vision, face recognition is a very popular application in which each face is represented by a very large number of pixel values. Linear discriminant analysis (LDA) is used here to reduce the number of features to a more manageable number before the process of classification. Each of the new dimensions generated is a linear combination of pixel values, which form a template. The linear combinations obtained using Fisher’s linear discriminant are called Fisher faces.</li>
    <li><b>Medical:</b> In this field, Linear discriminant analysis (LDA) is used to classify the patient disease state as mild, moderate or severe based upon the patient various parameters and the medical treatment he is going through. This helps the doctors to intensify or reduce the pace of their treatment.</li>
    <li><b>Customer Identification:</b> Suppose we want to identify the type of customers which are most likely to buy a particular product in a shopping mall. By doing a simple question and answers survey, we can gather all the features of the customers. Here, Linear discriminant analysis will help us to identify and select the features which can describe the characteristics of the group of customers that are most likely to buy that particular product in the shopping mall. </li>
</ul>

In [1]:
# Import necessary modules
import numpy as np
import pandas as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

In [2]:
# Generating data for our problem
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7, n_classes=10)

In [3]:
print(X)

[[ 2.3548775  -1.69674567  1.6193882  ... -3.33390362  2.45147541
  -1.23455205]
 [ 2.0204277  -1.62734821 -2.27697377 ... -0.28274722 -7.28166465
  -0.91070347]
 [-1.02400669  1.01276423  1.05505825 ...  3.83923974 -1.63530582
   3.96050914]
 ...
 [-0.36448581 -0.2996303   2.21875138 ... -1.11303373  3.67576043
  -1.44164572]
 [ 0.05614772  1.87270289 -2.63165761 ... -3.07434527  2.31606352
   1.65068838]
 [ 1.09853247  1.61067335  2.7977282  ... -1.62233539 14.09727916
   2.27215759]]


In [4]:
print(y)

[9 4 4 1 2 7 9 4 6 9 1 3 9 5 6 4 9 2 0 4 4 8 0 9 3 8 6 0 0 7 8 3 8 5 5 9 7
 1 3 1 8 7 6 7 4 6 5 6 2 8 3 1 7 0 7 0 4 5 1 6 6 8 3 3 3 2 5 8 0 5 6 2 7 1
 3 8 7 2 8 0 6 8 2 9 9 8 2 2 5 6 9 4 6 1 4 9 0 9 7 8 7 2 0 8 8 1 7 9 8 1 6
 3 9 2 5 5 3 9 1 1 2 0 8 0 7 2 0 5 0 1 8 0 2 2 1 3 0 2 5 9 3 8 8 7 7 0 4 3
 4 0 8 5 3 7 4 4 5 5 0 4 5 1 1 4 3 5 2 6 4 2 1 6 6 9 5 3 7 0 1 5 9 5 4 7 3
 9 0 0 1 9 5 2 2 7 4 0 1 2 6 4 3 7 6 8 8 3 0 8 3 0 5 5 1 7 8 6 8 4 1 1 3 1
 9 9 3 2 8 1 8 1 7 6 1 1 7 6 5 3 4 1 6 5 2 8 6 5 9 0 6 9 6 2 3 4 8 3 8 4 8
 1 0 4 0 6 3 8 4 6 9 2 9 2 7 5 1 6 3 0 6 9 3 7 1 5 5 0 9 4 8 9 2 8 2 9 3 2
 3 5 1 8 0 0 6 5 1 3 2 8 1 8 6 7 3 2 5 9 6 2 3 4 2 1 5 4 2 9 5 1 7 1 6 0 2
 8 6 1 8 7 8 0 3 0 7 1 0 4 1 4 2 0 8 2 7 9 7 3 5 1 5 1 4 9 0 4 9 5 0 8 9 1
 2 9 2 8 4 7 9 7 8 4 9 1 7 8 3 7 3 1 9 6 2 9 4 6 8 1 1 5 6 3 0 3 4 8 7 5 6
 9 9 6 4 8 2 6 2 7 0 6 8 0 7 0 1 5 7 3 2 2 3 5 2 1 3 6 9 5 4 3 6 7 9 2 4 2
 5 0 2 7 4 5 9 1 3 1 8 6 3 1 1 3 3 7 6 6 5 5 8 7 8 9 5 0 7 4 6 3 9 4 7 4 3
 5 7 6 7 6 7 9 7 7 7 7 5 

In [5]:
# Shape of inout data
print(X.shape)

(1000, 20)


In [6]:
# Defining model
def model(n_components=None, solver='svd', shrinkage=None, priors=None,
          store_covariance=False, tol=0.0001, covariance_estimator=None):
    '''
    n_components : int, default=None
                   Number of components (<= min(n_classes - 1, n_features)) for dimensionality reduction. 
                   If None, will be set to min(n_classes - 1, n_features). This parameter only affects the transform method.
                   
    solver : {‘svd’, ‘lsqr’, ‘eigen’}, default=’svd’
             Solver to use, possible values:
             ‘svd’: Singular value decomposition (default). Does not compute the covariance matrix, therefore this solver is recommended for data with a large number of features.
             ‘lsqr’: Least squares solution. Can be combined with shrinkage or custom covariance estimator.
             ‘eigen’: Eigenvalue decomposition. Can be combined with shrinkage or custom covariance estimator.
      
    shrinkage : ‘auto’ or float, default=None
                Shrinkage parameter, possible values:
                None: no shrinkage (default).
                ‘auto’: automatic shrinkage using the Ledoit-Wolf lemma.
                float between 0 and 1: fixed shrinkage parameter.
                This should be left to None if covariance_estimator is used. Note that shrinkage works only with ‘lsqr’ and ‘eigen’ solvers.

    priors : array-like of shape (n_classes,), default=None
             The class prior probabilities. By default, the class proportions are inferred from the training data.

    store_covariance : bool, default=False
                       If True, explicitely compute the weighted within-class covariance matrix when solver is ‘svd’. The matrix is always computed
                       and stored for the other solvers.

    tol : float, default=1.0e-4
          Absolute threshold for a singular value of X to be considered significant, used to estimate the rank of X. Dimensions whose singular 
          values are non-significant are discarded. Only used if solver is ‘svd’.

    covariance_estimator : covariance estimator, default=None
                           If not None, covariance_estimator is used to estimate the covariance matrices instead of relying on the empirical 
                           covariance estimator (with potential shrinkage). The object should have a fit method and a covariance_ attribute 
                           like the estimators in sklearn.covariance. if None the shrinkage parameter drives the estimate.
      '''
    lda = LinearDiscriminantAnalysis(solver=solver, shrinkage=shrinkage, 
                                     priors=priors, n_components=n_components, store_covariance=store_covariance, 
                                     tol=tol, covariance_estimator=covariance_estimator)
    return lda

In [7]:
# Fitting the data to the model
lda = model(5)
lda.fit(X,y)



LinearDiscriminantAnalysis(n_components=5)

In [8]:
# Transforming data 
data_transformation = lda.transform(X)

In [9]:
print(data_transformation)

[[-1.34250698 -0.410752   -0.05284109 -2.52177124 -2.32197387]
 [ 0.92569633 -0.92633682 -0.29396574 -0.62144384  1.61682597]
 [-0.36265323 -0.87103112  1.53812275  0.59888243 -1.39423894]
 ...
 [-0.83323633  0.06686996  0.39414469 -0.5877848   0.11590941]
 [ 0.47329133  1.42040541  0.49439799 -0.05149737 -0.53591346]
 [-1.04969306  0.27613461 -0.13712968 -1.21293132 -0.22775809]]


In [10]:
# Notice the reduction of dimensions
print(data_transformation.shape)

(1000, 5)
