# Introducation

[ML DATASETS](http://archive.ics.uci.edu/ml/)

## 1. Preprocessing Data

### Standardization, or mean removal and variance scaling

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, **_then scale it by dividing non-constant features by their standard deviation_**.

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

[Should I normalize/standardize/rescale the data](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html,"Should I normalize/standardize/rescale the data")

[**StandardScaler**](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.


[**MinMaxScaler**](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)

Scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. 


In [None]:
# scale
sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)

# StandardScaler
class sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)

# StandardScaler provides transform API but scale not

In [None]:
# MinMaxScaler
class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)

In [1]:
# Coding in here
import numpy as np 
import pandas as pd 

In [134]:
df  = pd.read_csv("forestfires.csv")

In [4]:
df1 = df.loc[:,"FFMC":"rain"]

In [6]:
from sklearn.cross_validation import train_test_split

In [8]:
#70% train
# 30% test
X_train, X_test, y_train, y_test  = train_test_split(df1,df['area'],train_size = 0.7)

In [10]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

In [11]:
ss.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [12]:
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)

In [22]:
X_test_ss.std(axis = 0)

array([ 0.60462456,  1.02987456,  0.99204461,  0.91540587,  0.96692585,
        0.91336288,  0.99149453,  4.72647151])

In [23]:
from sklearn.preprocessing import MinMaxScaler

In [24]:
mms = MinMaxScaler((0,2))

In [25]:
mms.fit(X_train)

MinMaxScaler(copy=True, feature_range=(0, 2))

In [27]:
mms.transform(X_train)

array([[ 1.88129032,  1.88785047,  1.91456219, ...,  0.68235294,
         0.84705882,  0.        ],
       [ 1.90451613,  0.38283143,  1.00401227, ...,  0.56470588,
         1.05882353,  0.        ],
       [ 1.9483871 ,  0.80373832,  1.35284399, ...,  0.61176471,
         0.30588235,  0.        ],
       ..., 
       [ 1.89419355,  0.76220145,  1.52513571, ...,  0.75294118,
         0.        ,  0.        ],
       [ 2.        ,  1.20733818,  1.54330895, ...,  0.63529412,
         0.30588235,  0.        ],
       [ 1.88129032,  0.68605054,  1.08189757, ...,  0.58823529,
         0.09411765,  0.        ]])

###  Normalization

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.


In [None]:
# normalize
sklearn.preprocessing.normalize(X, norm='l2', axis=1, copy=True, return_norm=False)

# Normalizer
class sklearn.preprocessing.Normalizer(norm='l2', copy=True)

# Normalizer provides transform API, the fit method does nothing

In [28]:
# Coding in here
from sklearn.preprocessing import Normalizer
norm = Normalizer()

In [29]:
norm.fit(X_train)

Normalizer(copy=True, norm='l2')

In [30]:
norm.transform(X_train)

array([[ 0.10529772,  0.31474362,  0.94158691, ...,  0.05057969,
         0.00517292,  0.        ],
       [ 0.20600115,  0.12560502,  0.96497619, ...,  0.08685454,
         0.01202601,  0.        ],
       [ 0.15642214,  0.19461439,  0.96493533, ...,  0.06808183,
         0.00365317,  0.        ],
       ..., 
       [ 0.13710091,  0.16553335,  0.97369931, ...,  0.06996463,
         0.00133975,  0.        ],
       [ 0.13875792,  0.25313945,  0.95457373, ...,  0.06058038,
         0.00317326,  0.        ],
       [ 0.18774904,  0.20537613,  0.9557574 , ...,  0.08198648,
         0.00266456,  0.        ]])

In [31]:
norm.transform(X_test)

array([[ 0.11969767,  0.13705642,  0.98206496, ...,  0.03497659,
         0.00284994,  0.        ],
       [ 0.12580536,  0.23996207,  0.96025943, ...,  0.05207629,
         0.0086337 ,  0.        ],
       [ 0.12845887,  0.17876839,  0.97015415, ...,  0.09892322,
         0.00508748,  0.        ],
       ..., 
       [ 0.1493367 ,  0.20856462,  0.96142429, ...,  0.08709988,
         0.00633454,  0.        ],
       [ 0.14113963,  0.27935169,  0.94452611, ...,  0.09399036,
         0.00755005,  0.        ],
       [ 0.11836265,  0.21227236,  0.96823249, ...,  0.0455241 ,
         0.00234124,  0.        ]])

### Binarization

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM.

In [4]:
# Binarizer
class sklearn.preprocessing.Binarizer(threshold=0.0, copy=True)

In [128]:
df['DC'].mean()

547.9400386847191

In [129]:
from sklearn.preprocessing import Binarizer
bi = Binarizer(548)

In [131]:
DC_bi = bi.fit_transform(df['DC'])



In [132]:
DC_bi.shape

(1, 517)

In [136]:
df['DC_bi'] =DC_bi[0,:]

In [137]:
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,DC_bi
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,1.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,1.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0,0.0


Return indices of half-open bins to which each value of x belongs.
```python
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
```


In [141]:
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,DC_bi
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,1.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,1.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0,0.0


In [140]:
pd.cut(df['DC'],5)

0      (7.0473, 178.44]
1      (519.52, 690.06]
2      (519.52, 690.06]
3      (7.0473, 178.44]
4      (7.0473, 178.44]
5      (348.98, 519.52]
6      (348.98, 519.52]
7      (519.52, 690.06]
8       (690.06, 860.6]
9       (690.06, 860.6]
10      (690.06, 860.6]
11      (690.06, 860.6]
12     (519.52, 690.06]
13     (519.52, 690.06]
14      (690.06, 860.6]
15      (690.06, 860.6]
16     (7.0473, 178.44]
17     (519.52, 690.06]
18     (7.0473, 178.44]
19     (7.0473, 178.44]
20      (690.06, 860.6]
21      (690.06, 860.6]
22     (178.44, 348.98]
23     (519.52, 690.06]
24     (519.52, 690.06]
25     (519.52, 690.06]
26     (519.52, 690.06]
27     (519.52, 690.06]
28      (690.06, 860.6]
29      (690.06, 860.6]
             ...       
487    (519.52, 690.06]
488    (519.52, 690.06]
489    (519.52, 690.06]
490    (519.52, 690.06]
491    (519.52, 690.06]
492    (519.52, 690.06]
493    (519.52, 690.06]
494    (519.52, 690.06]
495    (519.52, 690.06]
496    (519.52, 690.06]
497    (519.52, 

### Encoding categorical features

We could encode categorical features as integers, but such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder). This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

In [None]:
# OneHotEncoder
class sklearn.preprocessing.OneHotEncoder(n_values='auto', categorical_features='all', dtype=<type 'numpy.float64'>, 
                                          sparse=True, handle_unknown='error')[source]¶

Convert categorical variable into dummy/indicator variables
```python
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)
```

In [96]:
# Coding in here 

In [155]:
modelData = pd.get_dummies(data = df,columns=['month','day','DC_bi'])

###  Imputation of missing values

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). **_A better strategy is to impute the missing values, i.e., to infer them from the known part of the data._**


The [Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer) class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. This class also allows for different missing values encodings.



**The imputation strategy:**
1. If “mean”, then replace missing values using the mean along the axis.
2. If “median”, then replace missing values using the median along the axis.
3. If “most_frequent”, then replace missing using the most frequent value along the axis.

In [None]:
# Imputer
class sklearn.preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy=True)


In [144]:
df.loc[df['DC']>=600,'DC_na'] = np.nan
df.loc[df['DC']>=600,'DC_na'] = df['DC']

In [146]:
from sklearn.preprocessing import Imputer
im =Imputer()

In [148]:
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,DC_bi,DC_na
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,0.0,
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,1.0,669.1
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,1.0,686.9
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,0.0,
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0,0.0,


In [149]:
im.fit_transform(df['DC_na'])



array([[ 669.1,  686.9,  608.2,  692.6,  698.6,  698.6,  713. ,  665.3,
         686.5,  699.6,  713.9,  664.2,  692.6,  724.3,  601.4,  668. ,
         686.5,  721.4,  728.6,  692.3,  709.9,  706.8,  718.3,  724.3,
         730.2,  669.1,  682.6,  686.9,  624.2,  647.1,  698.6,  735.7,
         692.3,  686.5,  706.4,  631.2,  654.1,  654.1,  661.3,  706.4,
         730.2,  691.8,  631.2,  638.8,  661.3,  668. ,  668. ,  668. ,
         692.3,  614.5,  713.9,  601.4,  631.2,  647.1,  654.1,  661.3,
         706.4,  706.4,  706.4,  728.6,  624.2,  601.4,  638.8,  704.4,
         601.4,  601.4,  601.4,  614.5,  647.1,  674.4,  631.2,  698.6,
         709.9,  704.4,  724.3,  608.2,  608.2,  680.7,  671.9,  692.3,
         691.8,  728.6,  673.8,  691.8,  685.2,  680.7,  686.5,  692.6,
         686.5,  671.9,  647.1,  685.2,  692.3,  721.4,  647.1,  721.4,
         654.1,  654.1,  668. ,  674.4,  704.4,  654.1,  699.6,  609.6,
         601.4,  686.5,  624.2,  624.2,  631.2,  735.7,  614.5, 

In [None]:
#Coding in here 
df.replace(np,inf,np.nan)

In [152]:
X_train_ss

array([[ 0.1795478 ,  2.57963786,  1.07729284, ..., -0.00977236,
         0.25997318, -0.10746504],
       [ 0.32641356, -0.84937862, -0.47609705, ..., -0.30874022,
         0.7615485 , -0.10746504],
       [ 0.60382666,  0.10961035,  0.1190067 , ..., -0.18915307,
        -1.02183043, -0.10746504],
       ..., 
       [ 0.26113989,  0.01497328,  0.4129348 , ...,  0.16960836,
        -1.74632813, -0.10746504],
       [ 0.93019502,  1.02916721,  0.44393818, ..., -0.1293595 ,
        -1.02183043, -0.10746504],
       [ 0.1795478 , -0.15852801, -0.34322544, ..., -0.24894665,
        -1.52340576, -0.10746504]])

In [151]:
df.fillna(-1)

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,DC_bi,DC_na
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.00,0.0,-1.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.00,1.0,669.1
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.00,1.0,686.9
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.00,0.0,-1.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.00,0.0,-1.0
5,8,6,aug,sun,92.3,85.3,488.0,14.7,22.2,29,5.4,0.0,0.00,0.0,-1.0
6,8,6,aug,mon,92.3,88.9,495.6,8.5,24.1,27,3.1,0.0,0.00,0.0,-1.0
7,8,6,aug,mon,91.5,145.4,608.2,10.7,8.0,86,2.2,0.0,0.00,1.0,608.2
8,8,6,sep,tue,91.0,129.5,692.6,7.0,13.1,63,5.4,0.0,0.00,1.0,692.6
9,7,5,sep,sat,92.5,88.0,698.6,7.1,22.8,40,4.0,0.0,0.00,1.0,698.6


#### Some practical tips
1. Do not omit samples contains missing values
2. Filling them with some non-sense values such as -999,-1
3. Some useful tricks:
    * np.nan
    * np.inf
    * df.fillna
    * df.replace
    

## 2. Feature selection

This can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

In [None]:
# SelectFromModel
class sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False)

In [153]:
from sklearn.feature_selection import SelectFromModel

In [157]:
xdata = modelData.drop("area",axis = 1)
ydata = modelData['area']

In [160]:
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(xdata.fillna(-999),ydata)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [163]:
xdata.head()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,...,month_sep,day_fri,day_mon,day_sat,day_sun,day_thu,day_tue,day_wed,DC_bi_0.0,DC_bi_1.0
0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [161]:
np.sort(lasso.coef_

array([  1.88324733e+00,   9.58928425e-02,  -1.25434649e-02,
         8.75375954e-02,  -2.88298355e-02,  -4.81990096e-01,
         7.97824030e-01,  -2.21426241e-01,   1.41927224e+00,
        -0.00000000e+00,   6.52084401e-03,   0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,   4.03266456e+00,  -0.00000000e+00,
        -0.00000000e+00,   7.53637559e+00,  -0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00])

In [173]:
model = SelectFromModel(lasso,prefit=True)

In [174]:
model.transform(xdata.fillna(-999))

array([[   7. ,    5. ,   86.2, ..., -999. ,    0. ,    0. ],
       [   7. ,    4. ,   90.6, ...,  669.1,    0. ,    0. ],
       [   7. ,    4. ,   90.6, ...,  686.9,    0. ,    1. ],
       ..., 
       [   7. ,    4. ,   81.6, ...,  665.6,    0. ,    0. ],
       [   1. ,    4. ,   94.4, ...,  614.7,    0. ,    1. ],
       [   6. ,    3. ,   79.5, ..., -999. ,    0. ,    0. ]])

####  L1-based feature selection

#### Tree-based feature selection

In [169]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()

In [170]:
rf.fit(xdata.fillna(-999),ydata)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [171]:
rf.feature_importances_

array([  2.48512870e-02,   9.99227611e-02,   2.67107651e-02,
         4.08986519e-02,   1.81259785e-02,   6.00357740e-02,
         4.10830196e-01,   4.97455827e-02,   4.26855284e-02,
         1.55210487e-06,   2.65642821e-02,   5.71645885e-04,
         2.95787712e-04,   1.96861294e-04,   4.12236375e-05,
         0.00000000e+00,   3.97521808e-02,   2.97067553e-04,
         1.86031754e-04,   3.72592817e-04,   0.00000000e+00,
         1.12109345e-03,   2.25167799e-03,   1.00845439e-03,
         1.60373994e-02,   2.12936823e-02,   2.98536498e-02,
         6.87283229e-02,   9.09556384e-03,   5.68486594e-03,
         1.50494190e-05,   2.82449056e-03])

In [175]:
model_rf = SelectFromModel(rf,prefit=True)

In [177]:
model_rf.transform(xdata.fillna(-999))

array([[   5. ,   26.2,    5.1, ...,    6.7,    0. ,    0. ],
       [   4. ,   35.4,    6.7, ...,    0.9,    0. ,    0. ],
       [   4. ,   43.7,    6.7, ...,    1.3,    0. ,    0. ],
       ..., 
       [   4. ,   56.7,    1.9, ...,    6.7,    0. ,    0. ],
       [   4. ,  146. ,   11.3, ...,    4. ,    0. ,    0. ],
       [   3. ,    3. ,    1.1, ...,    4.5,    0. ,    0. ]])

### Removing features with low variance

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

### Univariate feature selection
1. SelectKBest removes all but the k highest scoring features
2. SelectPercentile removes all but a user-specified highest scoring percentage of features

These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
* For regression: f_regression, mutual_info_regression
* For classification: chi2, f_classif, mutual_info_classif

In [None]:
class sklearn.feature_selection.VarianceThreshold(threshold=0.0)

class sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10)

class sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10)

In [179]:
from sklearn.feature_selection import VarianceThreshold,SelectKBest,SelectPercentile
v = VarianceThreshold(0.5)

In [183]:
v.fit_transform(modelData.fillna(-999)).shape

(517, 11)

In [184]:
modelData.shape

(517, 33)

###  Feature selection as part of a pipeline

Pipeline of transforms with a final estimator.
Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting to None.

## 3. Dimensionality reduction

###  Principal component analysis (PCA)

PCA works by mapping the original dataset into a new space where the new column
vectors of the matrix are each orthogonal. From a data analysis perspective, PCA transforms
the covariance matrix of the data into column vectors that can "explain" certain percentages
of the variance

[主成分分析（Principal components analysis）-最大方差解释](http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020209.html)

[主成分分析（Principal components analysis）-最小平方误差解释](http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020216.html)

### Truncated singular value decomposition and latent semantic analysis

TruncatedSVD is very similar to PCA, but differs in that it works on sample matrices X directly instead of their covariance matrices.

Truncated SVD is different from regular SVDs in that it produces a factorization where the
number of columns is equal to the specifed truncation. For example, given an n x n matrix,
SVD will produce matrices with n columns, whereas truncated SVD will produce matrices
with the specifed number of columns. This is how the dimensionality is reduced

In [None]:
# PCA
class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, 
                                iterated_power='auto', random_state=None)[source]
# TruncatedSVD
class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, 
                                         tol=0.0