Principal component analysis (PCA) is a linear dimensionality reduction method. It finds the eigenvalue and the eigenvectors of the covariance matrix. The eigenvectors are the principal components, and their respective eigenvalues are how much (not proportion) they of the total variance of the whole data set the principal components explain. The dimensionality reduction is done by taking an input vector, doing a dot product with a principal component, and voila, you have a real number: a component score. If you have fewer principal components than there are features in the data set, then the resulting principal components will (often) explain less than 100% of the variance in the data set, but will have fewer "features"/dimensions that the original data set. Note, this is a linear algorithm; component scores are linear combinations of the input vectors.  

Resources that can help you understand principal component analysis:
+ The math behind it and what it is: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
+ Interpreting the results: https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/multivariate/how-to/principal-components/interpret-the-results/key-results/
+ More on interpreting things and selecting the number of principal components: https://newonlinecourses.science.psu.edu/stat505/lesson/11/11.4
+ Importance of standardizing your data: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

In [1]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

In [7]:
X, y = load_iris(return_X_y=True)

In [8]:
print('X\'s shape: {}'.format(X.shape))
print('y\'s shape: {}'.format(y.shape))

X's shape: (150, 4)
y's shape: (150,)


In [6]:
help(PCA)

Help on class PCA in module sklearn.decomposition.pca:

class PCA(sklearn.decomposition.base._BasePCA)
 |  Principal component analysis (PCA)
 |  
 |  Linear dimensionality reduction using Singular Value Decomposition of the
 |  data to project it to a lower dimensional space.
 |  
 |  It uses the LAPACK implementation of the full SVD or a randomized truncated
 |  SVD by the method of Halko et al. 2009, depending on the shape of the input
 |  data and the number of components to extract.
 |  
 |  It can also use the scipy.sparse.linalg ARPACK implementation of the
 |  truncated SVD.
 |  
 |  Notice that this class does not support sparse input. See
 |  :class:`TruncatedSVD` for an alternative with sparse data.
 |  
 |  Read more in the :ref:`User Guide <PCA>`.
 |  
 |  Parameters
 |  ----------
 |  n_components : int, float, None or string
 |      Number of components to keep.
 |      if n_components is not set all components are kept::
 |  
 |          n_components == min(n_samples, n

In [9]:
pca = PCA().fit(X)

In [22]:
pca.components_

array([[ 0.36138659, -0.08452251,  0.85667061,  0.3582892 ],
       [ 0.65658877,  0.73016143, -0.17337266, -0.07548102],
       [-0.58202985,  0.59791083,  0.07623608,  0.54583143],
       [-0.31548719,  0.3197231 ,  0.47983899, -0.75365743]])

In [11]:
pca.explained_variance_

array([4.22824171, 0.24267075, 0.0782095 , 0.02383509])

In [12]:
pca.explained_variance_ratio_

array([0.92461872, 0.05306648, 0.01710261, 0.00521218])

Note, I did not standardize each feature in X. This is bad because the variance of a feature with respect to itself "confounds" the covariance of that feature with all other features. This affects the choice of the principal components, the explained variance, and so on. Therefore, if I standardize the data to have unit variance (and center it at 0), then I would expect to get different, but more statistically sound, results.

In [13]:
from sklearn.preprocessing import StandardScaler

iris_ss = StandardScaler().fit(X)

In [15]:
X_ss = iris_ss.transform(X)

In [16]:
pca_ss = PCA().fit(X_ss)

In [18]:
print('Non-standardized Principal Components:\n{}'.format(pca.components_))
print('Standardized Principal Components:\n{}'.format(pca_ss.components_))

Non-standardized Principal Components:
[[ 0.36138659 -0.08452251  0.85667061  0.3582892 ]
 [ 0.65658877  0.73016143 -0.17337266 -0.07548102]
 [-0.58202985  0.59791083  0.07623608  0.54583143]
 [-0.31548719  0.3197231   0.47983899 -0.75365743]]
Standardized Principal Components:
[[ 0.52106591 -0.26934744  0.5804131   0.56485654]
 [ 0.37741762  0.92329566  0.02449161  0.06694199]
 [-0.71956635  0.24438178  0.14212637  0.63427274]
 [-0.26128628  0.12350962  0.80144925 -0.52359713]]


In [20]:
print('Non-standardized Explained Variance of Principal Components:\n{}'.format(pca.explained_variance_))
print('Standardized Explained Variance of Principal Components:\n{}'.format(pca_ss.explained_variance_))

Non-standardized Explained Variance of Principal Components:
[4.22824171 0.24267075 0.0782095  0.02383509]
Standardized Explained Variance of Principal Components:
[2.93808505 0.9201649  0.14774182 0.02085386]


In [27]:
print('Non-standardized Proportion of Explained Variance of Principal Components:\n{}'.format(pca.explained_variance_ratio_))
print('Standardized Proportion of Explained Variance of Principal Components:\n{}'.format(pca_ss.explained_variance_ratio_))                                
                                                                                          

Non-standardized Proportion of Explained Variance of Principal Components:
[0.92461872 0.05306648 0.01710261 0.00521218]
Standardized Proportion of Explained Variance of Principal Components:
[0.72962445 0.22850762 0.03668922 0.00517871]


As you can probably tell, standardizing the data does make a difference. Not only did some of the numbers change in the principal components, but also, some of the signs changed. Thus, you should always standardize inputs to PCA.

Now for the interesting part, the interpretation of the principal components and the rest. Each row in the principal component matrices above corresponds to a feature. Which row corresponds to which feature can be determined simply by looking that the ordering of the features in the data set; the orders are the same.

In [25]:
load_iris()['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

The values in each principal component (which are eigenvectors of the covariance matrix) are the coefficients to compute the component score for each row of the data in X_ss. For example, the first component score of the standardized iris data set, for any (standardized) given input $x = [x_1, x_2, x_3, x_4]$ would be computed as: $p_1 = 0.521(x_1) + 0.377(x_2) - 0.720(x_3) - 0.261(x_4)$. Note, a component score is just a linear transformation of the input data / a dot product of the data vector by the principal component. Also note, the entries in a principal component are called "loadings."

Another thing about principal components is that we can select how many principal components we want to keep by looking at the explained variance. As per Kaiser's criterion, only principal components with eigenvalues (explained variance) that are greater than 1 should be kept. By that criterion, only the first (standardized) principal component would be kept. However, you can go to as many principal components as you like really. Just be sure to consider how including or excluding components affects the cumulative (sum of all) explained variance proportions. 

Yet another thing about principal components is that the entries in each of the principal components are not the correlations with the given component scores. To get correlations and see how each features is related to the each principal component, you have to compute the correlations between the component scores of each row of the data set and that each given row's value for each feature. See: https://newonlinecourses.science.psu.edu/stat505/lesson/11/11.3 and https://newonlinecourses.science.psu.edu/stat505/lesson/11/11.4. Notice that the loading for Arts and the correlation coefficient for Arts for principal component 1 differ. You have to decide for you task at hand, what level of correlation is important (e.g., correlation > 0.5 or < -0.5 is worth looking at).

With the correlations of each feature to the component scores, you can begin to tell a story, especially if the first component explains a lot of variance. You can tell which features increase or decrease together and the degree to which they do, and if a correlation is high and the proportion of explained variance is high then that feature is important to look at. You can also look at all of the correlations together, again especially for the first component if it has a high enough proportion of explained variance, and do what https://newonlinecourses.science.psu.edu/stat505/lesson/11/11.4 does for the first component interpretation especially. In other words, you can find some general rules or form some hypotheses, though, admittedly, there is no formulaic and straight-forward way of doing so. 