Standard deviation is a measure of the dispersion or variability of a set of values in a dataset. It quantifies how much the values in a dataset differ from the mean of the dataset. A low standard deviation indicates that the values are close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.

The standard deviation is calculated using the following formula:

\[ \sigma = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}} \]

Where:
- \( \sigma \) is the standard deviation.
- \( x_i \) represents each individual value in the dataset.
- \( \mu \) is the mean of the dataset.
- \( n \) is the number of values in the dataset.

Let's go through an example to calculate the standard deviation for a small dataset:

Consider the dataset:
\[ X = [5, 10, 15, 20, 25] \]

1. **Calculate the Mean (\( \mu \))**:
   \[ \mu = \frac{5 + 10 + 15 + 20 + 25}{5} = \frac{75}{5} = 15 \]

2. **Calculate the Variance**:
   - Calculate the squared differences from the mean for each value:
     \[ (5 - 15)^2 = 100 \]
     \[ (10 - 15)^2 = 25 \]
     \[ (15 - 15)^2 = 0 \]
     \[ (20 - 15)^2 = 25 \]
     \[ (25 - 15)^2 = 100 \]
   - Calculate the variance:
     \[ \text{Variance} = \frac{100 + 25 + 0 + 25 + 100}{5} = \frac{250}{5} = 50 \]

3. **Calculate the Standard Deviation (\( \sigma \))**:
   - Take the square root of the variance:
     \[ \sigma = \sqrt{50} \approx 7.07 \]

Therefore, the standard deviation of the dataset \( X = [5, 10, 15, 20, 25] \) is approximately 7.07. This indicates the spread or dispersion of the values around the mean of 15.

In [35]:
from sklearn.preprocessing import scale
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [25]:
us_df= pd.read_csv('acs2015_county_data.csv').dropna()

In [26]:
labels = ['CensusId', 'State', 'County']

In [30]:
features = us_df.drop(labels,axis = 1)      

In [32]:
us_scaled = scale(features)

In [37]:
pca = PCA(.85)
pca.fit_transform(us_scaled)



array([[ 4.42934293e-01,  5.18969413e-01, -1.59499024e+00, ...,
        -4.29625446e-01, -7.91981003e-02, -2.96048593e-02],
       [ 1.71989852e+00,  3.43068145e-01, -1.66609490e+00, ...,
        -4.11465700e-02,  1.15833114e+00,  2.06054001e-01],
       [-1.75940609e+00,  3.69092706e+00,  7.07138712e-03, ...,
        -1.14593387e+00,  5.99572336e-01, -9.22270266e-01],
       ...,
       [-4.14477054e+00,  7.00634238e+00,  4.08624219e+00, ...,
        -1.47155607e+00, -1.37838261e+00, -4.67169874e-01],
       [-3.58184259e+00,  7.45413922e+00,  3.76754665e+00, ...,
        -5.18276192e-01, -1.48977485e+00,  9.62547378e-01],
       [-2.99379525e+00,  7.39835693e+00,  3.32653897e+00, ...,
        -1.19875324e+00, -4.38306390e-01,  2.83486576e-01]])

In [39]:
pca.explained_variance_ratio_

array([0.20059818, 0.17109461, 0.13235941, 0.06281225, 0.05108803,
       0.0428644 , 0.03679227, 0.03265611, 0.03195313, 0.03024532,
       0.02677679, 0.02469989, 0.02231643])

In [41]:
len(pca.components_)

13