# Principal Component Analysis Exploration

We begin PCA by importing the necessary libraries for PCA and one-hot encoding the categorical variables in our CCRB dataset. PCA is a mathematical technique that only operates on numerical data, so this is a crucial preprocessing step. Then, we needed to standardize our present data. This is another important step since PCA is sensitive to the scale of the variables. Without standardization, variables with larger variance will dominate the principal components, regardless of their true importance. Standardizing the data before running PCA ensures that all features are given equal importance, and the principal components are based on the actual variations in the data, rather than the scale of the features. Finally, we have to check for Nan and infinite values once the standardization is complete. `StandardScaler` performs a mathematical operation that scales the data by subtracting the mean of each feature and dividing by its standard deviation. If a feature has a standard deviation of zero, this will result in dividing by zero, which produces an infinite value.

In [4]:
# import libraries for PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# load dataset
ccrb = pd.read_csv("CCRB Complaint Database Raw 04.20.2021.csv")
ccrb = ccrb.drop(columns = ["AsOfDate", "AllegationID", "LastName", "FirstName", "ShieldNo", "ComplaintID",
                            "PenaltyRec", "NYPDDisposition", "PenaltyDesc", "LocationType", "ReceivedDate",
                            "CloseDate", "IncidentRank", "CurrentRank", "BoardCat", "OfficerID", "LastActive"])

# lowercase all column names and strings in column names
ccrb.columns= ccrb.columns.str.lower()
for column in list(ccrb.columns):
    if type(ccrb[column][0]) == str:
        ccrb[column] = ccrb[column].str.lower()
    
# binarize the CCRBDisposition column between (0) unsubstantiated and (1) substantiated
# if substantiated, switch to substantiated; if anything else, then unsubstantiated
ccrbDispositionList = list(ccrb["ccrbdisposition"])
for i in range(len(ccrbDispositionList)):
    if "substantiated" in ccrbDispositionList[i] and "unsubstantiated" not in ccrbDispositionList[i]:
        ccrbDispositionList[i] = 1
    else:
        ccrbDispositionList[i] = 0
ccrb["ccrbdisposition"] = ccrbDispositionList
ccrb = ccrb.drop(columns = ["incidentdate"])
ccrb = ccrb.drop(ccrb[ccrb["daysonforce"] < 0].index)
ccrb = ccrb.drop(ccrb[ccrb["impactedage"] < 0].index)
ccrb = ccrb.drop(ccrb[ccrb["impactedage"] > 116].index)

In [6]:
# select the columns to one-hot encode
categorical_cols = ['officerrace', 'officergender', 'currentranklong', 'currentcommand',
                    'incidentranklong', 'incidentcommand', 'status', 'fadotype',
                    'allegation', 'ccrbdisposition', 'contactreason', 'contactoutcome',
                    'incidentprecinct', 'impactedrace', 'impactedgender']

# create a new data frame with one-hot encoded categorical columns
ccrb_onehot = pd.get_dummies(ccrb, columns=categorical_cols)

# standardize data (important since PCA is sensitive to scale of variables)
scaler = StandardScaler()
ccrb_std = scaler.fit_transform(ccrb_onehot)

# check for NaN and infinite values
print(np.argwhere(np.isnan(ccrb_std)))
print(np.argwhere(np.isinf(ccrb_std)))

[[     6      1]
 [     7      1]
 [    38      1]
 ...
 [279401      1]
 [279402      1]
 [279403      1]]
[]


In order to PCA to run, we must drop all missing and infinite values from our standardized dataset.

In [7]:
# remove rows with missing values after data standardization
ccrb_std = ccrb_std[~np.isnan(ccrb_std).any(axis=1)]

Before we run the PCA algorithm on our standardized dataset, we can see that the dataset currently as a shape of (156413, 1727). We can decrease this tremendously with PCA.

In [8]:
# standardized data before dimentionality reduction
ccrb_std.shape

(156413, 1727)

Now, we can fit the PCA algorithm to our clean, standardized data and plot the cumulative expected variance ratio (CEVR) in order to determine the optimal number of principal components. This plot shows the proportion of the total variance in the data explained by each principal component. As shown below, the range of principal components from 1250 to 1500 is the optimal range we can use on our dataset without losing much of the original variance in our data.

In [None]:
# create instance of PCA class and fit it to standardized data
pca = PCA()
pca.fit(ccrb_std)

# plot the explained variance ratio to determine optimal number of principal components
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_.cumsum(), marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.show()

We chose to incorporate the lower end of the "elbow point" in our plot. While 1500 principal components would represent our original data without losing any information, using all the principal components may not always be the best option, as it can lead to overfitting and poor generalization to new data. It's recommended to select a smaller number of principal components that explain most of the variance in the data, while still keeping the overall number of dimensions low. Therefore, we chose 1250 principal components as our optimal number.

In [None]:
# select the optimal number of principal components based on the explained variance ratio
n_components = 1250

# create a new instance of PCA with the optimal number of components
pca = PCA(n_components=n_components)

# fit the PCA to the standardized data
pca.fit(ccrb_std)

# apply the PCA transformation to the original dataset
ccrb_pca = pca.transform(ccrb_std)

# convert the numpy array to a pandas dataframe
ccrb_pca_df = pd.DataFrame(ccrb_pca, columns=['PC{}'.format(i) for i in range(1, n_components + 1)])

Our new dataframe is now (156413, 1250), which is an excellent improvement from our original shape of (156413, 1727).

In [None]:
ccrb_pca_df.shape

In [None]:
ccrb_pca_df

When using PCA, the new column names generated by the transformation will correspond to the principal components themselves (i.e., "PC1", "PC2", etc.). These new columns represent linear combinations of the original features that capture the most variance in the data.

In general, it's not possible to directly interpret the principal components in terms of the original features, since they are not just simple combinations of the original features. However, it is possible to examine the loadings of each feature on each principal component to gain some insight into which original features are most strongly associated with each principal component. This process is done below. Each row in `loadings_df` corresponds to a principal component and each column corresponds to an original feature, with the values in the cells representing the weight of each feature on each principal component. We can use this data frame to interpret the relationship between the original features and the principal components.