# Principal Component Analysis (EDA)

This notebook applies Principal Component Analysis (PCA) to the merged rodent_df dataset. The output are the principal components and their explained variance, plus the importance score of the top 10 columns. This information can be used to identify the most important variables in the data, which can help with feature selection and dimensionality reduction.

It does not include certain fields in the analysis such as start with 't_' or 'd_' (breakdowns of the rodent sightings per time). It then drops three columns, 'year', 'num_dsny_complaints', and 'spatial_id'.

In [36]:
from datetime import datetime, timedelta
import geopandas as gpd
import json
import pandas as pd
import mapclassify
import matplotlib.pyplot as plt
import numpy as np
import os
import requests
from sklearn.decomposition import PCA
from io import StringIO
import warnings

warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option("display.max_rows", None)
np.set_printoptions(threshold=np.inf)

## Load Data

In [37]:
# Parameters
parent_dir = os.path.abspath('..')  # get the absolute path of the parent directory

In [38]:
file_path = os.path.join(parent_dir, 'Data', 'rodents_per_year_merged.csv')  # construct the file path
roadents_df = pd.read_csv(file_path)
print(len(roadents_df))
roadents_df.head()

38958


Unnamed: 0,spatial_id,year,l_Commercial_sum,l_Other_sum,l_Outdoor_sum,l_Residential_sum,l_Residential-Mixed_sum,l_Vacant_Space_sum,d_Friday_sum,d_Monday_sum,d_Saturday_sum,d_Sunday_sum,d_Thursday_sum,d_Tuesday_sum,d_Wednesday_sum,t_Evening_sum,t_Midday_sum,t_Morning_sum,num_sightings,s_Dead_Animal:Residential_sum,s_Dead_Animal:Street_sum,s_Dog_waste:Street_sum,s_Illegal_Dumping:Street_sum,s_Trash:Residential_sum,s_Trash:Street_sum,s_Trash_MissedService:Street_sum,s_Trash_Overflowing:Street_sum,s_Trash_Time:Street_sum,s_Trash_Unsecure:Residential_sum,s_Trash_Unsecure:Street_sum,num_dsny_complaints,subway_count
0,360050001000,2020,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,360050001000,2021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,360050001000,2022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,360050001000,2018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,360050001000,2019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Run PCA

In [39]:
# Load your high-dimensional data into a pandas DataFrame
df = roadents_df.filter(regex='^(?!t_|d_).*')
df = df.drop(['year', 'num_sightings', 'num_dsny_complaints','spatial_id'], axis=1)
print(df.dtypes)

# Center and scale the data
X = df.values
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

# Create a PCA object and fit it to the data
pca = PCA()
pca.fit(X)

# Access the principal components and their explained variance
pcs = pca.components_
variance = pca.explained_variance_ratio_

# Print the explained variance for each principal component
for i, var in enumerate(variance):
    print("PC", i+1, "explains", round(var*100, 2), "% of the variance")

# Determine the most important columns
importance_scores = np.abs(pcs)
column_importance = pd.DataFrame(importance_scores.T, index=df.columns)

# Print the top 10 columns and their importance for the first 5 components
for i in range(5):
    print("Top 10 columns for PC", i+1)
    importance = column_importance[i].nlargest(10)
    print(importance)
    print()


l_Commercial_sum                    float64
l_Other_sum                         float64
l_Outdoor_sum                       float64
l_Residential_sum                   float64
l_Residential-Mixed_sum             float64
l_Vacant_Space_sum                  float64
s_Dead_Animal:Residential_sum       float64
s_Dead_Animal:Street_sum            float64
s_Dog_waste:Street_sum              float64
s_Illegal_Dumping:Street_sum        float64
s_Trash:Residential_sum             float64
s_Trash:Street_sum                  float64
s_Trash_MissedService:Street_sum    float64
s_Trash_Overflowing:Street_sum      float64
s_Trash_Time:Street_sum             float64
s_Trash_Unsecure:Residential_sum    float64
s_Trash_Unsecure:Street_sum         float64
subway_count                        float64
dtype: object
PC 1 explains 15.94 % of the variance
PC 2 explains 10.24 % of the variance
PC 3 explains 6.66 % of the variance
PC 4 explains 6.03 % of the variance
PC 5 explains 5.63 % of the variance
PC 6 ex