Skip to content

CodeNomad-I/finalCapstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

PCA Analysis on US Arrests Dataset

Project Description

This project contains a PCA report that I conducted on the US arrests dataset, found on Kaggle at https://www.kaggle.com/datasets/halimedogan/usarrests.

A description of the data is given as: “This data set contains statistics, in arrests per 100,000 residents, for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.”

The aim of this project was to:

  • Reduce the dimensionality of the dataset through PCA.
  • Use heirarchical clustering to determine how many distinct groups there were.
  • Use K-means clustering to identify what data belonged to what cluster.
  • Attempt to glean an insight into what causes the distinction between clusters.

In the Jupyter Notebook you will see:

  • How I approached the cleaning of the dataset.
  • Any other pre-processing methods I took before PCA, such as observing the distribution and correlations.
  • The PCA itself, with feature importance and a cumulative explained variance graph thrown in there.
  • My heirarchical cluistering approach, including my geeking out at how nice the complete linkage dendrogram looked.
  • Using the number of clusters from the heirarchical approach to determine what k is, then perform k-means clustering.
  • Displaying the clusters and their states in a neatly formatted fashion.
  • And last but not least, some observations and guesses I made towards the story behind the clusters.
  • Just when you didn't think it could get any more fun, my commentary and analysis is sprinkled throughout the entire document too.

About

PCA Analysis on US Arrests dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published