Affinity Propagation clustering of VA counties and independent cities
For the final assignment of the fourth course in my Master's degree program for Northcentral University, I was instructed to correlate two or more otherwise unrelated datasets and generate some new knowledge. In a previous course I had looked at COVID-19 cases in Virginia along with county demographic information and political affiliation data for votes from the 2019 House of Delegates election to see if there was a correlation between political affiliation and COVID-19 cases in mid summer 2020. For that course, only a brief proposal was required.
For this course, I used similar datasets for COVID-19 cases in Virignia, demographic information, and votes for president in the 2020 general election to see if I could cluster Virginia counties and independent cities based on those characteristics. Polling data suggested that political affiliation and other demographic features such as race, age, and income influenced people's willingness to get a COVID-19 vaccine. The new knowledge I wanted to generate was a clustering of Virginia's counties and independent cities based on demographic and political characteristics for some hypothetical marketing department to take and develop a vaccine promotion campaign.
Affinity Propagation clustering is an unsupervised learning technique that determines the optimum number of clusters based on the data. Messages are sent between all pairs of data points to determine how similar each point is to another and whether any one point might be considered an exemplar of another. Clusters are data points that share an exemplar data point.
The four notebooks in this repository show the cleaning process for each of the four datasets I used, the collating process, the affinity propagation clustering, and the additional exploration of the resulting clusters by certain target characteristics. The target characteristics were based on national polling data showing who was less willing to get a vaccine (younger people, lower income people, Black Americans, Republicans), population density, and actual number of reported cases per county or independent city.
The four datasets are the COVID-19 cases as reported by the Virginia Department of Health to the public as of December 20, 2020; the votes for president in the November 3, 2020 general election as reported by the Virginia department of Elections; poverty data from the US Census Bureau's Small Area Income and Poverty Estimates for 2019; and age, racial, and rural population data from the 2020 County Health Rankings which pulled from the US Census PopulationEstimate for 2018.