![projectimage.jpg](./images/projectbanner.jpg)

# Demonstration Notebook for Cluster Analysis of the "Views of the Electorate Research Survey" (VOTERS) dataset
_a project by [Kaleb Nyquist](http://kalebnyquist.me)_
&nbsp;

&nbsp;

This project demonstrates a potential application of unsupervised machine learning to help policy entrepreneurs and political scientists better understand one of the contributing factors to contemporary American political dysfunction — that is, the reality that majority support in terms of public opinion does not always translate to desired policy outcomes. 

For example, most Americans want to see [campaign finance reform](https://www.pewresearch.org/fact-tank/2018/05/08/most-americans-want-to-limit-campaign-spending-say-big-donors-have-greater-political-influence/), [background checks for gun ownership](https://news.gallup.com/poll/1645/guns.aspx), and [reductions in fossil fuel consumption](https://news.gallup.com/poll/248006/americans-support-reducing-fossil-fuel.aspx). However, this majority popular support has resulted in minimal national policy action at best, although there has been some state-level policy action and grassroots action in each of these issue arenas.

What this project illustrates is one partial explanation for this phenomenon. With resonances to the [Condorcet paradox](https://en.wikipedia.org/wiki/Condorcet_paradox), where majorities can simultaneously prefer candidates for elected office in a cyclical "rock-paper-scissors" manner, here we assume there are limited opportunities for political representation available to voters (in terms of political parties, candidates for elected office, civil society advocacy organizations, activist movements, etc.) and that voters can coalesece around these opportunities for representation in a quasi-paradoxical manner. The best metaphor for this dilemna is a sort of "mathematical gerrymandering" that, while less nefarious and more accidental than [geographic gerrymandering](https://www.theatlantic.com/ideas/archive/2019/03/how-courts-can-objectively-measure-gerrymandering/585574/), is similar in that contested issues where the political support is highly concentrated within certain networks will tend to lose out to issues where the political support is evenly distributed across networks (at least enough so to form slight majorities per network.

To illustrate, consider the simplified model below:

![](./images/theorygraph.gif)

Pretend that each dot represents a single voter, and that the horizontal axis represents that voter's "concern for economic issues" and the vertical axis represents "concern for social issues". In terms of a simple direct majority, positive support for social issues outweighs negative support for social issues, 17 voters to 13 voters. However, if we consider that each voter has to choose their representation in terms of economic and social issues simultaneously, we find that those voters self-organize into five distinct networks that are stratified in terms of their positive or negative concern for economic issues. Assuming that each network gets an equal amount of representation, and each voter gets a single vote within each network, we see that negative support for social issues wins out, 3 networks to 2 networks.  

**Is it possible to measure how significant a factor this phenomenon is in American politics?** The assertion of this project is that __*yes, it is possible*__ and that the machine learning technique of __*cluster analysis is a viable method of doing so*__. The dataset we use for this project is the "Views of the Electorate Research Survey" 
(or VOTERS) dataset made publicly available by Democracy Fund's [Voter Study Group](https://www.voterstudygroup.org/data). Although not specifically designed for machine learning applications, due to its size and scope of this dataset it is one of the best freely available datasets for machine learning within the domain of political science.

What follows is a step-by-step walkthrough of a Python-language program designed to do exactly this form of cluster analysis on VOTERS. Since I am trying to recommend Python code rather than sell 21st century snake oil, there also will be some discussion on the limitations of this analysis given current data capacities alongside the current development status of this code, which may be best understood as a working prototype that invites input rather than a finished product). As you work through this notebook, feel free to (if you are a Python developer) [fork the repository on GitHub](https://github.com/KalebNyquist/voters-clustering) to contribute your own amendments or (if you are a political scientist or policy entrepreneur) send me your comments and suggestions via e-mail to contact@kalebnyquist.me.

&nbsp;

&nbsp;

## Import Libraries

Below, we import libraries of Python code created specifically for working with the VOTERS dataset.

In [None]:
import VOTERSdata
import VOTERSguide
import VOTERSvisualizer
import VOTERSclustering

&nbsp;

#### <mark style="background-color:lime">Developer-specific</mark>

Module for importing other libraries. Useful for troubleshooting and checking new features to the custom libraries above.

In [None]:
import importlib
importlib.reload(VOTERSclustering)

Pickle module. Useful for saving versions of the dataset and other results that take a long time to generate. Good pickling points marked by 🥒 in the code. 

In [None]:
import pickle

&nbsp;

&nbsp;

## Data Retrieval

The code below imports the .csv version of the data into a Pandas dataframe suitable for Python. Access to the free 2018 Survey Full Data Set can be requested [here](https://www.voterstudygroup.org/publication/2018-voter-survey-1). Note that to submit a request for the data, you will have to give your name and email.

In [None]:
voters = VOTERSdata.import_csv_file('data/VOTER_Survey_April18_Release1.csv')

For intelligibility, the code below scrapes labels of survey variables from the corresponding guide. "Guide to 2018 Survey Data" available either [here](https://www.voterstudygroup.org/publication/2018-voter-survey-1), or when downloading the full data set.

In [None]:
guide_dict = VOTERSguide.compile_dictionary('data/dfvsg_Guide_May2018_VOTER_Survey.pdf')

&nbsp;

&nbsp;


## Choose Respondent Group to Analyze

The 2018 Survey Full Data Set can be subsetted to one of six groups, based on the year(s) they responded to questions or if they belong to an oversample of Latino voters or an oversample of youth voters. Furthermore, each respondent is weighted according to a algorithm designed by survey administrator YouGov, which matches a respondent to a multi-dimensional strata of age, gender, race and education proportional to the U.S. Census Bureau's American Community Survey and other databases.

Here, you can [view a sketch of the respondent groups across the sample](./images/respondentgroups.png). The following code allows you to select a specific subgroup and expand the data according to the sample weights. The tradeoff in the `magnitude` parameter above is precision versus computing time. The higher the magnitude, the more "data points" that are created from a particular respondent; however, each additional respondent exponentially increases the amount of computing time needed to create clusters.

In [None]:
voters_panel = VOTERSdata.choose_subset(voters, 'panel')

In [None]:
reconstituted_voters_panel = VOTERSdata.apply_weights(voters_panel, 'panel', magnitude=8)

In [None]:
#🥒 
reconstituted_pickle = open('reconstituted_data', 'ab')
pickle.dump(reconstituted_voters_panel, reconstituted_pickle)
reconstituted_pickle.close()

&nbsp;

&nbsp;

## Choose Features of Interest

VOTERS contains responses collected over many years for a wide variety of questions, ranging from "If you had that chance to do it again, how would you vote in 2016?" (asked in 2017) to "Outside of attending religious services, how often do you pray?" (asked in 2011).

Below, the Features Picker can be used to create a version of the VOTERS data subset that includes any combination of questions from various categories, such as: 
* __Issue Importance__ (regarding issues, e.g. "Climate change" or "Religious liberty")
* __Feeling Thermometer__ (towards groups, e.g. "Asians" or "Feminists")
* __Favorability__ (towards public figures, e.g. "Ted Cruz" or "Nancy Pelosi")
* __Institutional Confidence__ (regarding large institutions, e.g., "Supreme Court" or "Big business")

In [None]:
voters_issues = VOTERSdata.FeaturesPicker(IssueImportance=True, FeelingThermometer=False).process(reconstituted_voters_panel)

⚠️ *In the current version of this data analysis code, only the `IssueImportance` features are guaranteed to be clustered and visualized neatly.*

&nbsp;

&nbsp;

## Reduce Number of Features

Because some questions were asked multiple times in different years to the same respondents, there exist a number of features in the data that are made redundant (assuming we are not doing an analysis of changes over time).

Here, we condense the data by the most recent value available. Note that if a datapoint is missing for a question asked multiple times, this function imputes the "most recent" as the most recent response: that is, if a question was asked in 2017 did not receive a response from an individual who answered that question in 2016, then the 2016 data is used to impute the missing 2017 value. The assumption here is that political preferences are relatively stable over time.

In [None]:
voters_issues_condensed = VOTERSdata.condense_by_most_recent_feature(voters_issues)

As a means of representation into what this process achieves, the code below can be uncommented to see a list of the features in the original subset and in the condensed subset.

In [None]:
# print("\n--- Original ---")
# print(voters_issues.columns)
# print("\n--- Condensed ---")
# print(voters_issues_condensed.columns)

&nbsp;

&nbsp;

## Label Features into Readable English

During the data retrival stage, we scraped the VOTERS Survey Guide .pdf file to create a means of connecting variable codenames with English readable names. Here, the module below is used to undertake the actual translation.

In [None]:
voters_issues_condensed.columns = VOTERSguide.feature_labeler(voters_issues_condensed, guide_dict, make_concise=True)

To see the list of labeled features, uncomment and run the code below.

In [None]:
#issues_only_condensed.columns

&nbsp;

&nbsp;

## Manage missing values

We can visualize missing values in our data with help of the `missingno` library, below.

In [None]:
import missingno
missingno.matrix(voters_issues_condensed)

As is typical for survey data, some respondents did not answer a couple of the questions in our variables of interest. Furthermore, a few respondents hardly answered any questions at all (if any). For the clustering algorithm to work, we need at least some value the computer to work with. The first line of code below removes low-information respondents who did not answer a significant number of questions, while the second line of code imputes missing values in order to retain high-information respondents.

In [None]:
voters_issues_condensed = VOTERSdata.remove_low_response_respondents(voters_issues_condensed)

In [None]:
voters_issues_imputed = VOTERSdata.impute_nas(voters_issues_condensed)

Note that the imputation algorithm, with an emphasis on mode, works best for numerical data on a small scale (for example, Issue Importance, evaluated on a 1-4 scale). Larger scale numerical data would be better decided with an emphasis on mean or median (for example, the "Feeling Thermometer" questions, evaluated on a 1-100 scale). ⚠️ *In the current version of this data analysis code, the latter solution has not yet been engineered.*

Now that we have removed low-information respondents and imputed the remaining missing values, we can return to our missing values matrix and see that we are now working with a complete dataset.

In [None]:
missingno.matrix(voters_issues_imputed)

&nbsp;

&nbsp;

## Feature Agglmomeration

Survey data are prone to a number of subjectivity biases that can skew the data.

The purpose of feature agglomomeration is to reduce bias created by (creators of) the survey instrument itself. That is, if two or three questions are asked about similar topics (eg., "climate change" and the "environnment"), then these topics are condensed into one so the underlying concern is not given a double or triple amount of significance in determining the cluster, relative to a more unique question topic (eg., "religious liberty").

In [None]:
voters_agglomerated = VOTERSdata.feature_agglomeration(voters_issues_imputed,15, rounding=False)

⚠️ _The number of hybrid features, 15, is arbitrary. One way to improve this project would be to create a comparison clustering metrics for different number of hybrid features, or aggregate results across different number of hybrid features._

To see the list of agglmomerated features, uncomment and run the code below.

In [None]:
#voters_agglomerated.columns

&nbsp;

&nbsp;

## Correcting for Respondent Bias

Next, we remove two forms of bias created by the respondents and their answers to the questions. First, we remove "Abortion" as an issue of concern. Because of the unique nature of this issue, it is unclear if the concern being reported is "Abortion rights" or "Abortion restrictions". (Arguably, "Immigration" has a similar ambiguity, but this does not rise to the same degree as "Abortion", and it does not appear to affect our clusters in the same way abortion does, so we maintain this feature.) Note that, later on, we can still see how the "Abortion" feature is distributed among clusters even if we do not use this feature to construct the clusters.

In [None]:
voters_agglomerated.drop('Abortion', axis=1, inplace=True)

To make the math easier and more intuitive, the 4-1 scale is converted to a 0-3 scale.

In [None]:
voters_inverted = -((voters_agglomerated - 1) - 3)
print("🔀 Scale inverted.")

Second, many respondents are likely to respond to nearly every question as "very important". In terms of our clustering algorithm, this creates a problem in that the initial clusters that might be created are simply (1) people who are very concerned just about everything and (2) people who are not very concerned about much of anything. This is not as interesting as relative preferences, so to correct for this we create a "proportional" version of the dataset that adds up the "total amount of concern" from each respondent and uses that to scale each specific amount of issue concern. 

In [None]:
voters_proportional = VOTERSdata.importance_scale_proportionizer(voters_inverted)

&nbsp;
 
&nbsp;

## Rank features by mean value (visualization aid)

In [None]:
#voters_proportional.columns

To make the graph easier to interpret, features are ranked by their mean total value. Uncomment code above and below to see the difference.

In [None]:
features_ranked = voters_proportional.mean().sort_values().keys()
voters_proportional = voters_proportional[features_ranked[::1]]
voters_inverted = voters_inverted[features_ranked[::1]]

In [None]:
#voters_proportional.columns

&nbsp;

&nbsp;

## Visualize entire dataset, pre-clustering

Before moving to a somewhat "magical" step of clustering our data, this is a good checkpoint for us to take a step back and visualize what our processed dataset now looks like. 

However, for our visualization to display properly, we first need to repeat some of our previous steps on a copy of the imputed version of the dataset, where rounding is introduced to feature agglomeration function.

In [None]:
voters_agglomerated_visualization = VOTERSdata.feature_agglomeration(voters_issues_imputed, 15, rounding=True)
voters_agglomerated_visualization.drop('Abortion', axis=1, inplace=True)
voters_inverted_visualization = voters_agglomerated_visualization.applymap(VOTERSdata.importance_scale_inverter)
voters_inverted_visualization = voters_inverted_visualization[features_ranked[::1]]

Now that we have a version of the dataset that can be used for visualiztion, we can now plug this into the code below to view the dataset in agglomerated form.

In [None]:
VOTERSvisualizer.cluster_visualization(voters_proportional, voters_proportional, voters_inverted_visualization, "Entire Dataset")

To visualize each cluster, the grey bars in the back indicate percentage of respondents who gave a particular answer to a survey question (or, in the case of features combined by agglomeration, the visualization is "rounded" to the nearest answer; i.e., two unimportants and one not very important becomes an unimportant). **<mark style="background-color:#999999">Dark grey</mark>** bars indicate the percentage of "very important" responses, whereas **<mark style="background-color:#EEEEEE">light grey</mark>** bars indicate the percentage of "unimportant" responses.

 

&nbsp;

&nbsp;

# Clustering Algorithm

All the code prior to this point has been for the purpose of preparing our data to be "clustered", an unsupervised machine learning technique that detects patterns in the data.

An easy example of clustering would be to imagine a dataset of U.S. Cities in a two-dimensional space of latitude and longitude. A clustering algorithm might find the following clusters: (Seattle, Portland, Boise), (Minneapolis, Milwaukee, Chicago, Indianapolis), (Los Angeles, San Diego, San Francisco, Las Vegas), (New York, Philadelphia, Washington DC, Boston). 

In the case of the VOTERS dataset we have modified, the idea is the same, except the number of dimensions is equal to number of agglomerated features. That is: instead of (A) latitude and (B) longitude, we might have something like (A) "The economy, Jobs" (B) "Health Care, Social Security, Medicare", (C) "Crime, Terrorism, Taxes", and so on.

The algorithm used here is [Ward's method](https://en.wikipedia.org/wiki/Ward%27s_method), an agglomerative hierarchical clustering technique. Compared to [other clustering methods](https://scikit-learn.org/stable/modules/clustering.html), Ward's has a couple advantages.

* It is **hierarchical**. This means that every additional cluster is created as a split between two pre-existing clusters. Therefore, the structure of our clustering does not radically change if, for example, we go from five to six clusters or from ten to seven. Furthermore, because we can chart this hierarchy, we can easily which clusters are most closely related to each other. 
* It is **comprehensive**. Every datapoint gets assigned a cluster, even if it is an outlier or ambiguously between two clusters. This reflects the reality that fringe or idiosyncratic political opitions do not necessarily discourage political participation: "every vote counts."
* It produces clusters of **relatively equal sizes**. While there will likely be some small clusters (particularly if a peculiar pattern of responses are given by a highly weighted respondent or two) and there likely will be a large center cluster or two, for the most part the size of clusters produced will be of a similar scale. 

That said, it would not be difficult to run a different clustering algorithm at this point. It would just take a few tweaks to the underlying code. All the pre-processing steps that were applied earlier, and all the visualization and analysis code that comes later, should work with most clustering algorithms.

&nbsp;

&nbsp;

### Performance Metrics over Number of Clusters

First, we need to determine the number of clusters we want to work with.

The code below takes our data and evaluates the performance of the algorithm across a range of total clusters. For both metrics displayed in the first visualization below, silhouette and Calinski-Harabasz, the higher the number the better. In the second visualization, the greater the distance between "splits" the better.

In [None]:
silhouette_scores, calinski_harabasz_scores, cal = VOTERSclustering.performance_per_no_clusters(voters_proportional, 15)

While higher Calinski-Harabasz and higher Silhouette scores are desirable, we also are interested in higher number of clusters. That is, the more clusters we have, the more "precise" we can be, even though we risk being less "accurate". One way to determine a threshold number of clusters we might want to work with is by looking for local maxima in the Silhouette score (where the Silhouette score is increasing). The code below is helpful for this purpose.

In [None]:
VOTERSclustering.metrics_table(calinski_harabasz_scores, silhouette_scores)

In [None]:
#🥒 
metrics_pickle = open('cluster_metrics', 'ab')
cluster_metrics = VOTERSclustering.metrics_table(calinski_harabasz_scores, silhouette_scores)
pickle.dump(cluster_metrics, metrics_pickle)                      
cluster_pickle.close() 

&nbsp;

&nbsp;

### Make Determined Number of Clusters from Agglomerated Features

Once we have determined the number of clusters we want to use, we can now make cluster assignments and a dendrogram given that specific number.

In [None]:
cluster_assignments = VOTERSclustering.get_cluster_assignments(voters_proportional, 8, dendrogram_generate=True)

⚠️ *Note that in order to match individual clusters from the dendrogram above with the visualizations below, this process has to be done by cluster size rather than an id or order number. This admittedly is a limitation in the current usability of the code, and is the consequence of using two different Python modules for cluster assignment and the dendrogram calculations.*

In [None]:
#🥒 
cluster_pickle = open('cluster_assignments', 'ab') 
pickle.dump(cluster_assignments, cluster_pickle)                      
cluster_pickle.close() 

&nbsp;

&nbsp;

### Make clusters from non-agglomerated features

We now create a new copy of the dataset from right before we agglomerated the features. Although we use agglomerated features for purposes of clustering, for interpretation of the clusters it makes sense to keep features separate (that is, saying a respondent considers "Education" to be *very important* and "Poverty" to be *somewhat unimportant* tells us more than if they consider "Education, Poverty" to be *somewhat important*).

In [None]:
voters_nonagglomerated = voters_issues_imputed.copy()

We run the new copy of the dataset through the same pre-processing steps as before that correct for respondent bias.

In [None]:
voters_inverted = voters_nonagglomerated.applymap(VOTERSdata.importance_scale_inverter)
voters_proportional = VOTERSdata.importance_scale_proportionizer(voters_inverted)

And, again, we rank features to make our dataset easier to visualize.

In [None]:
features_ranked = voters_proportional.mean().sort_values().keys()
voters_proportional = voters_proportional[features_ranked[::1]]
voters_inverted = voters_inverted[features_ranked[::1]]

We now apply our cluster assignments to the non-agglomerated dataset.

In [None]:
clusters_proportional = VOTERSclustering.assign_clusters(voters_proportional, cluster_assignments)
clusters_absolute = VOTERSclustering.assign_clusters(voters_inverted, cluster_assignments)

And, now, we can visualize our dataset on a per cluster basis! Note that **<font color="red">red lines</font>** indicate that for that particular feature, the cluster as a whole considered the issue less important than the average person in the sample whereas **<font color="blue">blue lines</font>** indicate that the cluster as a whole considers the issue to be more important than the average person.

In [None]:
VOTERSvisualizer.show_all_clusters(clusters_proportional, voters_proportional, clusters_absolute)

Notice in particular places where there are some respondents who consider the issue to be of high importance (indicated by **<mark style="background-color:#999999">dark grey</mark>**) but the cluster as a whole considers the issue to be of less importance (indicated by a long **<font color="red">red line</font>**). 

Similarly, there are some respondents who consider the issue to be a low importance (indicated by **<mark style="background-color:#EEEEEE">light grey</mark>**) but the cluster as a whole considers the issue to be of greater importance (indicated by a long **<font color="blue">blue line</font>**).

These "outliers" are significant in the context of our investigation, because they indicate voters whose ability to express a policy preference on a particular issue is crowded out by their other preferences. In terms of efficiency use of resources within a political economy, political institutions (e.g. elected officials, broad-based advocacy organizations, political parties, etc.) will tend towards satisfying the concerns of a cluster as a whole rather than the idiosyncracies of each individual voter.

One democratically frustrating outcome this can potentially create is that voters who care deeply for a particular issue ("Animal Rights", to use an example not in the survey) are concentrated (say, 80%) in clusters that do not add up to a majority and spread out among other clusters (say 40%) that do add up to a majority. So even if over 50% of the population cares deeply about Animal Rights, this direct majority might be not be enough to effect policy change in a representative system.

&nbsp;

&nbsp;

## Identify possible coalitions

The above graphs are made available so that an interpreter can describe each cluster by policy profile preference, and the hierarchical dendrogram can be used for identifying which clusters are most closely related to each other. 

That said, given most parameters, it is unlikely that any particular cluster already represents 50% of the sampled population. The code below aids in interpretation by finding possible coalitions of clusters that add up to 50% of the population.

First, we need to calculate cluster sizes (relative to the total dataset).

In [None]:
cluster_sizes = VOTERSclustering.cluster_sizes_dict(clusters_proportional, voters_proportional) #can also be voters inverted
display_cluster_sizes(cluster_sizes)

The first analysis that can be done is by issue. Here, we add up the clusters that are "positive" on a particular issue, and also add up the clusters that are "negative" on a particular issue. If concern for an issue was evenly distributed between clusters, we would expect something like a 50/50 split here. However, if above-average support for an issue is concentrated in particular clusters, then the clusters that add up to positive be less than half (and vice versa). The greater this difference, the more potentially frustrating it will be to achieve desired policy outcomes through a representative democratic process on that particular issue.

In [None]:
VOTERSclustering.issues_by_clustered_advantage(clusters_proportional, voters_proportional, cluster_sizes)

However, looking at the landscape on an issue-by-issue basis doesn't tell the full story. While thinking in terms of 50% or more support might make sense for passing legislation on a piecemeal deal, it certainly does not indicate if those clusters could work together on a sustained basis (especially, in the American context, the executive branch of government). 

The following code can be used to see which cluster coalitions are feasible. Here, we search for consensus coalitions which represent half of the population and agree on at least four issues. 

In [None]:
cluster_coalitions = VOTERSclustering.generate_cluster_coalitions(cluster_sizes, maximum = 10, threshold = .5)
VOTERSclustering.find_consensus_coalitions(cluster_coalitions, clusters_proportional, voters_proportional, 
                                           agreements = 4, show_negatives=False)

This code is versatile! Here, it can be used to see which consensus coalitions represent a supermajority (two-thirds) of the population and agree on at least one issue.

In [None]:
cluster_coalitions = VOTERSclustering.generate_cluster_coalitions(cluster_sizes, maximum = 10, threshold = .67)
VOTERSclustering.find_consensus_coalitions(cluster_coalitions, clusters_proportional, voters_proportional, 
                                           agreements = 1, show_negatives=False)

&nbsp;

&nbsp;

## What else can we know about these clusters?

While creating these clusters only required that we use a subset of the data available to us in the VOTERS data, we can use the remainder of data available to us through VOTERS to learn more about each cluster. First, we match our clusters up with the original (post-expansion-by-weight) dataset.

In [None]:
full_data_clusters = VOTERSclustering.create_full_data_clusters(reconstituted_voters_panel, clusters_absolute)
print("🗃️🔺✨ Cluster assignments merged with original dataset, expanded by weight.")

Then, we can use most any variable in the survey to learn more about a particular cluster. Below are results for political party identification in 2017, followed by race, religion, and gender.

⚠️ _Note that the dictionary scraper isn't perfect! For example, it is likely to be confused by questions where there are numerals in the answer, for example, a question such as "highest level of education completed" having an answer like "**4**-year degree"_.

In [None]:
VOTERSdata.investigate_variable_across_clusters(guide_dict, 'pid3_2017', full_data_clusters)

In [None]:
VOTERSdata.investigate_variable_across_clusters(guide_dict, 'race_2017', full_data_clusters)

In [None]:
VOTERSdata.investigate_variable_across_clusters(guide_dict, 'religpew_2017', full_data_clusters)

In [None]:
VOTERSdata.investigate_variable_across_clusters(guide_dict, 'gender_2018', full_data_clusters)

Note that if, for reference purposes, you are interested in seeing the breakdown of a particular variable for the entire dataset, the following code is also available.

In [None]:
VOTERSdata.investigate_variable_by_subset(guide_dict, 'religpew_2017', reconstituted_voters_panel)

&nbsp;

&nbsp;

## Conclusion

This project demonstrates a possible application of unsupervised machine learning that may be of interest to both political scientists and policy entrepreneurs alike. In particular, it shows the potential of clustering algorithms to identify issues where popular concern may be disproportionately concentrated in particular patterns of policy preferences, frustrating attempts at legislative reform or executive action on the issue.

That said, there are a number of places where the methodology of this project may still be significantly improved upon.

**First, the underlying dataset, while generally terrific for a whole range of investigations and not neccessarily designed for machine learning applications, has some notable limitations for our purposes here.** First, a more comprehensive set of issues would have useful. While it is impossible to cover everything, it appears that gun ownership, marijuana legalization, criminal justice reform, foreign aid, and trade policy are all significant issues that have been omitted. Second, the policy implications of each issue could be made clearer in the questions that are asked. Especially in the case of "abortion" or "immigration", it is not clear if what is desired by the respondent are "restrictions" or "rights". Third, there was a tendency for responents to answer nearly every issue question as "very important", a sort of ask-for-anything Santa Claus view of the political economy. To better reflect reality, it might have made more sense for respondents to be asked something along the lines of "if you had $100 to allocate to each of the following issues, how much would you allocate to each?" This would help filter out high responses due to social desirability bias out from high responses related to genuine concern that would affect voter behavior. 

**Second, the machine learning algorithm remains a work in progress. (Note that the silhouette score is often in the 0.05 - 0.20 range, where we would prefer for it to be at least 0.5.)** Whether due to limitations in the data (above) or the highly complicated nature of political reality the data attempts to reflect, these clusters are not incredibly robust. While the hierarchical clustering method means that the structure of the clusters are relatively stable across a range of number of clusters being generated, the clusters are prone to appearing radically different given changes in parameters such as number of features agglomerated, the imputation formula that is used, or the "magnitude" by which the dataset is unpacked according to respondent weight. One possible solution to this problem is to aggregate results across parameters and see which results are more likely to appear consistently. Another possible solution would be to add other features — perhaps, for example, demographic variables — to the cluster to see if that results in more stable clusters (these "dirty" clusters might better reflect reality, even though for making theoretical conclusions they are less than ideal than "pure" clusters made strictly from policy preferences).

**Speaking of which, our theoretical model could use some work as well.** Most obviously, individual issue politics are often complicated by stakeholder interests that may encourage or discourage particular policy outcomes through lobbying or campaiging.

Furthermore, in the American political system voters are federated (e.g., the Electoral College and the Senate) and systematically over/underrepresented (e.g., voter identification laws and gerrymandering), which deeply complicates the math involved in figuring out how much political power a particular cluster of policy preferences actually has. That said, the process involved of "cluster coalitions" seems to reflect the political experience of multiparty parliamentary democracies (e.g. Canada) moreso than the United States example, but I would argue that this actually highlights the value of this form of analysis. Whereas in other countries this process of "clusters forming coalitions" is explicit and formalized, in the United States the same process occurs and is more organic and implicit (e.g., happens within and between the two major parties and a large swath of independents). While "clusters forming coalitions" is not the whole story of American politics, if the technical aspects detailed above can be approved upon, then this part of the American political landscape can be told more clearly and inserted into appropriate theoretical models that can tell a larger story.
    
### Roadmap for future development
*possible addition would be a bot that takes a random voter in the sample as describes them in plain english*


### Example analysis for publication

&nbsp;

&nbsp;

## Special Thanks
<div ALIGN=”right” />
<img  src="https://upload.wikimedia.org/wikipedia/commons/6/61/FS_wiki.png" width="300"/ align="left">
</div> <p><br><br>This code and demonstration notebook was completed as part of a <a href="https://flatironschool.com/career-courses/data-science-bootcamp/dc">Data Science Fellowship</a> with the Flatiron School's Washington DC Campus. Special thanks to the numerous instructors, coaches, curriculum designers and cohort-mates who guided me through the learning and development process.</p>
