This assignment asks you to cluster data from a credit union in Washington. This data was gathered by survey. I've included a pretty full description of the data as an appendix down below. 

In class we used code from _Data Science From Scratch_ to cluster, but it's more traditional to use the [`Kmeans` function](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) from `scikit-learn`. I'll show you a brief example of that with the survey data from class, then I ask you to do your own clustering.

### Clustering Class Survey Data

In class we surveyed students from the MSBA and put them into groups. We'll repeat that work here. 


In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans

In [None]:
example_data = pd.read_csv("survey_responses.txt",sep="\t")

In [None]:
example_data

As in class, let's convert the data to numeric and we'll do some renaming as well. 

In [None]:
current_names = example_data.columns
new_names = "timestamp name birth_dist age_15_dist school_years ug_mkt ug_biz hh_size".split()

In [None]:
example_data = example_data.rename(columns = dict(zip(current_names,new_names)))

In [None]:
# df.MySeriesName.map(dict(yes=1, no=0))
example_data.ug_mkt = example_data.ug_mkt.map(dict(Yes=1,No=0))
example_data.ug_biz = example_data.ug_biz.map(dict(Yes=1,No=0))


In [None]:
example_data

In [None]:
numeric_cols = "birth_dist age_15_dist school_years ug_mkt ug_biz hh_size".split()


In [None]:
num_example_data = example_data.loc[:,numeric_cols]

In [None]:

example_normalized =(num_example_data -num_example_data.min())/(num_example_data.max()-num_example_data.min())

In [None]:
example_normalized

Let's set up our K-Means model. Some of the below code is adapted from 

In [None]:
km = KMeans(
    n_clusters=3, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)


In [None]:
clustering = km.fit_predict(example_normalized)

In [None]:
example_data['cluster'] = clustering

In [None]:
for cluster in set(clustering) :
    print(f'Printing Results for Cluster {cluster}.\n\n')
    
    print(example_data.query(f"cluster == {cluster}"))
    
    print("\n\n")
    

There's much more we could do here, such as printing out the cluster means. With this current scaling approach it appears that we're basically splitting into UG degrees. It might be more interesting to normalize by subtracting the mean of the column and dividing by the standard deviation. 

## Demographic Clustering

In this section, perform a cluster analysis using the fields unrelated to the values survey:

* age
* gender
* engagement
* account.age

Choose a number of clusters and use K-means to cluster the data using these fields. Briefly describe the clusters. How evenly do your clusters align with the regions?

In [None]:
# Your work here

## Values Clustering

Similarly to the previous section, perform a cluster analysis, this time on the values questions:

* fair
* harm
* in.group
* authority
* purity
* public.sector
* sustainability
* localism

After you’ve built your clusters, report the following information on each cluster:

* Predominant region
* Average age and account age
* Most common focal value
* Mean results on the questions of `pub.greater.priv`, `experience.more.important`, and `teachers.underpaid`. 

In [None]:
# your work here

##  Appendix: Full Data Description
A financial institution in Washington has become concerned that their current membership 
base is not well-aligned with their corporate values. Through that concern they 
realized that don't actually understand their membership's values very well. They 
surveyed 2,421 members to shed light on the issue. 

The heart of the survey was the Moral Foundations Theory of Jonathan Haidt. Members 
were surveyed on the Moral Foundations Questionnaire, which you should take so you 
understand the test. Survey respondents were scored on the five foundations as well 
as a single-number summary, Progressivism. 

The financial institution values Localism, Sustainability, and Education. These aspects 
of member's values were assessed in the survey as well. Localism and Sustainability used
validated scales and thus can be summarized via a single score, where higher values indicate
greater support for the values. Education is summarized by the following three questions, 
which we do not have evidence can be combined into a single score:

* In general, public schools provide a better education than private schools.
* Public school teachers are underpaid.
* Experience is more important than education in determining success in life.
These questions were evaluated on a 1 to 6 scale where 1 indicated "Strongly Disagree" and 
6 indicated "Strongly Agree". 

Finally, we have information on the member that can be used to understand variation 
in their values. 

The data consists of the following columns:

* ID: a unique identifier for the survey respondent.
* age: the age of the respondent.
* gender: gender was evaluated with robust scale and collapsed into male/female/other for 
  those whose gender identity was not male or female.
* engagement: three categories of engagement with the financial institution.
* mem.edu: the self-reported education level of the member with the following scale:
* zip: the member zip code. 
* channel: how the member joined the financial institution. Options are "Loan" if they joined 
  via an auto loan, "Branch" if they joined at a branch and other for online or unknown. 
* progressivism/harm/fair/in.group/authority/purity: The MFQ results.
* account.age: the age of the member's account, in years. 
* region: The region of Washington the member lives in. May be easier to work with than zip.
* public.sector: has the person ever been a public employee?
* sustainability/localism: Scores on the validated scales. Higher values indicate greater
  support for the value.
* pub.greater.priv/experience.more.important/teachers.underpaid: The responses to the 
  education questions above. 
* main.focal.value: Respondents were asked, "Below is a list of broad areas to which people 
  often dedicate their volunteer or philanthropic efforts. From this list, please select the 
  most important to you. If an area of particular importance is missing, please let us know 
  about it in the space for 'other.'" This column holds the respondents' answer to that question.
* support.of.focal.value: Respondents were given an opportunity to indicate how they 
  supported their focal value. Those responses were collapsed into a single score, where 
  a higher value indicates more support.
