This assignment asks you to cluster data from a credit union in Washington. This data was gathered by survey. I've included a pretty full description of the data as an appendix down below. 

In class we used code from _Data Science From Scratch_ to cluster, but it's more traditional to use the [`Kmeans` function](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) from `scikit-learn`. I'll show you a brief example of that with the survey data from class, then I ask you to do your own clustering.

### Clustering Class Survey Data

In class we surveyed students from the MSBA and put them into groups. We'll repeat that work here. 


In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans

In [None]:
example_data = pd.read_csv("survey_responses.txt",sep="\t")

In [None]:
example_data

As in class, let's convert the data to numeric and we'll do some renaming as well. 

In [None]:
current_names = example_data.columns
new_names = "timestamp name birth_dist age_15_dist school_years ug_mkt ug_biz hh_size".split()

In [None]:
example_data = example_data.rename(columns = dict(zip(current_names,new_names)))

In [None]:
# df.MySeriesName.map(dict(yes=1, no=0))
example_data.ug_mkt = example_data.ug_mkt.map(dict(Yes=1,No=0))
example_data.ug_biz = example_data.ug_biz.map(dict(Yes=1,No=0))


In [None]:
example_data

In [None]:
numeric_cols = "birth_dist age_15_dist school_years ug_mkt ug_biz hh_size".split()


In [None]:
num_example_data = example_data.loc[:,numeric_cols]

In [None]:

example_normalized =(num_example_data -num_example_data.min())/(num_example_data.max()-num_example_data.min())

In [None]:
example_normalized

Let's set up our K-Means model. Some of the below code is adapted from 

In [None]:
km = KMeans(
    n_clusters=3, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)


In [None]:
clustering = km.fit_predict(example_normalized)

In [None]:
example_data['cluster'] = clustering

In [None]:
for cluster in set(clustering) :
    print(f'Printing Results for Cluster {cluster}.\n\n')
    
    print(example_data.query(f"cluster == {cluster}"))
    
    print("\n\n")
    

There's much more we could do here, such as printing out the cluster means. With this current scaling approach it appears that we're basically splitting into UG degrees. It might be more interesting to normalize by subtracting the mean of the column and dividing by the standard deviation. 

## Demographic Clustering

In this section, perform a cluster analysis using the fields unrelated to the values survey:

* age
* gender
* engagement
* account.age

Choose a number of clusters and use K-means to cluster the data using these fields. Briefly describe the clusters. How evenly do your clusters align with the regions?

In [None]:
# Read in data

survey_data = pd.read_csv("washington_survey_data.txt",sep="\t")

survey_data


In [None]:
# Change column names
current_names_ws = survey_data.columns
new_names_ws = "id age gender engagement edu zip channel progressivism harm fair group authority purity accountage region publicsector sustainability localism pubgreaterpriv experiencemoreimportant teachersunderpaid mainfocalvalue supportoffocalvalue".split()


In [None]:
# Change column names
survey_data = survey_data.rename(columns = dict(zip(current_names_ws,new_names_ws)))

In [None]:
# Change columns to numeric
survey_data['gender'] = survey_data['gender'].map({'female': 1, 'male': 2, 'other': 3})
survey_data['publicsector'] = survey_data['publicsector'].map({'yes': 1, 'no': 0}) 
survey_data['engagement'] = survey_data['engagement'].map({'Not Engaged':0,'Engaged':1,'Highly Engaged':2})


In [None]:
survey_data

In [None]:
demo_cols = "age gender engagement accountage".split()

In [None]:
num_survey_data_demo = survey_data.loc[:,demo_cols]

In [None]:
survey_normalized_demo =(num_survey_data_demo -num_survey_data_demo.min())/(num_survey_data_demo.max()-num_survey_data_demo.min())

In [None]:
# Set up demographic clustering model

d_km = KMeans(
    n_clusters=4, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)

In [33]:
# Fit the model
demo_clustering = d_km.fit_predict(survey_normalized_demo)

In [34]:
# Create cluster column
survey_data['clusterdemo'] = demo_clustering

In [44]:
pd.set_option('display.max_columns', None)

In [45]:
# Results
for clusterdemo in set(demo_clustering) :
    print(f'Printing Results for Cluster {clusterdemo}.\n\n')
    
    print(survey_data.query(f"clusterdemo == {clusterdemo}"))
    
    print("\n\n")

Printing Results for Cluster 0.


         id  age  gender  engagement  edu      zip channel  progressivism  \
1       348   66       1           2    3  98012.0  Branch       2.291667   
9       371   71       1           2    6      NaN  Branch       3.166667   
10      372   73       1           2    6  98506.0  Branch       1.958333   
11      372   73       1           2    6  98506.0  Branch       1.958333   
16      424   38       1           2    4  98660.0  Branch       1.500000   
...     ...  ...     ...         ...  ...      ...     ...            ...   
2409  21255   45       1           2    4  98503.0  Branch       1.208333   
2410  21767   25       1           2    5      NaN  Branch       0.375000   
2411  21768   55       1           2    4  98579.0  Branch       1.916667   
2412  21872   64       1           2    4  99205.0  Branch       1.458333   
2417  22163   93       1           1    5  98103.0  Branch       1.583333   

      harm  fair  group  authority  purit

In [30]:
regional_analysis = survey_data.groupby(['clusterdemo','region']).size().reset_index(name='counts')
regional_analysis

Unnamed: 0,clusterdemo,region,counts
0,0,E WA Metro,45
1,0,E WA Non Metro,50
2,0,Thurston,203
3,0,W WA Metro,187
4,0,W WA Non Metro,58
5,1,E WA Metro,68
6,1,E WA Non Metro,124
7,1,Thurston,235
8,1,W WA Metro,364
9,1,W WA Non Metro,109


In [31]:
total_counts_per_cluster = regional_analysis.groupby('clusterdemo')['counts'].sum()

regional_analysis['percentage'] = regional_analysis.apply(lambda row: (row ['counts']/total_counts_per_cluster[row['clusterdemo']]) * 100, axis=1)

In [29]:
regional_analysis

Unnamed: 0,clusterdemo,region,counts,percentage
0,0,E WA Metro,45,8.287293
1,0,E WA Non Metro,50,9.208103
2,0,Thurston,203,37.384899
3,0,W WA Metro,187,34.438306
4,0,W WA Non Metro,58,10.6814
5,1,E WA Metro,68,7.555556
6,1,E WA Non Metro,124,13.777778
7,1,Thurston,235,26.111111
8,1,W WA Metro,364,40.444444
9,1,W WA Non Metro,109,12.111111


I decided on 4 clusters for this analysis. Cluster zero looks like it is predominately female with higher engagement centered around Thurston county and Western Washington Metro locations. Cluster one is also mostly female but with lower engagement and with much fewer people living in Thurston county, most people in this cluster live in Western Washington Metro locations. Cluster two is mostly men, with high engagement, most living in Thurston county and Western Washington Metro. Finally, cluster three is predominately males, most living in the Western Washington Metro area and having low engagement compared to cluster two. I don't feel like my clusters exactly align with regions. I feel like the reason Western Washington Metro and Thurston county have higher percentages across all clusters is because those places have higher populations and more customers. I expected Eastern Washington to have its own cluster, along with non-metro areas having their own, but that didn't happen with this model. The columns that influenced this model the most are gender and engagement.


## Values Clustering

Similarly to the previous section, perform a cluster analysis, this time on the values questions:

* fair
* harm
* in.group
* authority
* purity
* public.sector
* sustainability
* localism

After you’ve built your clusters, report the following information on each cluster:

* Predominant region
* Average age and account age
* Most common focal value
* Mean results on the questions of `pub.greater.priv`, `experience.more.important`, and `teachers.underpaid`. 

In [36]:
val_cols = "fair harm group authority purity sustainability localism".split()

In [37]:
num_survey_data_val = survey_data.loc[:,val_cols]

In [38]:
survey_normalized_val =(num_survey_data_val -num_survey_data_val.min())/(num_survey_data_val.max()-num_survey_data_val.min())

In [39]:
survey_normalized_val

Unnamed: 0,fair,harm,group,authority,purity,sustainability,localism
0,0.20,0.20,0.20,0.20,0.25,0.266667,0.35
1,0.95,1.00,0.45,0.65,0.45,1.000000,0.95
2,0.85,0.65,0.45,0.65,0.55,0.833333,0.70
3,0.95,0.85,0.95,0.65,0.75,0.900000,0.70
4,0.90,0.40,0.70,0.90,0.65,0.366667,0.60
...,...,...,...,...,...,...,...
2416,0.90,0.95,0.90,0.55,0.65,0.533333,0.55
2417,0.85,0.95,0.55,0.45,0.75,0.833333,0.95
2418,0.75,0.80,0.55,0.60,0.45,0.700000,0.90
2419,0.95,1.00,1.00,0.80,1.00,0.833333,1.00


In [40]:
v_km = KMeans(
    n_clusters=5, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)

In [41]:
val_clustering = v_km.fit_predict(survey_normalized_val)

In [42]:
survey_data['clusterval'] = val_clustering

In [43]:
# Results
for clusterval in set(val_clustering) :
    print(f'Printing Results for Cluster {clusterval}.\n\n')
    
    print(survey_data.query(f"clusterval == {clusterval}"))
    
    print("\n\n")

Printing Results for Cluster 0.


         id  age  gender  engagement  edu      zip channel  progressivism  \
0       346   56       3           1    2  98503.0  Branch      -0.083333   
15      412   71       2           0    6  98532.0  Branch       0.041667   
19      429   46       2           2    5      NaN  Branch       1.541667   
20      430   29       1           1    6  98938.0    Loan       1.375000   
30      444   64       2           2    4      NaN  Branch      -0.416667   
...     ...  ...     ...         ...  ...      ...     ...            ...   
2376  19456   52       1           0    5  98038.0   Other       1.125000   
2380  19530   30       1           2    2  98501.0  Branch       0.000000   
2382  19542   46       1           1    6  98275.0  Branch      -0.083333   
2383  19560   35       2           0    6  98008.0  Branch       0.041667   
2393  19960   55       1           0    4  98503.0  Branch       0.291667   

      harm  fair  ...  publicsector  sust

In [46]:
# Predominant region

# Define a function to return the mode of a series
def get_mode(series):
    return series.mode()[0] if not series.mode().empty else None

# Group by 'clusterval' and apply the mode function to the 'region' column
predominant_regions = survey_data.groupby('clusterval')['region'].apply(get_mode)

# Display the result
print(predominant_regions)

clusterval
0    W WA Metro
1    W WA Metro
2    W WA Metro
3    W WA Metro
4      Thurston
Name: region, dtype: object


In [47]:
# Average age
average_age_per_cluster = survey_data.groupby('clusterval')['age'].mean()

print(average_age_per_cluster)

clusterval
0    44.150943
1    48.022936
2    51.572354
3    52.875994
4    55.093817
Name: age, dtype: float64


In [48]:
# Average accountage
average_accountage_per_cluster = survey_data.groupby('clusterval')['accountage'].mean()

print(average_accountage_per_cluster)

clusterval
0    12.709111
1    13.542447
2    13.538082
3    15.221382
4    13.528475
Name: accountage, dtype: float64


In [49]:
# Most common focal value
value_counts_per_cluster = survey_data.groupby('clusterval')['mainfocalvalue'].value_counts()

print(value_counts_per_cluster)

clusterval  mainfocalvalue               
0           Education                         91
            Health (i.e. cancer research)     75
            Hunger/Poverty                    54
            Human/Civil rights                39
            Environment                       37
            Animal rights                     29
            Specify                           27
            Domestic abuse                    23
            Law enforcement                   19
            Homelessness                      17
            Arts & culture                    13
1           Environment                      111
            Human/Civil rights                85
            Hunger/Poverty                    56
            Education                         52
            Health (i.e. cancer research)     38
            Homelessness                      25
            Animal rights                     21
            Specify                           20
            Arts & culture 

In [50]:
# Mean for pubgreaterpriv
mean_pubgreaterpriv_per_cluster = survey_data.groupby('clusterval')['pubgreaterpriv'].mean()

print(mean_pubgreaterpriv_per_cluster)

clusterval
0    3.096698
1    3.348624
2    2.846652
3    3.271860
4    3.319829
Name: pubgreaterpriv, dtype: float64


In [51]:
# Mean for experiencemoreimportant
mean_experiencemoreimportant_per_cluster = survey_data.groupby('clusterval')['experiencemoreimportant'].mean()

print(mean_experiencemoreimportant_per_cluster)

clusterval
0    3.955189
1    3.878440
2    4.110151
3    3.976153
4    4.093817
Name: experiencemoreimportant, dtype: float64


In [52]:
# Mean for teachersunderpaid
mean_teachersunderpaid_per_cluster = survey_data.groupby('clusterval')['teachersunderpaid'].mean()

print(mean_teachersunderpaid_per_cluster)

clusterval
0    4.521226
1    5.357798
2    4.352052
3    5.155803
4    4.950959
Name: teachersunderpaid, dtype: float64


For this model, cluster zero had people living mostly in Western Washington Metro areas, the average age was 44 years old, the mean account age was 12.7 years, the most common focal value was education, and this cluster scored relatively in the middle on questions related to education. Cluster 1 had also had most people living in Western Washington Metro areas, the average age was 48 years old, the mean account age was 13.5 years, the most common focal value was the environment, and this cluster scored highest on the "teachers underpaid?" education question. Cluster 2 was also predominately Western Washington Metro area residents, the average age from this cluster was 51 years old, the mean accountage was 13.5 years, the most common focal value was human health, and this cluster scored highest in the experience related teachin question. Cluster 3 was Western Washington Metro area dominated, the average age was 53 years old, the mean account age was 15.2 years, the most common focal value was the environment, and this cluster scored relatively in the middle on all teaching related questions. The last cluster, cluster 4 was mostly from Thurston county Washington, the average age was 55 years old, the average account age was 13.5 years, the most common focal value was human health, and this cluster scored second in two of the three teaching related questions.

##  Appendix: Full Data Description
A financial institution in Washington has become concerned that their current membership 
base is not well-aligned with their corporate values. Through that concern they 
realized that don't actually understand their membership's values very well. They 
surveyed 2,421 members to shed light on the issue. 

The heart of the survey was the Moral Foundations Theory of Jonathan Haidt. Members 
were surveyed on the Moral Foundations Questionnaire, which you should take so you 
understand the test. Survey respondents were scored on the five foundations as well 
as a single-number summary, Progressivism. 

The financial institution values Localism, Sustainability, and Education. These aspects 
of member's values were assessed in the survey as well. Localism and Sustainability used
validated scales and thus can be summarized via a single score, where higher values indicate
greater support for the values. Education is summarized by the following three questions, 
which we do not have evidence can be combined into a single score:

* In general, public schools provide a better education than private schools.
* Public school teachers are underpaid.
* Experience is more important than education in determining success in life.
These questions were evaluated on a 1 to 6 scale where 1 indicated "Strongly Disagree" and 
6 indicated "Strongly Agree". 

Finally, we have information on the member that can be used to understand variation 
in their values. 

The data consists of the following columns:

* ID: a unique identifier for the survey respondent.
* age: the age of the respondent.
* gender: gender was evaluated with robust scale and collapsed into male/female/other for 
  those whose gender identity was not male or female.
* engagement: three categories of engagement with the financial institution.
* mem.edu: the self-reported education level of the member with the following scale:
* zip: the member zip code. 
* channel: how the member joined the financial institution. Options are "Loan" if they joined 
  via an auto loan, "Branch" if they joined at a branch and other for online or unknown. 
* progressivism/harm/fair/in.group/authority/purity: The MFQ results.
* account.age: the age of the member's account, in years. 
* region: The region of Washington the member lives in. May be easier to work with than zip.
* public.sector: has the person ever been a public employee?
* sustainability/localism: Scores on the validated scales. Higher values indicate greater
  support for the value.
* pub.greater.priv/experience.more.important/teachers.underpaid: The responses to the 
  education questions above. 
* main.focal.value: Respondents were asked, "Below is a list of broad areas to which people 
  often dedicate their volunteer or philanthropic efforts. From this list, please select the 
  most important to you. If an area of particular importance is missing, please let us know 
  about it in the space for 'other.'" This column holds the respondents' answer to that question.
* support.of.focal.value: Respondents were given an opportunity to indicate how they 
  supported their focal value. Those responses were collapsed into a single score, where 
  a higher value indicates more support.
