This assignment asks you to cluster data from a credit union in Washington. This data was gathered by survey. I've included a pretty full description of the data as an appendix down below. 

In class we used code from _Data Science From Scratch_ to cluster, but it's more traditional to use the [`Kmeans` function](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) from `scikit-learn`. I'll show you a brief example of that with the survey data from class, then I ask you to do your own clustering.

### Clustering Class Survey Data

In class we surveyed students from the MSBA and put them into groups. We'll repeat that work here. 


In [25]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans

In [26]:
example_data = pd.read_csv("survey_responses.txt",sep="\t")

In [27]:
example_data[:5]

Unnamed: 0,Timestamp,What's your family or last name?,Distance from where you were born to UMT,Distance from where you lived at 15 to UMT,How many years have you been in post-secondary school?,Were you an undergraduate marketing major?,Were you an undergraduate business major?,How many people live in your house/apartment (including you)?
0,11/12/2019 21:50:35,Schwartz,1726.0,250.0,9.0,No,No,1
1,11/13/2019 4:18:47,Thompson,2358.0,2787.0,6.0,No,Yes,2
2,11/13/2019 9:24:37,Bankston,1273.0,45.0,6.0,Yes,Yes,2
3,11/13/2019 10:57:34,Barr,1982.0,5.0,7.0,No,No,1
4,11/13/2019 14:50:06,Gabrielsen,276.0,276.0,6.0,No,Yes,3


As in class, let's convert the data to numeric and we'll do some renaming as well. 

In [28]:
current_names = example_data.columns
new_names = "timestamp name birth_dist age_15_dist school_years ug_mkt ug_biz hh_size".split()

In [29]:
example_data = example_data.rename(columns = dict(zip(current_names,new_names)))

In [30]:
# df.MySeriesName.map(dict(yes=1, no=0))
example_data.ug_mkt = example_data.ug_mkt.map(dict(Yes=1,No=0))
example_data.ug_biz = example_data.ug_biz.map(dict(Yes=1,No=0))


In [31]:
example_data[:5]

Unnamed: 0,timestamp,name,birth_dist,age_15_dist,school_years,ug_mkt,ug_biz,hh_size
0,11/12/2019 21:50:35,Schwartz,1726.0,250.0,9.0,0,0,1
1,11/13/2019 4:18:47,Thompson,2358.0,2787.0,6.0,0,1,2
2,11/13/2019 9:24:37,Bankston,1273.0,45.0,6.0,1,1,2
3,11/13/2019 10:57:34,Barr,1982.0,5.0,7.0,0,0,1
4,11/13/2019 14:50:06,Gabrielsen,276.0,276.0,6.0,0,1,3


In [32]:
numeric_cols = "birth_dist age_15_dist school_years ug_mkt ug_biz hh_size".split()


This operation is typically done to isolate the numeric data that will be used for further analysis or processing, such as normalization or clustering.

In [33]:
num_example_data = example_data.loc[:,numeric_cols]

In [34]:
#example_normalized = (num_example_data - num_example_data.min()) / (num_example_data.max() - num_example_data.min())

# Calculate the mean and standard deviation for each column
mean_values = num_example_data.mean()
std_values = num_example_data.std()

# Perform z-score normalization
example_normalized = (num_example_data - mean_values) / std_values

In [35]:
example_normalized[:10]

Unnamed: 0,birth_dist,age_15_dist,school_years,ug_mkt,ug_biz,hh_size
0,0.092412,-0.546875,0.557685,-0.422282,-1.069687,-1.150927
1,0.409156,0.886745,-0.27484,-0.422282,0.916875,-0.295953
2,-0.134621,-0.662717,-0.27484,2.322548,0.916875,-0.295953
3,0.220713,-0.685321,0.002668,-0.422282,-1.069687,-1.150927
4,-0.634294,-0.532183,-0.27484,-0.422282,0.916875,0.559022
5,3.438265,0.623981,-0.27484,-0.422282,-1.069687,-1.150927
6,-0.772117,-0.685321,0.835193,2.322548,0.916875,0.559022
7,-0.020854,-0.654241,-0.552348,2.322548,0.916875,-0.295953
8,-0.636298,-0.534443,-0.552348,-0.422282,0.916875,-0.295953
9,-0.599712,-0.493192,-0.27484,-0.422282,0.916875,-1.150927


Let's set up our K-Means model. Some of the below code is adapted from 

In [36]:
km = KMeans(
    n_clusters=4, 
    init='random',
    n_init=10, 
    max_iter=300, 
    tol=1e-04, 
    random_state=0
)


In [37]:
clustering = km.fit_predict(example_normalized)

In [38]:
example_data['cluster'] = clustering

In [39]:
cluster_df = pd.DataFrame()

for cluster in set(clustering):
    cluster_data = example_data[example_data['cluster'] == cluster]
    cluster_df = pd.concat([cluster_df, cluster_data], axis=0)

cluster_df = cluster_df.reset_index(drop=True)

print(cluster_df)

              timestamp         name  birth_dist  age_15_dist  school_years  \
0    11/13/2019 4:18:47     Thompson      2358.0       2787.0           6.0   
1   11/13/2019 14:50:06   Gabrielsen       276.0        276.0           6.0   
2   11/13/2019 16:03:36    Cerkovnik       272.0        272.0           5.0   
3   11/13/2019 16:03:48        McNea       345.0        345.0           6.0   
4   11/13/2019 16:04:18       Toepke       543.0        840.0           6.0   
5    11/3/2020 11:59:47    Hendricks        32.0         32.0           5.0   
6    11/3/2020 14:06:10     Anderson       700.0        700.0           5.0   
7    11/3/2020 16:48:11        Moore       871.0        892.0           8.0   
8    11/5/2020 17:23:28      Gilbert      1579.0       1579.0           2.0   
9    11/9/2020 10:26:20    Stahlberg        95.0         95.0           7.0   
10   11/9/2020 10:30:12      Hoffman       272.0        272.0           6.0   
11   11/1/2021 17:56:58    Stahlberg       120.0    

In [21]:
# Write cluster_df to csv
cluster_df.to_csv('data/survey_cluster_results_fourgroup.csv', index=False)


There's much more we could do here, such as printing out the cluster means. With this current scaling approach it appears that we're basically splitting into UG degrees. It might be more interesting to normalize by subtracting the mean of the column and dividing by the standard deviation. 

## Demographic Clustering

In this section, perform a cluster analysis using the fields unrelated to the values survey:

* age
* gender
* engagement
* account.age

Choose a number of clusters and use K-means to cluster the data using these fields. Briefly describe the clusters. How evenly do your clusters align with the regions?

In [6]:
# Your work here
demographic_data = pd.read_csv("washington_survey_data.txt",sep="\t")
num_demo_data = demographic_data

In [7]:
numeric_cols = "age gender engagement account.age".split()


In [8]:
num_demographic_data = demographic_data.loc[:,numeric_cols]



In [9]:
num_demographic_data.gender = num_demographic_data.gender.map(dict(other=2,female=1,male=0))
num_demographic_data.engagement = num_demographic_data.engagement.map({"Highly Engaged" : 2,"Engaged": 1, "Not Engaged": 0})


In [10]:
#example_normalized = (num_example_data - num_example_data.min()) / (num_example_data.max() - num_example_data.min())

# Calculate the mean and standard deviation for each column
mean_values = num_demographic_data.mean()
std_values = num_demographic_data.std()

# Perform z-score normalization
demo_normalized = (num_demographic_data - mean_values) / std_values

In [11]:
km = KMeans(
    n_clusters=4, 
    init='random',
    n_init=10, 
    max_iter=300, 
    tol=1e-04, 
    random_state=0
)


In [12]:
clustering = km.fit_predict(demo_normalized)

In [13]:
demo_normalized['cluster'] = clustering

In [14]:
cluster_df = pd.DataFrame(demo_normalized)
cluster_df['cluster'] = clustering

print(cluster_df)
# Your work here
demographic_data = num_demo_data
demographic_data['cluster'] = cluster_df['cluster']

           age    gender  engagement  account.age  cluster
0     0.331628  2.349315   -0.059365    -0.327557        3
1     0.951989  0.552256    1.179627    -0.421563        0
2     1.262170 -1.244803   -1.298358    -0.715660        2
3     0.951989 -1.244803   -0.059365    -0.391664        2
4    -0.040589 -1.244803    1.179627    -0.190813        0
...        ...       ...         ...          ...      ...
2416  2.006603  0.552256   -0.059365    -0.968539        3
2417  2.626965  0.552256   -0.059365     2.515685        1
2418 -0.350769  0.552256   -1.298358     0.583703        3
2419 -1.095203 -1.244803   -0.059365    -0.868367        2
2420 -1.467419  0.552256   -1.298358    -1.150470        3

[2421 rows x 5 columns]


In [15]:
# Write cluster_df to csv
demographic_data.to_csv('data/demo_cluster_results_fourgroup.csv', index=False)


## Values Clustering

Similarly to the previous section, perform a cluster analysis, this time on the values questions:

* fair
* harm
* in.group
* authority
* purity
* public.sector
* sustainability
* localism

After you’ve built your clusters, report the following information on each cluster:

* Predominant region
* Average age and account age
* Most common focal value
* Mean results on the questions of `pub.greater.priv`, `experience.more.important`, and `teachers.underpaid`. 

In [16]:
demographic_data = pd.read_csv("washington_survey_data.txt",sep="\t")
num_demo_data = demographic_data

In [17]:
numeric_cols = "fair harm in.group authority purity public.sector sustainability localism".split()
num_demographic_data = demographic_data.loc[:,numeric_cols]



In [18]:
# Assuming num_demographic_data is your DataFrame
num_demographic_data['public.sector'] = num_demographic_data['public.sector'].map({"yes": 1, "no": 0})


In [19]:
#example_normalized = (num_example_data - num_example_data.min()) / (num_example_data.max() - num_example_data.min())

# Calculate the mean and standard deviation for each column
mean_values = num_demographic_data.mean()
std_values = num_demographic_data.std()

# Perform z-score normalization
demo_normalized = (num_demographic_data - mean_values) / std_values

In [20]:
km = KMeans(
    n_clusters=5, 
    init='random',
    n_init=10, 
    max_iter=300, 
    tol=1e-04, 
    random_state=0
)


In [21]:
clustering = km.fit_predict(demo_normalized)

In [22]:
demo_normalized['cluster'] = clustering

In [23]:
cluster_df = pd.DataFrame(demo_normalized)
cluster_df['cluster'] = clustering

print(cluster_df)
# Your work here
#demographic_data = num_demo_data
demographic_data['cluster'] = cluster_df['cluster']

          fair      harm  in.group  authority    purity  public.sector  \
0    -3.866462 -3.784118 -2.165151  -1.902780 -1.380514       0.853083   
1     1.168521  1.400023 -0.889021   0.293643 -0.527144       0.853083   
2     0.497190 -0.868039 -0.889021   0.293643 -0.100459       0.853083   
3     1.168521  0.427997  1.663239   0.293643  0.752912      -1.171735   
4     0.832855 -2.488083  0.387109   1.513878  0.326227       0.853083   
...        ...       ...       ...        ...       ...            ...   
2416  0.832855  1.076015  1.408013  -0.194451  0.326227       0.853083   
2417  0.497190  1.076015 -0.378569  -0.682545  0.752912      -1.171735   
2418 -0.174141  0.103988 -0.378569   0.049596 -0.527144       0.853083   
2419  1.168521  1.400023  1.918465   1.025784  1.819625       0.853083   
2420  0.161524  0.103988  0.131883  -0.926592 -0.953829      -1.171735   

      sustainability  localism  cluster  
0          -1.950921 -2.300722        3  
1           1.635396  1.200

In [24]:
# Write cluster_df to csv
demographic_data.to_csv('data/values_cluster_results_fivegroup.csv', index=False)


##  Appendix: Full Data Description
A financial institution in Washington has become concerned that their current membership 
base is not well-aligned with their corporate values. Through that concern they 
realized that don't actually understand their membership's values very well. They 
surveyed 2,421 members to shed light on the issue. 

The heart of the survey was the Moral Foundations Theory of Jonathan Haidt. Members 
were surveyed on the Moral Foundations Questionnaire, which you should take so you 
understand the test. Survey respondents were scored on the five foundations as well 
as a single-number summary, Progressivism. 

The financial institution values Localism, Sustainability, and Education. These aspects 
of member's values were assessed in the survey as well. Localism and Sustainability used
validated scales and thus can be summarized via a single score, where higher values indicate
greater support for the values. Education is summarized by the following three questions, 
which we do not have evidence can be combined into a single score:

* In general, public schools provide a better education than private schools.
* Public school teachers are underpaid.
* Experience is more important than education in determining success in life.
These questions were evaluated on a 1 to 6 scale where 1 indicated "Strongly Disagree" and 
6 indicated "Strongly Agree". 

Finally, we have information on the member that can be used to understand variation 
in their values. 

The data consists of the following columns:

* ID: a unique identifier for the survey respondent.
* age: the age of the respondent.
* gender: gender was evaluated with robust scale and collapsed into male/female/other for 
  those whose gender identity was not male or female.
* engagement: three categories of engagement with the financial institution.
* mem.edu: the self-reported education level of the member with the following scale:
* zip: the member zip code. 
* channel: how the member joined the financial institution. Options are "Loan" if they joined 
  via an auto loan, "Branch" if they joined at a branch and other for online or unknown. 
* progressivism/harm/fair/in.group/authority/purity: The MFQ results.
* account.age: the age of the member's account, in years. 
* region: The region of Washington the member lives in. May be easier to work with than zip.
* public.sector: has the person ever been a public employee?
* sustainability/localism: Scores on the validated scales. Higher values indicate greater
  support for the value.
* pub.greater.priv/experience.more.important/teachers.underpaid: The responses to the 
  education questions above. 
* main.focal.value: Respondents were asked, "Below is a list of broad areas to which people 
  often dedicate their volunteer or philanthropic efforts. From this list, please select the 
  most important to you. If an area of particular importance is missing, please let us know 
  about it in the space for 'other.'" This column holds the respondents' answer to that question.
* support.of.focal.value: Respondents were given an opportunity to indicate how they 
  supported their focal value. Those responses were collapsed into a single score, where 
  a higher value indicates more support.
