# Clustering Class Survey Data

This purpose of this project is to cluster data from a credit union in Washington. This data was gathered by survey. Included is a pretty full description of the data as an appendix down below. 

We use the [`Kmeans` function](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) from `scikit-learn`. Below is a brief example of that with the MSBA student survey data, then some other clustering.

### A word on clustering:

Clustering is a technique in data analysis that aims to group similar data points together based on their inherent characteristics or patterns. It is an unsupervised learning method that does not rely on predefined labels or categories.

The goal of clustering is to identify natural groupings or clusters within a dataset, where data points within the same cluster are more similar to each other compared to data points in other clusters. Clustering can help uncover underlying structures or relationships in the data, discover patterns, and gain insights into the data distribution.

The process of clustering involves the following steps:

1. **Data Preparation**: Preprocess and transform the raw data into a suitable format for clustering. This may involve cleaning the data, handling missing values, and selecting relevant features.

2. **Feature Selection/Extraction**: If needed, reduce the dimensionality of the data by selecting important features or extracting new features that capture the essence of the data.

3. **Similarity/Distance Calculation**: Determine the similarity or dissimilarity between data points using appropriate distance metrics. Common distance metrics include Euclidean distance, Manhattan distance, or cosine similarity.

4. **Clustering Algorithm Selection**: Choose an appropriate clustering algorithm based on the nature of the data and the problem at hand. Popular clustering algorithms include k-means, hierarchical clustering, DBSCAN, and Gaussian mixture models.

5. **Clustering**: Apply the selected algorithm to partition the data into clusters. Each data point is assigned to a cluster based on its similarity to other data points.

6. **Evaluation**: Assess the quality of the clustering results. This can be done using internal evaluation metrics such as silhouette score or external evaluation measures if ground truth labels are available.

7. **Interpretation and Analysis**: Analyze the resulting clusters to understand the patterns and characteristics of each cluster. This may involve visualizing the clusters, examining the cluster centroids or representatives, and conducting further analysis or exploration.

Clustering has various applications in different domains, such as customer segmentation, image segmentation, document clustering, anomaly detection, and recommender systems. It is a valuable tool for exploratory data analysis, pattern recognition, and gaining insights from unlabeled data.

### K-means clustering:

K-means clustering is a popular unsupervised machine learning algorithm used for clustering analysis. It aims to partition a dataset into distinct groups or clusters based on the similarity of data points.

The algorithm works as follows:

1. **Initialization**: Choose the number of clusters, denoted by "K," that you want to create. Randomly initialize K centroids within the feature space.

2. **Assignment**: Assign each data point to the nearest centroid based on a distance metric, typically the Euclidean distance. This step creates K clusters, where each data point belongs to the cluster with the closest centroid.

3. **Update**: Recalculate the centroids of the clusters based on the mean (average) of the data points within each cluster. The centroid becomes the new center of the cluster.

4. **Re-assignment**: Reassign each data point to the nearest centroid based on the updated centroids.

5. **Iteration**: Repeat steps 3 and 4 until convergence or until a maximum number of iterations is reached. Convergence occurs when the centroids no longer change significantly, or when the assignments remain unchanged.

6. **Termination**: The algorithm terminates, and the final clusters are determined.

K-means clustering aims to minimize the within-cluster sum of squares, also known as inertia or distortion. The objective is to create clusters that are internally homogeneous (data points within the same cluster are similar) and externally distinct (data points from different clusters are dissimilar).

It is important to note that K-means clustering requires the number of clusters, K, to be specified in advance. However, determining the optimal value of K can be challenging and may require domain knowledge, heuristics, or evaluation metrics such as the elbow method or silhouette score.

K-means clustering has various applications, including customer segmentation, image compression, document clustering, anomaly detection, and more. It is a computationally efficient algorithm but can be sensitive to the initial centroid positions and may converge to suboptimal solutions. Therefore, it is common to run the algorithm multiple times with different initializations to mitigate this issue.

### Euclidian distance:

In the context of K-means clustering, Euclidean distance is a commonly used distance metric to measure the similarity or dissimilarity between two data points in a feature space. It is named after the Greek mathematician Euclid and is based on the Pythagorean theorem.

The Euclidean distance between two points, represented as vectors in a multidimensional space, is calculated as the square root of the sum of the squared differences between the corresponding coordinates of the points. Mathematically, the Euclidean distance between two points, P and Q, in an n-dimensional space can be defined as:

```
d(P, Q) = sqrt((x1 - x2)^2 + (y1 - y2)^2 + ... + (xn - yn)^2)
```

where `(x1, y1, ..., xn)` and `(x2, y2, ..., yn)` are the coordinates of points P and Q, respectively.

In the context of K-means clustering, the Euclidean distance is used to measure the dissimilarity between a data point and the centroid of a cluster. During the assignment step of the K-means algorithm, each data point is assigned to the cluster with the closest centroid based on the Euclidean distance. The data point is assigned to the cluster that minimizes the sum of squared Euclidean distances between the data point and the centroid.

By using the Euclidean distance, K-means clustering aims to create clusters where the data points within each cluster are closer to their own centroid compared to centroids of other clusters. This promotes the formation of tight and internally homogeneous clusters in the feature space.

### Now, the code:

In [1]:
# The first line imports the pandas library, which provides data structures and functions for data manipulation and 
#analysis.

import pandas as pd

# The second line imports the numpy library, which is used for numerical computing and provides array objects and 
# mathematical functions.

import numpy as np

# The third line imports the pyplot module from the matplotlib library, which is a plotting library in Python. 
# It provides a MATLAB-like interface for creating visualizations.

from matplotlib import pyplot as plt

# The fourth line imports the KMeans class from the sklearn.cluster module. This class implements the K-means 
# clustering algorithm, which is a popular clustering method.

from sklearn.cluster import KMeans

In [2]:
# Used to modify the display options in pandas, specifically the maximum number of rows to be displayed in a DataFrame. 
    # By setting it to None, it allows all rows of a DataFrame to be displayed without any truncation.

# pd.set_option('display.max_rows', None)

In [3]:
# Reading in the data and displaying it.

example_data = pd.read_csv("survey_responses.txt",sep="\t")

example_data

Unnamed: 0,Timestamp,What's your family or last name?,Distance from where you were born to UMT,Distance from where you lived at 15 to UMT,How many years have you been in post-secondary school?,Were you an undergraduate marketing major?,Were you an undergraduate business major?,How many people live in your house/apartment (including you)?
0,11/12/2019 21:50:35,Schwartz,1726,250,9,No,No,1
1,11/13/2019 4:18:47,Thompson,2358,2787,6,No,Yes,2
2,11/13/2019 9:24:37,Bankston,1273,45,6,Yes,Yes,2
3,11/13/2019 10:57:34,Barr,1982,5,7,No,No,1
4,11/13/2019 14:50:06,Gabrielsen,276,276,6,No,Yes,3
5,11/13/2019 16:01:33,Campestre,8402,2322,6,No,No,1
6,11/13/2019 16:02:42,Leonard,1,5,10,Yes,Yes,3
7,11/13/2019 16:03:33,Martin,1500,60,5,Yes,Yes,2
8,11/13/2019 16:03:36,Cerkovnik,272,272,5,No,Yes,2
9,11/13/2019 16:03:48,McNea,345,345,6,No,Yes,1


Convert the data to numeric and do some renaming as well. 

In [4]:
# This line assigns the column names of the DataFrame example_data to the variable current_names.

current_names = example_data.columns

# This line splits the string "timestamp name birth_dist age_15_dist school_years ug_mkt ug_biz hh_size" into a list of 
    # individual words and assigns it to the variable new_names. Each word in the string will become an element in the 
    # resulting list. In this example, the split() method is called on the string string without specifying any delimiter. 
    # As a result, the string is split at each occurrence of whitespace characters (spaces)

new_names = "timestamp name birth_dist age_15_dist school_years ug_mkt ug_biz hh_size".split()

# This line renames the columns of the DataFrame example_data using the rename() method. It takes a dictionary as an 
    # argument, where the keys are the current column names (current_names) and the values are the new column names 
    # (new_names). The zip() function is used to pair the corresponding elements from current_names and new_names 
    # together, and dict() is used to convert this pairings into a dictionary. The columns in example_data will be renamed 
    # according to the provided dictionary.

example_data = example_data.rename(columns = dict(zip(current_names, new_names)))

# This line maps the values in the column "ug_mkt" of example_data to new values using the map() method. It takes a 
    # dictionary as an argument, where the keys are the original values in the column ("Yes" and "No") and the values are 
    # the corresponding new values (1 and 0). The values in the "ug_mkt" column will be replaced accordingly.

example_data.ug_mkt = example_data.ug_mkt.map(dict(Yes=1,No=0))

# This line performs a similar operation as the previous line, but for the "ug_biz" column of example_data. The values in 
    # the "ug_biz" column will be replaced based on the provided dictionary mapping.

example_data.ug_biz = example_data.ug_biz.map(dict(Yes=1,No=0))

In [5]:
# Looking at the transformations

example_data

Unnamed: 0,timestamp,name,birth_dist,age_15_dist,school_years,ug_mkt,ug_biz,hh_size
0,11/12/2019 21:50:35,Schwartz,1726,250,9,0,0,1
1,11/13/2019 4:18:47,Thompson,2358,2787,6,0,1,2
2,11/13/2019 9:24:37,Bankston,1273,45,6,1,1,2
3,11/13/2019 10:57:34,Barr,1982,5,7,0,0,1
4,11/13/2019 14:50:06,Gabrielsen,276,276,6,0,1,3
5,11/13/2019 16:01:33,Campestre,8402,2322,6,0,0,1
6,11/13/2019 16:02:42,Leonard,1,5,10,1,1,3
7,11/13/2019 16:03:33,Martin,1500,60,5,1,1,2
8,11/13/2019 16:03:36,Cerkovnik,272,272,5,0,1,2
9,11/13/2019 16:03:48,McNea,345,345,6,0,1,1


In [6]:
# In this line, the string "birth_dist age_15_dist school_years ug_mkt ug_biz hh_size" is split into a list of individual 
    # column names. The .split() method is called on the string, using the default delimiter which is a space. The resulting list numeric_cols contains the individual column names as elements.

numeric_cols = "birth_dist age_15_dist school_years ug_mkt ug_biz hh_size".split()

In [7]:
# Here, the DataFrame example_data is indexed with the loc accessor, and all rows are selected (:). Only the columns 
    # specified in the numeric_cols list are included in the resulting DataFrame num_example_data.

num_example_data = example_data.loc[:,numeric_cols]

num_example_data

Unnamed: 0,birth_dist,age_15_dist,school_years,ug_mkt,ug_biz,hh_size
0,1726,250,9,0,0,1
1,2358,2787,6,0,1,2
2,1273,45,6,1,1,2
3,1982,5,7,0,0,1
4,276,276,6,0,1,3
5,8402,2322,6,0,0,1
6,1,5,10,1,1,3
7,1500,60,5,1,1,2
8,272,272,5,0,1,2
9,345,345,6,0,1,1


In [8]:
# In this line, the DataFrame num_example_data is subtracted element-wise by the minimum value of each column, obtained 
    # using the .min() method. Then, the result is divided element-wise by the range of each column, obtained by 
    # subtracting the minimum value from the maximum value of each column using the .max() and .min() methods. The 
    # resulting DataFrame example_normalized contains the normalized values of the numeric columns, where each value is 
    # scaled between 0 and 1 based on the minimum and maximum values of the respective columns.

example_normalized =(num_example_data -num_example_data.min())/(num_example_data.max()-num_example_data.min())

example_normalized

Unnamed: 0,birth_dist,age_15_dist,school_years,ug_mkt,ug_biz,hh_size
0,0.205333,0.037004,0.333333,0.0,0.0,0.0
1,0.280562,0.412522,0.190476,0.0,1.0,0.166667
2,0.151411,0.006661,0.190476,1.0,1.0,0.166667
3,0.235805,0.00074,0.238095,0.0,0.0,0.0
4,0.032734,0.040853,0.190476,0.0,1.0,0.333333
5,1.0,0.343694,0.190476,0.0,0.0,0.0
6,0.0,0.00074,0.380952,1.0,1.0,0.333333
7,0.178431,0.008881,0.142857,1.0,1.0,0.166667
8,0.032258,0.040261,0.142857,0.0,1.0,0.166667
9,0.040948,0.051066,0.190476,0.0,1.0,0.0


Let's set up our K-Means model.

In [9]:
# In this line, a KMeans object is created with the specified parameters. The n_clusters parameter is set to 3, indicating 
    # that the algorithm should aim to identify 3 clusters. The init parameter is set to 'random', meaning that the 
    # initial cluster centroids will be randomly assigned. The n_init parameter is set to 10, indicating that the algorithm 
    # will be run 10 times with different centroid seeds and the final results will be based on the best run. The max_iter 
    # parameter is set to 300, indicating the maximum number of iterations for each run. The tol parameter is set to 1e-04, 
    # which is the tolerance for convergence. The random_state parameter is set to 0, ensuring reproducibility of results.

km = KMeans(
    n_clusters=3, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)

In [10]:
# In this line, the fit_predict method is called on the km object, passing the normalized example data (example_normalized) 
    # as the input. This method fits the K-means model to the data and assigns each sample to a cluster. The resulting 
    # cluster labels are stored in the clustering variable.

clustering = km.fit_predict(example_normalized)

In [11]:
# Here, a new column named 'cluster' is added to the example_data DataFrame, and the cluster labels stored in the 
    # clustering variable are assigned to this column.

example_data['cluster'] = clustering

In [12]:
# This loop iterates over each unique cluster label in the clustering variable. For each cluster, it prints a header 
    # indicating the cluster number. It then uses the query method to filter the example_data DataFrame and retrieve only the 
    # rows where the 'cluster' column matches the current cluster label. Finally, it prints the filtered DataFrame to show the 
    # results for each cluster.

for cluster in set(clustering) :
    print(f'Printing Results for Cluster {cluster}.\n\n')    
    print(example_data.query(f"cluster == {cluster}"))    
    print("\n\n") 

Printing Results for Cluster 0.


              timestamp       name  birth_dist  age_15_dist  school_years  \
2    11/13/2019 9:24:37   Bankston        1273           45             6   
6   11/13/2019 16:02:42    Leonard           1            5            10   
7   11/13/2019 16:03:33     Martin        1500           60             5   
10  11/13/2019 16:04:03     Runkel           5           56             5   
15   11/2/2020 21:12:19   Danicich         478            5             5   
23   11/6/2020 19:40:50   Scheibel         202            0             5   
37   11/7/2022 16:06:36  Halderman        1782         1906             5   

    ug_mkt  ug_biz  hh_size  cluster  
2        1       1        2        0  
6        1       1        3        0  
7        1       1        2        0  
10       1       1        2        0  
15       1       1        2        0  
23       1       1        2        0  
37       1       1        2        0  



Printing Results for Cluster 1.




There's much more we could do here, such as printing out the cluster means. With this current scaling approach it appears that we're basically splitting into UG degrees. It might be more interesting to normalize by subtracting the mean of the column and dividing by the standard deviation. 

## Demographic Clustering

In this section, we perform a cluster analysis using the fields unrelated to the values survey:

* age
* gender
* engagement
* account.age

Choose a number of clusters and use K-means to cluster the data using these fields. Briefly describe the clusters. How evenly do your clusters align with the regions?

In [13]:
# In this line, the pd.read_csv() function is used to read the contents of a file named "washington_survey_data.txt" into 
    # a DataFrame called wsdata. The sep parameter is set to "\t", indicating that the data in the file is separated by 
    # tabs. This is commonly used when dealing with tab-delimited files.

wsdata = pd.read_csv("washington_survey_data.txt",sep="\t") 
wsdata

Unnamed: 0,id,age,gender,engagement,mem.edu,zip,channel,progressivism,harm,fair,...,account.age,region,public.sector,sustainability,localism,pub.greater.priv,experience.more.important,teachers.underpaid,main.focal.value,support.of.focal.value
0,346,56,other,Engaged,2,98503.0,Branch,-0.083333,1.00,1.00,...,9.951,Thurston,yes,2.333333,2.75,3,3,2,Homelessness,0.000000
1,348,66,female,Highly Engaged,3,98012.0,Branch,2.291667,5.00,4.75,...,8.838,W WA Metro,yes,6.000000,5.75,1,4,5,Environment,18.627830
2,349,71,male,Not Engaged,7,98506.0,Branch,1.000000,3.25,4.25,...,5.356,Thurston,yes,5.166667,4.50,4,3,6,Hunger/Poverty,28.739986
3,352,66,male,Engaged,7,,Branch,0.583333,4.25,4.75,...,9.192,W WA Metro,no,5.500000,4.50,3,3,4,Education,18.055372
4,358,50,male,Highly Engaged,4,98233.0,Branch,-0.500000,2.00,4.50,...,11.570,W WA Non Metro,yes,2.833333,4.00,2,4,5,Education,4.400921
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2416,22070,83,female,Engaged,6,99163.0,Branch,1.125000,4.75,4.50,...,2.362,E WA Non Metro,yes,3.666667,3.75,3,4,5,Domestic abuse,21.061463
2417,22163,93,female,Engaged,5,98103.0,Branch,1.583333,4.75,4.25,...,43.614,W WA Metro,no,5.166667,5.75,4,3,6,Education,22.968902
2418,22830,45,female,Not Engaged,4,,Branch,1.208333,4.00,3.75,...,20.740,E WA Metro,yes,4.500000,5.50,4,5,6,Specify,37.564720
2419,23167,33,male,Engaged,6,,Branch,0.208333,5.00,4.75,...,3.548,Thurston,yes,5.166667,6.00,6,4,6,Education,73.189127


In [14]:
# Create list of original column names and look at them.

orig_names = wsdata.columns 
orig_names

Index(['id', 'age', 'gender', 'engagement', 'mem.edu', 'zip', 'channel',
       'progressivism', 'harm', 'fair', 'in.group', 'authority', 'purity',
       'account.age', 'region', 'public.sector', 'sustainability', 'localism',
       'pub.greater.priv', 'experience.more.important', 'teachers.underpaid',
       'main.focal.value', 'support.of.focal.value'],
      dtype='object')

In [15]:
# Create list of names with no '.' and with '_' to make them easier to work with.

_names = []
for name in orig_names :
    name = name.replace('.', '_') 
    _names.append(name)
_names

['id',
 'age',
 'gender',
 'engagement',
 'mem_edu',
 'zip',
 'channel',
 'progressivism',
 'harm',
 'fair',
 'in_group',
 'authority',
 'purity',
 'account_age',
 'region',
 'public_sector',
 'sustainability',
 'localism',
 'pub_greater_priv',
 'experience_more_important',
 'teachers_underpaid',
 'main_focal_value',
 'support_of_focal_value']

In [16]:
# In this line of code, the rename() method of the DataFrame wsdata is used to rename the columns. It takes a dictionary 
    # that maps the original column names (orig_names) to the new column names (_names). The zip() function is used to 
    # pair the original and new names together, creating a dictionary that specifies the column name mappings.

wsdata = wsdata.rename(columns = dict(zip(orig_names,_names))) 
wsdata

Unnamed: 0,id,age,gender,engagement,mem_edu,zip,channel,progressivism,harm,fair,...,account_age,region,public_sector,sustainability,localism,pub_greater_priv,experience_more_important,teachers_underpaid,main_focal_value,support_of_focal_value
0,346,56,other,Engaged,2,98503.0,Branch,-0.083333,1.00,1.00,...,9.951,Thurston,yes,2.333333,2.75,3,3,2,Homelessness,0.000000
1,348,66,female,Highly Engaged,3,98012.0,Branch,2.291667,5.00,4.75,...,8.838,W WA Metro,yes,6.000000,5.75,1,4,5,Environment,18.627830
2,349,71,male,Not Engaged,7,98506.0,Branch,1.000000,3.25,4.25,...,5.356,Thurston,yes,5.166667,4.50,4,3,6,Hunger/Poverty,28.739986
3,352,66,male,Engaged,7,,Branch,0.583333,4.25,4.75,...,9.192,W WA Metro,no,5.500000,4.50,3,3,4,Education,18.055372
4,358,50,male,Highly Engaged,4,98233.0,Branch,-0.500000,2.00,4.50,...,11.570,W WA Non Metro,yes,2.833333,4.00,2,4,5,Education,4.400921
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2416,22070,83,female,Engaged,6,99163.0,Branch,1.125000,4.75,4.50,...,2.362,E WA Non Metro,yes,3.666667,3.75,3,4,5,Domestic abuse,21.061463
2417,22163,93,female,Engaged,5,98103.0,Branch,1.583333,4.75,4.25,...,43.614,W WA Metro,no,5.166667,5.75,4,3,6,Education,22.968902
2418,22830,45,female,Not Engaged,4,,Branch,1.208333,4.00,3.75,...,20.740,E WA Metro,yes,4.500000,5.50,4,5,6,Specify,37.564720
2419,23167,33,male,Engaged,6,,Branch,0.208333,5.00,4.75,...,3.548,Thurston,yes,5.166667,6.00,6,4,6,Education,73.189127


In [17]:
# In this line of code, the map() function is applied to the 'engagement' column of the DataFrame wsdata. The map() 
    # function takes a lambda function as an argument, which performs string manipulations on each element of the 
    # 'engagement' column.

# The lambda function lambda x: x.replace('.', '').replace(' ', '') is defined to remove periods ('.') and spaces (' ') 
    # from each string in the 'engagement' column. The replace() method is used twice to substitute these characters with 
    # empty strings, effectively removing them from the strings.

wsdata.engagement = wsdata.engagement.map(lambda x: x.replace('.', '').replace(' ', '')) 
wsdata.engagement

0             Engaged
1       HighlyEngaged
2          NotEngaged
3             Engaged
4       HighlyEngaged
            ...      
2416          Engaged
2417          Engaged
2418       NotEngaged
2419          Engaged
2420       NotEngaged
Name: engagement, Length: 2421, dtype: object

In [18]:
# # df.MySeriesName.map(dict(yes=1, no=0))

# wsdata.gender = wsdata.gender.map(dict(other=2,male=1,female=0))
# wsdata.engagement = wsdata.engagement.map(dict(HighlyEngaged=2,Engaged=1,NotEngaged=0))

In [19]:
# Subset wsdata columns to to_cluster_a for clustering.

# In this line of code, a new DataFrame to_cluster_a is created by selecting specific columns from the wsdata DataFrame. 
    # The columns included in the new DataFrame are 'age', 'gender', 'engagement', and 'account_age'.

# The double square brackets [['age', 'gender', 'engagement', 'account_age']] are used to indicate column selection. 
    # By enclosing the column names within the outer square brackets, it creates a new DataFrame containing only those 
    # specified columns from the original wsdata DataFrame. The order of the columns in to_cluster_a will match the order 
    # specified in the column selection.

to_cluster_a = wsdata[['age','gender','engagement','account_age']] 
to_cluster_a

Unnamed: 0,age,gender,engagement,account_age
0,56,other,Engaged,9.951
1,66,female,HighlyEngaged,8.838
2,71,male,NotEngaged,5.356
3,66,male,Engaged,9.192
4,50,male,HighlyEngaged,11.570
...,...,...,...,...
2416,83,female,Engaged,2.362
2417,93,female,Engaged,43.614
2418,45,female,NotEngaged,20.740
2419,33,male,Engaged,3.548


In [20]:
# The 'gender' column of the DataFrame to_cluster_a is mapped using a dictionary. The values 'other', 'male', and 'female' 
    # are mapped to 2, 1, and 0, respectively.
    
to_cluster_a.gender = to_cluster_a.gender.map(dict(other=2,male=1,female=0)) 

# The 'engagement' column of to_cluster_a is also mapped using a dictionary. The values 'HighlyEngaged', 'Engaged', and 
    # 'NotEngaged' are mapped to 2, 1, and 0, respectively.

to_cluster_a.engagement = to_cluster_a.engagement.map(dict(HighlyEngaged=2,Engaged=1,NotEngaged=0))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  to_cluster_a.gender = to_cluster_a.gender.map(dict(other=2,male=1,female=0))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  to_cluster_a.engagement = to_cluster_a.engagement.map(dict(HighlyEngaged=2,Engaged=1,NotEngaged=0))


In [21]:
#  A new DataFrame to_clusta_normalized is created by performing min-max normalization on the to_cluster_a DataFrame. The 
    # min() function returns the minimum value for each column, and the max() function returns the maximum value for each 
    # column. By subtracting the minimum values from each element and dividing by the range (difference between maximum 
    # and minimum values), each element is scaled to a range between 0 and 1. This normalization process helps in bringing 
    # all the features to a similar scale.

toclusta_normalized =(to_cluster_a -to_cluster_a.min())/(to_cluster_a.max()-to_cluster_a.min()) 
toclusta_normalized

Unnamed: 0,age,gender,engagement,account_age
0,0.4750,1.0,0.5,0.203923
1,0.6000,0.0,1.0,0.181108
2,0.6625,0.5,0.0,0.109731
3,0.6000,0.5,0.5,0.188365
4,0.4000,0.5,1.0,0.237111
...,...,...,...,...
2416,0.8125,0.0,0.5,0.048357
2417,0.9375,0.0,0.5,0.893979
2418,0.3375,0.0,0.0,0.425087
2419,0.1875,0.5,0.5,0.072669


In [22]:
# KMeans object km2 is created with parameters specifying the number of clusters, initialization method, number of 
    # initializations, maximum iterations, tolerance, and random state.

km2 = KMeans(
    n_clusters=3, init='random', 
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)

In [23]:
# The fit_predict() method is called on km2 using the normalized data toclusta_normalized. This performs K-means 
    # clustering and assigns each sample to a cluster, returning the cluster labels.

clustered_a = km2.fit_predict(toclusta_normalized) 

In [24]:
# A new column 'cluster_a' is added to the wsdata DataFrame, and the cluster labels are assigned to 
    # this column.

wsdata['cluster_a'] = clustered_a 

In [25]:
# A loop is executed over each unique cluster label in clustered_a. For each cluster, the code prints the results for that 
    # specific cluster by filtering the wsdata DataFrame using the query cluster_a == {cluster_a} and displaying the 
    # corresponding data.

for cluster_a in set(clustered_a) :
    print(f'Printing Results for Cluster_a {cluster_a}.\n\n')     
    print(wsdata.query(f"cluster_a == {cluster_a}"))    
    print("\n\n") 

Printing Results for Cluster_a 0.


         id  age gender  engagement  mem_edu      zip channel  progressivism  \
0       346   56  other     Engaged        2  98503.0  Branch      -0.083333   
2       349   71   male  NotEngaged        7  98506.0  Branch       1.000000   
3       352   66   male     Engaged        7      NaN  Branch       0.583333   
6       367   49   male     Engaged        4      NaN  Branch       1.791667   
7       368   43   male  NotEngaged        6  99163.0  Branch       2.500000   
...     ...  ...    ...         ...      ...      ...     ...            ...   
2391  19904   55   male  NotEngaged        5  99163.0  Branch       0.666667   
2401  20181   71   male     Engaged        8  98103.0  Branch       0.791667   
2413  21907   48   male     Engaged        3  99006.0  Branch       0.166667   
2414  21962   49   male  NotEngaged        3  98596.0  Branch      -0.083333   
2419  23167   33   male     Engaged        6      NaN  Branch       0.208333   

   

In [26]:
# In the code above, three new DataFrames clust_a0, clust_a1, and clust_a2 are created by filtering the wsdata DataFrame 
    # based on the values in the 'cluster_a' column.

# clust_a0 contains the rows from wsdata where the 'cluster_a' column has a value of 0.
# clust_a1 contains the rows from wsdata where the 'cluster_a' column has a value of 1.
# clust_a2 contains the rows from wsdata where the 'cluster_a' column has a value of 2.

# These DataFrames represent subsets of the original data grouped by their assigned clusters.

clust_a0 = wsdata.loc[(wsdata['cluster_a'] == 0)]
clust_a1 = wsdata.loc[(wsdata['cluster_a'] == 1)]
clust_a2 = wsdata.loc[(wsdata['cluster_a'] == 2)]

The next three cells print out some features of each cluster. Just looking at this, one could say that the clusters do align somewhat with region but it's difficult to tell without a map. All clusters mostly consist of people from W WA Metro or Thurston. The mean and standard deviation of age of each cluster are all very similar to one another. 
* Cluster 0: No females, mid to low engagement, similar account age to cluster 2
* Cluster 1: Older account ages than clusters 0 and 2, mixed gender, all highly engaged
* Cluster 2: All female, mid to low engagement, similar account ages to cluster 0

In [27]:
# Would be nice to get the next three cells to print side by side. Must work on this at some point.

In [46]:
print('Cluster A 0')
print()
print(clust_a0['region'].value_counts())
print()
print(clust_a0['gender'].value_counts())
print()
print(clust_a0['engagement'].value_counts())
print()
print(clust_a0['age'].describe())
print()
print(clust_a0['account_age'].describe())

Cluster A 0

W WA Metro        269
Thurston          180
E WA Non Metro    104
W WA Non Metro     79
E WA Metro         37
Name: region, dtype: int64

male     591
other     78
Name: gender, dtype: int64

Engaged       355
NotEngaged    314
Name: engagement, dtype: int64

count    669.000000
mean      50.530643
std       15.978214
min       18.000000
25%       37.000000
50%       52.000000
75%       63.000000
max       92.000000
Name: age, dtype: float64

count    669.000000
mean      11.916713
std       11.323581
min        0.003000
25%        2.230000
50%        7.959000
75%       19.148000
max       47.948000
Name: account_age, dtype: float64


In [48]:
print('Cluster A 1')
print()
print(clust_a1['region'].value_counts())
print()
print(clust_a1['gender'].value_counts())
print()
print(clust_a1['engagement'].value_counts())
print()
print(clust_a1['age'].describe())
print()
print(clust_a1['account_age'].describe())

Cluster A 1

Thurston          309
W WA Metro        289
W WA Non Metro     88
E WA Non Metro     87
E WA Metro         76
Name: region, dtype: int64

female    540
male      270
other      39
Name: gender, dtype: int64

HighlyEngaged    849
Name: engagement, dtype: int64

count    849.000000
mean      52.983510
std       14.903929
min       19.000000
25%       41.000000
50%       55.000000
75%       65.000000
max       85.000000
Name: age, dtype: float64

count    849.000000
mean      18.272398
std       11.847089
min        0.022000
25%        8.430000
50%       16.397000
75%       27.044000
max       48.786000
Name: account_age, dtype: float64


In [49]:
print('Cluster A 2')
print()
print(clust_a2['region'].value_counts())
print()
print(clust_a2['gender'].value_counts())
print()
print(clust_a2['engagement'].value_counts())
print()
print(clust_a2['age'].describe())
print()
print(clust_a2['account_age'].describe())

Cluster A 2

W WA Metro        366
Thurston          236
E WA Non Metro    124
W WA Non Metro    109
E WA Metro         68
Name: region, dtype: int64

female    903
Name: gender, dtype: int64

Engaged       484
NotEngaged    419
Name: engagement, dtype: int64

count    903.000000
mean      48.555925
std       17.020951
min       18.000000
25%       35.000000
50%       49.000000
75%       62.000000
max       98.000000
Name: age, dtype: float64

count    903.000000
mean      11.068504
std       10.965885
min        0.005000
25%        2.264500
50%        6.896000
75%       17.515000
max       47.197000
Name: account_age, dtype: float64


## Values Clustering

Similarly to the previous section, we perform a cluster analysis, this time on the values questions:

* fair
* harm
* in.group
* authority
* purity
* public.sector
* sustainability
* localism

After the clusters are built, I report the following information on each cluster:

* Predominant region
* Average age and account age
* Most common focal value
* Mean results on the questions of `pub.greater.priv`, `experience.more.important`, and `teachers.underpaid`. 

In [32]:
# Convert values to numeric.

wsdata.public_sector = wsdata.public_sector.map(dict(yes=1,no=0))

In [33]:
# Subset the dataframe for use in clustering.

to_cluster_b = wsdata[['fair','harm','in_group','authority','purity','public_sector','sustainability','localism']]
to_cluster_b

Unnamed: 0,fair,harm,in_group,authority,purity,public_sector,sustainability,localism
0,1.00,1.00,1.00,1.00,1.25,1,2.333333,2.75
1,4.75,5.00,2.25,3.25,2.25,1,6.000000,5.75
2,4.25,3.25,2.25,3.25,2.75,1,5.166667,4.50
3,4.75,4.25,4.75,3.25,3.75,0,5.500000,4.50
4,4.50,2.00,3.50,4.50,3.25,1,2.833333,4.00
...,...,...,...,...,...,...,...,...
2416,4.50,4.75,4.50,2.75,3.25,1,3.666667,3.75
2417,4.25,4.75,2.75,2.25,3.75,0,5.166667,5.75
2418,3.75,4.00,2.75,3.00,2.25,1,4.500000,5.50
2419,4.75,5.00,5.00,4.00,5.00,1,5.166667,6.00


In [34]:
# Normalize the subset.

toclustb_normalized =(to_cluster_b -to_cluster_b.min())/(to_cluster_b.max()-to_cluster_b.min())
toclustb_normalized

Unnamed: 0,fair,harm,in_group,authority,purity,public_sector,sustainability,localism
0,0.20,0.20,0.20,0.20,0.25,1.0,0.266667,0.35
1,0.95,1.00,0.45,0.65,0.45,1.0,1.000000,0.95
2,0.85,0.65,0.45,0.65,0.55,1.0,0.833333,0.70
3,0.95,0.85,0.95,0.65,0.75,0.0,0.900000,0.70
4,0.90,0.40,0.70,0.90,0.65,1.0,0.366667,0.60
...,...,...,...,...,...,...,...,...
2416,0.90,0.95,0.90,0.55,0.65,1.0,0.533333,0.55
2417,0.85,0.95,0.55,0.45,0.75,0.0,0.833333,0.95
2418,0.75,0.80,0.55,0.60,0.45,1.0,0.700000,0.90
2419,0.95,1.00,1.00,0.80,1.00,1.0,0.833333,1.00


In [35]:
# Fit the model.

clustered_b = km2.fit_predict(toclustb_normalized)

In [36]:
# Assign clusters to original data set.

wsdata['cluster_b'] = clustered_b

In [37]:
# Print the clusters.

for cluster_b in set(clustered_b) :
    print(f'Printing Results for Cluster_b {cluster_b}.\n\n')    
    print(wsdata.query(f"cluster_b == {cluster_b}"))    
    print("\n\n") 

Printing Results for Cluster_b 0.


         id  age  gender     engagement  mem_edu      zip channel  \
4       358   50    male  HighlyEngaged        4  98233.0  Branch   
5       361   40  female        Engaged        3  98520.0  Branch   
8       369   61    male  HighlyEngaged        3  98133.0  Branch   
19      429   46    male  HighlyEngaged        5      NaN  Branch   
25      437   43  female     NotEngaged        4      NaN  Branch   
...     ...  ...     ...            ...      ...      ...     ...   
2410  21767   25  female  HighlyEngaged        5      NaN  Branch   
2413  21907   48    male        Engaged        3  99006.0  Branch   
2415  22002   71    male  HighlyEngaged        5  99224.0  Branch   
2416  22070   83  female        Engaged        6  99163.0  Branch   
2419  23167   33    male        Engaged        6      NaN  Branch   

      progressivism  harm  fair  ...  public_sector  sustainability  localism  \
4         -0.500000  2.00  4.50  ...              1   

In [38]:
# Divide the dataframe based on cluster value.

clust_b0 = wsdata.loc[(wsdata['cluster_b'] == 0)]
clust_b1 = wsdata.loc[(wsdata['cluster_b'] == 1)]
clust_b2 = wsdata.loc[(wsdata['cluster_b'] == 2)]

Cluster 0:
* Predominant region- W WA Metro
* Average age- 56
* Average account age- 16.5
* Most common focal value- Health (i.e. cancer research)
* Mean pub_greater_priv- 3.085
* Mean experience_more_important- 3.99
* Mean teachers_underpaid- 4.644

In [50]:
# Print the cluster attributes.

print('Cluster B 0')
print()
print(clust_b0['region'].value_counts())
print()
print(clust_b0['main_focal_value'].value_counts())
print()
print(clust_b0['age'].describe())
print()
print(clust_b0['account_age'].describe())
print()
print('mean pub_greater_priv')
print(round(clust_b0['pub_greater_priv'].mean(),3))
print()
print('mean experience_more_important')
print(round(clust_b0['experience_more_important'].mean(),3))
print()
print('mean teachers_underpaid')
print(round(clust_b0['teachers_underpaid'].mean(),3))

Cluster B 0

W WA Metro        240
Thurston          240
E WA Non Metro    110
W WA Non Metro    103
E WA Metro         79
Name: region, dtype: int64

Health (i.e. cancer research)    150
Education                        123
Hunger/Poverty                   110
Human/Civil rights                71
Law enforcement                   65
Environment                       64
Specify                           54
Homelessness                      45
Animal rights                     43
Domestic abuse                    38
Arts & culture                     9
Name: main_focal_value, dtype: int64

count    772.000000
mean      55.977979
std       14.240961
min       18.000000
25%       46.000000
50%       58.000000
75%       67.000000
max       98.000000
Name: age, dtype: float64

count    772.000000
mean      16.539876
std       12.683709
min        0.011000
25%        5.575750
50%       14.024500
75%       25.421000
max       47.616000
Name: account_age, dtype: float64

mean pub_greater_priv


Cluster 1:
* Predominant region- W WA Metro
* Average age- 52.5
* Average account age- 16
* Most common focal value- Environment
* Mean pub_greater_priv- 3.362
* Mean experience_more_important- 3.797
* Mean teachers_underpaid- 5.06

In [51]:
# Print the cluster attributes.

print('Cluster B 1')
print()
print(clust_b1['region'].value_counts())
print()
print(clust_b1['main_focal_value'].value_counts())
print()
print(clust_b1['age'].describe())
print()
print(clust_b1['account_age'].describe())
print()
print('mean pub_greater_priv')
print(round(clust_b1['pub_greater_priv'].mean(),3))
print()
print('mean experience_more_important')
print(round(clust_b1['experience_more_important'].mean(),3))
print()
print('mean teachers_underpaid')
print(round(clust_b1['teachers_underpaid'].mean(),3))

Cluster B 1

W WA Metro        260
Thurston          214
E WA Non Metro     65
W WA Non Metro     65
E WA Metro         25
Name: region, dtype: int64

Environment                      133
Human/Civil rights                97
Education                         94
Hunger/Poverty                    81
Health (i.e. cancer research)     63
Homelessness                      40
Animal rights                     40
Specify                           28
Arts & culture                    24
Domestic abuse                    21
Law enforcement                    8
Name: main_focal_value, dtype: int64

count    629.000000
mean      52.616852
std       15.108876
min       18.000000
25%       41.000000
50%       56.000000
75%       65.000000
max       93.000000
Name: age, dtype: float64

count    629.000000
mean      15.766618
std       11.549673
min        0.005000
25%        5.395000
50%       13.893000
75%       24.304000
max       47.948000
Name: account_age, dtype: float64

mean pub_greater_priv


Cluster 2:
* Predominant region- W WA Metro
* Average age- 45.5
* Average account age- 10.5
* Most common focal value- Education
* Mean pub_greater_priv- 3.146
* Mean experience_more_important- 4.141
* Mean teachers_underpaid- 4.966

In [52]:
# Print the cluster attributes.

print('Cluster B 2')
print()
print(clust_b2['region'].value_counts())
print()
print(clust_b2['main_focal_value'].value_counts())
print()
print(clust_b2['age'].describe())
print()
print(clust_b2['account_age'].describe())
print()
print('mean pub_greater_priv')
print(round(clust_b2['pub_greater_priv'].mean(),3))
print()
print('mean experience_more_important')
print(round(clust_b2['experience_more_important'].mean(),3))
print()
print('mean teachers_underpaid')
print(round(clust_b2['teachers_underpaid'].mean(),3))

Cluster B 2

W WA Metro        424
Thurston          271
E WA Non Metro    140
W WA Non Metro    108
E WA Metro         77
Name: region, dtype: int64

Education                        191
Health (i.e. cancer research)    152
Hunger/Poverty                   145
Environment                      133
Human/Civil rights                88
Animal rights                     75
Domestic abuse                    61
Homelessness                      60
Law enforcement                   52
Specify                           41
Arts & culture                    22
Name: main_focal_value, dtype: int64

count    1020.000000
mean       45.414706
std        16.481279
min        18.000000
25%        31.000000
50%        45.000000
75%        58.000000
max        97.000000
Name: age, dtype: float64

count    1020.000000
mean       10.582764
std        10.513686
min         0.003000
25%         2.201500
50%         6.697000
75%        16.422750
max        48.786000
Name: account_age, dtype: float64

mean p

##  Appendix: Full Data Description

#### This appendix was authored by Professor John Chandler, Ph.D., of UM, Missoula.

A financial institution in Washington has become concerned that their current membership 
base is not well-aligned with their corporate values. Through that concern they 
realized that don't actually understand their membership's values very well. They 
surveyed 2,421 members to shed light on the issue. 

The heart of the survey was the Moral Foundations Theory of Jonathan Haidt. Members 
were surveyed on the Moral Foundations Questionnaire, which you should take so you 
understand the test. Survey respondents were scored on the five foundations as well 
as a single-number summary, Progressivism. 

The financial institution values Localism, Sustainability, and Education. These aspects 
of member's values were assessed in the survey as well. Localism and Sustainability used
validated scales and thus can be summarized via a single score, where higher values indicate
greater support for the values. Education is summarized by the following three questions, 
which we do not have evidence can be combined into a single score:

* In general, public schools provide a better education than private schools.
* Public school teachers are underpaid.
* Experience is more important than education in determining success in life.
These questions were evaluated on a 1 to 6 scale where 1 indicated "Strongly Disagree" and 
6 indicated "Strongly Agree". 

Finally, we have information on the member that can be used to understand variation 
in their values. 

The data consists of the following columns:

* ID: a unique identifier for the survey respondent.
* age: the age of the respondent.
* gender: gender was evaluated with robust scale and collapsed into male/female/other for 
  those whose gender identity was not male or female.
* engagement: three categories of engagement with the financial institution.
* mem.edu: the self-reported education level of the member with the following scale:
* zip: the member zip code. 
* channel: how the member joined the financial institution. Options are "Loan" if they joined 
  via an auto loan, "Branch" if they joined at a branch and other for online or unknown. 
* progressivism/harm/fair/in.group/authority/purity: The MFQ results.
* account.age: the age of the member's account, in years. 
* region: The region of Washington the member lives in. May be easier to work with than zip.
* public.sector: has the person ever been a public employee?
* sustainability/localism: Scores on the validated scales. Higher values indicate greater
  support for the value.
* pub.greater.priv/experience.more.important/teachers.underpaid: The responses to the 
  education questions above. 
* main.focal.value: Respondents were asked, "Below is a list of broad areas to which people 
  often dedicate their volunteer or philanthropic efforts. From this list, please select the 
  most important to you. If an area of particular importance is missing, please let us know 
  about it in the space for 'other.'" This column holds the respondents' answer to that question.
* support.of.focal.value: Respondents were given an opportunity to indicate how they 
  supported their focal value. Those responses were collapsed into a single score, where 
  a higher value indicates more support.
