# **Welcome To Machine Learning Algorithms** 
---



## *K Means Clustering*


First we will import the necessary libraries into the file. 
- Pandas is a very useful library for working with data and includes functions for analyzing and manipulating data.
- scikit-learn (sklearn) is the library that contains the tools and functions for machine learning and analysis.
- numpy is a general purpose library used for creating and working with arrays
- matplotlib is used to create graphs and visualizations of the data

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt


Next, we need to read the data we want to analyze. we can manually enter data into a script but the Pandas library has a function to read a csv file which is an easy way to read a large dataset into memory. we first save our file path into a variable (file_path) and use this as the perameter for the read_csv() function. If the csv file is saved to the same directory as the script we only need to use the name and extension of the file and makes reading the file simple.

In [None]:
# Read the Excel file
file_path = 'atendees.csv'  # Update this with the actual path to your Excel file
data = pd.read_csv(file_path)

This section of the code is preparing the data in a format that can be understood by the machine learing algorithm. The k means clustering algorithm only accepts numeric values and the data must use the same scale to avoid a bias toward values with a higer magnitude. 
- The work specialty column forms discrete groups so numeric codes are used to represent the distinct groups. (e.g. all data = 1, clinical = 2, etc.) 
- A continuous or numeric variable such as years of experience may have a much larger magnitude so we can to use the StandarScaler() function from the scikit-learn library to normalize the data. (This function basically sets the average of the dataset to 0 and transforms the values so that is has a standard deviation of 1)

Prining the results to the console allows us to look for errors in our data. Small spelling or capitalization errors can lead to skewed results.

In [None]:
# Encode the work specialty
data['work_specialty_encoded'] = data['work_specialty'].astype('category').cat.codes

# Normalize the years of experience
scaler = StandardScaler()
data['experience_normalized'] = scaler.fit_transform(data[['years_of_experience']])

# Prepare features for clustering
features = data[['experience_normalized', 'work_specialty_encoded']]

# Prints the results of the preparation
print(features)

*For simplcity we are manually setting the number of clusters we want to see, but in a situation where you do not know how many clusters to use you would use a technique such as the Elbow Method or the Silhouette Score to determine how many clusters ideally represent the dataset. The following code is an example of the Elbow method which plots the variation between using different numbers of clusters. The 'elbow' on the graph is where adding more clusters does not add any informantion and risks over-fitting the model and sub dividing actual groups.*

- how many natural groups do you think this data set forms? what should we use for K?

In [None]:
# Elbow Method to find the optimal number of clusters
inertia = []
K = range(1, 11)  # Test cluster numbers from 1 to 10

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(features)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Method result
plt.figure(figsize=(10, 6))
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.grid(True)
plt.show()

This section actually applies the k means algorithm to the prepared data. 
- The n_clusters parameter determine sets the number of clusters that will be calculated (we are using 3 here but you would determine this number based on the results of the elbow method or similar analysis). 
- The random state initializes the random number generator behind the scenes so it does not matter what the value is as long as it remains the same across multiple runs of the code it will allow you to validate the results.

This randomly assigns the the centroids then as we add each data point we calculate the distance to the centroid and assign that point to the closest cluster. Then the centroid is recalculated to represent the center of the cluster until all data points have been assigned to a cluster. This is the final step in creating our clusters that are as similar as possible. 

In [None]:
# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
data['cluster'] = kmeans.fit_predict(features)

# Calculate distances to the cluster centroids
data['distance_to_centroid'] = np.linalg.norm(features.values - kmeans.cluster_centers_[data['cluster']], axis=1)

# Sort individuals within each cluster by distance to the centroid
data = data.sort_values(by=['cluster', 'distance_to_centroid'])


This step is creating three empty buckets (arrays) and then looping throught the clusters and assigning evenly distribuing each cluster into the buckets. we essentially created the most similar groups we can and then evenly split them into balanced groups.

In [None]:
# Initialize lists to hold the balanced groups
groups = [[] for _ in range(3)]

# Distribute individuals to ensure balanced groups
for i, person in data.iterrows():
    smallest_group = min(groups, key=len)
    smallest_group.append(person)

# Assign group numbers to the data
data['balanced_group'] = None
for group_number, group in enumerate(groups):
    for person in group:
        data.loc[person.name, 'balanced_group'] = group_number


This step uses the matplotlib library to plot the results of the groups on a two dimentional graph. More than two variables can be used with this algorithm but it makes visualization difficult. There are many ways to customize and label the graph but these are some of the most common.

In [None]:
# Plot the clusters and centroids
plt.figure(figsize=(10, 6))

# Colors for the clusters
colors = ['red', 'green', 'blue']
for i in range(3):
    cluster_data = data[data['cluster'] == i]
    plt.scatter(cluster_data['experience_normalized'], cluster_data['work_specialty_encoded'], c=colors[i], label=f'Cluster {i}', alpha=0.5)

# Plot centroids
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='yellow', marker='X', label='Centroids')

# Add labels and legend
plt.xlabel('Normalized Years of Experience')
plt.ylabel('Work Specialty Encoded')
plt.title('K-Means Clustering of Individuals')
plt.legend()
plt.grid(True)
plt.show()


Finally we print the results to the console. The balanced_group column represnts the buckets we created to hold the evenly distributed clusters and labels each row with a group number.

In [None]:
# Print the resulting balanced groups
print(data[['name', 'years_of_experience', 'work_specialty', 'balanced_group']])

Congratulations! You just used the k means clusters machine learning algrithm to group data with multiple variables into similar clusters and distribute them into balanced groups!

### Further reading:
- https://en.wikipedia.org/wiki/K-means_clustering
- https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
- https://www.geeksforgeeks.org/k-means-clustering-introduction/
- http://varianceexplained.org/r/kmeans-free-lunch/
- https://en.wikipedia.org/wiki/Elbow_method_(clustering)
- https://medium.com/@evgen.ryzhkov/5-stages-of-data-preprocessing-for-k-means-clustering-b755426f9932
- https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-means-8129e4d88ed7#:~:text=We%20can%20clearly%20see%20that,like%20KNN%20or%20K%2DMeans.
- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
- https://www.baeldung.com/cs/k-means-flaws-improvements
- https://datarundown.com/k-means-clustering-pros-cons/
