hierarchical clustering

# Import linkage and fcluster functions
from scipy.cluster.hierarchy import linkage, fcluster

# Use the linkage() function to compute distance
Z = linkage(df, 'ward')

# Generate cluster labels
df['cluster_labels'] = fcluster(Z, 2, criterion='maxclust')

# Plot the points with seaborn
sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
plt.show()

----
# Import kmeans and vq functions
from scipy.cluster.vq import kmeans, vq

# Compute cluster centers
centroids,_ = kmeans(df, 2)

# Assign cluster labels
df['cluster_labels'], _ = vq(df, centroids)

# Plot the points with seaborn
sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
plt.show()

----
Normalization of Data:

Easy, achieved through: x_new = x / std_dev(x)


from scipy.cluster.vq import whiten

data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]

scaled_data = whiten(data)

print(scaled_data)

----
# Prepare data
rate_cuts = [0.0025, 0.001, -0.0005, -0.001, -0.0005, 0.0025, -0.001, -0.0015, -0.001, 0.0005]

# Use the whiten() function to standardize the data
scaled_data = whiten(rate_cuts)

# Plot original data
plt.plot(rate_cuts, label='original')

# Plot scaled data
plt.plot(scaled_data, label='scaled')

plt.legend()
plt.show()

----
# Scale wage and value
fifa['scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])

# Plot the two columns in a scatter plot
fifa.plot(x='scaled_wage', y='scaled_value', kind = 'scatter')
plt.show()

# Check mean and standard deviation of scaled values
print(fifa[['scaled_wage', 'scaled_value']].describe())

----
K-means clustering requires two steps.

**Step 1: Generate cluster centers.**

kmeans(obs, k_or_guess, iter, thresh, check_finite)

obs: standardized operations via the whiten method

k_or_guess: number of clusters

iter: number of iterations to perform (default: 20)

thresh: threshold (default 1e-05) terminated when the distortion is <= the threshold

check_finite: whether to check if observations include infinite or NaN values numbers (default: True)

Returns two objects: cluster centers (aka, the code book), distortion

**Step 2: Generate the cluster labels.**

vq(obs, code_book, check_finite = True)

obs: standardized operations via the whiten method

code_book: cluster centers, the first output of the k-means method

check_finite: whether to check if observations include infinite or NaN values numbers (default: True)

**A note on distortions**

k-means returns a signle value of distortions

vq returns a list of distortions

----
**Running k-means**

#import kmeans and vq functions

from scipy.cluster.vq import kmeans, vq

#Generate cluster centers and labels

cluster_centers, _ = kmeans(df[['scaled_x', 'scaled_y']], 3)

df['cluster_labels'], _ = vq(df[['scaled_x', 'scaled_y']], cluster_centers)

#Plot clusters

sns.scatterplot(x = 'scaled_x', y = 'scaled_y', hue = 'cluster_labels', data = df)

plt.show()

----
# Import the kmeans and vq functions
from scipy.cluster.vq import kmeans, vq

# Generate cluster centers
cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)

# Assign cluster labels
comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers)

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()

----
* No absolute method to find the right number of clusters (k) in k-means clustering
* Elbow method does an okay job indicating it

**Elbow method**

#Declaring variables for use

distortions = []

num_cluster = range(2, 7)


#Populating distortions for various clusters

for i in num_clusters:

    centoids, distortion = kmeans(df[['scaled_x', 'scaled_y']], i)

    distortions.append(distortion)


#Plotting elbow plot data
elbow_plot_data = pd.DataFrame({'num_clusters': num_clusters,

                                'distortions': distortions})


sns.lineplot(x = 'num_clusters', y = 'distortions',

             data = elbow_plot_data)


plt.show()