### Clustering Basics

- grouping items together with similar characteristics
- items in groups similar to each other than in other groups
- unsupervised learning
- examples
    - news articles
    - customer segmentation
    
#### Types of Clustering Algorithms

- Heirarchal
- K means
- DBSCAN
- Gaussian Methods

#### Data Prep

- different units $_$, cm, feet
-  different scales km, mm
-  data to be normalized

##### whiten

- scipy package to normalize data

        # Import the whiten function
        from scipy.cluster.vq import whiten

        goals_for = [4,3,2,3,1,1,2,0,1,4]

        # Use the whiten() function to standardize the data
        scaled_data = whiten(goals_for)
        print(scaled_data)
        # Plot original data
        plt.plot(goals_for, label='original')

        # Plot scaled data
        plt.plot(scaled_data, label='scaled')

        # Show the legend in the plot
        plt.legend()

        # Display the plot
        plt.show()

### Heirarchal Clustering

#### linkage

- measures the distance between objects
    - method= how to calculate proximity between clusters
        - no right method for problems
        - understand the data before selecting a method
        - single - based on two closest objects
        - complete - based on two farthest objects
        - average - based on the arithmetic mean of objects
        - centroid - based on the geometric mean of objects
        - median - uses the median of objects
        - ward - based on sum squares
    - metric= distance metric
    - optimal_ordering= ordering data points
    
#### fcluster

- creates the cluster labels
    - distance_matrix= output of linkage() method
    - num_clusters= number of clusters
    - criterion= how to decide thresholds to form clusters
    
            # Import linkage and fcluster functions
            from scipy.cluster.hierarchy import linkage, fcluster

            # Use the linkage() function to compute distance
            Z = linkage(df, 'ward')

            # Generate cluster labels
            df['cluster_labels'] = fcluster(Z, #of clusters, criterion='maxclust')

            # Plot the points with seaborn
            sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
            plt.show()
            
#### Limitations
- linkage method takes longer as datapoints increase exponentially
- infeasible for large datasets

### KMeans Clustering

    # Import kmeans and vq functions
    from scipy.cluster.vq import kmeans, vq

    # Compute cluster centers
    centroids,_ = kmeans(df, 2)

    # Assign cluster labels
    df['cluster_labels'], _ = vq(df, centroids)

    # Plot the points with seaborn
    sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
    plt.show()

### Visualization

#### scatterplots
- provides a good initial look at potential cluster formations
- try to make sense of the clusters formed
- spot trends in the data

##### matplotlib

        # Matplotlib method with color assignment
        import matplotlib.pyplot as plt

        # Define a colors dictionary for clusters
        colors = {1:'red', 2:'blue'}

        # Plot a scatter plot
        comic_con.plot.scatter(x = 'x_scaled', 
                               y = 'y_scaled',
                               c = comic_con['cluster_labels'].apply(lambda x: colors[x]))
        plt.show()
        
##### seaborn  

        # Seaborn method using cluster labels as color
        import seaborn as sns

        # Plot a scatter plot using seaborn
        sns.scatterplot(x='x_scaled', 
                        y='y_scaled', 
                        hue='cluster_labels', 
                        data = comic_con)
        plt.show()
        
##### dendrograms

- shows progressions as clusters emerge
- branching diagram
- can assist in deciding how many clusters to decide on

        # Import the dendrogram function
        from scipy.cluster.hierarchy import dendrogram

        # Create a dendrogram
        dn = dendrogram(distance_matrix)

        # Display the dendogram
        plt.show()