### Clustering Basics

- grouping items together with similar characteristics
- items in groups similar to each other than in other groups
- unsupervised learning
- examples
    - news articles
    - customer segmentation
    
#### Types of Clustering Algorithms

- Heirarchal
- K means
- DBSCAN
- Gaussian Methods

#### Data Prep

- different units $_$, cm, feet
-  different scales km, mm
-  data to be normalized

##### whiten

- scipy package to normalize data

        # Import the whiten function
        from scipy.cluster.vq import whiten

        goals_for = [4,3,2,3,1,1,2,0,1,4]

        # Use the whiten() function to standardize the data
        scaled_data = whiten(goals_for)
        print(scaled_data)
        # Plot original data
        plt.plot(goals_for, label='original')

        # Plot scaled data
        plt.plot(scaled_data, label='scaled')

        # Show the legend in the plot
        plt.legend()

        # Display the plot
        plt.show()

### Heirarchal Clustering

#### linkage

- measures the distance between objects
    - method= how to calculate proximity between clusters
        - no right method for problems
        - understand the data before selecting a method
        - single - based on two closest objects
        - complete - based on two farthest objects
        - average - based on the arithmetic mean of objects
        - centroid - based on the geometric mean of objects
        - median - uses the median of objects
        - ward - based on sum squares
    - metric= distance metric
    - optimal_ordering= ordering data points
    
#### fcluster

- creates the cluster labels
    - distance_matrix= output of linkage() method
    - num_clusters= number of clusters
    - criterion= how to decide thresholds to form clusters
    
            # Import linkage and fcluster functions
            from scipy.cluster.hierarchy import linkage, fcluster

            # Use the linkage() function to compute distance
            Z = linkage(df, 'ward')

            # Generate cluster labels
            df['cluster_labels'] = fcluster(Z, #of clusters, criterion='maxclust')

            # Plot the points with seaborn
            sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
            plt.show()
            
#### Limitations
- linkage method takes longer as datapoints increase exponentially
- infeasible for large datasets

### KMeans Clustering

- clusters faster than hierarchal

#### kmeans()

- obs - standardized observations
- k_or_guess - number of clusters
- iter - number of iterations (default 20)
- thresh - threshold (default 1e-05)
- check_finite - bbolean whether to check if observations contain only finite numbers
- returns cluster centers, distortion

##### distortion

- sum of squares of distances of points from cluster centers
- distortion decreases with increasing # of clusters (zero when == # of points)

#### vq()

- generates cluster labels
- obs - standardized observations
- code_book - cluster centers
- check_finite - boolean whether to check if observations contain only finite numbers
- returns list of cluster labels, list of distortions

        # Import the kmeans and vq functions
        from scipy.cluster.vq import kmeans, vq

        # Generate cluster centers
        cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)

        # Assign cluster labels
        comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']],code_book=cluster_centers)

        # Plot clusters
        sns.scatterplot(x='x_scaled', y='y_scaled', 
                        hue='cluster_labels', data = comic_con)
        plt.show()
        
#### Limits of kmeans

- how to find the right # of clusters
- impact of multiple random seeds
- biased towards equal sized clusters

### Visualization

#### scatterplots
- provides a good initial look at potential cluster formations
- try to make sense of the clusters formed
- spot trends in the data

##### matplotlib

        # Matplotlib method with color assignment
        import matplotlib.pyplot as plt

        # Define a colors dictionary for clusters
        colors = {1:'red', 2:'blue'}

        # Plot a scatter plot
        comic_con.plot.scatter(x = 'x_scaled', 
                               y = 'y_scaled',
                               c = comic_con['cluster_labels'].apply(lambda x: colors[x]))
        plt.show()
        
##### seaborn  

        # Seaborn method using cluster labels as color
        import seaborn as sns

        # Plot a scatter plot using seaborn
        sns.scatterplot(x='x_scaled', 
                        y='y_scaled', 
                        hue='cluster_labels', 
                        data = comic_con)
        plt.show()
        
##### dendrograms

- shows progressions as clusters emerge
- branching diagram
- can assist in deciding how many clusters to decide on

        # Import the dendrogram function
        from scipy.cluster.hierarchy import dendrogram

        # Create a dendrogram
        dn = dendrogram(distance_matrix)

        # Display the dendogram
        plt.show()
        
#### Elbow charts

- helps in determing # of clusters in k-means clustering
- x = # of clusters, y = distortion
- only gives _indication_ of optimal clusters

        distortions = []
        num_clusters = range(1, 7)

        # Create a list of distortions from the kmeans function
        for i in num_clusters:
            cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']],i)
            distortions.append(distortion)

        # Create a data frame with two lists - num_clusters, distortions
        elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})

        # Creat a line plot of num_clusters and distortions
        sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
        plt.xticks(num_clusters)
        plt.show()


### Other methods to determine optimal clusters

####  Average Silhouette

#### Gap Statistic

### Clustering Example (Dominant Colors)

    # Import image class of matplotlib
    import matplotlib.image as img

    # Read batman image and print dimensions
    batman_image = img.imread('batman.jpg')
    print(batman_image.shape)

    # Store RGB values of all pixels in lists r, g and b
    for row in batman_image:
        for temp_r, temp_g, temp_b in row:
            r.append(temp_r)
            g.append(temp_g)
            b.append(temp_b)
    distortions = []
    
    num_clusters = range(1, 7)

    # Create a list of distortions from the kmeans function
    for i in num_clusters:
        cluster_centers, distortion = kmeans(batman_df[['scaled_red', 'scaled_blue', 'scaled_green']], i)
        distortions.append(distortion)

    # Create a data frame with two lists, num_clusters and distortions
    elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})

    # Create a line plot of num_clusters and distortions
    sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
    plt.xticks(num_clusters)
    plt.show()
    
    # Get standard deviations of each color
    r_std, g_std, b_std = batman_df[['red', 'green', 'blue']].std()

    for cluster_center in cluster_centers:
        scaled_r, scaled_g, scaled_b = cluster_center
        # Convert each standardized value to scaled value
        colors.append((
            scaled_r * r_std / 255,
            scaled_g * g_std / 255,
            scaled_b * b_std / 255
        ))

    # Display colors of cluster centers
    plt.imshow([colors])
    plt.show()

### Clustering Example (Document)

    # Import TfidfVectorizer class from sklearn
    from sklearn.feature_extraction.text import TfidfVectorizer

    # Initialize TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.75, max_features=50, tokenizer= remove_noise)

    # Use the .fit_transform() method on the list plots
    tfidf_matrix = tfidf_vectorizer.fit_transform(plots)
    
    num_clusters = 2

    # Generate cluster centers through the kmeans function
    cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)

    # Generate terms from the tfidf_vectorizer object
    terms = tfidf_vectorizer.get_feature_names()

    for i in range(num_clusters):
        # Sort the terms and print top 3 terms
        center_terms = dict(zip(terms, list(cluster_centers[i])))
        sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)
        print(sorted_terms)

### Clustering Example (Multiple Features)
- consider feature reduction using factor analysis or multidimensional scaling

            # Print the size of the clusters
        print(fifa.groupby('cluster_labels')['ID'].count())

        # Print the mean value of wages in each cluster
        print(fifa.groupby('cluster_labels')['eur_wage'].mean())
        
        # Create centroids with kmeans for 2 clusters
        cluster_centers,_ = kmeans(fifa[scaled_features], 2)
        
        # Create centroids with kmeans for 2 clusters
        cluster_centers,_ = kmeans(fifa[scaled_features], 2)

        # Assign cluster labels and print cluster centers
        fifa['cluster_labels'], _ = vq(fifa[scaled_features], cluster_centers)
        print(fifa.groupby('cluster_labels')[scaled_features].mean())
        
        # Plot cluster centers to visualize clusters
        fifa.groupby('cluster_labels')[scaled_features].mean().plot(legend=True, kind='bar')
        plt.show()
        
        # Get the name column of top 5 players in each cluster
        for cluster in fifa['cluster_labels'].unique():
            print(cluster, fifa[fifa['cluster_labels'] == cluster]['name'].values[:5])