# Document Clustering with pandas, flair, and sklearn

Here the following Python packages are used to vectorize text and visualize it:  
- **flair** is a NLP packages which is very powerful and well documented: https://flairnlp.github.io/docs/intro  
- **numpy** is one of the most used packages for mathematical/vectorization purposes: https://numpy.org
- **scikit-learn** (sklearn) is a well known and powerful Machine Learning package: https://scikit-learn.org/stable/index.html
- **matplotlib** is a powerful package to visualize data: https://matplotlib.org 
- **pandas** is used to handle data, anlyse, and manipulate it fast and efficient: https://pandas.pydata.org 

In [None]:
%pip install flair tqdm scikit-learn pandas matplotlib

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm import tqdm # package to visualize progress

# imports for the documents embeddings
from flair.embeddings import TransformerDocumentEmbeddings
from flair.data import Sentence

# imports for clustering and pca
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# initiate pandas progress bar
tqdm.pandas(ncols=50)

# read the csv data
# source: https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset
# despite it's name - there is not a million songs in it :)
data = pd.read_csv('./spotify_millsongdata.csv')

# some songs are probably more than one time in the data
data.drop_duplicates(subset='text', inplace=True, ignore_index=True)

In [None]:
# we load the data in a pandas dataframe
data.info()

In [None]:
# to cluster, we choose four artists of different styles
# hoping, their songtexts represent this
chosen_artists = ['Bob Marley', 'Zucchero', 'Snoop Dogg', 'Alice Cooper']
data = data[data['artist'].isin(chosen_artists)]

In [None]:
# the filtered dataframe, where you can see the amount of songs of the artists
data.info()

### Embed, dimension reduction, and clustering

In [None]:
# loading the roberta-large or roberta-base embedding to vectorize the text
# the large ones needs more computational power
# feel free to test other roberta (or bert) models: https://huggingface.co/models?pipeline_tag=fill-mask&library=transformers&language=en&sort=trending&search=roberta

embedding = TransformerDocumentEmbeddings('distilroberta-base')

In [None]:
# progress_apply() does the same like apply(), it just visualizes the progress with tqdm
# embedding takes place here
data['text_embedding'] = data['text'].progress_apply(lambda x: embedding.embed(Sentence(x)))

In [None]:
# thats how a vector looks like
data.iloc[0,4][0].embedding

In [None]:
# flair returns a sentence, we need the embedding of it
data['embedding'] = data['text_embedding'].progress_apply(
    lambda x: x[0].embedding
)

In [None]:
data.embedding

In [None]:
# the tensor data has to be standardized for PCA and clustering
tensor_data = data['embedding'].values
flattened_tensors = [tensor.flatten() for tensor in tensor_data]
flattened_array = np.array(flattened_tensors)

# standardize the flattened array
scaler = StandardScaler()
standardized_data = scaler.fit_transform(flattened_array)

# apply PCA down to 2 dimensions for visualization
pca = PCA(n_components=2)
pca_result = pca.fit_transform(standardized_data)

In [None]:
# apply agglomerative clustering (same amount of clusters as artist have been chosen)
clustering = AgglomerativeClustering(
    n_clusters=len(chosen_artists)
)
cluster_labels = clustering.fit_predict(standardized_data)

### Analyze the data

In [None]:
# names of the cluster_labels
np.unique(cluster_labels)

In [None]:
data['cluster_label'] = cluster_labels

In [None]:
# put the results in a dataframe
result = pd.DataFrame(data.value_counts(subset=['cluster_label', 'artist'])).sort_values(by=['cluster_label'])
result

In [None]:
# reset the index
result = result.reset_index(drop=False)
result

In [None]:
# get a list of strings for every cluster - needed for the visualization legend
result_list = []

for _, cluster_data in result.groupby('cluster_label'):
    artist_counts = []
    for index, row in cluster_data.iterrows():
        artist_counts.append(f"{row['artist']}: {row['count']}")
    result_list.append(artist_counts)

In [None]:
result_list

In [None]:
# visualize the result
fig = plt.figure(figsize=(10,16))
ax = plt.subplot(211)

scatter = ax.scatter(pca_result[:, 0], pca_result[:, 1], c=cluster_labels, cmap='rainbow')

# create a readable legend
legend_labels = ['\n'.join(l) for l in result_list]
handles = scatter.legend_elements(num=[0,1,2,3])[0]
ax.legend(
    handles=handles,
    labels=legend_labels,
    bbox_to_anchor=(1,0.5),
    loc='center left',
    fontsize=10,
    shadow=True,
)

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Agglomerative Clustering Result')

plt.show()

Maybe compare some **songtexts** you wouldn't expect to be in the same cluster.