# Text Clustering with K-Means
In the present notebook we will use the [k-means algorithm](https://www.datascience.com/blog/k-means-clustering), a simple and popular __*unsupervised clustering*__ algorithm, to cluster the national anthems of the world into different groups.

The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed defined number (k) of centroids in a dataset. A centroid refers to a cluster, which is a collection of data points aggregated together because of certain similarities with each other. The ‘means’ in the K-means refers to the averaging of the data; that is, finding the centroid. And the algorithm is said to be unsupervised because we have no prior knowledge with regards to the groups or classes of our dataset, that is, we will find the underlying groups in our dataset!

Below we can visualize the algorithm. The green centroids matches the closest datapoints to each one and form clusters, then each centroid moves to the center of each respective group and matches again the closest datapoint to each other.

![alt text](Images/kmeans.gif "Title")

**Steps:**

__1.__ Explore our collection of national anthems (corpus) <br>
__2.__ Data Engineer the dataset to get the best perfomance from the K-means algorithm <br>
__3.__ Run the algorithm many times, each time testing with a different number of clusters <br>
__4.__ Use different metrics to visualize our results and find the best number of clusters (*ie. Why are a total of X clusters better than a total of Y clusters*) <br>
__5.__ Cluster Analysis

**Metrics Utilized for Determining the Best Number of K Cluters:**
- [Elbow Method](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
- [Silhouette Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)

## Importing Libraries:

In [2]:
# Data Structures
import numpy  as np
import pandas as pd
# import geopandas as gpd
import json

# Corpus Processing
import re
import nltk.corpus
# from unidecode                        import unidecode
from sklearn.feature_extraction.text  import TfidfVectorizer
from sklearn.preprocessing            import normalize

# K-Means
from sklearn import cluster

# Visualization and Analysis
import matplotlib.pyplot  as plt
import matplotlib.cm      as cm
# import seaborn            as sns
from sklearn.metrics                  import silhouette_samples, silhouette_score
# from wordcloud                        import WordCloud

# Map Viz
# import folium
#import branca.colormap as cm
# from branca.element import Figure

In [3]:
# Read the CSV file
data = pd.read_csv('./lyrics_p.csv')

# Filter rows for each decade and sample 100 rows from each
random_samples = []

# Decades to consider
decades = [(1930, 1939), (1940, 1949), (1950, 1959) ,(1960,1969),(1970,1979),(1980,1989),(1990,1999),(2000,2009)]

for decade in decades:
    start_year, end_year = decade
    decade_data = data[(data['Year'] >= start_year) & (data['Year'] <= end_year)]
    sample = decade_data.sample(n=100)  # Sampling 100 rows from each decade
    random_samples.append(sample)
    
print(len(data))
print(len(random_samples[0]))

# data = data.sample(n=100, random_state=42)
random_samples = pd.concat(random_samples)
data = random_samples

print(data.head())

10827
100
                                                  Title             Film  \
2804    ambuuvaa kii daaliidaalii jhuum rahii hai aalii        vidyapati   
5951  he ho dhoye mahobe ghaat  dhobiyaa re dhobiyaa...           pukaar   
1223  more anganaa men aaye aalii main chaal chaluun...        vidyapati   
9234               prem nagar men banaauungii ghar main        chandidas   
8878                 koii priit kii riit bataa do hamen  kaarwaanehayaat   

      Year                             Singer                  Composer  \
2804  1937              kanan devi dhumi khan                 r c boral   
5951  1939    sardar akhtar male voice chorus                 mir sahab   
1223  1937                         kanan devi                 r c boral   
9234  1934                  saigal uma shashi                 r c boral   
8878  1935  saigal pahadi sanyal female voice  mihir kiran bhattacharya   

            Lyricist                                             Lyrics  
2804    

In [4]:
# Remove extra spaces from data
# data['Lyrics'] = data['Lyrics'].str.strip()
# Replace tab spaces with 2 spaces
# data['Lyrics'] = data['Lyrics'].str.replace(r'\t', '  ', regex=True)
# Replace more than two spaces with two spaces
# data['Lyrics'] = data['Lyrics'].str.replace(r'\s{3,}', '  ', regex=True)

# print(data.head())

In [5]:
from indic_transliteration import sanscript
def roman_to_devanagari(text_list):
    devanagari_sentences = []
    for text in text_list:
        # Transliterate Romanized Hindi to Devanagari
        devanagari_text = sanscript.transliterate(text, sanscript.ITRANS, sanscript.DEVANAGARI)
        devanagari_sentences.append(devanagari_text)
    return devanagari_sentences

# Test the function
roman_sentences = data['Lyrics'].to_list()
print(len(roman_sentences))
print(roman_sentences[:5])

devanagari_sentences = roman_to_devanagari(roman_sentences)
print(len(devanagari_sentences))
print(devanagari_sentences[:5])

# Convert the list to a pandas Series
devanagari_series = pd.Series(devanagari_sentences)
print("Data length: ", len(data))
print(len(devanagari_series))
print(devanagari_series.head())

data.reset_index(drop=True, inplace=True)
# Add the series as a new column in the DataFrame
data['Devanagari'] = devanagari_series
print(data.head())

800
['\tkaa\tambuuvaa kii daaliidaalii\tjhuum rahii hai aalii\tjhuum rahii hai aalii\tdhu\tmain pii kar mad kii pyaalii\tyuun chaal chaluun matawaalii\tyuun chaal chaluun matawaalii\tkaa\tambuuvaa kii daaliidaalii\tjhuum rahii hai aalii\tjhuum rahii hai aalii\tmatawaalii\tmatawaalii\tmatawaalii\tdhu\tmatawaalii\tmatawaalii\tmatawaalii\tmain chaal chaluun matawaalii kaa\tmain chaal chaluun matawaalii dhu\tdaraa dar dar dar dar dar da kaa\tdaraa dar dar dar dar dar da\tmore anganaa men aaye aaliiaalii\tmain chaal chaluun matawaalii\tjhuum rahii hai aalii\tambuuvaa kii daalii\tjhuum rahii hai aalii\tambuuvaa kii daalii\tambuuvaa kii daaliidaalii\tjhuum rahii hai aalii\tjhuum rahii hai aalii\tdo\tmatawaalii\tmatawaalii\tmatawaalii\tmatawaalii\tmatawaalii\tmatawaalii\tmain chaal chaluun matawaalii\tmain chaal chaluun matawaalii\tmore anganaa men aaye aaliiaalii\tmain chaal chaluun matawaalii\tjhuum rahii hai aalii\tambuuvaa kii daalii\tjhuum rahii hai aalii\tambuuvaa kii daalii\tambuuvaa ki

In [6]:
# # Apply the function to the 'Lyrics' column
# data['lyrics_devanagari'] = data['Lyrics'].apply(back_transliterate)

# Print the first 5 rows of the DataFrame
# print((data))

from spello.model import SpellCorrectionModel
sp = SpellCorrectionModel(language='hi')
sp.load('hi.pkl')
temp = data['Devanagari'].apply(sp.spell_correct)
data['Devanagari_spell_check'] = temp.apply(lambda x: x['spell_corrected_text'])
print(data['Devanagari'], data['Devanagari_spell_check'])


from spello.model import SpellCorrectionModel 
sp = SpellCorrectionModel(language='en')  
sp.load('/home/ubuntu/model.pkl')
sp.config.min_length_for_spellcorrection = 4 # default is 3
sp.config.max_length_for_spellcorrection = 12 # default is 15
sp.save(model_save_dir='/home/ubuntu/')




0      \tका\tअम्बूवा की दालीदाली\tझूम् रही है आली\tझू...
1      \tको\tहे हो धोये महोबे घात्\tहे हो धोये महोबे ...
2      \tअन्गना मेन् आये आली मैन् चाल् चलून् मतवाली\t...
3       उ\tप्रेम् नगर् मेन् बनाऊन्गी घर् मैन् सजके घर...
4       कोई प्रीत् की रीत् बता दो हमेन् कोई मन् का मी...
                             ...                        
795     इश्क़् की ज़िन्दा निशानी है ये सदियोन् पुरानी क...
796     मेरी नीन्द् चुरा ले मेरा चैन् चुरा ले दिल् मे...
797     पम् पम् पम् पम् प र र र र ऐली रे ऐली क्या है ...
798     ओ ये ये ये कोई मेरी जान् ले ले तदपता हुआ यून्...
799     यार् तेरी बेवफ़ाई का हमको ज़रा सा गम् नहीन्\tयह...
Name: Devanagari, Length: 800, dtype: object 0      का\tअम्बूवा की दालीदाली\tझूम् रही है आली\tझूम्...
1      को\tहे हो धोखे महोबा घात्\tहे हो धोखे महोबा घा...
2      अन्ना मेन आये आली मेन चाल चलने मतवाली\tअन्गना ...
3      उ\tप्रेम् नगर' मेन बनाऊंगा घर) मैन सके घर) सन्...
4      कोई प्रीति की रीति बता दो हमें कोई मन का मीता ...
                             ...           

In [7]:
import stanza

# Download the Hindi models for stanza
stanza.download('hi')

# Initialize the Hindi pipeline without the MWT processor
nlp = stanza.Pipeline(lang='hi', processors='tokenize,pos,lemma')

# Initialize a counter
counter = 0

# Define a function to extract lemmas
def extract_lemmas(text):
    # # Join the tokens into a string
    # text = ' '.join(text)
    global counter
    # Process the text
    doc = nlp(text)
    # Extract lemmas and join them into a string
    lemmas = ' '.join(word.lemma for sent in doc.sentences for word in sent.words)
    # Increment the counter
    counter += 1
    # Print the counter
    print(f"Processed {counter} lines.")
    return lemmas

# Apply the function to the 'lyrics_devanagari' column
data['lyrics_lemmatized'] = data['Devanagari_spell_check'].apply(extract_lemmas)

  from .autonotebook import tqdm as notebook_tqdm
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 379kB [00:00, 19.1MB/s]                    
2024-04-26 19:42:49 INFO: Downloaded file to /home/rushil/stanza_resources/resources.json
2024-04-26 19:42:49 INFO: Downloading default packages for language: hi (Hindi) ...
2024-04-26 19:42:50 INFO: File exists: /home/rushil/stanza_resources/hi/default.zip
2024-04-26 19:42:51 INFO: Finished downloading models and saved to /home/rushil/stanza_resources
2024-04-26 19:42:51 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 379kB [00:00, 18.7MB/s]                    
2024-04-26 19:42:51 INFO: Downloaded file to /home/rushil/stanza_resources/resources.json

Processed 1 lines.
Processed 2 lines.
Processed 3 lines.
Processed 4 lines.
Processed 5 lines.
Processed 6 lines.
Processed 7 lines.
Processed 8 lines.
Processed 9 lines.
Processed 10 lines.
Processed 11 lines.
Processed 12 lines.
Processed 13 lines.
Processed 14 lines.
Processed 15 lines.
Processed 16 lines.
Processed 17 lines.
Processed 18 lines.
Processed 19 lines.
Processed 20 lines.
Processed 21 lines.
Processed 22 lines.
Processed 23 lines.
Processed 24 lines.
Processed 25 lines.
Processed 26 lines.
Processed 27 lines.
Processed 28 lines.
Processed 29 lines.
Processed 30 lines.
Processed 31 lines.
Processed 32 lines.
Processed 33 lines.
Processed 34 lines.
Processed 35 lines.
Processed 36 lines.
Processed 37 lines.
Processed 38 lines.
Processed 39 lines.
Processed 40 lines.
Processed 41 lines.
Processed 42 lines.
Processed 43 lines.
Processed 44 lines.
Processed 45 lines.
Processed 46 lines.
Processed 47 lines.
Processed 48 lines.
Processed 49 lines.
Processed 50 lines.
Processed

In [8]:
# Set the maximum column width to None to display the full content
pd.set_option('display.max_colwidth', None)

import Levenshtein as lev

# Define a function to calculate the Levenshtein distance
def calculate_change(original, lemmatized):
    # Calculate the Levenshtein distance
    distance = lev.distance(original, lemmatized)
    # Normalize the distance by the length of the original text
    normalized_distance = distance / max(len(original), 1)
    return normalized_distance

# Apply the function to the 'lyrics_devanagari' and 'lyrics_lemmatized' columns
data['lemmatization_change'] = data.apply(lambda row: calculate_change(row['Devanagari_spell_check'], row['lyrics_lemmatized']), axis=1)

# Print the first 5 rows of the DataFrame
print(data[['Devanagari_spell_check', 'lyrics_lemmatized', 'lemmatization_change']].head())

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

## Corpus Loading:

We'll use pandas to read the csv file contaning the national anthem for each country and it's corresponding country code. The anthems were extracted from wikipedia and many of them contain words that use non UTF-8 characters (generaly names of places and such), so we'll read the file with the _latin1_ encoding.

Then we'll extract the __Anthem__ column into a list of texts for our corpus.

In [9]:
# data = pd.read_csv('datasets/anthems.csv', encoding='utf-8')
# data.columns = map(str.lower, data.columns)

# continents = ['Europe', 'South_America', 'North_America']
# data = data.loc[data['continent'].isin(continents)]
# data.head(6)

In [10]:
# corpus = data['anthem'].tolist()
# corpus[18][0:447]

## Corpus Processing

### 1. Stop Words and Stemming
We will do a data engineering routine with our anthems dataset so later we can make a good statistical model. In order to do so, we'll remove all words that don't contribute to the semantic meaning of the text (words that are not within the english alphabet) and keep all of the remaining words in the simplest format possible, so we can apply a function that gives weights to each word without generating any bias or outliers. To do that there are many techniques to clean up our corpus, among them we will remove the most common words ([stop words](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/)) and apply [stemming](https://www.researchgate.net/figure/Stemming-process-Algorithms-of-stemming-methods-are-divided-into-three-parts-mixed_fig2_324685008), a technique that reduces a word to it's root.

The methods that apply stemming and stop words removal are listed bellow. We will also define a method that removes any words with less than 2 letters or more than 21 letters to clean our corpus even more.

In [11]:
# # removes a list of words (ie. stopwords) from a tokenized list.
# def removeWords(listOfTokens, listOfWords):
#     return [token for token in listOfTokens if token not in listOfWords]

# # applies stemming to a list of tokenized words
# def applyStemming(listOfTokens, stemmer):
#     return [stemmer.stem(token) for token in listOfTokens]

# # removes any words composed of less than 2 or more than 21 letters
# def twoLetters(listOfTokens):
#     twoLetterWord = []
#     for token in listOfTokens:
#         if len(token) <= 2 or len(token) >= 21:
#             twoLetterWord.append(token)
#     return twoLetterWord

### 2. The main corpus processing function.

A section back, at the exploration of our dataset, we noticed some words containg weird characters that should be removed. By using RegEx our main processing function will remove unknown ASCII symbols, especial chars, numbers, e-mails, URLs, etc (It's a bit of a overkill, I know). It also uses the auxiliary funcitions defined above.

In [12]:
# def processCorpus(corpus, language):   
#     stopwords = nltk.corpus.stopwords.words(language)
#     param_stemmer = SnowballStemmer(language)
#     countries_list = [line.rstrip('\n') for line in open('lists/countries.txt')] # Load .txt file line by line
#     nationalities_list = [line.rstrip('\n') for line in open('lists/nationalities.txt')] # Load .txt file line by line
#     other_words = [line.rstrip('\n') for line in open('lists/stopwords_scrapmaker.txt')] # Load .txt file line by line
    
#     for document in corpus:
#         index = corpus.index(document)
#         corpus[index] = corpus[index].replace(u'\ufffd', '8')   # Replaces the ASCII '�' symbol with '8'
#         corpus[index] = corpus[index].replace(',', '')          # Removes commas
#         corpus[index] = corpus[index].rstrip('\n')              # Removes line breaks
#         corpus[index] = corpus[index].casefold()                # Makes all letters lowercase
        
#         corpus[index] = re.sub('\W_',' ', corpus[index])        # removes specials characters and leaves only words
#         corpus[index] = re.sub("\S*\d\S*"," ", corpus[index])   # removes numbers and words concatenated with numbers IE h4ck3r. Removes road names such as BR-381.
#         corpus[index] = re.sub("\S*@\S*\s?"," ", corpus[index]) # removes emails and mentions (words with @)
#         corpus[index] = re.sub(r'http\S+', '', corpus[index])   # removes URLs with http
#         corpus[index] = re.sub(r'www\S+', '', corpus[index])    # removes URLs with www

#         listOfTokens = word_tokenize(corpus[index])
#         twoLetterWord = twoLetters(listOfTokens)

#         listOfTokens = removeWords(listOfTokens, stopwords)
#         listOfTokens = removeWords(listOfTokens, twoLetterWord)
#         listOfTokens = removeWords(listOfTokens, countries_list)
#         listOfTokens = removeWords(listOfTokens, nationalities_list)
#         listOfTokens = removeWords(listOfTokens, other_words)
        
#         listOfTokens = applyStemming(listOfTokens, param_stemmer)
#         listOfTokens = removeWords(listOfTokens, other_words)

#         corpus[index]   = " ".join(listOfTokens)
#         corpus[index] = unidecode(corpus[index])

#     return corpus

In [13]:
# language = 'english'
# corpus = processCorpus(corpus, language)
# corpus[18][0:460]

### Statistical Weighting of Words

Now we will apply the [TF-IDF](https://jmotif.github.io/sax-vsm_site/morea/algorithm/TFIDF.html) function, short for term frequency inverse document frequency, which is a numerical statistic that's intended to reflect how important a word is to a document in a corpus by giving each word in a document a score that ranges from 0 to 1.

In [14]:
# vectorizer = TfidfVectorizer()
# X = vectorizer.fit_transform(corpus)
# tf_idf = pd.DataFrame(data = X.toarray(), columns=vectorizer.get_feature_names())

# final_df = tf_idf

# print("{} rows".format(final_df.shape[0]))
# final_df.T.nlargest(5, 0)

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Perform TF-IDF on the 'lyrics_lemmatized' column
X = vectorizer.fit_transform(data['lyrics_lemmatized'])

# Check if the vectorizer was fitted correctly
if not hasattr(vectorizer, 'vocabulary_'):
    print("The vectorizer was not fitted correctly.")
else:
    # Convert the result to a DataFrame
    tf_idf = pd.DataFrame(data = X.toarray(), columns=vectorizer.get_feature_names_out())

    # Assign the DataFrame to final_df
    final_df = tf_idf

    # Print the number of rows in the DataFrame
    print("{} rows".format(final_df.shape[0]))

    # Print the 5 largest values in the first row
    print(final_df.T.nlargest(5, 0))

800 rows
          0         1         2    3    4    5         6         7    \
मतव  0.682286  0.000000  0.264788  0.0  0.0  0.0  0.000000  0.000000   
आल   0.605069  0.000000  0.313094  0.0  0.0  0.0  0.000000  0.000000   
अम   0.282806  0.000000  0.000000  0.0  0.0  0.0  0.000000  0.000000   
दर   0.217668  0.078083  0.000000  0.0  0.0  0.0  0.116900  0.138207   
रह   0.147901  0.000000  0.139149  0.0  0.0  0.0  0.086652  0.000000   

          8    9    ...       790       791  792       793       794  795  \
मतव  0.000000  0.0  ...  0.000000  0.000000  0.0  0.000000  0.000000  0.0   
आल   0.000000  0.0  ...  0.000000  0.000000  0.0  0.000000  0.000000  0.0   
अम   0.000000  0.0  ...  0.000000  0.000000  0.0  0.000000  0.000000  0.0   
दर   0.418796  0.0  ...  0.000000  0.000000  0.0  0.000000  0.393158  0.0   
रह   0.000000  0.0  ...  0.112771  0.103315  0.0  0.321609  0.058286  0.0   

     796     797  798       799  
मतव  0.0  0.0000  0.0  0.000000  
आल   0.0  0.0000  0.0  0.00

In [16]:
# first 5 words with highest weight on document 0:
final_df.T.nlargest(5, 0)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,790,791,792,793,794,795,796,797,798,799
मतव,0.682286,0.0,0.264788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
आल,0.605069,0.0,0.313094,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
अम,0.282806,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
दर,0.217668,0.078083,0.0,0.0,0.0,0.0,0.1169,0.138207,0.418796,0.0,...,0.0,0.0,0.0,0.0,0.393158,0.0,0.0,0.0268,0.0,0.124435
रह,0.147901,0.0,0.139149,0.0,0.0,0.0,0.086652,0.0,0.0,0.0,...,0.112771,0.103315,0.0,0.321609,0.058286,0.0,0.0,0.0,0.0,0.0


## K-Means

##### Function that runs the K-Means algorithm *max_k* times and returns a dictionary of each k result

In [17]:
def run_KMeans(max_k, data):
    max_k += 1
    kmeans_results = dict()
    for k in range(2 , max_k):
        kmeans = cluster.KMeans(n_clusters = k
                               , init = 'k-means++'
                               , n_init = 10
                               , tol = 0.0001
                               , n_jobs = -1
                               , random_state = 1
                               , algorithm = 'full')

        kmeans_results.update( {k : kmeans.fit(data)} )
        
    return kmeans_results

#### Silhouette Score

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

In [18]:
def printAvg(avg_dict):
    for avg in sorted(avg_dict.keys(), reverse=True):
        print("Avg: {}\tK:{}".format(avg.round(4), avg_dict[avg]))
        
def plotSilhouette(df, n_clusters, kmeans_labels, silhouette_avg):
    fig, ax1 = plt.subplots(1)
    fig.set_size_inches(8, 6)
    ax1.set_xlim([-0.2, 1])
    ax1.set_ylim([0, len(df) + (n_clusters + 1) * 10])
    
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--") # The vertical line for average silhouette score of all the values
    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.2, 0, 0.2, 0.4, 0.6, 0.8, 1])
    plt.title(("Silhouette analysis for K = %d" % n_clusters), fontsize=10, fontweight='bold')
    
    y_lower = 10
    sample_silhouette_values = silhouette_samples(df, kmeans_labels) # Compute the silhouette scores for each sample
    for i in range(n_clusters):
        ith_cluster_silhouette_values = sample_silhouette_values[kmeans_labels == i]
        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7)

        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i)) # Label the silhouette plots with their cluster numbers at the middle
        y_lower = y_upper + 10  # Compute the new y_lower for next plot. 10 for the 0 samples
    plt.show()
    
        
def silhouette(kmeans_dict, df, plot=False):
    df = df.to_numpy()
    avg_dict = dict()
    for n_clusters, kmeans in kmeans_dict.items():      
        kmeans_labels = kmeans.predict(df)
        silhouette_avg = silhouette_score(df, kmeans_labels) # Average Score for all Samples
        avg_dict.update( {silhouette_avg : n_clusters} )
    
        if(plot): plotSilhouette(df, n_clusters, kmeans_labels, silhouette_avg)

In [19]:
# # Running Kmeans
# k = 8
# kmeans_results = run_KMeans(k, final_df)

# # Plotting Silhouette Analysis
# #silhouette(kmeans_results, final_df, plot=True)

In [20]:
def run_KMeans(max_k, data):
    kmeans_results = dict()
    for k in range(2 , max_k):
        kmeans = cluster.KMeans(n_clusters = k
                               , init = 'k-means++'
                               , n_init = 10
                               , max_iter = 300
                               , random_state = 42)
        kmeans_results[k] = kmeans.fit(data)
    return kmeans_results

max_k = 10
kmeans_results = run_KMeans(max_k, final_df)

## Cluster Analysis

Now we can choose the best number of K and take a deeper look at each cluster. Looking at the plots above, we have some clues that when K = 5 is when the clusters are best defined. So first we will use a simple histogram to look at the most dominant words in each cluster:

In [21]:
def get_top_features_cluster(tf_idf_array, prediction, n_feats):
    labels = np.unique(prediction)
    dfs = []
    for label in labels:
        id_temp = np.where(prediction==label) # indices for each cluster
        x_means = np.mean(tf_idf_array[id_temp], axis = 0) # returns average score across cluster
        sorted_means = np.argsort(x_means)[::-1][:n_feats] # indices with top 20 scores
        features = vectorizer.get_feature_names()
        best_features = [(features[i], x_means[i]) for i in sorted_means]
        df = pd.DataFrame(best_features, columns = ['features', 'score'])
        dfs.append(df)
    return dfs

def plotWords(dfs, n_feats):
    plt.figure(figsize=(8, 4))
    for i in range(0, len(dfs)):
        plt.title(("Most Common Words in Cluster {}".format(i)), fontsize=10, fontweight='bold')
        sns.barplot(x = 'score' , y = 'features', orient = 'h' , data = dfs[i][:n_feats])
        plt.show()

In [22]:
# best_result = 5
# kmeans = kmeans_results.get(best_result)

# final_df_array = final_df.to_numpy()
# prediction = kmeans.predict(final_df)
# n_feats = 20
# dfs = get_top_features_cluster(final_df_array, prediction, n_feats)
# plotWords(dfs, 13)

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Perform TF-IDF on the 'lyrics_lemmatized' column
X = vectorizer.fit_transform(data['lyrics_lemmatized'])

# Check if the vectorizer was fitted correctly
if not hasattr(vectorizer, 'vocabulary_'):
    print("The vectorizer was not fitted correctly.")
else:
    # Convert the result to a DataFrame
    tf_idf = pd.DataFrame(data = X.toarray(), columns=vectorizer.get_feature_names_out())

    # Assign the DataFrame to final_df
    final_df = tf_idf

    # Print the number of rows in the DataFrame
    print("{} rows".format(final_df.shape[0]))

    # Print the 5 largest values in the first row
    print(final_df.T.nlargest(5, 0))

800 rows
          0         1         2    3    4    5         6         7    \
मतव  0.682286  0.000000  0.264788  0.0  0.0  0.0  0.000000  0.000000   
आल   0.605069  0.000000  0.313094  0.0  0.0  0.0  0.000000  0.000000   
अम   0.282806  0.000000  0.000000  0.0  0.0  0.0  0.000000  0.000000   
दर   0.217668  0.078083  0.000000  0.0  0.0  0.0  0.116900  0.138207   
रह   0.147901  0.000000  0.139149  0.0  0.0  0.0  0.086652  0.000000   

          8    9    ...       790       791  792       793       794  795  \
मतव  0.000000  0.0  ...  0.000000  0.000000  0.0  0.000000  0.000000  0.0   
आल   0.000000  0.0  ...  0.000000  0.000000  0.0  0.000000  0.000000  0.0   
अम   0.000000  0.0  ...  0.000000  0.000000  0.0  0.000000  0.000000  0.0   
दर   0.418796  0.0  ...  0.000000  0.000000  0.0  0.000000  0.393158  0.0   
रह   0.000000  0.0  ...  0.112771  0.103315  0.0  0.321609  0.058286  0.0   

     796     797  798       799  
मतव  0.0  0.0000  0.0  0.000000  
आल   0.0  0.0000  0.0  0.00

#### Map of Words

Now that we can look at the graphs above and see the best scored words in each cluster, it's also interesting to make it prettier by making a map of words of each cluster!

In [24]:
# Transforms a centroids dataframe into a dictionary to be used on a WordCloud.
def centroidsDict(centroids, index):
    a = centroids.T[index].sort_values(ascending = False).reset_index().values
    centroid_dict = dict()

    for i in range(0, len(a)):
        centroid_dict.update( {a[i,0] : a[i,1]} )

    return centroid_dict

def generateWordClouds(centroids):
    wordcloud = WordCloud(max_font_size=100, background_color = 'white')
    for i in range(0, len(centroids)):
        centroid_dict = centroidsDict(centroids, i)        
        wordcloud.generate_from_frequencies(centroid_dict)

        plt.figure()
        plt.title('Cluster {}'.format(i))
        plt.imshow(wordcloud)
        plt.axis("off")
        plt.show()

In [25]:
# centroids = pd.DataFrame(kmeans.cluster_centers_)
# centroids.columns = final_df.columns
# generateWordClouds(centroids)

In [26]:
for k, kmeans in kmeans_results.items():
    print(f"Results for k={k}:")
    
    # Get the centroids
    centroids = pd.DataFrame(kmeans.cluster_centers_)
    centroids.columns = final_df.columns
    
    # Generate word clouds
    generateWordClouds(centroids)
    
    # Add any other analysis you want to perform for each k

Results for k=2:


NameError: name 'WordCloud' is not defined

### Preparing our final groups for visualization

Now that we're satisfied with our clustering we should assign which country belongs to which group.

In [None]:
# Assigning the cluster labels to each country
labels = kmeans.labels_ 
data['label'] = labels
data.head()

### Visualization the Clustered Countries in a Map

Now that we have our final grouping it would be really cool to visualize it in a interactive map. To do this we'll use the awesome Folium library to see our interactive map!

We'll load a geojson file of polygons and country codes with geopandas and merge it with the labelled dataframe from the cell above.

In [None]:
# Map Viz
import json
import geopandas as gpd

# Loading countries polygons
geo_path = 'datasets/world-countries.json'
country_geo = json.load(open(geo_path))
gpf = gpd.read_file(geo_path)

# Merging on the alpha-3 country codes
merge = pd.merge(gpf, data, left_on='id', right_on='alpha-3')
data_to_plot = merge[["id", "name", "label", "geometry"]]

data_to_plot.head(3)

Now we'll create a color_step for each group

In [None]:
import branca.colormap as cm

# Creating a discrete color map
values = data_to_plot[['label']].to_numpy()
color_step = cm.StepColormap(['r', 'y','g','b', 'm'], vmin=values.min(), vmax=values.max(), caption='step')

color_step

### Painting the Groups into a Choropleth Map

Now that we have all the information that we want to plot into a Dataframe, we'll create a function that makes a Choropleth Map to be displayed on a folium map.

In [None]:
import folium
from branca.element import Figure

def make_geojson_choropleth(display, data, colors):
    '''creates geojson choropleth map using a colormap, with tooltip for country names and groups'''
    group_dict = data.set_index('id')['label'] # Dictionary of Countries IDs and Clusters
    tooltip = folium.features.GeoJsonTooltip(["name", "label"], aliases=display, labels=True)
    return folium.GeoJson(data[["id", "name","label","geometry"]],
                          style_function = lambda feature: {
                               'fillColor': colors(group_dict[feature['properties']['id']]),
                               #'fillColor': test(feature),
                               'color':'black',
                               'weight':0.5
                               },
                          highlight_function = lambda x: {'weight':2, 'color':'black'},
                          smooth_factor=2.0,
                          tooltip = tooltip)

# Makes map appear inline on notebook
def display(m, width, height):
    """Takes a folium instance and embed HTML."""
    fig = Figure(width=width, height=height)
    fig.add_child(m)
    #return fig

In [None]:
# Initializing our Folium Map
m = folium.Map(location=[43.5775, -10.106111], zoom_start=2.3, tiles='cartodbpositron')

# Making a choropleth map with geojson
geojson_choropleth = make_geojson_choropleth(["Country:", "Group:"], data_to_plot, color_step)
geojson_choropleth.add_to(m)

width, height = 1300, 675
display(m, width, height)
m