# **Exploring National Anthems: A Data Journey**

## **Introduction**

Welcome to the captivating world of national anthems! 🌍🎶 In this data science project, we embark on a musical voyage across continents, exploring the lyrical expressions that resonate with patriotism, history, and culture. Our dataset contains the stirring verses of anthems from diverse nations, each encapsulating the spirit of its people.

### **Project Goals**

Our mission is twofold:

1. **Uncover Anthem Themes**: We'll dissect the anthems, unraveling their hidden themes and sentiments. Are there common threads that bind anthems together? Do certain regions favor love, war, or freedom in their lyrical odes?


### **Tools at Our Disposal**

We'll wield three powerful tools:

1. **K-Means Clustering**: Like musical harmonies, K-Means will group anthems into clusters based on their lyrical content. Are there clusters of anthems celebrating unity, resilience, or nature?

2. **Power BI**: Our canvas for visual storytelling! Power BI dashboards will breathe life into our data, allowing us to explore trends, sentiments, and geographic patterns.

3. **Natural Language Processing (NLP)**: NLP will decode the poetic language. We'll analyze sentiments, spot historical references, and identify notable entities—whether they're legendary heroes or cherished landscapes.

## Final Briefings will start at 2:30pm on Friday. Everyone will be expected to bring their insights to the table. You may use a notebook OR PowerPoint OR PowerBI to brief
    - 10 min MAX breifing times
    - Final output should be data driven insights (think actionable!)
    - If Python and/or Machine Learning is not your jam, analyze the data YOUR way, just get me insights! 

### **Let the Anthem Symphony Begin! 🎵**

Gather your curiosity, tune your analytical instruments, and let's dive into the rich tapestry of national anthems. From the Himalayan peaks to the African savannas, every stanza carries a tale waiting to be told.

---

## Remember, data science is not just about numbers—it's about weaving narratives from raw data. So, let's harmonize data and creativity, and celebrate the anthems that echo through time! 🌟🎤

In [1]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Text analytics/NLP toolkits
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import PorterStemmer

# Unsupervised ML Models
from sklearn.cluster import KMeans


### Read CSV

In [2]:
path = "national_anthems.csv"
df = pd.read_csv(path)
print(df.shape)
df.head()

(190, 5)


Unnamed: 0,Country,Alpha-2,Alpha-3,Continent,Anthem
0,Albania,AL,ALB,Europe,"Around our flag we stand united, With one wish..."
1,Armenia,AM,ARM,Europe,"Our Fatherland, free, independent, That has fo..."
2,Austria,AT,AUT,Europe,"Land of mountains, land by the river, Land of ..."
3,Azerbaijan,AZ,AZE,Europe,"Azerbaijan, Azerbaijan! The glorious Fatherlan..."
4,Belarus,BY,BLR,Europe,"We, Belarusians, are peaceful people, Wholehea..."


### Profile and Clean Data

In [3]:
# Possible null values
df.info(memory_usage=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190 entries, 0 to 189
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Country    190 non-null    object
 1   Alpha-2    189 non-null    object
 2   Alpha-3    190 non-null    object
 3   Continent  190 non-null    object
 4   Anthem     190 non-null    object
dtypes: object(5)
memory usage: 7.6+ KB


In [4]:
# there are missing and duplicated data
df.describe().T

Unnamed: 0,count,unique,top,freq
Country,190,190,Albania,1
Alpha-2,189,188,CH,2
Alpha-3,190,190,ALB,1
Continent,190,6,Africa,56
Anthem,190,188,"Arise, ye who refuse to be slaves; With our ve...",2


In [5]:
# Looing for missing data
df[df['Alpha-2'].isna()]

Unnamed: 0,Country,Alpha-2,Alpha-3,Continent,Anthem
168,Namibia,,NAM,Africa,Namibia land of the brave Freedom fight we he ...


In [6]:
# Replace missing NaN value with "NA" in the "Alpha-2" column
df['Alpha-2'].fillna('NA', inplace=True)


In [7]:
# check df again for missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190 entries, 0 to 189
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Country    190 non-null    object
 1   Alpha-2    190 non-null    object
 2   Alpha-3    190 non-null    object
 3   Continent  190 non-null    object
 4   Anthem     190 non-null    object
dtypes: object(5)
memory usage: 7.6+ KB


In [8]:
# looking for the duplicated data
duplicate_mask = df.duplicated(subset=['Alpha-2'], keep=False)
duplicate_rows = df[duplicate_mask]
print("Duplicate rows based on 'Alpha-2':")
duplicate_rows


Duplicate rows based on 'Alpha-2':


Unnamed: 0,Country,Alpha-2,Alpha-3,Continent,Anthem
41,Switzerland,CH,CHE,Europe,"When the morning skies grow red, and over us t..."
47,Chile,CH,CHL,South_America,"Beloved Homeland, receive the vows That Chile ..."


In [9]:
# Change the Alpha-2 value for Chile to "CL". (The correct 2 letter country code)
df.at[47, 'Alpha-2'] = "CL"


In [10]:
# Checking the value in row index 47 of the "Alpha-2" column
value_at_index_47 = df.at[47, 'Alpha-2']

# Printing the value
print("Value at index 47 in the 'Alpha-2' column:", value_at_index_47)


Value at index 47 in the 'Alpha-2' column: CL


In [11]:
# looks like there are no missing values now
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190 entries, 0 to 189
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Country    190 non-null    object
 1   Alpha-2    190 non-null    object
 2   Alpha-3    190 non-null    object
 3   Continent  190 non-null    object
 4   Anthem     190 non-null    object
dtypes: object(5)
memory usage: 7.6+ KB


In [12]:
# Looks like there are no more duplicated values except for some Anthems
df.describe().T

Unnamed: 0,count,unique,top,freq
Country,190,190,Albania,1
Alpha-2,190,190,AL,1
Alpha-3,190,190,ALB,1
Continent,190,6,Africa,56
Anthem,190,188,"Arise, ye who refuse to be slaves; With our ve...",2


In [13]:
# Some Duplicate anthems are not a mistake. For example, Cyprus and Greece share the same anthem.
# will leave these alone for now.

duplicate_mask = df.duplicated(subset=['Anthem'], keep=False)
duplicate_rows = df[duplicate_mask]
print("Duplicate rows based on 'Anthem':")
duplicate_rows

Duplicate rows based on 'Anthem':


Unnamed: 0,Country,Alpha-2,Alpha-3,Continent,Anthem
9,Cyprus,CY,CYP,Europe,"We knew thee of old, O, divinely restored, By ..."
17,Greece,GR,GRC,Europe,"We knew thee of old, O, divinely restored, By ..."
96,China,CN,CHN,Asia,"Arise, ye who refuse to be slaves; With our ve..."
109,Macau,MO,MAC,Asia,"Arise, ye who refuse to be slaves; With our ve..."


### Exploring Anthem feature

In [14]:
# add new column text_length
df['text_length'] = df['Anthem'].apply(len)
df.head(3)

Unnamed: 0,Country,Alpha-2,Alpha-3,Continent,Anthem,text_length
0,Albania,AL,ALB,Europe,"Around our flag we stand united, With one wish...",794
1,Armenia,AM,ARM,Europe,"Our Fatherland, free, independent, That has fo...",468
2,Austria,AT,AUT,Europe,"Land of mountains, land by the river, Land of ...",613


In [15]:
# overall average length of anthem
mean_text_length = df['text_length'].mean()
print("Mean text length:", mean_text_length)

Mean text length: 748.5105263157894


In [16]:
# average anthem length grouped by continent
mean_text_length_by_continent = df.groupby('Continent')['text_length'].mean()
print("Mean text length by continent:")
print(mean_text_length_by_continent)


Mean text length by continent:
Continent
Africa            648.482143
Asia              630.909091
Europe            688.318182
North_America     899.083333
Oceania           499.500000
South_America    1773.583333
Name: text_length, dtype: float64


## Starting NLP

In [17]:
# create a series for the Anthem column
X = df['Anthem']
X

0      Around our flag we stand united, With one wish...
1      Our Fatherland, free, independent, That has fo...
2      Land of mountains, land by the river, Land of ...
3      Azerbaijan, Azerbaijan! The glorious Fatherlan...
4      We, Belarusians, are peaceful people, Wholehea...
                             ...                        
185    O defenders of the Homeland! Rally around to t...
186    Oh Uganda! May God uphold Thee, We lay our fut...
187    O sons of the Sahara! In the battlefield, you ...
188    Stand and sing of Zambia, proud and free, Land...
189    Oh lift high the banner, the flag of Zimbabwe ...
Name: Anthem, Length: 190, dtype: object

### Apply CountVectorizer

In [18]:
# apply count vectorizer with ngram range and stop words removed
vect = CountVectorizer(ngram_range=(1,2),stop_words='english')#, token_pattern=r'\b\w+\b' 
vect.fit(X) 
X_dtm = vect.transform(X)
X_dtm.shape

(190, 14317)

In [19]:
# Last 50 features
print(vect.get_feature_names_out()[-50:])

['young free' 'young heroes' 'young illustrious' 'young men' 'young old'
 'young stand' 'young tree' 'youth' 'youth sense' 'youth tire'
 'youth truth' 'youth tunisia' 'youthful' 'youthful men' 'zambezi'
 'zambezi limpopo' 'zambia' 'zambia free' 'zambia praise' 'zambia proud'
 'zambia sky' 'zambia zambia' 'zeal' 'zeal loyalty' 'zeal make'
 'zeal tires' 'zealand' 'zealand let' 'zealand men' 'zealand mountains'
 'zealand peace' 'zealous' 'zealous adore' 'zenith' 'zenith faith'
 'zenith skies' 'zimbabwe' 'zimbabwe symbol' 'zimbabwe wondrously' 'zion'
 'zion hope' 'zion jerusalem' 'ºciuszko' 'ºciuszko god' 'ãƒâ' 'ãƒâ rpãƒâ'
 'œending' 'œending love' 'šawice' 'šawice koã']


### Apply TF-IDF

In [20]:
# Initialize TF-IDF transformer
tfidf_transformer = TfidfTransformer()

# Fit and transform the document-term matrix to TF-IDF representation
X_tfidf = tfidf_transformer.fit_transform(X_dtm)

# Display the shape of the resulting TF-IDF matrix
print(X_tfidf.shape)


(190, 14317)


## Starting Unsupervised ML

### K-means clustering

In [21]:
# Initialize KMeans with the desired number of clusters
num_clusters = 5  
kmeans = KMeans(n_clusters=num_clusters, n_init=10)

# Fit KMeans to the TF-IDF matrix
kmeans.fit(X_tfidf)

# Get the cluster labels
cluster_labels = kmeans.labels_

# Print the cluster labels
print("Cluster labels:", cluster_labels)


Cluster labels: [3 3 2 3 2 1 2 2 2 0 2 1 3 2 0 4 1 0 4 0 1 1 1 4 2 4 2 3 2 1 0 4 1 3 4 2 0
 1 4 4 3 2 2 4 1 1 4 1 2 3 2 3 4 2 0 3 2 2 3 2 4 1 4 4 1 0 4 1 0 2 0 1 0 4
 4 0 2 2 4 1 2 2 2 3 2 2 3 1 1 0 2 4 2 4 4 1 0 0 2 1 4 2 2 4 4 4 1 1 4 0 4
 2 4 1 2 4 4 3 3 0 4 4 1 2 2 3 4 3 3 4 4 4 1 1 3 1 4 2 3 2 3 1 4 1 1 1 3 4
 1 0 1 1 1 3 1 4 0 1 2 4 3 3 2 1 4 0 4 3 0 1 1 1 4 3 3 1 4 2 2 1 1 4 2 1 1
 4 2 0 2 2]


In [22]:
# Add cluster labels to the original DataFrame
df['Cluster'] = cluster_labels

# Analyze cluster characteristics
for cluster_num in range(num_clusters):
    print(f"Cluster {cluster_num}:")
    cluster_df = df[df['Cluster'] == cluster_num]
    
    # Print out some statistics or characteristics of each cluster
    print("Number of countries:", len(cluster_df))
    print("Countries:", cluster_df['Country'].unique())
    print("Continents:", cluster_df['Continent'].unique())
    print()


Cluster 0:
Number of countries: 22
Countries: ['Cyprus' 'France' 'Greece' 'Iceland' 'Norway' 'Serbia' 'Uruguay'
 'Nicaragua' 'Puerto Rico' 'Trinidad and Tobago' 'Belize' 'Grenada'
 'Tonga' 'China' 'India' 'Macau' 'Philippines' 'Eritrea' 'Ivory Coast'
 'Mauritius' 'Namibia' 'Western Sahara']
Continents: ['Europe' 'South_America' 'North_America' 'Oceania' 'Asia' 'Africa']

Cluster 1:
Number of countries: 47
Countries: ['Belgium' 'Denmark' 'Germany' 'Ireland' 'Italy' 'Latvia'
 'Netherlands (the)' 'Portugal' 'Slovakia' 'Argentina' 'Bolivia' 'Chile'
 'Haiti' 'El Salvador' 'Panama' 'Bahamas' 'Greenland' 'Kiribati'
 'Federated States of Micronesia' 'Cambodia' 'Iran' 'Kyrgyzstan' 'Laos'
 'Myanmar' 'Singapore' 'Vietnam' 'Yemen' 'Angola' 'Cape Verde' 'Chad'
 'Comoros' 'Democratic Republic of Congo' 'Equatorial Guinea' 'Ethiopia'
 'Gabon' 'Gambia' 'Guinea' 'Kenya' 'Mali' 'Niger' 'Nigeria'
 'Republic of the Congo' 'Senegal' 'South Africa' 'South Sudan' 'Tanzania'
 'Togo']
Continents: ['Europe' 'So

In [23]:
# Get the feature names (words) from the TF-IDF matrix
feature_names = vect.get_feature_names_out()

# Get the centroid of each cluster
centroids = kmeans.cluster_centers_

# Find the top words for each cluster
for i in range(num_clusters):
    print(f"Cluster {i}:")
    # Get the indices of the top 10 words with the highest TF-IDF scores
    top_words_indices = centroids[i].argsort()[-10:][::-1]
    # Map indices to words
    top_words = [feature_names[idx] for idx in top_words_indices]
    print("Top words:", top_words)
    print()


Cluster 0:
Top words: ['thy', 'thee', 'liberty', 'march', 'god', 'arise', 'eritrea', 'hail', 'namibia', 'land']

Cluster 1:
Top words: ['let', 'people', 'freedom', 'africa', 'sing', 'nation', 'bless', 'unity', 'song', 'arise']

Cluster 2:
Top words: ['land', 'free', 'thee', 'god', 'oh', 'home', 'dear', 'love', 'mother', 'salute']

Cluster 3:
Top words: ['fatherland', 'flag', 'shall', 'god', 'thy', 'warrior', 'fiji', 'land', 'blood', 'glorious']

Cluster 4:
Top words: ['homeland', 'country', 'live', 'long', 'glory', 'long live', 'flag', 'god', 'king', 'land']



# With more time:
### Remove one letter words and non-english words.
### Apply stemming or lemming
### Change number of clusters.
### Group countries into more specific geographic areas.