<a href="https://colab.research.google.com/github/GuidoGiacomoMussini/Text_Mining-Lyrics_Analysis/blob/main/3_Topic_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DEPENDENCIES

In [None]:
from google.colab import drive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import re
import string
from tqdm import tqdm as progress_bar
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!cp /content/drive/MyDrive/Colab\ Notebooks/Text\ Mining/Utils/topic_detection.py /content
import topic_detection as TD # it takes 30\40 minutes -> download fasttext model

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.it.300.bin.gz





# Data

In [None]:
#df = pd.read_excel(r'/content/drive/MyDrive/DSE/Text_Mining/Text_mining_Preprocessed.xlsx')
df =  pd.read_hdf(r'/content/drive/MyDrive/Colab Notebooks/Text Mining/Files/lyrics_preprocessing.h5', key='data')

df.head()

Unnamed: 0,Author,Title,Lyrics,verses,processed_lyrics,processed_verses,POS
0,Fabrizio De André,La Guerra Di Piero,dormi sepolto in un campo di grano non è la ro...,"[dormi sepolto in un campo di grano, non è la...",dormire seppellire campo Grano rosa Tulipano f...,"[dormire seppellire campo Grano, rosa Tulipan...","[[VERB, VERB, ADP, DET, NOUN, ADP, NOUN], [ADV..."
1,Fabrizio De André,Don Raffaè,io mi chiamo pasquale cafiero e son brigadiero...,"[io mi chiamo pasquale cafiero, e son brigadi...",chere pasquale Cafiero son brigadiero carcere ...,"[chere pasquale Cafiero, son brigadiero carce...","[[PRON, PRON, VERB, ADJ, ADJ], [CCONJ, AUX, NO..."
2,Fabrizio De André,Dolcenera,amìala ch'a l'arìa amìa cum'a l'é cum'a l'é am...,"[amìala ch'a l'arìa amìa cum'a l'é cum'a l'é, ...",amìalare ch' a il arìa amìa cum' a il é cum' a...,[amìalare ch' a il arìa amìa cum' a il é cum' ...,"[[VERB, SCONJ, ADP, DET, PROPN, PROPN, NUM, AD..."
3,Fabrizio De André,Bocca Di Rosa,la chiamavano bocca di rosa metteva l'amore me...,"[la chiamavano bocca di rosa, metteva l'amore...",chiamare bocca rosa mettere il amore mettere i...,"[chiamare bocca rosa, mettere il amore metter...","[[PRON, VERB, NOUN, ADP, NOUN], [VERB, DET, NO..."
4,Fabrizio De André,Il Testamento di Tito,non avrai altro dio all'infuori di me spesso m...,"[non avrai altro dio all'infuori di me, spess...",altro dio a il infuore me spesso fare pensare ...,"[altro dio a il infuore me, spesso fare pensa...","[[ADV, VERB, ADJ, NOUN, ADP, NOUN, ADP, PRON],..."


# TOPIC DETECTION
Each song is assigned to a topic by an algorithm called *Fast Text Algorithm*, which exploit the [FastText similarity](https://arxiv.org/abs/1607.04606)


**FastText Classification Algorithm**:

Assuming $A$ is the list of words in a document, where each word is denoted as $ a_i$, and $T$ is the list of topics (each topic is represented by a word), where each topic is denoted as $t_j$.

Let $S(x, y)$ be the similarity measure between two words $x$ and $y$. Note that the similarity measure is the *Fast text similarity*

The algorithm involves calculating the similarity between each word $a_i$ in $A$ and each topic $t_j$ in $T$. The document is assigned to the topic $t_k$ such that the sum of similarities between $t_k$ and all words in $A$ is greater than the sum of similarities between $t_j$ and all words in $A$, where $t_j$ belongs to $T \setminus \{ t_k \}$.

Mathematically, this can be expressed as:

$$ t_k = \underset{t_j \in T}{\arg \max} \sum_{i=1}^{|A|} S(a_i, t_j)w_j > \sum_{i=1}^{|A|} S(a_i, t_k)w_k $$

This notation signifies that we find the topic $t_k$ that maximizes the sum of similarities between $t_k$ and all words in $A$, ensuring it is greater than the sum of similarities between $t_j$ and all words in $A$ for any $t_j$ in $T \setminus \{ t_k \}$.


The terms $w_j$, $w_k$  $\in [0,1]$ refer to a metric called **Inverse Popularity**:

Popularity of a word represents how common the word is within the language.

Let $X$ represent a random variable that randomly pick a word from a language, $c$ a common word and $n$ a less common word, where *common* refers to the word frequency in the dataset used to train the embedding model.

Under the assumption that the words distribution in the datset is consisent with the words distribution in the language, we have that $P(S(c,x) > S(n,x))$ is probable for each x $\in$ X.

Which means that FastText inherently tends to calculate a higher similarity score when a common word is involved. This bias tends to frequently assign documents to topics represented by the most common term.

To counteract this effect, the algorithm derives, for each topic t, a weight $w_t$ defined as: $$ w_t ∝  (\sum_{i=1}^{|L|} S(l_i, t_t))^{-1} $$ Here, $L$
is the list of each word $l$ across all documents.

Therefore, the most *Popular* topics are penalized in the algorithm.

---

Practically, i don't use all the words in a song or in the list of song to calculate the similarity with the topics, i.e, assuming 'love' is a topic, it's pointless to derive the similarity S('love', 'the').

In fact, i use only the NOUNs in the songs to calculate the similarity and assign the topic. You are free to choose other POS (part of speech), like 'ADJ' or 'VERB'.


In [None]:
#Extract the nouns from each song

POS = ['NOUN'] #here you can add other POS to be extracted
df['important_POS'] = [TD.extract_POS(words, POS) for words in progress_bar(df['processed_lyrics'])]

#To speed up the next computations, you can choose to restrict the similarity derivation to the most common words in each song:

#Extract the 10 most common nouns from each song
df['common_words'] = [TD.common_words(words, 10) for words in progress_bar(df['important_POS'])]

df.head(1)

100%|██████████| 441/441 [01:12<00:00,  6.06it/s]
100%|██████████| 441/441 [00:00<00:00, 1676.29it/s]


Unnamed: 0,Author,Title,Lyrics,verses,processed_lyrics,processed_verses,POS,important_POS,common_words
0,Fabrizio De André,La Guerra Di Piero,dormi sepolto in un campo di grano non è la ro...,"[dormi sepolto in un campo di grano, non è la...",dormire seppellire campo Grano rosa Tulipano f...,"[dormire seppellire campo Grano, rosa Tulipan...","[[VERB, VERB, ADP, DET, NOUN, ADP, NOUN], [ADV...","[campo, fan, veglia, ombra, sponda, luccio, ca...","[tempo, terra, campo, fan, veglia, ombra, inve..."


**Fast Text Algorithm**
* Define the topics
* derive the popularity
* calculate the weighted similarity for each song
* extract the most common topic

In [None]:
topics = ['amore', 'dio', 'natura', 'politica', 'morte', 'guerra'] #chosen topics
vocabulary = [word for list_ in df['important_POS'] for word in list_] #retrieve the vocabulary of the whole set of songs

#now we extract the relative frequency in the lyrics-set of the words representing the topics.
topics_popularity = TD.find_popularity(topics, vocabulary)
topics_popularity

{'dio': 0.319,
 'amore': 0.186,
 'morte': 0.162,
 'guerra': 0.124,
 'natura': 0.124,
 'politica': 0.086}

the weights $w_j$ are derived from the *topic frequency dictionary*: $$w_j = (frequency_j)^{-1}$$

Note that the first result of the algorithm is a dictionary *topic:value*, representing the percentage of how much a song can be associated with that particular topic. Therefore that dictionary describes how much a song is about *all* the topics, providing some furhter information about the text.

In [None]:
#derive the FastText Similarity between each topic and song
df['topic_similarity'] = [TD.find_topic_similarity(topics_popularity, words, weight = True) for words in progress_bar(df['important_POS'])]

#extract the most common topic
df['topic'] = [list(topic.keys())[0] for topic in df['topic_similarity']]

print("\n# topic assigned by song:\n", Counter(df['topic']))

100%|██████████| 441/441 [00:08<00:00, 55.04it/s]


# topic assigned by song:
 Counter({'dio': 104, 'amore': 89, 'politica': 81, 'guerra': 58, 'morte': 56, 'natura': 53})





In [None]:
#example
df[(df.Author == "Fabrizio De André")].head(5)

Unnamed: 0,Author,Title,Lyrics,verses,processed_lyrics,processed_verses,POS,important_POS,common_words,topic_similarity,topic
0,Fabrizio De André,La Guerra Di Piero,dormi sepolto in un campo di grano non è la ro...,"[dormi sepolto in un campo di grano, non è la...",dormire seppellire campo Grano rosa Tulipano f...,"[dormire seppellire campo Grano, rosa Tulipan...","[[VERB, VERB, ADP, DET, NOUN, ADP, NOUN], [ADV...","[campo, fan, veglia, ombra, sponda, luccio, ca...","[tempo, terra, campo, fan, veglia, ombra, inve...","{'guerra': 0.178, 'morte': 0.173, 'dio': 0.168...",guerra
1,Fabrizio De André,Don Raffaè,io mi chiamo pasquale cafiero e son brigadiero...,"[io mi chiamo pasquale cafiero, e son brigadi...",chere pasquale Cafiero son brigadiero carcere ...,"[chere pasquale Cafiero, son brigadiero carce...","[[PRON, PRON, VERB, ADJ, ADJ], [CCONJ, AUX, NO...","[brigadiero, carcere, cafiero, catenaccio, ser...","[cafè, carcere, ricetta, cumpagno, fortuna, uo...","{'dio': 0.182, 'politica': 0.176, 'morte': 0.1...",dio
2,Fabrizio De André,Dolcenera,amìala ch'a l'arìa amìa cum'a l'é cum'a l'é am...,"[amìala ch'a l'arìa amìa cum'a l'é cum'a l'é, ...",amìalare ch' a il arìa amìa cum' a il é cum' a...,[amìalare ch' a il arìa amìa cum' a il é cum' ...,"[[VERB, SCONJ, ADP, DET, PROPN, PROPN, NUM, AD...","[ch, amiala, ch, amiala, amiala, ch, porgere, ...","[acqua, ch, amiala, amore, moglie, via, aegua,...","{'natura': 0.179, 'amore': 0.176, 'guerra': 0....",natura
3,Fabrizio De André,Bocca Di Rosa,la chiamavano bocca di rosa metteva l'amore me...,"[la chiamavano bocca di rosa, metteva l'amore...",chiamare bocca rosa mettere il amore mettere i...,"[chiamare bocca rosa, mettere il amore metter...","[[PRON, VERB, NOUN, ADP, NOUN], [VERB, DET, NO...","[bocca, amore, amore, bocca, amore, cosa, staz...","[amore, bocca, gente, stazione, voglie, consig...","{'amore': 0.194, 'guerra': 0.173, 'morte': 0.1...",amore
4,Fabrizio De André,Il Testamento di Tito,non avrai altro dio all'infuori di me spesso m...,"[non avrai altro dio all'infuori di me, spess...",altro dio a il infuore me spesso fare pensare ...,"[altro dio a il infuore me, spesso fare pensa...","[[ADV, VERB, ADJ, NOUN, ADP, NOUN, ADP, PRON],...","[dio, infuore, gente, est, fondo, nome, coltel...","[dolore, nome, padre, madre, uomo, amore, dio,...","{'dio': 0.207, 'amore': 0.181, 'morte': 0.177,...",dio


#Song similarity

* the same logic used to perform topic detection can be used to find which songs are are most similar to each other.
* Consider Guccini and De André songs

In [None]:
#extract all guccini and De André songs:
fg = df[df.Author == 'Francesco Guccini'][['Author', 'Title', 'common_words']].reset_index(drop = True)
fda = df[df.Author == 'Fabrizio De André'][['Author', 'Title', 'common_words']].reset_index(drop = True)

find song similarity for all the couple of songs:

note that in this case you can't correct the classification for the popularity, since we have a big vocabulary that flatten to 0 the popularity of each word, which translate in a weight that tend to infinity.

In [None]:
song_pairs, sim_list= [], []
for de_andre in progress_bar(range(len(fda))):
  for guccini in range(len(fg)):
    sim_list.append(TD.song_similarity(fda.common_words[de_andre], fg.common_words[guccini]))
    song_pairs.append((str(fda.Title[de_andre]), str(fg.Title[guccini])))

df_similarity=pd.DataFrame({'Titles': song_pairs, 'Similarity':sim_list}).sort_values(by='Similarity', ascending=False)

100%|██████████| 119/119 [01:58<00:00,  1.00it/s]


 In fact, in the first places you can see how the result is driven by the words:
 we have songs in dialects, words that were probably misinterpreted in the embedding model and songs regarding the faith, which i observed to be the area that generally provide the higher similarity scores.


In [None]:
df_similarity

Unnamed: 0,Titles,Similarity
13558,"(A Pittima, Barun Litrun)",42.045179
7642,"(Sinan Capudan Pascia, Barun Litrun)",39.986924
13442,"(Â cúmba, Barun Litrun)",37.677947
6537,"(Leggenda Di Natale, Bisanzio)",36.630282
4649,"(Coda di Lupo, Dio è morto)",35.740505
...,...,...
9571,"(Primo Intermezzo, Ho Ancora La Forza)",3.940267
9620,"(Primo Intermezzo, Sei minuti all’alba)",3.838603
9586,"(Primo Intermezzo, Canzone delle ragazze che s...",3.837559
8918,"(Introduzione, Barun Litrun)",2.846110


# Store the file

In [None]:
files_path = '/content/drive/MyDrive/Colab Notebooks/Text Mining/Files/'
file_ = 'lyrics_topic_detection.h5'

df.to_hdf(files_path + file_, key='data', mode='w')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['Author', 'Title', 'Lyrics', 'verses', 'processed_lyrics',
       'processed_verses', 'POS', 'important_POS', 'common_words',
       'topic_similarity', 'topic'],
      dtype='object')]

  df.to_hdf(files_path + file_, key='data', mode='w')


In [None]:
!rm /content/topic_detection.py