Considered features:
according to: https://github.com/sgiammy/emotion-patterns-in-music-playlists

<ul>
    <li>**Title_vector**</li>
    <li>**Lyric_vector**</li>
    <li>**%Rhymes**:<br> defined as the percentage of the number of rhymes over the number of total lines. A rhyme is defined as a rhyme between two following lines.</li>
    <li>**%Past_tense_verbs**:<br> defined as the the percentage of the number of past tense verbs over the total number of verbs.</li>
    <li>**%Present_tense_verbs**:<br>  defined as the the percentage of the number of present tense verbs over the total number of verbs.</li>
    <li>**%Future_tense_verbs**:<br>  defined as the the percentage of the number of future tense verbs over the total number of verbs, where future is just will + base form.</li>
    <li>**%ADJ**:<br> Percentage of adjectives over the total number of words.</li>
    <li>**%ADP**:<br> Percentage of adpositions (e.g. in, to, during) over the total number of words.</li>
    <li>**%ADV**:<br> Percentage of adverbs (e.g. very, tomorrow, down, where, there) over the total number of words.</li>
    <li>**%AUX**:<br> Percentage of auxiliaries (e.g. is, has (done), will (do), should (do)) over the total number of words.</li>
    <li>**%INTJ**:<br> Percentage of interjections (e.g. psst, ouch, bravo, hello) over the total number of words.</li>
    <li>**%NOUN**:<br> Percentage of nouns over the total number of words.</li>
    <li>**%NUM**:<br> Percentage of numerals over the total number of words.</li>
    <li>**%PRON**:<br> Percentage of pronouns (e.g. I, you, he, she, myself, themselves, somebody,...) over the total number of words.</li> 
    <li>**%PROPN**:<br> Percentage of proper nouns (e.g. Mary, John) over the total number of words.</li>
    <li>**%PUNCT**:<br> Percentage of puntuctuation (e.g. ., (, ), ?) over the total number of words.</li>
    <li>**%VERB**:<br> Percentage of verbs over the total number of words.</li>
    <li>**Selfish_degree**:<br> Percentage of 'I' pronouns over the total number of pronouns</li>
    <li>**%Echoism**:<br> Percentage of echoism over the total number of words, where an echoism is either a sequence of two subsequent repeated words or the repetition of a vowel in a word. </li>
    <li>**%Duplicates**:<br> Percentage of duplicate words over the total number of words</li>
    <li>**isTitleInLyric**:<br> Boolean, true if the title string is also a substring of the lyric</li>
    <li>**sentiment**:<br> Sentiment between -1 and 1</li>
    <li>**subjectivity degree**:<br> Degree of subjectivity of the text</li>
</ul>

In [20]:
import os
import json
import pandas as pd

from langdetect import detect


In [55]:
def detect_language(text):
    try:
        language = detect(text)
    except Exception as e:
        print(e)
        language = None
    print(f"Detected lang = {language}")

    return language

INSTRUMENTAL_COMMENT = "This song is an instrumental"

def load_lyric_dataset(input_path):

    rows = list()
    ids = list()

    lyric_files = [os.path.join(input_path, pos_json) for pos_json in os.listdir(input_path) if pos_json.endswith('.json')]

    for file_path in lyric_files:
        with open(file_path) as f:
            song_info = json.load(f)

        try:
            id = song_info['id']
            id = id.replace("ML", "")
            id = int(id)
        except:
            id = None
            print(f"For {file_path} there is no id")

        try:
            mood = song_info['mood']
        except:
            mood = None
            print(f"For {file_path} there is no mood")

        try:
            title = song_info['title']
        except:
            title = None
            print(f"For {file_path} there is no title")

        try:
            lyric = song_info['song']['lyrics']
            if lyric == '': 
                print(f"For {file_path} lyric is empty")
        except:
            lyric = None
            print(f"For {file_path} there is no lyrics")
        
        try:
            language = song_info['song']['language']
            if language == '':
                print('jest empty')
                language = detect_language(lyric)
            elif language == None:
                print('tu jest null')
                language = detect_language(lyric)
        except:
            print(f"For {file_path} there is no language info in dataset")
            language = detect_language(lyric)
            

        try:
            comment = song_info['song']['//coment']
            if comment == INSTRUMENTAL_COMMENT:
                instrumental = True
                print(f"For {file_path} is instrumental\n")
            else:
                instrumental = False
        except:
            instrumental = False

        row = (mood, title, lyric, language, instrumental)
        
        rows.append(row)
        ids.append(id)

    df = pd.DataFrame(rows, columns=['mood', 'title', 'lyric', 'language', 'instrumental'], index=ids)
    
    return df

In [56]:
input_path = os.path.join('..', '..', 'database', 'lyrics')

dataset = load_lyric_dataset(input_path) 
dataset.head(10)


For ..\..\database\lyrics\ML1159.json lyric is empty
For ..\..\database\lyrics\ML1159.json there is no language info in dataset
No features in text.
Detected lang = None
For ..\..\database\lyrics\ML1159.json is instrumental

For ..\..\database\lyrics\ML1230.json lyric is empty
For ..\..\database\lyrics\ML1230.json there is no language info in dataset
No features in text.
Detected lang = None
For ..\..\database\lyrics\ML1230.json is instrumental

For ..\..\database\lyrics\ML1336.json lyric is empty
For ..\..\database\lyrics\ML1336.json there is no language info in dataset
No features in text.
Detected lang = None
For ..\..\database\lyrics\ML1336.json is instrumental

For ..\..\database\lyrics\ML1349.json lyric is empty
For ..\..\database\lyrics\ML1349.json there is no language info in dataset
No features in text.
Detected lang = None
For ..\..\database\lyrics\ML1349.json is instrumental

For ..\..\database\lyrics\ML136.json there is no language info in dataset
Detected lang = cs
For ..\

Unnamed: 0,mood,title,lyric,language,instrumental
1,happy,I Want Your Sex,I Want Your Sex Lyrics[From a PSA recorded for...,en,False
10,happy,Heart of Glass,Heart of Glass Lyrics[Verse 1]\nOnce I had a l...,en,False
100,happy,Crazy Little Thing Called Love,Crazy Little Thing Called Love Lyrics[Intro]\n...,en,False
1000,happy,Almost,Almost Lyrics[Verse 1]\nI almost got drunk at ...,en,False
1001,happy,Glow,Glow Lyrics[Verse 1]\nI never thought that you...,en,False
1002,sad,The Kids,The Kids Lyrics[Verse 1]\nThey're taking her c...,en,False
1003,relaxed,Political,Political LyricsA loose grip on a thin line\nL...,en,False
1004,happy,Hold Me Tight,Hold Me Tight Lyrics[LUCY]\nIt feels so right ...,en,False
1005,relaxed,I Will Never See the Sun,I Will Never See the Sun Lyrics[Chorus]\nI wil...,en,False
1006,happy,Superfast Jellyfish,Superfast Jellyfish Lyrics[Intro]\nThis mornin...,en,False


In [54]:
dataset.describe()

Unnamed: 0,mood,title,lyric,language,instrumental
count,2000,2000,2000.0,1986,2000
unique,4,1978,1987.0,17,2
top,happy,Fire,,en,False
freq,500,3,5.0,1890,1995


In [48]:
dataset = dataset.loc[dataset['language'] == "en"]
en_dataset = dataset.loc[dataset['instrumental'] == False]
en_dataset.describe()

Unnamed: 0,mood,title,lyric,language,instrumental
count,1890,1890,1890,1890,1890
unique,4,1871,1881,1,1
top,angry,Fire,Back to This LyricsWe were giving up time\nWe ...,en,False
freq,493,3,2,1890,1890
