### Folder, modules and data import

In [1]:
%cd /content/drive/MyDrive/Projects/mamasita/data

/content/drive/MyDrive/Projects/mamasita/data


In [2]:
!pip install pyspellchecker

Collecting pyspellchecker
[?25l  Downloading https://files.pythonhosted.org/packages/64/c7/435f49c0ac6bec031d1aba4daf94dc21dc08a9db329692cdb77faac51cea/pyspellchecker-0.6.2-py3-none-any.whl (2.7MB)
[K     |████████████████████████████████| 2.7MB 16.7MB/s 
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.6.2


In [3]:
import os
import pathlib
import pandas as pd
import re
from spellchecker import SpellChecker

from collections import Counter

Specify the path for the json containing the scraped lyrics, the export path for the cleaned lyrics to be used at the end of the nb and the path for the words in the Spanish dictionary.

In [8]:
current_path = pathlib.Path().absolute()
lyrics_path = str(current_path) + '/lyrics_expanded.json'
clean_lyrics_path = str(current_path) + '/clean_lyrics_extended.csv'
spanish_dict_path = str(current_path) + '/spanish_dict.txt'

In [9]:
with open(spanish_dict_path, encoding='utf8') as f:
  spanish_dictionary = f.read() #IMport the words in the Spanish dictionary
  spanish_dictionary = spanish_dictionary.split('\n')

Import the data in the lyrics json into a pandas df

In [10]:
songs_df = pd.read_json(lyrics_path, orient='records')
songs_df = songs_df.dropna()
# Transform from single-element-list to string for titles and artist columns
songs_df['title'] = [l[0] for l in songs_df['title']]
songs_df['artist'] = [l[0] for l in songs_df['artist']]

# Data cleaning

## Removing artist insertion and punctuation elements

Many lyrics have insertions like: (artist_name) or [artist_name] indicating the singer switch between sentences, which are not actually lyrics. After removing them, the punctuation signs will be removed since they are interpretations from the person uploading the lyrics. Only the end of question (?) will be left as an objective sign (i.e. amount of questions in a song could be a valid feature).

In [11]:
def remove_artist_insertion(lyrics:pd.Series, artists:pd.Series, spanish_dict:list):
  for index in range(len(lyrics)):
    dirty_text = lyrics.iloc[index]
    artist_name = artists.iloc[index]
    for word in artist_name.split():
      if word not in spanish_dict: 
        re_str = '[\(\[\"]'+word.lower()+'[\]\)\"]'
        lyrics.iloc[index] = re.sub(re_str, '', dirty_text.lower())
  return lyrics

songs_df['lyrics'] = remove_artist_insertion(songs_df['lyrics'], songs_df['artist'], spanish_dictionary)

def remove_special_signs(lyrics:pd.Series):
  signs_correction = [(',', ''), ('¿', ''), ('?', ' ? '), ('¡', ''), ('!',''), 
                    ('(', ''), (')', ''), ('[', ''), (']', ''), ('-', ''),
                    ('.', ''), ('.', ''), (":", ""), (";", ""), ("_", ""), ('...', '')]
  for replacement in signs_correction:
    lyrics = [l.replace(replacement[0], replacement[1]) for l in lyrics]
  return lyrics

songs_df['lyrics'] = remove_special_signs(songs_df['lyrics'])


We can see the result of this preliminar cleaning:

In [13]:
songs_df.head(10)

Unnamed: 0,title,lyrics,artist
0,Yo Se Bien Quien Tu Eres,me pregunto que qué edad yole pongo le dije 1...,Maluma
1,Ya No Es Niña,pretty boy dirty boy yo la conocí cuando era...,Maluma
2,Vuelo Hacia El Olvido,eh ye ye ye aohoh me llevo tus abrigos para ...,Maluma
3,Zombie (feat. Yaviah),metiendo miedo como zombie azicalao con las t...,Alexis y Fido
4,Vuelve (feat. Paulina Rubio & DCS),tengo la sensación de que no vuelves nunca te...,Juan Magan
5,Tu Y Yo,sabes quien llego gente de zona una ves mas ...,Gente de Zona
6,YO VISTO ASÍ,yehyeh yeh yehyeh yehyeh yehyeh yehyeh yehy...,Bad Bunny
7,Yo soy tu hombre,las knarias con nicky jam… ya tu sabes como v...,Nicky Jam
8,Yo Te Lo Dije,pero yo te dejé todo claro en que habíamos q...,J Balvin
9,Yo Quiero Ser,llevo tanto tiempo esperando que regreses lle...,J Balvin


## Spelling correction

The next step is to try to reduce as much as possible the amount of misspelled words. This is a complex task due to several reasons:
* Slang and geographical variations not recognized in the 
reference dictionaries are common.
* A mix of terms in English are widely used, sometimes with semantic variation from the original English word.
Slang and English words are left untouched since they might be significant features.

A feedback process has been manually performed to identify the most common mispellings.These have been manually incorporated to the function *refine_spelling* by the tuned_spelling list to be corrected. A subseequent correction is also performed by the candidate proposed by the SpellChecker.correction() method for those words ending with a "'", since it was verified manually that the accuracy rate of the candidate proposed for this misspelling was quite high.

In [None]:
def refine_spelling(lyrics:pd.Series):
  spelling_check = SpellChecker(language='es') #Using the module built-in dict
  tuned_spelling = [("pa’", "para"), ("to'", "todo"), ("vamo'", "vamos"), ("e'", "es"), ("na'", "nada"), 
                    ("pa'", "para"), ("lo'", "los"), ("to'a", "toda"), ("ere'", "eres"), ("quiere'", "quieres"),
                    ("pa'l", "para el"), ("ma'", "más"), ("po'", "por"), ("tiene'", "tienes"), ("la'", "las"),
                    ("sabe'", "sabes"), ("va'", "vas"), ("estamo'", "estamos"), ("hacemo'", "hacemos"), ("somo'", "somos"),
                    ("está'", "estás"), ("no'", "nos"), ("haga'", "hagas"), ("llama'", "llamas"), ("yo'", "yo"),
                    ("la'o", "lado"), ("perriarte", "perrear"), ("toa", "toda"), ("amigo'", "amigos"), ("claro'", "claro"),
                    ("partío'", "partido"), ("má'", "más"), ("mojaíta", "mojada"), ('to’', 'todo'), ("oí'te", "oíste"),
                    ("beso'", "beso"), ("nosotro'", "nosotros"), ("rompe'", "romper"), ("dize", "dice"), ("dio'", "dios"),
                    ("vece'", "veces"), ("ve'", "ves"), ("cosa'", "cosas"), ("pa'lante", "para adelante"), ("sabe’", "sabes"),
                    ("perrearte", "perrear"), ("perriando", "perreando"), ("hora'", "horas"), ("diga'", "digas"), ("pá'", "papá"), ("toa'", "toda"),
                    ("despué'", "después"), ("andamo'", "andamos"), ("pasa'o", "pasado"), ("entonce'", "entonces")]
  for replacement in tuned_spelling:
    lyrics = [l.replace(replacement[0], replacement[1]) for l in lyrics]
  adhoc_dict = {}
  for index in range(len(lyrics)):
    print(index)
    for word in re.findall(r"(\w*)\'\B", lyrics[index]): 
      #Target remaining words that end in "'"
      word = word+"'"
      if word not in adhoc_dict.keys():
        correction = spelling_check.correction(word)
        lyrics[index] = re.sub(word, correction, lyrics[index])
        adhoc_dict[word] = correction
      else:
        correction = adhoc_dict[word]
        lyrics[index] = re.sub(word, correction, lyrics[index])
  return lyrics, adhoc_dict

songs_df['lyrics'], words_corrected = refine_spelling(songs_df['lyrics'])

As an interesting side detail, the dictionary of corrected words is shown hereunder. The automatic correction was not perfect in some cases, but the overall accuracy is high.

In [19]:
print('The amount of words corrected was: {0}'.format(len(words_corrected)))

The amount of words corrected was: 1175


In [15]:
words_corrected

{"podemo'": 'podemos',
 "empezamo'": 'empezamos',
 "estábamo'": 'estábamos',
 "pesa'": 'pesar',
 "blackfatha'": "blackfatha'",
 "loco'": 'loco',
 "raya'": 'raya',
 "guaya'": 'guapa',
 "cruzamo'": 'cruzamos',
 "toda'": 'todas',
 "pal'": 'palo',
 "maaa'": 'maaa',
 "aja'": 'aja',
 "prrrra'": "prrrra'",
 "yasid'": 'yacido',
 "maravish'": "maravish'",
 "copiaa'": 'copia',
 "copiaaa'": 'copiaba',
 "round'": 'round',
 "pra'": 'pra',
 "voa'": 'voy',
 "pendejaa'": 'pendejadas',
 "pinta'": 'pinta',
 "pista'": 'pista',
 "quimico'": 'quimicos',
 "yao'": 'yao',
 "supiera'": 'supiera',
 "pierda'": 'pierdas',
 "mientra'": 'mientras',
 "toca'": 'tocar',
 "deja'": 'dejar',
 "aventura'": 'aventura',
 "diverso'": 'diversos',
 "mai'": 'mail',
 "salimo'": 'salimos',
 "tenemo'": 'tenemos',
 "prendío'": 'prendí',
 "sea'": 'sea',
 "amiga'": 'amiga',
 "ojo'": 'ojos',
 "bellaca'": 'bellaca',
 "todo'": 'todo',
 "fuimo'": 'fuimos',
 "rankeá'": "rankeá'",
 "cuidao'": 'cuidaos',
 "amarrao'": 'amarrado',
 "paramo'":

As a final sanity check, we can peak into the non-recognised words still present in the lyrics, as well as their occurence count. This can give a quick intuition of the *dirtiness* of our lyrics dataset. The candidate proposed by SpellChecker.correction() method is also added, to give an intuition of how many of these words could be corrected with a drastic automatic correction of all the non-recognised terms.

In [16]:
def count_ocurrences(df, spanish_dict_path = None):
  spelling_check = SpellChecker(language='es') #Using the module built-in dict
  if spanish_dict_path: #Specifying the path as an argument will merge the
  #dictionary given by the path with the built in one.
     spelling_check.word_frequency.load_text_file(spanish_dict_path)
  word_count = dict()
  for index, lyrics in df.lyrics.iteritems():
    words = lyrics.lower().split()
    for word in words:
        word_count[word] = word_count.get(word,0) + 1
  {k: v for k, v in sorted(word_count.items(), key=lambda item: item[1])}
  no_dict_words = dict()
  for word, count in word_count.items():
    if spelling_check.unknown([word]):
      if count > 3:
        dict_correction = spelling_check.correction(word)
      else:
        dict_correction = 'unknown'      
      no_dict_words[word] = {'count' : count, 
                             'Dict_candidate': dict_correction}
  return no_dict_words
  no_dict_word = count_ocurrences(songs_df, spanish_dict_path)

In [20]:
print('The amount of non-recognised words in the dataset is: {0}'.format(len(no_dict_word)))

The amount of non-recognised words in the dataset is: 15244


As it can be seen below, many of the words still remaining are self-mentions of the artists to themselves (quite common in reggaeton), English terms, and representations of singing bits (i.e. uah, aah, ohohoh). One tempting idea would be to replace the self-mentions by a common token, representing a self-mentioning feature. However the cleaning of this elements will be left as a *might do* after the results of the model are obtained. 

In [21]:
{k: v for k, v in sorted(no_dict_word.items(), key=lambda item: item[1]['count'], reverse=True)}

{'uah': {'Dict_candidate': 'ah', 'count': 413},
 'balvin': {'Dict_candidate': 'calvin', 'count': 299},
 'maluma': {'Dict_candidate': 'malum', 'count': 244},
 "'e": {'Dict_candidate': 'de', 'count': 224},
 "i'm": {'Dict_candidate': 'im', 'count': 207},
 'anuel': {'Dict_candidate': 'aquel', 'count': 188},
 'x2': {'Dict_candidate': 'x', 'count': 169},
 'eheh': {'Dict_candidate': 'ehh', 'count': 163},
 "don't": {'Dict_candidate': 'donut', 'count': 132},
 "vo'a": {'Dict_candidate': 'vota', 'count': 128},
 'aa': {'Dict_candidate': 'a', 'count': 127},
 'vico': {'Dict_candidate': 'vino', 'count': 112},
 "it's": {'Dict_candidate': 'its', 'count': 97},
 'reggaetón': {'Dict_candidate': 'reggaeton', 'count': 96},
 'chojin': {'Dict_candidate': 'chopin', 'count': 95},
 "'tá": {'Dict_candidate': 'tá', 'count': 86},
 'luian': {'Dict_candidate': 'lian', 'count': 81},
 'corillo': {'Dict_candidate': 'cerillo', 'count': 79},
 'kingz': {'Dict_candidate': 'king', 'count': 74},
 'não': {'Dict_candidate': 'no

In [22]:
songs_df.to_csv(path_or_buf=clean_lyrics_path) #Export the clean set to a csv file