# Further pre-processing

Although we already did a pre-processing step, we realize while doing the weak learning process that we can still improve our data.

These are some of the things that we can do:

- We deleted most of the english words, however, there are still some of them mixed in the lyrics. The idea is to remove those using an english dictionary.
    - To achive this, we take an english dictionary with the 1000 most common words and manually delete spanish cognates ending with 964 words.

- Remove punctuation

- Remove stop words

- lowercase


In [18]:
eval_path = '/content/drive/MyDrive/Colab Notebooks/regaetton_songs_nlp/data/eval_lyrics.csv'
train_path = '/content/drive/MyDrive/Colab Notebooks/regaetton_songs_nlp/data/train_lyrics.csv'
english_dict = '/content/drive/MyDrive/Colab Notebooks/regaetton_songs_nlp/data/english_dict.txt'
normalized_train_path = '/content/drive/MyDrive/Colab Notebooks/regaetton_songs_nlp/data/normalized_train_lyrics.csv'
normalized_eval_path = '/content/drive/MyDrive/Colab Notebooks/regaetton_songs_nlp/data/normalized_eval_lyrics.csv'

In [2]:
import pandas as pd
import nltk
import re
import string

from nltk.corpus import stopwords
from nltk import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
eval_data = pd.read_csv(eval_path)
train_data = pd.read_csv(train_path)
train_data.head()

Unnamed: 0,song_name,artist,lyrics,sexual_content,women_denigration,drugs
0,,La Base,che vos la corres de derretir\ny a mi que me i...,0,-1,-1
1,,Eddy Lover,\n\nVete si ya tu no me quieres\nya no sigo ma...,0,-1,-1
2,,Tego Calderon,Es Tego calder con el voltaje\n\n\n\n\n\nCon L...,1,-1,-1
3,,Hector El Father,Ladies and gentlemen\nAnd now from Puerto Rico...,0,-1,-1
4,,Alexis Y Fido,\n\nEsta es la última que me juego!!!\nSolo un...,0,-1,-1


The lyrics are separated by paragraphs. In earlier this was important for us because we needed to read and create some labels, but this is not the case anymore. Thus, we are not going to worry about this anymore.

In [4]:
count = 0
words_deleted = list()

def remove_english_words(lyrics, english_words):
    """
    Removes a word if appears in the 
    english dict. We also count how many we 
    remove and which word we remove
    """
    global count
    global words_deleted
    words = lyrics.lower().split()
    new_lyrics = list()
    for word in words:
        if word in english_words:
            count += 1
            words_deleted.append(word)
            continue
        else:
            new_lyrics.append(word)

    return " ".join(new_lyrics)

def read_file(path):
    with open(path, encoding='utf-8') as f:
        f = f.read()
        english_words = list()
        words = f.split('\n')
        for word in words:
            # delete \t at the end
            word = word.strip()
            english_words.append(word)
    return english_words

In [5]:
english_words = read_file(english_dict)
print(english_words[:10])
len(english_words)

['the', 'of', 'to', 'and', 'in', 'is', 'it', 'you', 'that', 'he']


964

# Removing english words in the train and eval datasets 

In [6]:
train_data_copy = train_data.copy()
train_data_copy['lyrics'] = train_data_copy.lyrics.apply(lambda x: remove_english_words(x, english_words))

In [7]:
eval_data_copy = eval_data.copy()
eval_data_copy['lyrics'] = eval_data_copy.lyrics.apply(lambda x: remove_english_words(x, english_words))

In [8]:
print(f"We removed {count} english words. \nThe set of unique english words removed is {len(set(words_deleted))}")

We removed 40440 english words. 
The set of unique english words removed is 694


### Uncomment to see the words in the set

In [9]:
# set(words_deleted)

# Remove punctution

In [10]:
def remove_punctuation(lyrics):
    new_lyrics = ""
    for char in lyrics:
        if char not in string.punctuation:
            new_lyrics += char

    return new_lyrics

In [11]:
train_data_copy['lyrics'] = train_data_copy.lyrics.apply(remove_punctuation)
eval_data_copy['lyrics'] = eval_data_copy.lyrics.apply(remove_punctuation)

# Remove stop-words and lowercase

As a start, we are going to use the spanish nltk-stopwords. We are going to remove the ones that we consider add some meaning to our task (detect sexual content). After some trial and error, we could consider adding or deleting more words.

In [12]:
stopwords_count = 0
def remove_stopwords_lowercase(lyrics):
    global stopwords_count
    spanish_stopwords = stopwords.words('spanish')
    spanish_stopwords += ['yeah', 'uhh', 'ehh', 'ie', 'ee', 'uh', 'yeh', 'ah', 'ohh', 'uohh']
    words = word_tokenize(lyrics.lower(), language='spanish')
    new_lyrics = list()
    for word in words:
        if word in spanish_stopwords:
            stopwords_count += 1
            continue
        else:
            new_lyrics.append(word)
    
    return " ".join(new_lyrics)




In [13]:
train_data_copy['lyrics'] = train_data_copy.lyrics.apply(remove_stopwords_lowercase)
eval_data_copy['lyrics'] = eval_data_copy.lyrics.apply(remove_stopwords_lowercase)

In [14]:
print(f"We deleted {stopwords_count} words in total")

We deleted 1141897 words in total


# Observe some examples

In [15]:
def show_lyrics(df, index):
    lyrics = df.lyrics.iloc[index]

    print(lyrics)

In [16]:
eval_data_copy.head()

Unnamed: 0,song_name,artist,lyrics,sexual_content,women_denigration,drugs
0,Te quedas o te vas,Nicky Jam,sé pasando siento conmigo jugando preguntando ...,0,0,0
1,Sola,Anuel,real muerte welcome remix remix infierno vi an...,1,1,1
2,Pensando en ti. Feat Wisin,Anuel,bebé real muerte w conexión anuel pensando pen...,1,1,0
3,Desperte sin ti,Noriel,desperté casa vino encima dime pasa diablo hic...,0,0,0
4,Tú y yo,Makano,ramo imagines simple formula caminan hadas aqu...,0,0,0


In [17]:
show_lyrics(eval_data_copy, 300)

combo bambalan hacen na hora verdad hacen na suelo va correr sangre fantasee muere combo tambalean fantasee tos pillemos prestado ustedes hacen na despertaron difunto mato papi apunto si bloque seria dueño punto hombre cabrones mirenme tos juntos envueltas loco amaneces moscas tas guerreando motherfucker tira pies bote amaneces loco tira damos soplamocon ando combo pistolón si pilla seguro parto melon ronques conozco lloron loco tires enano cabrón mama bichos roncan pasan hablando sabiendo tos ustedes semi cojones grammy notorio gamma usted payaso remi durmieron orilla país llevo tsunami va pasar manny tumba piquete benny socio usted vale penny papi pillemos vengas llamar mami


In [19]:
train_data_copy.to_csv(normalized_train_path, index=False)
eval_data_copy.to_csv(normalized_eval_path, index=False)