# Final Project: KidzBopifying Lyrics
Cleo De Rocco (cleo.m.de.rocco.24@dartmouth.edu)<br>
Abigail Kayser (abigail.e.kayser.24@dartmouth.edu)<br>
Stefel Smith (stefel.s.smith.24@dartmouth.edu)<br>

Dartmouth College, LING48, Spring 2023

Our project is clean editing music lyrics. This is the code for the phonetic, semantic, and combinatorial (both phonetic and semantic) approaches. Call run all to initiate process. After running all, comment out the cell to load fasttext model, as this requires around 3 minutes to complete and should not be rerun once files exist. Make sure to have the files "combinedLyrics.txt" and "bad_words.txt" in the same location as the code to open them properly.

We first train a bigram model on a set of clean lyrics, more specifically, the lyrics of the Kidzbop discography. We use the Laplace model because it includes smoothing, which we need for the perplexity calculation to evaluate the performance of the suggestions. We write a function using nltk's ConditionalFreqDist that will return the top 100 words following an input word.

We then assemble our bag of bad words using a list of bad words, including stemming each word to make sure the bag is as robust as possible given the confines of our list.

We write the phonetic and semantic functions, defined more in detail in their sections.

Our final two cells are the main program, which invite the user to give input to tailor their replacements to their wishes. They input a sentence/lyric/phrase, and the program will flag potential bad words. The user can choose to start replacing or enter another input. If the user chooses to replace, they can choose between phonetic replacement, semantic replacement, phonetic --> semantic replacement (ignoring the bigram model entirely), all three at once, or not to replace the word. The replacement process continues until the user has given a reponse for all flagged bad words, at which point they will be prompted to give another sentence. To exit, the user must enter 'exit' at the first user input stage (and must follow the prompt sequence if words are flagged or if they choose to start replacement process).

The phonetic replacement most often will give suggestions in a list format. In the case that the phonetic replacement finds words among the bigram model's suggested candidate words, suggestions will be given in the format: "[word] (3, "Rhymes!")", where the word is listed first, followed by its phonetic score and the word "Rhymes!". The phonetic score is the number of matching phonemes, with an added bonus score for rhyming. The semantic form will most often give a list of words followed by a numeric, which is the calculated distance according to the semantic function. They are sorted in decreasing order, with the first listed word being the closest semantic distance and the last being the furthest. Finally, the combinatorial approach calculates all of the rhyming words with the target word, then inputs those rhyming words to the semantic function to generate a semantically ranked list of rhyming words with the target word, in the same format as the semantic approach. 

Lastly, the code for calculating perplexities is included in the final cell for evaluation of the model's performance.

Enjoy clean editing your lyrics!


## Imports

In [2]:
# Upgrade from version in the VM
!pip install -U nltk==3.4
import nltk
nltk.download('punkt')

!pip install pronouncing
from nltk.corpus import cmudict
import pronouncing



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/abbykayser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




In [22]:
import io 
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, bigrams
from nltk.stem import SnowballStemmer
import fasttext
import fasttext.util
import numpy as np

## Load fasstext English model

In [4]:
# comment out this block once run one time

!curl -o en.bin.gz https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
!gzip -d en.bin.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4294M  100 4294M    0     0  22.0M      0  0:03:14  0:03:14 --:--:-- 25.6M
en.bin already exists -- do you wish to overwrite (y or n)? ^C


## Bigram Model
Our bigram model is built off of a dataset of Kidzbop lyrics

In [39]:
# Open file
file = io.open('combinedLyrics.txt', encoding='utf8')
text = file.read()

In [24]:
# BIGRAM
n = 2
paddedLine = [list(pad_both_ends(word_tokenize(text.lower()), n))]
train, vocab = padded_everygram_pipeline(n, paddedLine)

# Train a n-gram Laplace model.
bigram_model = Laplace(n) 
bigram_model.fit(train, vocab)

# Tokenize the text into words
words = nltk.word_tokenize(text)

# Build frequency distribution of words that come after each word in the text
cfd = nltk.ConditionalFreqDist(
    (prev_word, next_word)
    for prev_word, next_word in nltk.bigrams(words)
)

# Define a function to get the top 200 most likely words that come after an input word
def get_top_words(input_word):
    # Get the frequency distribution for the input word
    freq_dist = cfd[input_word.lower()]

    # Get the top 100 most likely words that come after the input word
    top_100_words = freq_dist.most_common(100)

    # Return the top 200 words
    return [word[0] for word in top_100_words]


Top 10 words that come after "more":
['than', 'time', ',', 'I', 'dance', 'hours', 'But', 'shot', '...', "'Cause", '(', ')', 'things', 'they', 'now', 'And', 'gas', 'Ooh', 'So', 'It', 'exciting', 'song', 'Got', 'Sorry', 'love', 'Are', 'blue', 'you', 'Glasses', 'You', 'and', 'night', 'or', 'more', 'Then', 'Wasted', 'This', 'days', 'counting', 'Watch', 'what', 'Have', 'in', 'of', 'Oh', 'cynical', 'Atari', 'hits', 'style', '.', 'People', 'seats', 'frustrated', 'famous', 'Than', 'Said', 'Keep', 'try', 'She', 'she', 'Run', 'air', 'every', 'let', 'can', '?', 'Someone', 'like', 'smart', 'messed', 'Well', 'to']


## Bad Word Bag


In [77]:
# Open file
with io.open('bad_words.txt', encoding='utf8') as file:
    text = file.read()

In [102]:
# Tokenize the bad words
bad_words = word_tokenize(text)

# Create a set to store unique stemmed words
bag_of_words = set()

stemmer = SnowballStemmer("english")

# Add original bad words to the bag
bag_of_words.update(bad_words)

# Add stemmed versions of bad words to the bag
for word in bad_words:
    stemmed_word = stemmer.stem(word)
    if stemmed_word not in bad_words:
        bag_of_words.add(stemmed_word)

# Function to flag bad words in a sentence
def flag_bad_words(sentence):
    flagged_words = []
    for word in sentence:
        if word in bag_of_words:
            flagged_words.append(word)
    return flagged_words


## Phonetic Functions

In [32]:
#### PHONETIC ####

def calculate_phonetic_similarity(target_word, candidate_words):
    similarity_scores = {}
    
    # Load the CMU Pronouncing Dictionary
    pronouncing_dict = cmudict.dict()
    
    # Check if the target word exists in the CMU Pronouncing Dictionary
    if target_word.lower() not in pronouncing_dict:
        print("Target word not found in CMU Pronouncing Dictionary.")
        return similarity_scores
    
    target_phonemes = pronouncing_dict[target_word.lower()][0]
    
    for candidate_word in candidate_words:
        # Check if the candidate word exists in the CMU Pronouncing Dictionary
        if candidate_word.lower() in pronouncing_dict:
            candidate_phonemes = pronouncing_dict[candidate_word.lower()][0]
            # Calculate the phonetic similarity using the intersection of phonemes
            similarity_score = len(set(target_phonemes) & set(candidate_phonemes))
            if candidate_word in pronouncing.rhymes(target_word):
                similarity_scores[candidate_word] = (similarity_score + 1, "Rhymes!")
            else:
                similarity_scores[candidate_word] = (similarity_score, "")
        else:
            similarity_scores[candidate_word] = (0, "") # Assign a similarity score of 0 if candidate word not found
    
    return similarity_scores

# To find all of the words that strictly rhyme with the target word
def strict_rhymes(target_word, candidate_words, similarity_scores): 
    rhymes_from_ngram = []
    rhymes = []
    print_rhymes = []

    # append all words in candidates that rhyme to a list
    for word in candidate_words:
        if word in pronouncing.rhymes(target_word):
            rhymes_from_ngram.append(word)

    # print out the rhymes from the bigram and their simialrity scores calculated in calculate_phonetic_similarity
    if len(rhymes_from_ngram) != 0:
        print("Yay, there is one or more matching rhymes from the dataset! Try using:")
        for rhyme in rhymes_from_ngram:
            print(rhyme, similarity_scores[rhyme])
    # otherwise print a (cleaned) list of all of the words that rhyme with the target word
    else:
        rhymes = pronouncing.rhymes(target_word)
        print("No rhymes were found from the dataset. Try any of these instead!")
        for rhyme in rhymes:
            if rhyme not in bad_words:
                print_rhymes.append(rhyme)
        print(print_rhymes)

    return print_rhymes

def calculate_suggestion(target_word, candidate_words):
    similarity_scores = calculate_phonetic_similarity(target_word, candidate_words)
    rhymes = strict_rhymes(target_word, candidate_words, similarity_scores)
    return rhymes

def phonetics(target_word, candidate_words):
    rhymes = calculate_suggestion(target_word, candidate_words)
    return rhymes
    

## Semantic Functions

In [11]:
#### SEMANTICS ####

embeddings = fasttext.load_model('en.bin')   #load embedding into memeory 

# takes in the bad word and a list of possible replacemnt words using the bigram model and calculates euclidean distance
def similarity(bad_words, edits):
    edit_scores = {}
    for word in edits:
        if word not in text: 
            w1 = embeddings.get_word_vector(bad_words)
            w2 = embeddings.get_word_vector(word)
            dist = np.linalg.norm(w2 - w1)
            edit_scores[word] = dist 
    
    sorted_edit_scores =  sorted(edit_scores.items(), key=lambda x:x[1])
    return sorted_edit_scores 


def semantics(bad_word, suggested_edits):
    print(f"Top words Semantically Simliar to  {bad_word}.\n")
    print(similarity(bad_word,suggested_edits))



## Replacement Program, user input

In [103]:
# this cell contains helper functions
# prompts user for input and calls replacement functions
def replace(sentence, words):
    for bad_word in words:
        
        # get the word before and the candidate words
        word_before = sentence[sentence.index(bad_word) - 1]
        candidate_words = get_top_words(word_before)

        replacing = True
        while(replacing):
            replace_choice = input("Which type of replacement would you like for '" + str(bad_word) + "' ? Enter [p]honetic, [s]emantic, [b]oth phonetic and semantic, [a]ll, or [n]one:     ")
            
            # PHONETIC
            if replace_choice == "p":  
                replacing = False
                print("You chose phonetic replacement for " + str(bad_word) + ".")
                rhymes = phonetics(bad_word, candidate_words)

            # SEMANTIC
            elif replace_choice == "s":
                replacing = False
                print("You chose semantic replacement for " + str(bad_word) + ".")
                semantics(bad_word, candidate_words)
            
            # COMBINATORIAL
            elif replace_choice == "b":
                replacing = False
                print("You chose phonetic and semantic replacement for " + str(bad_word) + ".")
                both_candidates = phonetics(bad_word, [])
                semantics(bad_word, both_candidates)
            
            # ALL 
            elif replace_choice == "a":
                replacing = False
                print("You chose all replacements for " + str(bad_word) + ".")
                rhymes = phonetics(bad_word, candidate_words)
                semantics(bad_word, candidate_words)
                both_candidates = phonetics(bad_word, [])
                semantics(bad_word, both_candidates)
            
            # If the user chooses not to replace the word
            elif replace_choice == "n":
                replacing = False
                print("You chose not to replace " + str(bad_word) + ".")
            
            else:
                print("Please enter [p]honetic, [s]emantic, [b]oth phonetic and semantic, [a]ll, or [n]one")  

# Helper function to remove punctuation
def clean(userString): 

    userString = userString.replace(",", "")
    userString = userString.replace("' ", "  ")
    userString = userString.replace(".", "")
    userString = userString.replace("!", "")
    userString = userString.replace("?", "")
    userString = userString.replace(" '", "  ")
    userString = userString.lower()


    return userString


In [106]:
# Running this cell initiates the code
stemmer = SnowballStemmer("english")

responding = True
while(responding):
    userString = input("USER:     ")
    
    # exit if the user is done
    if userString == "exit":
        print("Goodbye!")
        responding = False
        break
    
    # print and clean the string
    print("You entered: '" + str(userString) + "'")
    userString = clean(userString)
    
    # split the string and stem it to find stemmed bad words
    input_string = userString.split()
    stemmed_string = []
    output = flag_bad_words(input_string)       # flag initial bad words
    
    for word in input_string:
        stemmed_word = stemmer.stem(word)
        if word not in output:                      # do not add repeats
            stemmed_string.append(stemmed_word)
        else:
            stemmed_string.append("")
        
    # add stem bad words to bad words
    stemmed_output = flag_bad_words(stemmed_string)
    for word in stemmed_output:
        output.append(input_string[stemmed_string.index(word)])


    if not output:
        print("No bad words found.")
        continue

    # initiate replacement process if wished by the user
    replacing = True
    while(replacing):
        print("Flagged words: " + str(output))
        replace_want = input("Start replacement process for the word(s)? Enter [y] [n]:    ")
        
        if replace_want == "y":
            replace(input_string, output)
            replacing = False
        
        elif replace_want == "n":
            print("Continue entering sentences to flag.")
            replacing = False
        
        else:
            print("Please enter [y] or [n]")


Goodbye!


In [107]:
test_sentences = ["[enter a sentence here to calculate the perplexity!]"]
tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in test_sentences]

test_data = [bigrams(t,  pad_right=False, pad_left=False) for t in tokenized_text]
for i, test in enumerate(test_data):
  print("PP({0}):{1}".format(test_sentences[i], bigram_model.perplexity(test)))

PP([enter a sentence here to calculate the perplexity!]):6458.06842686986
