# The Profanity Project: A Comparitive Study of Popular Rap and Pop Lyrics

## 1. Introduction & Background

This research aims to study the contextual application of profanity across different music genres. By comparing different popular genres including rap and pop, we aim to provide a nuanced perspective on language in popular music, potentially challenging (or reaffirming) stereotypes regarding the use of explicit language within rap. Common preconceptions suggest that rap music is characterised as aggressive, offensive, misogynistic or even violent in its use of profanity (Howard et al., 2021). A study found that rap accounted for 76% of profanity within top 10 songs in the Billboard Hot 100 from 2009-2018 (Moloney and Sylva, 2020). Another study found that rap ranked number one for the most profanity used, with a swear word in every 47 words, and pop ranked number four, with a swear word every 904 words (Medium, 2015). 

However, profanity can serve various purposes, from expressing intense emotions (both positive and negative) to emphasising statements or simply being used as a device to create linguistic rhythm. For instance, while the term 'bitch' is frequently employed as a word used to dehumanise and to insult women, it can also be utilised as a symbol of empowerment or an expression of camaraderie. This research aims to provide a nuanced perspective on the complex role of profanity in popular music, contributing to a more comprehensive understanding of linguistic choices in contemporary musical expression.

Pop music is often produced for mass consumption, so characteristics of pop music tend to include catchy, simple melodies, repetitive beats and straightforward lyrics (Musical Dictionary, 2024). As a result, it’s often the most accessible music and most recognisable genre of music, but due to its broad target audience, pop music typically avoids the use of profanity or incorporates it sparingly. Conversely, rap music, while also a form of popular music, emerged in the 1970s through the creativity of urban Black performers. It is commonly identified by its repetitive beat patterns and rapid, slang-heavy, and often boastful rhyming vocal delivery (Dictionary.com, 2025).

Over the past decade, rap music has achieved significant prominence in mainstream media, propelled by artists such as Kendrick Lamar, Travis Scott, and Drake, and now accounts for a substantial share of contemporary music airplay (Bruner, 2018). The goal of this project is to conduct research that explores usage of profanity in rap and pop music lyrics, addressing the question: 
- *How does the sentiment and usage of profanity differ between rap and pop music lyrics, and does this support or challenge stereotypes about profanity in rap music?*

To gain insight into the music commonly consumed by the general public within the rap and pop genres, this study utilises Apple Music's playlists, *Top 100 2024: Shazam’s Radio Chart*. This resource includes curated playlists for several genres, including hip-hop and pop, comprising songs that are currently popular on radio stations worldwide. As such, it serves as a representative sample of the music that is being listened to by the general public. 

Although a rap playlist was not available, we determined that the hip-hop playlist would be a suitable alternative, given the close relationship between the two terms and their frequent interchangeable use. While all rap falls under the umbrella of hip-hop, not all hip-hop is rap, as hip-hop also encompasses instrumental tracks and other vocal styles. In mainstream media, however, the term "hip-hop" is often used synonymously with rap. It is important to emphasise that hip-hop represents a broader cultural movement, whereas rap is the musical art form that serves as a central component of this culture (Medium, 2022).

## 2. Tutorial

In this tutorial, we will walk through the process of collecting, managing, exploring, analysing, and visualising this study's data using Python. Our focus will be on analysing song lyrics to extract and annotate profanity, while also providing context for each instance. We will use various Python libraries, including `pandas`, `selenium`, `BeautifulSoup` and `langdetect`, to achieve our goal.

### 2.1 Collecting the Data

#### Web Scraping Lists of Song Titles and Artists for Each Genre

In this section, we will collect data from the *Top 100 2024: Shazam Radio Charts* for both Hip-Hop and Pop genres:

- Hip-Hop: https://music.apple.com/be/playlist/top-100-2024-shazam-radio-chart-hip-hop/pl.745180bfbd374d56aa07d84c9420e5dc
- Pop: https://music.apple.com/nl/playlist/top-100-2024-shazam-radio-chart-pop/pl.bdb7c5da0c8c44479af4299a62c67b78?l=en-GB

The information we're interested in are the song titles and artists so that we can use that information to web scrape the lyrics of those specific songs from [AZlyrics](https://www.azlyrics.com/) later on. 

In order to do this we will be using the Selenium package. The data we want to scrape is dynamically loaded, meaning that it is not present in the initial HTML response that you receive when you make a request to a webpage (with BeautifulSoup). Instead, the content is generated and inserted into the page after the initial load, typically through JavaScript. For example, a webpage might initially load a blank list and then use JavaScript to fetch and display song titles and artists from a server after the page has loaded. If you view the page source in this case, you may not see the song titles and artists because they are added to the DOM after the page has loaded. Selenium is a web automation tool that can control a web browser programmatically. It can wait for JavaScript to execute and for elements to be rendered on the page, allowing you to scrape content that is dynamically loaded.

*Note that the code in this notebook was specifically written to extract the data from Apple musics' Top 100 2024: Shazam Radio Lists at the time of conducting this study (january 2025). Apple music may decide to change the songs listed in their playlists at some point in time which will have a minor effect on the efficiency of some of the code below.*

#### Installation

To get started, you first need to install the required packages:

In [13]:
# If you don't already have the packages below installed you can uncomment the following lines and run the code

# !pip install -q selenium
# !pip install -q webdriver-manager

#### Code to Scrape Music Data

Run the following code cell to scrape the song titles and artist for both the Hip-Hop and Pop songs. This results in a list for each genre consisting of tuples where the first item is the song title, and the second the artist.

*Note that on Apple music some songs list multiple artists, however the AZlyrics URL is structured using only one artist name, so we will only extract the first artist listed in the Apple music playlists for each song.*

In [16]:
# Import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

def scrape_music_data(url):
    # Set up Selenium WebDriver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service)

    # Initialise empty song_titles and artists lists
    song_titles = []
    artists = []

    # Open the webpage
    driver.get(url)

    try:
        # Wait for the song titles to be present
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.CLASS_NAME, 'songs-list-row__song-name'))
        )

        # Find all song title elements
        title_tags = driver.find_elements(By.CLASS_NAME, 'songs-list-row__song-name')

        # Extract song titles
        song_titles = [title.text for title in title_tags]

        # Find all song artist containers
        artist_containers = driver.find_elements(By.XPATH, "//div[contains(@class, 'songs-list__col songs-list__col--secondary svelte-113if15')]")

        # Extract the first artist from each song
        for container in artist_containers:
            # Find all artist <a> tags in the current container
            artist_tags = container.find_elements(By.XPATH, ".//a[@data-testid='click-action']")
            
            # Only append the first artist
            artists.append(artist_tags[0].text)

    except Exception as e:
        print("An error occurred:", e)
        return []  # Return an empty list on error

    finally:
        # Close the driver
        driver.quit()

    # Create a list of tuples (song title, artist)
    song_artist_tuples = list(zip(song_titles, artists))

    return song_artist_tuples

# Scrape Hip-Hop songs' data
url = 'https://music.apple.com/be/playlist/top-100-2024-shazam-radio-chart-hip-hop/pl.745180bfbd374d56aa07d84c9420e5dc'
songs_hh = scrape_music_data(url)

# Scrape Pop songs' data
url = 'https://music.apple.com/nl/playlist/top-100-2024-shazam-radio-chart-pop/pl.bdb7c5da0c8c44479af4299a62c67b78?l=en-GB'
songs_pop = scrape_music_data(url)

#### Formatting Song titles and Artists for Web Scraping

In order to scrape the lyrics of the songs we just retrieved the titles and artists of, we first need to create a function that will take in our list of tuples and return the list with the song titles and artists correctly formatted for their AZLyrics web page. 

For example, the song title `MILLION DOLLAR BABY (VHS)` by `Tommy Richman` needs to be converted to `milliondollarbaby` and `tommyrichman`.

*Note that some artists' names are formatted or spelled differently on AZLyrics than on Apple music, so we will need to specifially account for those.*

Run the following code cell to format the song titles and artists correclty.

In [18]:
# Import regular expressions library
import re

def format_song_data(song_data):
    # Initialise empty list for formatted data
    formatted_data = []
    
    for title, artist in song_data:
        # Replace specific artist names
        artist_replacements = {
            'Macklemore & Ryan Lewis': 'macklemore',
            'Latto': 'missmulatto',
            'GIMS': 'matregims',
            'The Notorious B.I.G.': 'notoriousbig',
            'The Kid LAROI': 'kidlaroi',
            'P!nk': 'pink',
            'The Human League': 'humanleague'
        }
        
        # Replace the artist name if it matches any key in the replacements dictionary
        artist = artist_replacements.get(artist, artist)

        # For artist, remove text in (square) brackets and punctuation, then format to lowercase and remove spaces
        formatted_artist = re.sub(r'\(.*?\)|\[.*?\]', '', artist)
        formatted_artist = re.sub(r'[^\w\s]', '', formatted_artist) 
        formatted_artist = formatted_artist.lower().replace(" ", "")

        # For title, remove text in (square) brackets and punctuation, then format to lowercase and remove spaces
        formatted_title = re.sub(r'\(.*?\)|\[.*?\]', '', title)
        formatted_title = re.sub(r'[^\w\s]', '', formatted_title)  
        formatted_title = formatted_title.lower().replace(" ", "")
        
        # Replace 'ºC' in the title
        formatted_title = formatted_title.replace('ºc', 'oc')  # Ensure 'ºC' is replaced with 'oc'
        
        # Append the formatted tuple to the list
        formatted_data.append((formatted_title, formatted_artist))
        
    return formatted_data

# Format the song data for Hip-Hop and Pop
formatted_songs_hh = format_song_data(songs_hh)
formatted_songs_pop = format_song_data(songs_pop)

#### Web Scraping Lyrics Function

Next, we will define a function to scrape the lyrics for each song from AZLyrics using BeautifulSoup:

In [22]:
# Import requests and the BeautifulSoup library.
import requests
from bs4 import BeautifulSoup

def scrape_lyrics(formatted_title, formatted_artist):
    # Construct URL using formatted title and artist
    url = f'https://www.azlyrics.com/lyrics/{formatted_artist}/{formatted_title}.html'

    try:
        # Get HTML data, extract text and transform into BeautifulSoup document
        response = requests.get(url)
        response.raise_for_status()  # Raise error if not successful
        document = BeautifulSoup(response.text, 'html.parser') 

        # Find all <div> elements where lyrics are found
        lyrics_tag = document.find_all("div", attrs={"class": None, "id": None})

        # Initialise empty list to store lyrics
        lyrics = []  
        
        # Extract text from each lyric tag, strip leading/trailing whitespace and append to list
        # for content in lyrics_tag:
            # line = content.get_text(separator=" ")
            # lyrics.append(line.strip())
        lyrics = [content.get_text(separator=" ").strip() for content in lyrics_tag]

        # Join lines to form complete lyrics
        lyrics_text = "\n".join(lyrics)
        
        return lyrics_text

    # Print error message if issue fetching the lyrics
    except requests.exceptions.RequestException as e:
        print(f"Error fetching lyrics for {formatted_title} by {formatted_artist}: {e}")
        return None 


#### Saving Lyrics as Files Function

We will also define a function to save the lyrics into text files:

In [25]:
# Import os module 
import os

def save_lyrics_to_file(formatted_title, formatted_artist, lyrics, genre):
    # Use os to create directory for the genre if it doesn't exist
    os.makedirs(f'lyrics/{genre}', exist_ok=True)
    
    # Create filename
    filename = f"lyrics/{genre}/{formatted_artist}_{formatted_title}.txt"
    
    # Write lyrics to file
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(lyrics)
    print(f"Lyrics saved to {filename}")

#### Scraping and Saving Lyrics

Now, we will loop through our formatted song data to scrape and save the lyrics of both genres using the two functions we just defined:

*Note that despite our efforts to account for certain non-regular formatted song titles and artist in the `format_song_data` function, some lyrics (6 Hip-Hop songs, 10 Pop songs) will still fail to scrape due to their non-regular formatted AZlyrics URL or not listing the same artist as the first artist on songs with multiple artists as Apple music.*  

In [28]:
# Import time module
import time

# Scrape and save lyrics for Hip-Hop
for formatted_title, formatted_artist in formatted_songs_hh:
    lyrics = scrape_lyrics(formatted_title, formatted_artist)
    if lyrics:
        save_lyrics_to_file(formatted_title, formatted_artist, lyrics, 'hiphop')
        print(f"Lyrics for {formatted_title} by {formatted_artist} have been saved in the hiphop folder.\n")
    else:
        print(f"Failed to scrape lyrics for: {formatted_title} by {formatted_artist}")

    time.sleep(10)  # Use sleep to avoid triggering anti-bot measures


# Scrape and save lyrics for Pop
for formatted_title, formatted_artist in formatted_songs_pop:
    lyrics = scrape_lyrics(formatted_title, formatted_artist)
    if lyrics:
        save_lyrics_to_file(formatted_title, formatted_artist, lyrics, 'pop')
        print(f"Lyrics for {formatted_title} by {formatted_artist} have been saved in the pop folder.\n")
    else:
        print(f"Failed to scrape lyrics for: {formatted_title} by {formatted_artist}")

    time.sleep(10)  # Use sleep to avoid triggering anti-bot measures


Lyrics saved to lyrics/hiphop/jackharlow_lovinonme.txt
Lyrics for lovinonme by jackharlow have been saved in the hiphop folder.

Lyrics saved to lyrics/hiphop/dojacat_paintthetownred.txt
Lyrics for paintthetownred by dojacat have been saved in the hiphop folder.

Lyrics saved to lyrics/hiphop/eminem_houdini.txt
Lyrics for houdini by eminem have been saved in the hiphop folder.

Lyrics saved to lyrics/hiphop/tommyrichman_milliondollarbaby.txt
Lyrics for milliondollarbaby by tommyrichman have been saved in the hiphop folder.

Lyrics saved to lyrics/hiphop/dojacat_agorahills.txt
Lyrics for agorahills by dojacat have been saved in the hiphop folder.

Lyrics saved to lyrics/hiphop/postmalone_circles.txt
Lyrics for circles by postmalone have been saved in the hiphop folder.

Lyrics saved to lyrics/hiphop/coolio_gangstasparadise.txt
Lyrics for gangstasparadise by coolio have been saved in the hiphop folder.

Error fetching lyrics for whatitis by doechii: 404 Client Error: Not Found for url: h

### 2.2 Managing the Data

Then, after having collected the lyrics, we will write a script that holds the following features:
1. Preprocessing of lyrics (lowercase conversion, punctuation removal, empty line elimination)
2. Profanity detection using a predefined list
3. Extraction of lines containing profanities, including context (lines before and after)
4. Highlighting of detected profanities for easy identification when annotating the resulting corpus
5. Generation of statistics including:
   - Profanity count per genre
   - Total word count per genre
   - Frequency of occurrence for each profanity per genre

These features will then all be combined into one function to analyse the lyrics and generate a corpus with all the lines containing profanity and their contexts so they can be annotated for sentiment, while also generating the above mentioned statistics. 

The predefined profanity list used in this analysis can be found [here](https://gist.github.com/ryanlewis/a37739d710ccdb4b406d). It is sourced from a GitHub Gist by Ryan Lewis and consists of a curated collection of common swear words.

#### Installation

To get started, you first need to install the `langdetect` package and its necessary libraries:

In [42]:
# If you don't already have the package below installed you can uncomment the following lines and run the code
# !pip install -q langdetect
# from langdetect import detect, DetectorFactory

#### Setting Up Functions

We will first define all the necessary separate functions:

In [45]:
# Set seed for reproducibility
DetectorFactory.seed = 0

def preprocess_lyrics(lyrics):
    """Convert lyrics to lowercase, remove punctuation, and remove empty lines."""
    lyrics = lyrics.lower()
    lyrics = re.sub(r'[^\w\s]', '', lyrics)
    lines = lyrics.splitlines()
    non_empty_lines = [line for line in lines if line.strip()]
    return non_empty_lines 

def load_profanity_list(file_path):
    """Load the profanity list from a file."""
    with open(file_path, 'r') as f:
        profanity_list = [line.strip().lower() for line in f.readlines()]
    return profanity_list

def extract_profanity_lines(processed_lines, profanity_list):
    """Extract lines containing profanities and return them along with context."""
    profanity_lines = []
    
    for index, line in enumerate(processed_lines):
        for profanity in profanity_list:
            # Use regex to check for whole word matches
            if re.search(r'\b' + re.escape(profanity) + r'\b', line, flags=re.IGNORECASE):
                # Initialise before and after lines
                before = processed_lines[index - 1] if index > 0 else ""  # Line before, or empty if it's the first line
                after = processed_lines[index + 1] if index < len(processed_lines) - 1 else ""  # Line after, or empty if it's the last line
                profanity_lines.append((before, line, after, profanity))  # Include the specific profanity

    return profanity_lines

def highlight_profanity(profanity_line, profanity_list):
    """Highlight profanity words in the given line by converting them to uppercase."""
    for profanity in profanity_list:
        # Use regex to replace the profanity with its uppercase version
        profanity_line = re.sub(
            r'\b' + re.escape(profanity) + r'\b',  # Ensure whole word match
            lambda match: match.group(0).upper(),  # Convert matched profanity to uppercase
            profanity_line,
            flags=re.IGNORECASE
        )
    return profanity_line

#### Analysing Lyrics

Now, we will combine all our functions into a single one, which will process the lyrics and generate a DataFrame containing the results.

In [48]:
# Import necessary libraries
import os
import pandas as pd
from collections import Counter
import csv

def analyse_lyrics(base_directory='lyrics', profanity_file='profanity_list'):
    # Load the profanity list
    profanity_list = load_profanity_list(profanity_file)
    
    # Initialise DataFrame to store profanity lines
    profanity_data = []
    
    # Counters for total word count and profanity counts
    genre_word_count = {'hiphop': 0, 'pop': 0}
    genre_profanity_count = {'hiphop': 0, 'pop': 0}
    profanity_frequency = {'hiphop': Counter(), 'pop': Counter()}

    # Walk through the base directory
    for genre in os.listdir(base_directory):
        genre_path = os.path.join(base_directory, genre)
        
        if os.path.isdir(genre_path):
            for filename in os.listdir(genre_path):
                if filename.endswith('.txt'):
                    # Read lyrics from each file
                    with open(os.path.join(genre_path, filename), 'r', encoding='utf-8') as file:
                        lyrics = file.read()
                    
                    # Check if lyrics are in English and skip if non-English
                    try:
                        if detect(lyrics) != 'en':
                            continue 
                    except Exception as e:
                        print(f"Error detecting language for {filename}: {e}")
                        continue

                    # Extract song title and artist from the filename for metadata
                    name_without_extension = filename[:-4]
                    parts = name_without_extension.split('_', 1)
                    if len(parts) == 2:
                        formatted_artist = parts[0]
                        formatted_title = parts[1]
                        
                        # Preprocess the lyrics
                        processed_lines = preprocess_lyrics(lyrics)
                        
                        # Calculate total word count
                        word_count = sum(len(line.split()) for line in processed_lines)
                        genre_word_count[genre] += word_count
                        
                        # Extract profanity lines
                        profanity_lines = extract_profanity_lines(processed_lines, profanity_list)
                        genre_profanity_count[genre] += len(profanity_lines)

                        # Count frequency of each profanity
                        for before, line, after, profanity in profanity_lines:
                            highlighted_profanity_line = highlight_profanity(line, profanity_list)

                            # Combine lines into one string with line breaks
                            combined_line = f"{before}\n{highlighted_profanity_line}\n{after}" 
                            profanity_data.append((formatted_title, formatted_artist, genre, combined_line))

                            # Count frequency of each profanity
                            if profanity in line:
                                profanity_frequency[genre][profanity] += 1

    # Create a DataFrame from the extracted profanity lines
    df_profanity = pd.DataFrame(profanity_data, columns=['Song Title', 'Artist', 'Genre', 'Profanity Lines'])

    # Create new empty column for sentiment annotation
    df_profanity['Sentiment Annotation'] = None  

    # Print results
    print("Total Word Count by Genre:", genre_word_count)
    print("Total Profanity Count by Genre:", genre_profanity_count)
    print("Profanity Frequency by Genre:", {genre: dict(freq) for genre, freq in profanity_frequency.items()})
    
    return df_profanity, genre_word_count, genre_profanity_count, profanity_frequency 

#### Running the Analysis and Saving the Results

Next, we will call the function and save the resulting DataFrame to a CSV file ready for annotation:

In [51]:
# Call the function
df_profanity, genre_word_count, genre_profanity_count, profanity_frequency = analyse_lyrics()

# Save the DataFrame to a CSV file
df_profanity.to_csv('full_profanity_lyrics_corpus.csv', index=False, quoting=csv.QUOTE_ALL) # Use index=False to avoid writing row indices

Total Word Count by Genre: {'hiphop': 43815, 'pop': 31335}
Total Profanity Count by Genre: {'hiphop': 633, 'pop': 92}
Profanity Frequency by Genre: {'hiphop': {'ass': 49, 'bitch': 85, 'dick': 15, 'damn': 34, 'butt': 2, 'fuck': 59, 'niggas': 50, 'shit': 71, 'fuckin': 25, 'fucking': 12, 'cock': 2, 'prick': 2, 'motherfuckin': 5, 'balls': 2, 'pussy': 29, 'bitches': 38, 'hell': 10, 'sex': 15, 'nigga': 73, 'dink': 1, 'fucked': 10, 'goddamn': 1, 'shits': 2, 'motherfucker': 16, 'piss': 1, 'bum': 5, 'cum': 1, 'tits': 2, 'tit': 1, 'fag': 1, 'titties': 2, 'motherfuck': 1, 'asses': 2, 'pissed': 2, 'porno': 1, 'penis': 1, 'pussies': 1, 'clitoris': 1, 'viagra': 1, 'crap': 1, 'pornography': 1}, 'pop': {'hell': 8, 'sex': 1, 'bitch': 8, 'damn': 18, 'fuck': 3, 'shit': 26, 'ass': 1, 'fuckin': 4, 'goddamn': 8, 'fucked': 6, 'lust': 1, 'motherfucker': 2, 'snatch': 2, 'faggot': 3, 'fucks': 1}}


#### Creating a Smaller DataFrame for Pilot Annotation

Before annotating the entire corpus, we want to run a pilot annotation to ensure our main annotator and second annotator (for quality control) are on the same page when it comes to defining the different sentiments of each profanity occurence. To facilitate the annotation process, guidelines were written to define each sentiment category (with examples included). These guidelines can be found as a `.pdf` file under the name `annotation_guidelines` in the same repository as this file.

For the pilot annotation, we will create a smaller DataFrame with 20 rows for each genre from the full corpus we created:

In [54]:
# Initialise an empty list to hold the smaller DataFrame rows
smaller_df_rows = []

# Get the unique genres in the DataFrame
genres = df_profanity['Genre'].unique()

# Loop through each genre and sample the specified number of rows
for genre in genres:
    genre_rows = df_profanity[df_profanity['Genre'] == genre]
    sampled_rows = genre_rows.sample(n=min(20, len(genre_rows)), random_state=1)  # Sample with a fixed random state for reproducibility
    smaller_df_rows.append(sampled_rows)

# Concatenate the sampled rows into a new DataFrame
df_profanity_pilot = pd.concat(smaller_df_rows, ignore_index=True)

# Save the DataFrame to a CSV file
df_profanity_pilot.to_csv('pilot_profanity_lyrics_corpus.csv', index=False, quoting=csv.QUOTE_ALL) # Again use index=False to avoid writing row indices

This pilot corpus will now be annotated by our annotators individually, after which they shall discuss their annotations and alter the guidelines if needed before starting on the annotation of the full corpus. 

### 2.3 Exploring & Analysing the data

**Notes for Pragati**

There are 2 parts to the analysis since we're technically trying to answer two questions, one on the usage of profanity, if rap actually uses more swearing than other genres (and I thought it might be interesting to see which ones are used more by which genre), and one on sentiment (for which the annotations must be done).

- Usage:\
I built in some counters into the `analyse_lyrics` function (they created dictionaries) so that you can use those numbers for analysis and visualisations and interpret/discuss them:\
~ The dict 'genre_word_count' gives the total word count by genre.\
~ The dict 'genre_profanity_count' gives the total profanity count by genre.\
~ The dict 'profanity_frequency' gives the frequency of occurence of each profanity by genre.

[This article](https://medium.com/musixmatch-blog/profanity-in-lyrics-most-used-swear-words-and-their-usage-by-popular-genres-d8a12c776713) might serve as an inspiration.\
[The notebook from this project](https://github.com/pilyu/Collecting-Data-)
might serve as an example.

- Sentiment:\
~ Calculate the Inter Annotator Agreement for a quality control of the annotations. One of you is technically the 'main' annotator and one the 'help' annotator whose annotations are used basically to check if the main annotations are 'accurate':\
Load in the annotated corpus, create a list of just the annotation column, and do the same for Nat's, then check [the notebook from lab 3, step 5](https://github.com/yevgenm/iaa) on how to calculate the Inter Annotator Agreement and the slides of that week of how to interpret it and discuss it.
  
Then take whoever's annotations you regard as the 'main' annotations and discuss the sentiments in each genre (make visualisations) and compare them.

In [58]:
genre_word_count

{'hiphop': 43815, 'pop': 31335}

In [60]:
genre_profanity_count

{'hiphop': 633, 'pop': 92}

In [62]:
print(profanity_frequency)

{'hiphop': Counter({'bitch': 85, 'nigga': 73, 'shit': 71, 'fuck': 59, 'niggas': 50, 'ass': 49, 'bitches': 38, 'damn': 34, 'pussy': 29, 'fuckin': 25, 'motherfucker': 16, 'dick': 15, 'sex': 15, 'fucking': 12, 'hell': 10, 'fucked': 10, 'motherfuckin': 5, 'bum': 5, 'butt': 2, 'cock': 2, 'prick': 2, 'balls': 2, 'shits': 2, 'tits': 2, 'titties': 2, 'asses': 2, 'pissed': 2, 'dink': 1, 'goddamn': 1, 'piss': 1, 'cum': 1, 'tit': 1, 'fag': 1, 'motherfuck': 1, 'porno': 1, 'penis': 1, 'pussies': 1, 'clitoris': 1, 'viagra': 1, 'crap': 1, 'pornography': 1}), 'pop': Counter({'shit': 26, 'damn': 18, 'hell': 8, 'bitch': 8, 'goddamn': 8, 'fucked': 6, 'fuckin': 4, 'fuck': 3, 'faggot': 3, 'motherfucker': 2, 'snatch': 2, 'sex': 1, 'ass': 1, 'lust': 1, 'fucks': 1})}


### Conclusions & Limitations

**More notes for Pragati**

**Limitations:** (came across these when writting the code, thought it be best to explain them in the conclusion)
- Non-regularised formatting of AZLyrics urls causes some songs to be left out (6 hip-hop, 10 pop songs).
- Certain words were recognised as profanities even though they weren't.\
  For example, due to the preprocessing of removing any punctuation, he'll > hell.\
  Took this into account for the annotation with the addition of a fourth 'N/A' category.
- We only compared rap to pop -- Comparison to other musical genres (also interesting for future research)

**Future research:**
- Comparison to other genres.
- Other languages besides English? see if there are differences in languages.


## 3. Active Learning Exercises: Calculating Cohen's Kappa in Python

In our analysis we calculated inter-annotator agreement using Python's Cohen's kappa function from the `Scikit` library. However, this statistic can of course also be calculated manually. In this excerise we will teach you how to compute Cohen's Kappa manually by going through the calculations step-by-step.

### 3.1. Understanding the Confusion Matrix

Cohen's Kappa is based on a confusion matrix, a table that reports the number of true positives, false negatives, false positives, and true negatives.

Let's assume we have two annotators, Annotators A and B, who classified profanity in 100 lines of lyrics into two categories: "Positive" and "Negative":

| A |  B | Count |
|---------|---------|-------|
| Positive| Positive| 50    |
| Positive| Negative| 10    |
| Negative| Positive| 5     |
| Negative| Negative| 35    |

In this table:
- The first row indicates that both annotators classified 50 profanity occurences as "Positive".
- The second row indicates that Annotator A classified 10 profanity occurences as "Positive" while Annotator B classified them as "Negative".
- The third row indicates that Annotator A classified 5 profanity occurences as "Negative" while Annotator B classified them as "Positive".
- The fourth row indicates that both Annotators classified 35 profanity occurences as "Negative".

Your first exercise is to create a confusion matrix based on the example data above using the `NumPy` package, which is a fundamental package for numerical computing in Python. It provides support for arrays and mathematical functions.

*(Hint: you'll need the `array` function, see documentation [here](https://numpy.org/doc/2.1/reference/generated/numpy.array.html))*

In [66]:
# Import NumPy:

# Create the confusion matrix:

# Print the matrix:

### 3.2. Calculate Observed Agreement (Po)

Next, we will need to calculate the observed agreement (Po), the proportion of times the annotators agree on their classifications. The cases where the two annotators agree are visible in the *diagonal* of the confusion matrix (top left to bottom right). 

The formula for observed agreement is:

$P_o = \frac{( a  +  d )}{N}$ 

Where:
- a = number of agreements in the positive category (top-left cell in the confusion matrix)
- d = number of agreements in the negative category (bottom-right cell in the confusion matrix)
- N = total number of observations (sum of all cells in the confusion matrix)

Your task is to calculate Po using the confusion matrix and formula. Index the confusion matrix to extract the necessary values.

In [69]:
# Use indexing to extract the values ('a' and 'd') needed for the calculation of Po:

# Calculate the total number of observations (N) by summing all the elements in the confusion matrix:

# Calculate Po using the formula:

# Print the result:

### 3.3. Calculate Expected Agreement (Pe)

We will now do the same for the expected agreement (Pe), the agreement between two annotators that would be expected by chance. 

The formula for expected agreement is:

$P_e = \left(\frac{(a+b)(a+c)}{N^2}\right) + \left(\frac{(c+d)(b+d)}{N^2}\right)$ 

Where:
- b = number of positive ratings by Annotator A and negative ratings by Annotator B (top-right cell in the confusion matrix)
- c = number of negative ratings by Annotator A and positive ratings by Annotator B (bottom-left cell in the confusion matrix)

*(Note that you already created variables for values `a`, `d` and `N` in the previous exercise)*

In [72]:
# Use indexing to extract the values ('b' and 'c') needed for the calculation of Pe:

# Calculate Pe using the formula:

# Print the result:

### 3.4. Calculate Cohen's Kappa ($\kappa$)

Now we have the values needed to calculate Cohen's Kappa ($\kappa$). 

Calculate $\kappa$ with its formula:

$ \kappa = \frac{(P_o - P_e)} {(1 - P_e)}$ 

In [75]:
# Calculate k using the formula:

# Print the result:

### 3.5. Interpreting Cohen's Kappa

After calculating Cohen's Kappa ($\kappa$), you will have a numerical value that falls between -1 and 1, that indicates the level of agreement between the two annotators. But, we would of course like to know what this numerical value actually *means*.

Consulting [this](https://datatab.net/tutorial/cohens-kappa) documentation, write a few sentences below interpreting the $\kappa$ value you just computed:

    *Write your text here*

### 3.6. Reflection

We now know that Cohen's Kappa is a statistical measure used to assess inter-annotator reliability in classification tasks. While high agreement isn't absolutely critical for all applications, such as with our data example of analysing profanity sentiment in music lyrics, there are many fields where consistent classification is crucial and it can never hurt to include the measure in your research. 

Can you identify any fields/domains where achieving a high Cohen's Kappa is particularly important?

    *Write your text here*

## 4. References

> Bruner, R. (2018). How hip-hop became the sound of the mainstream. Time. 
https://time.com/5118041/rap-music-mainstream/

> Dictionary.com. (2025). Rap (music). https://www.dictionary.com/browse/rap--music 

> Howard, S., Hennes, E. P., & Sommers, S. R. (2021). Stereotype Threat Among Black Men Following Exposure to Rap Music. Social Psychological and Personality Science, 12(5), 719-730. https://doi.org/10.1177/1948550620936852 

> Medium. (2015). Profanity in lyrics: Most used swear words and their usage by popular genres. Medium. https://medium.com/musixmatch-blog/profanity-in-lyrics-most-used-swear-words-and-their-usage-by-popular-genres-d8a12c776713

> Medium. (2022). What is hip-hop and how is it different from rap? Medium. https://medium.com/@byvinci.io/what-is-hip-hop-and-how-is-it-different-from-rap-5726264814c9

> Moloney, M. J., & Sylva, H. M. (2020). 'And I Swear...' – Profanity in pop music lyrics on the American Billboard charts 2009-2018 and the effect on YouTube popularity. International Journal of Scientific & Technology Research, 9(2), 5212-5214. 

> Musical Dictionary. (2024). What is pop: Understanding pop music and its cultural significance. https://www.musicaldictionary.com/what-is-pop/ 