### Quick Description

This notebook calculates the toxicity score for every curse word from a file, gets the curse word with the median toxicity score (according to Perpective), and replace every curse word identifier on the subtitles by the median word

In [None]:
import os
import glob
import yaml
import time
import json

from tqdm import tqdm
from googleapiclient import discovery

In [None]:
def get_toxicity_score(service, text):
    try:
        if len(text) > 0 and len(text) < 3000:
            analyze_request = {
                'comment': {'text': text},
                'requestedAttributes': {'TOXICITY': {}}
            }
            response = service.comments().analyze(body=analyze_request).execute()
            toxicity_score = (
                response.get("attributeScores")
                .get("TOXICITY")
                .get("summaryScore")
                .get("value")
            )
        else:
            toxicity_score = None
    except:
        toxicity_score = None
    
    return toxicity_score

In [None]:
credentials = yaml.load(open("../credentials.yaml"))["perspective-api"]

service = discovery.build(
    "commentanalyzer",
    "v1alpha1",
    developerKey=credentials["key-1"],
    discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
    static_discovery=False,
)

### Calculate the Toxicity for Every Curse Word

Calculate the toxicity score for every curse word in the file referenced [here](https://www.cs.cmu.edu/~biglou/resources/) built by [Luis von Ahn](https://www.cs.cmu.edu/~biglou/)'s Research Group.

We need to take into consideration that some words do not present considerable toxicity scores. For example, "nook", "dahmer", "palesimian" both are considered to be curse words, however, have less than 5.0 toxicity scores, whereas muslim (18.99), medication (18.6), interracial (18.39), and atheist (15.83) are substantially more toxic.

In [None]:
with open(os.path.join("../data/", "bad_words.txt")) as f:
    curse_words = f.read().split()

curse_words_score = {}
for word in tqdm(curse_words, total=len(curse_words)):
    toxicity = get_toxicity_score(service, word)
    
    if toxicity is not None:
        curse_words_score[word] = toxicity

    time.sleep(1)

Remove irrelevant words that should not be considered curse words

In [None]:
filtered_curse_words_score = {}
for key, value in curse_words_score.items():
    if value > 0.2:
        filtered_curse_words_score[key] = value

curse_words_score = filtered_curse_words_score
del filtered_curse_words_score

Persist the scored curse words

In [None]:
with open('../data/bad_words_scored.json', 'w') as f:
    json.dump(dict(curse_words_score), f)

Get the median curse word index

In [None]:
curse_words_score = sorted(curse_words_score.items(), key=lambda item: item[1])

median_idx = len(curse_words_score) // 2
median_curse_word, median_score = curse_words_score[median_idx]

print(f"The median word is `{median_curse_word}` with a toxicity score of {median_score}")

---

### Replace Curse Words

Replace all curse words by the word with median toxicity score predicted by Perspective

In [None]:
input_path = "../data/00_raw"
output_path = "../data/01_preprocessed/"

filenames = [file.split("/")[-1] for file in glob.glob(os.path.join(input_path, "*"))]

for filename in tqdm(filenames, total=len(filenames)):
    with open(os.path.join(input_path, filename)) as file:
        text = file.read()
    
    text = text.replace("[Music]", '')
    text = text.replace("[ __ ]", median_curse_word)
    
    with open(os.path.join(output_path, filename), 'w') as file:
        file.write(text)