### Quick Description

This notebook calculates the toxicity score for every curse word from a file, gets the curse word with the median toxicity score (according to Perpective), and replace every curse word identifier on the subtitles by the median word

In [1]:
import os
import glob
import yaml
import time
import json
import sys
import re

from tqdm import tqdm
from googleapiclient import discovery

sys.path.append("../utils")
from toxicity_api_communication import get_toxicity_score

In [2]:
filepaths = yaml.load(open("../config/filepaths.yaml"))
credentials = yaml.load(open("../credentials.yaml"))["perspective-api"]

service = discovery.build(
    "commentanalyzer",
    "v1alpha1",
    developerKey=credentials["key-1"],
    discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
    static_discovery=False,
)

  filepaths = yaml.load(open("../config/filepaths.yaml"))


FileNotFoundError: [Errno 2] No such file or directory: '../credentials.yaml'

### Calculate the Toxicity for Every Curse Word

Calculate the toxicity score for every curse word in the file referenced [here](https://www.cs.cmu.edu/~biglou/resources/) built by [Luis von Ahn](https://www.cs.cmu.edu/~biglou/)'s Research Group.

We need to take into consideration that some words do not present considerable toxicity scores. For example, "nook", "dahmer", "palesimian" both are considered to be curse words, however, have less than 5.0 toxicity scores, whereas muslim (18.99), medication (18.6), interracial (18.39), and atheist (15.83) are substantially more toxic.

In [3]:
%%script False

with open(filepaths["bad_words_raw"]) as f:
    curse_words = f.read().split()

curse_words_score = {}
for word in tqdm(curse_words, total=len(curse_words)):
    toxicity = get_toxicity_score(service, word)
    
    if toxicity is not None:
        curse_words_score[word] = toxicity

    time.sleep(1)

Couldn't find program: 'False'


Persist the scored curse words

In [4]:
%%script False

with open(filepaths["bad_words_scored"]) as f:
    json.dump(dict(curse_words_score), f)

Couldn't find program: 'False'


Remove irrelevant words that should not be considered curse words

In [5]:
%%script False

filtered_curse_words_score = {}
for key, value in curse_words_score.items():
    if value > 0.2:
        filtered_curse_words_score[key] = value

curse_words_score = filtered_curse_words_score
del filtered_curse_words_score

Couldn't find program: 'False'


Get the median curse word index

In [6]:
%%script False

curse_words_score = sorted(curse_words_score.items(), key=lambda item: item[1])

median_idx = len(curse_words_score) // 2
median_curse_word, median_score = curse_words_score[median_idx]

print(f"The median word is `{median_curse_word}` with a toxicity score of {median_score}")

Couldn't find program: 'False'


---

In [7]:
def replace_tokens(input_path, output_path):
    filenames = [file.split("/")[-1] for file in glob.glob(os.path.join(input_path, "*"))]

    for filename in tqdm(filenames, total=len(filenames)):
        with open(os.path.join(input_path, filename)) as file:
            text = file.read()

        if re.search("\[ __ \]+", text) is not None:
            continue
            
        text = text.replace("[Music]", '')

        with open(os.path.join(output_path, filename), 'w') as file:
            file.write(text)

---

### Remove files with censured curse words

In [8]:
replace_tokens(filepaths["raw_data"], filepaths["preprocessed_data"])

100%|█████████████████████████████████████████████████████████████| 634/634 [00:00<00:00, 7939.80it/s]
