# Imports and Config 

In [1]:
from phonate import AALPhonate, phonate_filter

import torch
from transformers import T5ForConditionalGeneration, ByT5Tokenizer

import pandas as pd
pd.set_option('display.max_colwidth', 0)

2024-09-22 15:49:37.603807: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  _C._set_default_tensor_type(t)


# Setup 

PhonATe is constructed from a Grapheme-to-Phoneme Model (G2P; `p2g_model`), Phoneme-to-Grapheme Model (P2G; `p2g_model`), and a byte-tokenizer (`tokenizer`). The G2P model is taken from (Zhu et al, 2022). The P2G model is further finetuned on pairs of AAL texts and G2P model-predicted phoneme sequences. 

## via Parameters

PhoATe can be constructed manually, and at minimum requires passing in G2P, P2G, and tokenizer models or checkpoints. 

In [2]:
# G2P_CHKPT = 'charsiu/g2p_multilingual_byT5_small_100'
# P2G_CHKPT = 'phonate/t5-aal-p2g'
# TOK_CHKPT = 'google/byt5-small'

# DEVICE = torch.device('cuda:0')

In [3]:
# g2p_model = T5ForConditionalGeneration.from_pretrained(G2P_CHKPT).to(DEVICE)
# p2g_model = T5ForConditionalGeneration.from_pretrained(P2G_CHKPT).to(DEVICE)
# tokenizer = ByT5Tokenizer.from_pretrained(TOK_CHKPT)

In [4]:
# aal_phonate = AALPhonate(p2g_model = p2g_model, g2p_model = g2p_model, tok = tokenizer, device = DEVICE)

## via Config File 

PhonATe can also be constructed with a json configuration file. The AALPhonate documentation provides the possible config keys that can be used.

In [5]:
aal_phonate = AALPhonate(config = 'phonate/default_config.json')

Loading phonate from configuration: phonate/default_config.json
Loading models and tokenizer...

  return self.fget.__get__(instance, owner)()


Finished loading models and tokenizer


# Augmenting Texts 

## All Augmentations 

All PhonATe augmentations can be applied using the `full_phon_aug` function and passing in a list of texts. The result is a list of the original phoneme transcriptions (`phon_trans`), the augmented phoneme sequences (`phon_aug`), the decoded PhonATe-augmented sequences (`paug_out`), and the cleaned decoded sequences (`clean_out`). The cleaning step helps ensure that capital letters, punctuation, and other features of the original text unrelated to phonological features are conserved.

In [6]:
# Updating all probabilities to 1.0 to demonstrate augmentations
aal_phonate.update_probs(1.0)

In [7]:
# Sample texts from the toxicity dataset
ex_texts = ["Hellloooo? I'm done with this....If I want information I'll just go the source or Encarta.",
            "Or at least review the timing of Moreschi's obscene haste and agree that I had no way of seeing it before he acted.",
            "The Billy the Kid article with my contributions has been vandalized on or about Dec. 14, 2015 by someone calling themselves KrakatoaKatie",
            "you both sure do You want to give a free pass to every border jumper in this country you two are what's wrong here"
           ]

In [8]:
phon_trans, phon_aug, paug_out, clean_out = aal_phonate.full_phon_aug(ex_texts)

In [9]:
res = pd.DataFrame([ex_texts, phon_trans, phon_aug, paug_out, clean_out]).transpose()
res.columns = ['Original Text', 'Phoneme Transcripts', 'Augmented Phoneme Sequences', 'Decoded Augmentations', 'Clean PhonATe Result']
res

Unnamed: 0,Original Text,Phoneme Transcripts,Augmented Phoneme Sequences,Decoded Augmentations,Clean PhonATe Result
0,Hellloooo? I'm done with this....If I want information I'll just go the source or Encarta.,ˌhɛɫəˈu? ˈaɪm ˈdən ˈwɪθ this....If I ˈwɑnt ˌɪnfɝˈmeɪʃən ˈaɪɫ ˈdʒəst ˈɡoʊ ˈðɛ ˈsɔɹs ˈɔɹ Encarta.,ˌhɛɫəˈu? ˈaɪm ˈdən ˈwɪf this....If I ˈwɑnt ˌɪnfɝˈmeʃən ˈaɪɫ ˈdʒəs ˈɡo ˈdɛ ˈsɔɹs ˈɔɹ Encarta.,Hellloooo? I'm done wiff this....If I want infermation I'll jus go deh source or Encarta.,Hellloooo? I'm done wiff this....If I want infermation I'll jus go deh source or Encarta.
1,Or at least review the timing of Moreschi's obscene haste and agree that I had no way of seeing it before he acted.,ˈɔɹ ˈæt ˈɫist ɹivˈju ˈðɛ ˈtaɪmɪŋ ˈɑf mɔˈɹɛskiz əbˈsin ˈheɪst ˈænd əˈɡɹi ˈðæt I ˈhæd ˈnoʊ ˈweɪ ˈɑf ˈsiɪŋ ˈɪt bɪˈfɔɹ ˈhi acted.,ˈɔɹ ˈæt ˈɫis ɹivˈju ˈdɛ ˈtamɪŋ ˈɑf mɔˈɹɛskis əbˈsin ˈhest ˈænd ˈɡɹi ˈdæt I ˈhæt ˈno ˈwe ˈɑf ˈsiɪn ˈɪt bɪˈfɔɹ ˈhi acted.,Or at lease review deh taming of moreskis obscene hast and gree dat I hat no wa of seein it before he acted.,Or at lease review deh taming of Moreskis obscene hast and gree dat I hat no wa of seein it before he acted.
2,"The Billy the Kid article with my contributions has been vandalized on or about Dec. 14, 2015 by someone calling themselves KrakatoaKatie","ˈðɛ ˈbɪɫi ˈðɛ ˈkɪd ˈɑɹtɪkəɫ ˈwɪθ ˈmaɪ contributions ˈhɑz ˈbɪn ˈvændəˌɫaɪzd ˈɑn ˈɔɹ əˈbaʊt ˈdɛk. 14, 2015 ˈbaɪ ˈsəmˌwən ˈkɔɫɪŋ ˌðɛmˈsɛɫvz KrakatoaKatie","ˈdɛ ˈbɪɫi ˈdɛ ˈkɪt ˈɑɹtɪkəɫ ˈwɪf ˈma contributions ˈhɑs ˈbɪn ˈvændəˌɫazd ˈɑn ˈɔɹ əˈbat ˈdɛk. 14, 2015 ˈba ˈsəmˌwən ˈkɔɫɪn ˌdɛmˈsɛɫvz KrakatoaKatie","deh Billy deh kit article wiff ma contributions hos been vandelized on or abat Dec. 14, 2015 ba someone callin demselves KrakatoaKatie","Deh Billy deh Kit article wiff ma contributions hos been vandelized on or abat Dec. 14, 2015 ba someone callin demselves KrakatoaKatie"
3,you both sure do You want to give a free pass to every border jumper in this country you two are what's wrong here,ˈju ˈbɑθ ˈʃʊɹ ˈdu ˈju ˈwɑnt ˈtu ˈɡɪv a ˈfɹi ˈpæs ˈtu ˈɛvɹi ˈbɔɹdɝ ˈdʒəmpɝ ˈɪn ˈðɪs ˈkəntɹi ˈju ˈtu ˈɛɹ ˈwəts ˈɹɔŋ ˈhɪɹ,ˈju ˈbɑf ˈʃʊɹ ˈdu ˈju ˈwɑnt ˈtu ˈɡɪf a ˈfɹi ˈpæs ˈtu ˈɛvɹi ˈbɔɹdə ˈdʒəmpə ˈɪn ˈdɪs ˈkəntɹi ˈju ˈtu ˈɛɹ ˈwəts ˈɹɔŋ ˈhɪɹ,you boff sure do You want to giff a free pass to every borda jumpa in dis country you two are what's wrong here,you boff sure do You want to giff a free pass to every borda jumpa in dis country you two are what's wrong here


__Expected Result:__

|    | Original Text                                                                                                                             | Phoneme Transcripts                                                                                                                                    | Augmented Phoneme Sequences                                                                                                                        | Decoded Augmentations                                                                                                                  | Clean PhonATe Result                                                                                                                   |
|---:|:------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------|
|  0 | Hellloooo? I'm done with this....If I want information I'll just go the source or Encarta.                                                | ˌhɛɫəˈu? ˈaɪm ˈdən ˈwɪθ this....If I ˈwɑnt ˌɪnfɝˈmeɪʃən ˈaɪɫ ˈdʒəst ˈɡoʊ ˈðɛ ˈsɔɹs ˈɔɹ Encarta.                                                        | ˌhɛɫəˈu? ˈaɪm ˈdən ˈwɪf this....If I ˈwɑnt ˌɪnfɝˈmeʃən ˈaɪɫ ˈdʒəs ˈɡo ˈdɛ ˈsɔɹs ˈɔɹ Encarta.                                                       | Hellloooo? I'm done wiff this....If I want infermation I'll jus go deh source or Encarta.                                              | Hellloooo? I'm done wiff this....If I want infermation I'll jus go deh source or Encarta.                                              |
|  1 | Or at least review the timing of Moreschi's obscene haste and agree that I had no way of seeing it before he acted.                       | ˈɔɹ ˈæt ˈɫist ɹivˈju ˈðɛ ˈtaɪmɪŋ ˈɑf mɔˈɹɛskiz əbˈsin ˈheɪst ˈænd əˈɡɹi ˈðæt I ˈhæd ˈnoʊ ˈweɪ ˈɑf ˈsiɪŋ ˈɪt bɪˈfɔɹ ˈhi acted.                          | ˈɔɹ ˈæt ˈɫis ɹivˈju ˈdɛ ˈtamɪŋ ˈɑf mɔˈɹɛskis əbˈsin ˈhest ˈænd ˈɡɹi ˈdæt I ˈhæt ˈno ˈwe ˈɑf ˈsiɪn ˈɪt bɪˈfɔɹ ˈhi acted.                            | Or at lease review deh taming of moreskis obscene hast and gree dat I hat no wa of seein it before he acted.                           | Or at lease review deh taming of Moreskis obscene hast and gree dat I hat no wa of seein it before he acted.                           |
|  2 | The Billy the Kid article with my contributions has been vandalized on or about Dec. 14, 2015 by someone calling themselves KrakatoaKatie | ˈðɛ ˈbɪɫi ˈðɛ ˈkɪd ˈɑɹtɪkəɫ ˈwɪθ ˈmaɪ contributions ˈhɑz ˈbɪn ˈvændəˌɫaɪzd ˈɑn ˈɔɹ əˈbaʊt ˈdɛk. 14, 2015 ˈbaɪ ˈsəmˌwən ˈkɔɫɪŋ ˌðɛmˈsɛɫvz KrakatoaKatie | ˈdɛ ˈbɪɫi ˈdɛ ˈkɪt ˈɑɹtɪkəɫ ˈwɪf ˈma contributions ˈhɑs ˈbɪn ˈvændəˌɫazd ˈɑn ˈɔɹ əˈbat ˈdɛk. 14, 2015 ˈba ˈsəmˌwən ˈkɔɫɪn ˌdɛmˈsɛɫvz KrakatoaKatie | deh Billy deh kit article wiff ma contributions hos been vandelized on or abat Dec. 14, 2015 ba someone callin demselves KrakatoaKatie | Deh Billy deh Kit article wiff ma contributions hos been vandelized on or abat Dec. 14, 2015 ba someone callin demselves KrakatoaKatie |
|  3 | you both sure do You want to give a free pass to every border jumper in this country you two are what's wrong here                        | ˈju ˈbɑθ ˈʃʊɹ ˈdu ˈju ˈwɑnt ˈtu ˈɡɪv a ˈfɹi ˈpæs ˈtu ˈɛvɹi ˈbɔɹdɝ ˈdʒəmpɝ ˈɪn ˈðɪs ˈkəntɹi ˈju ˈtu ˈɛɹ ˈwəts ˈɹɔŋ ˈhɪɹ                                 | ˈju ˈbɑf ˈʃʊɹ ˈdu ˈju ˈwɑnt ˈtu ˈɡɪf a ˈfɹi ˈpæs ˈtu ˈɛvɹi ˈbɔɹdə ˈdʒəmpə ˈɪn ˈdɪs ˈkəntɹi ˈju ˈtu ˈɛɹ ˈwəts ˈɹɔŋ ˈhɪɹ                             | you boff sure do You want to giff a free pass to every borda jumpa in dis country you two are what's wrong here                        | you boff sure do You want to giff a free pass to every borda jumpa in dis country you two are what's wrong here                        |

Texts can additionally be filtered to remove transformations that results in, for example, a different existing word or grammatical role that likely alters the meaning 

In [10]:
filt_texts = [phonate_filter.filter_transforms(orig_text, aug_text) for orig_text, aug_text in zip(ex_texts, clean_out)]

In [11]:
filt_texts

["Hellloooo? I'm done wif this....If I want infermation I'll jus go deh source or Encarta.",
 'Or at least review deh taming of Moreskis obscene hast and agree dat I had no way of seein it before he acted.',
 'Deh Billy deh Kit article wif my contributions hos been vandelized on or about Dec. 14, 2015 by someone callin demselves KrakatoaKatie',
 "you bof sure do You want to give a free pass to every borda jumpa in this country you two are what's wrong here"]

## Individual Transformations 

Transformations can also be applied individually by passing a single augmentation to the `full_phon_aug` function. The implemented augmentation names are:

| Feature | Name |
|---|---|
| th-fronting | `th_front`|
| Monophthongization | `dpt_simp` |
| Non-rhoticity | `non_rhot` and `other_non_rhot` |
| str-backing | `str_back` |
| l-lessness | `l_del` |
| Word-final devoicing | `fin_dvc` |
| Haplology | `haplology` |
| Consonant Cluster Reduction | `cons_red` |
| g-dropping | `g_drop` |
| Stress Dropping | `stress_drop` |

In [12]:
_,_,_, clean_out = aal_phonate.full_phon_aug([ex_texts[2]], augs = ['th_front'])

In [13]:
clean_out

['Deh Billy deh Kid article wiff my contributions has been vandalized on or about Dec. 14, 2015 by someone calling demselves KrakatoaKatie']

## Random Transformations 

Random phonological transformations can be applied with the `full_random_aug` function and specifying the number of insertions, deletions, and substitutions on phoneme sequences.

In [14]:
num_ins  = [1] * len(ex_texts)
num_dels = [1] * len(ex_texts)
num_subs = [1] * len(ex_texts)

In [15]:
phon_trans, phon_aug, paug_out, clean_out = aal_phonate.full_random_aug(ex_texts, num_ins = num_ins, num_dels = num_dels, num_subs = num_subs)

In [16]:
res = pd.DataFrame([ex_texts, phon_trans, phon_aug, paug_out, clean_out]).transpose()
res.columns = ['Original Text', 'Phoneme Transcripts', 'Augmented Phoneme Sequences', 'Decoded Augmentations', 'Clean Random Result']
res

Unnamed: 0,Original Text,Phoneme Transcripts,Augmented Phoneme Sequences,Decoded Augmentations,Clean Random Result
0,Hellloooo? I'm done with this....If I want information I'll just go the source or Encarta.,ˌhɛɫəˈu? ˈaɪm ˈdən ˈwɪθ this....If I ˈwɑnt ˌɪnfɝˈmeɪʃən ˈaɪɫ ˈdʒəst ˈɡoʊ ˈðɛ ˈsɔɹs ˈɔɹ Encarta.,ˌhɫəˈu? ˈaɪm ˈdən ˈwɪθ thiʜ....If I ˈwɑnt ˌɪnfɝˈmeɪʃən ˈaɪɫ ˈdʒəst ˈɡoʊ ˈðɛ ˈsvɹs ˈɔɹ Encarta.,hlau? I'm done with this....If I want information I'll just go the svrs or Encarta.,Hlau? I'm done with this....If I want information I'll just go the svrs or Encarta.
1,Or at least review the timing of Moreschi's obscene haste and agree that I had no way of seeing it before he acted.,ˈɔɹ ˈæt ˈɫist ɹivˈju ˈðɛ ˈtaɪmɪŋ ˈɑf mɔˈɹɛskiz əbˈsin ˈheɪst ˈænd əˈɡɹi ˈðæt I ˈhæd ˈnoʊ ˈweɪ ˈɑf ˈsiɪŋ ˈɪt bɪˈfɔɹ ˈhi acted.,ˈɔɹ ˈæt ˈɫisɢ͡ʁ ɹivˈju ˈðɛ ˈtaɪmɪŋ ˈɑf mɔˈɹɛskz əbˈsin ˈheɪst ˈænd əˈɡɹi ˈðæt I ˈhæd ˈnoʊ ˈweɪ ˈɑf ˈsiɪŋ ˈɪt bɪˈfɔɹ ˈhi actʌd.,Or at leasegue review the timing of moresks obscene haste and agree that I had no way of seeing it before he acted.,Or at leasegue review the timing of Moresks obscene haste and agree that I had no way of seeing it before he acted.
2,"The Billy the Kid article with my contributions has been vandalized on or about Dec. 14, 2015 by someone calling themselves KrakatoaKatie","ˈðɛ ˈbɪɫi ˈðɛ ˈkɪd ˈɑɹtɪkəɫ ˈwɪθ ˈmaɪ contributions ˈhɑz ˈbɪn ˈvændəˌɫaɪzd ˈɑn ˈɔɹ əˈbaʊt ˈdɛk. 14, 2015 ˈbaɪ ˈsəmˌwən ˈkɔɫɪŋ ˌðɛmˈsɛɫvz KrakatoaKatie","ˈðɛ ˈbɪɫi ˈðɛ ˈkɪd ˈɑɹtɪkəɫ ˈwɪθ ˈmaɪ contributions ˈhɑz ˈbɪn ˈvʎ̥ndəˌɫaɪzd ˈɑn ˈɔɹ əˈbaʊt ˈdɛk. 14, 2015 ˈbaɪ ˈsəqˌwən ˈkɔɫɪŋ ˌðɛmˈsɛɫvz KrakatoaKtie","The Billy the Kid article with my contributions has been vandalized on or about Dec. 14, 2015 by suquan calling themselves KrakatoaKatie","The Billy the Kid article with my contributions has been vandalized on or about Dec. 14, 2015 by suquan calling themselves KrakatoaKatie"
3,you both sure do You want to give a free pass to every border jumper in this country you two are what's wrong here,ˈju ˈbɑθ ˈʃʊɹ ˈdu ˈju ˈwɑnt ˈtu ˈɡɪv a ˈfɹi ˈpæs ˈtu ˈɛvɹi ˈbɔɹdɝ ˈdʒəmpɝ ˈɪn ˈðɪs ˈkəntɹi ˈju ˈtu ˈɛɹ ˈwəts ˈɹɔŋ ˈhɪɹ,ˈju ˈbɑθ ˈʃʊɹ ˈdu ˈju ˈwɑnt ˈtu ˈɡɪʟ̝̊ a ˈfɹi ˈpɻs ˈtu ˈɛɹi ˈbɔɹdɝ ˈdʒəmpɝ ˈɪn ˈðɪs ˈkəntɹi ˈju ˈtu ˈɛɹ ˈwəts ˈɹɔŋ ˈhɪɹ,you both sure do You want to gishrow a free puss to airy border jumper in this country you two are what's wrong here,you both sure do You want to gishrow a free puss to airy border jumper in this country you two are what's wrong here
