# Creating a New Word "Game"
We have all heard of anagrams, where you take a word or phrase and shuffle the letters to make a new one. However, I was wondering what if, instead of letters, what if we shuffled the words base sounds around. The ARPAbet is a set of phonetic transcription codes developed by Advanced Research Projects Agency (ARPA) in the 1970s. It allows us to break down words farther into visual representation of speech sounds. Now what if we shuffled these around to make, what I am calling, an ARPAbet-agram. This is a bit more complex than anyone would ever care to deal with (It makes anagrams look pretty easy by comparasion). Calling it a word "game" is somewhat generous, but we can make a computer do all that teadious work to find ARPAbetagrams and we can just enjoy the results. Feel free to play with this notebook or just download the outputed .CSV file of ~135,000 words and their possible ARPAbetagrams. 

## Defining a ARPAbetagram

I will define an ARPAbetagram as a word (or phrase) whose formed by rearranging the ARPAbet phones of another. 2 additional rules:

-The ARPAbetagram cannot be a homophone

-Stresses can be ignored. Since I'm making this up, I can do that. It is like ignoring spaces or capitalization in anagrams. I find it makes for more interseting results. The ARPAbet-agrams are sparse enough without them.

As an example, the word 'accounts' has a pronuncation in ARPAbet as "AH K AW N T S" and the word 'countess' has a pronuncation in ARPAbet as "K AW N T AH S". Note that both words use the same phones (AH AW K N S T) so are concidered an ARPAbetagram of each other.

In [None]:
import pandas as pd
import numpy as np
import os
print(os.listdir("../input"))

In [None]:
dictionary = open('../input/cmudict.dict', 'r')

# Process ARPAbet dictionary
First we'll reformat the dictionary into a Dataset with the word and it's pronunciation. I am removing numbers from the set as numbers only indicate minor stress points for vowels in ARPAbet. Try as I might, I am unable to here the difference between these stresses so I am discounting them in this exercise.

In [None]:
%%time

with dictionary as f:
    phonics = [line.rstrip('\n') for line in f]

word = []
pronunciation = []
pronunciation_sorted = []

for x in phonics:
    x = x.split(' ')
    word.append(x[0])
    p = ' '.join(x[1:])
    # removing numbers from pronunciation
    p = p.replace('0','')
    p = p.replace('1','')
    p = p.replace('2','')
    pronunciation.append(p)
    a = p.split(' ')
    a.sort()
    a = ' '.join(a)
    pronunciation_sorted.append(a)

df = pd.DataFrame({
        "word": word,
        "pronunciation": pronunciation,
        "pronunciation_sorted": pronunciation_sorted
    })

# add placeholder columns
df['ARPAbetagrams'] = ''
df['index'] = df.index
df[:10]

# Find all ARPAbetagram
Note: This runs a but slow but gets the job done. Takes ~1 hour to complete. The result will be a new column listing all the ARPAbetagrams of that word.

In [None]:
%%time
def fillARPAbetagrams(line):
    word = line[0]
    cp = line[1]
    cpa = line[2]
    p = 0
    i = line[3]
    if i % 1350 == 0:
        print(str(i/1350)+'% done')
    
    pg = df.loc[(df['pronunciation_sorted'] == cpa) & (df['pronunciation'] != cp)]['word'].values.tolist()
    
    pg = ','.join(pg)
    h = ''
    return pg
df['ARPAbetagrams'] = df[['word', 'pronunciation', 'pronunciation_sorted', 'index']].apply(fillARPAbetagrams, axis = 1)

df.drop(['index'], axis=1)

# Look at the Results
As you can see, ARPAbetagrams are pretty rare. Most words have none. Many words only have a few because the dataset inculdes some questionable words. That being said, there are some pretty interesting and unexpected ARPAbetagrams mixed throughout. Making a program that can go through a phrase and find ARPAbetagrams of it might be phase 2 of this notebook, but I will leave it here for now.


In [None]:
# df.loc[(df['word'] == 'accord')]
df[:50]

# Output the CSV File
Enjoy going through the dataset

In [None]:
df.to_csv("ARPAbetagrams_Dataset.csv", index=False, header=True)