# Procedurally generated names

This will use the names of populations (pops) and celestial objects. It is based on a dataset of city names worldwide

**Note** that this file generates the `syllables.p` that is a part of the repo

I want names that are:
* Procedurally generated
* Culturally ambiguous
* Sound like real words

To do this, I'm going to parse a dataset of cities, and build a model that uses words based on their occurance. 

In [45]:
import pickle
import re

import numpy as np
import pandas as pd
from nltk import word_tokenize
from nltk.tokenize import SyllableTokenizer


from collections import Counter
import altair as alt


In [14]:
# # Install the nltk material if you haven't already
# import nltk
# nltk.download('all')

In [15]:
cities = pd.read_csv("../../data/world-cities.csv")
cities["name"] = cities["name"].str.lower()
cities["country"] = cities["country"].str.lower()


ci = cities["name"].drop_duplicates().str.lower().values
co = cities["country"].drop_duplicates().str.lower().values

ci

array(['les escaldes', 'andorra la vella', 'umm al qaywayn', ...,
       'beitbridge', 'epworth', 'chitungwiza'], dtype=object)

Making all of them into a big list of words

In [16]:
words = pd.concat([pd.DataFrame(ci), pd.DataFrame(co)]).drop_duplicates().values
words = " ".join(words.flatten())
words[:400]

'les escaldes andorra la vella umm al qaywayn ras al-khaimah khawr fakkān dubai dibba al-fujairah dibba al-hisn sharjah ar ruways al fujayrah al ain ajman adh dhayd abu dhabi zaranj taloqan shīnḏanḏ shibirghān shahrak sar-e pul sang-e chārak aībak rustāq qarqīn qarāwul pul-e khumrī paghmān nahrīn maymana mehtar lām mazār-e sharīf lashkar gāh kushk kunduz khōst khulm khāsh khanabad karukh kandahār k'

In [12]:
SSP = SyllableTokenizer()

tokens = [SSP.tokenize(token) for token in word_tokenize(words)]

  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)
  " assigning as vowel: '{}'".format(c)


In [19]:
tokens[:10]

[['les'],
 ['es', 'cal', 'des'],
 ['an', 'dor', 'ra'],
 ['la'],
 ['vel', 'la'],
 ['umm'],
 ['al'],
 ['qay', 'wayn'],
 ['ras'],
 ['al', '-', 'khai', 'mah']]

So you can see the list of lists in the names of cities world wide.

In [69]:
all_syls = np.concatenate(tokens).ravel()
all_syls


array(['les', 'es', 'cal', ..., 'zim', 'ba', 'bwe'], dtype='<U8')

Clean up to remove special characters (which may be les inclusive but will prevent errors in unicoding)

In [77]:
syl = [i for i in all_syls if type(i)==np.str_]
clean_syll = [i for i in syl if re.match("[a-z]+$",i)]

In [80]:
counts = Counter(clean_syll)
df = pd.DataFrame.from_dict(dict(counts), orient='index', columns=['count'])
df['pct'] = df['count']/(df['count'].sum())
df = df[df['count']>1]
df

Unnamed: 0,count,pct
les,85,0.001289
es,82,0.001244
cal,39,0.000592
des,45,0.000683
an,253,0.003837
...,...,...
bloem,2,0.000030
bwa,2,0.000030
slands,13,0.000197
tius,2,0.000030


In [81]:
hist = alt.Chart(df).mark_bar().encode(x = alt.X('count',
                                            bin = alt.BinParams(maxbins = 30)),
                                            y = 'count()')
hist

pretty long tail. Let's cut that a little more. 

In [103]:
shorter_df = df[df['count']>20]
shorter_df['pct'] = shorter_df['count']/(shorter_df['count'].sum())
shorter_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,count,pct
les,85,0.001779
es,82,0.001716
cal,39,0.000816
des,45,0.000942
an,253,0.005294
...,...,...
kiy,56,0.001172
tsu,63,0.001318
gorsk,28,0.000586
heights,35,0.000732


In [83]:
hist = alt.Chart(shorter_df).mark_bar().encode(x = alt.X('count',
                                            bin = alt.BinParams(maxbins = 30)),
                                            y = 'count()')
hist

## First iteration: Choosing purely randomly. 

In [110]:
# Actual production function
def make_word(n, spaces=True):
    # TODO: Spaces not implemented
    syl = np.random.choice(shorter_df.index.to_list(), n)
    word = "".join(syl)
    return word.capitalize()


In [90]:
[make_word(3) for word in range(10)]

['Nerhuling',
 'Helwalher',
 'Gaskdro',
 'Mauviapro',
 'Landfonay',
 'Carrezgro',
 'Turonin',
 'Fenvavil',
 'Ziburgua',
 'Herletche']

You randomizing the length of the word makes them look prety good. 

In [94]:
[make_word(np.random.choice([1, 2,3])) for word in range(10)]

['Damti',
 'May',
 'Loski',
 'Rio',
 'Linglisis',
 'Pet',
 'Ter',
 'Springsrin',
 'Delhunat',
 'Neu']

## Second Iteration: Using the probability distribution to aid in choosing

In [108]:
syllables = shorter_df.index.to_list()
syllables_dist = shorter_df['pct'].values

# Actual production function
def make_dist_word(n, spaces=True):
    # TODO: Spaces not implemented
    syl = np.random.choice(syllables, n, p=syllables_dist)
    word = "".join(syl)
    return word.capitalize()


In [114]:
[make_dist_word(np.random.choice([2,3,4])) for word in range(10)]


['Zotavorich',
 'Coli',
 'Cicama',
 'Linreno',
 'Dailiki',
 'Langnakai',
 'Tovo',
 'Chican',
 'Tate',
 'Albara']

In [113]:
[make_word(np.random.choice([2,3,4])) for word in range(10)]

['Wooddassochar',
 'Sulsy',
 'Nebonder',
 'Yhaberg',
 'Halheskrasno',
 'Raykra',
 'Santko',
 'Franbarin',
 'Eljilai',
 'Eastbermai']

I'm not sure which I like better so I'll try them both on for a while and make a decsision later. 

In [116]:
pickle.dump(syllables, open("../../data/syllables.p", "wb"))
pickle.dump(syllables_dist, open("../../data/syllables_dist.p", "wb"))

**Note** that you'll need to move your `.p` files to the `app/creators/specs/` manually as a process control. That way you can review before updating this in prod. 