## Finde five five-letter words with 25 different characters

### First approach

No optimization. Every word has the same value, despite how odd it is.

#### Loading the needed modules

In [1]:
import pulp as plp
import pandas as pd
import numpy as np

#### Loading and processing the word list

Source of word list: https://github.com/dwyl/english-words

In [2]:
# loading words
file = 'words_alpha.txt'
df = pd.read_csv(file, names=['word'])

# get length of each word
df['length'] = df['word'].apply(lambda x: len(str(x)))

# keep only words of length 5
dff = df[df.length == 5].copy()

# filter out words which do not contain vows
dff = dff[dff['word'].str.contains('a|e|i|o|u|y')]

# get number of unique characters of each word
dff['count unique'] = dff['word'].apply(lambda word: len(set(word)))

# keep only words with 5 unique characters
word_list = dff[dff['count unique'] == 5]['word'].values

# return length of lists
print(f'Original list has length {len(df)}')
print(f'Filtered list of five letter words has length {len(dff)}')
print(f'Filtered list with unique five letter characters has length {len(word_list)}')

Original list has length 370105
Filtered list of five letter words has length 15912
Filtered list with unique five letter characters has length 10172


#### Solving the ILP problem

This approach gives a possible of potentially many solutions

In [3]:
# set variables
variables = plp.LpVariable.dict('words', (word_list), lowBound=0, upBound=1, cat='Binary')

# initialize problem
prob = plp.LpProblem('Find_five_words')

# constraint 1: only five words
prob += plp.lpSum([variables[word] for word in word_list]) == 5

# constraint 2: each char is allowed to occur only once
for char in 'abcdefghijklmnopqrstuvwxyz':
    prob += plp.lpSum([variables[word] for word in word_list if char in word]) <= 1

prob.solve()

# returning the solution
solution = [word for word in word_list if variables[word].value() == 1.0]

GLPSOL--GLPK LP/MIP Solver 5.0
Parameter(s) specified in the command line:
 --cpxlp /tmp/0ba961bbfe52462787f2e414eee8e872-pulp.lp -o /tmp/0ba961bbfe52462787f2e414eee8e872-pulp.sol
Reading problem data from '/tmp/0ba961bbfe52462787f2e414eee8e872-pulp.lp'...
27 rows, 10173 columns, 61032 non-zeros
10172 integer variables, all of which are binary
22398 lines were read
GLPK Integer Optimizer 5.0
27 rows, 10173 columns, 61032 non-zeros
10172 integer variables, all of which are binary
Preprocessing...
27 rows, 10172 columns, 61032 non-zeros
10172 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be well scaled
Constructing initial basis...
Size of triangular part is 27
Solving LP relaxation...
GLPK Simplex Optimizer 5.0
27 rows, 10172 columns, 61032 non-zeros
      0: obj =   0.000000000e+00 inf =   2.400e+01 (6)
     57: obj =   0.000000000e+00 inf =   0.000e+00 (0)
OPTIMAL LP SOLUTION FOUND
Intege

In [4]:
solution

['frack', 'jowly', 'pbxes', 'vingt', 'zhmud']

## Getting more interesting solutions

### Using Pulp for optimization

#### Load a word frequency dataset
Loading Google's NGram dataset to retrieve the word frequencies for optimization

Source: https://www.kaggle.com/datasets/wheelercode/english-word-frequency-list

In [3]:
# load Google Books Ngram dataset from Kaggle
file_freq = 'ngram_freq.csv'
# file_freq = 'unigram_freq.csv'
df_freq = pd.read_csv(file_freq)

# get length of words
df_freq['length'] = df_freq['word'].apply(lambda x: len(str(x)))
dff_freq = df_freq[df_freq['length'] == 5].copy()

# get number of unique characters of each word
dff_freq['length'] = dff_freq['word'].apply(lambda word: len(set(word)))

# keep only words with 5 unique characters
dfff_freq = dff_freq[dff_freq.length == 5].reset_index(drop=True)

# transfer data into dictionary for later look up
freq_dict = {dfff_freq.loc[i, 'word']: dfff_freq.loc[i, 'count'] for i in dfff_freq.index}

# return length of lists
print(f'Original list has length {len(df_freq)}')
print(f'Filtered list with unique five letter characters has length {len(freq_dict)}')

Original list has length 9244879
Filtered list with unique five letter characters has length 427383


#### Adding the frequencies to the word list

Now we match the words of the original list with the count frequencies found in the NGram dataset. Word which are not included in the NGram dataset are given the count 1.

In [4]:
word_dict = {}
not_found_count = 0

for word in word_list:
    try: 
        # try to get counts from NGram dataset
        count = freq_dict[word]
        word_dict[word] = count
    except KeyError:
        # if key is not found in NGram dataset, add it to list with a count of 1
        word_dict[word] = 1
        not_found_count += 1

word_list = list(word_dict.keys())
len(word_dict), not_found_count

(10172, 119)

#### Solve the adapted problem

Finding the optimal solution to the problem. We take the logarithm of the count frequencies to get better solutions since the frequencies range over several magnitutes. After finding an optimal solution which maximizes the word frequency, we exclude the word with the highes frequency count from the problem and re-run the code.

In [5]:
# set variables
variables = plp.LpVariable.dict('words', (word_list), lowBound=0, upBound=1, cat='Binary')

# initialize problem
prob = plp.LpProblem('Find_five_words', plp.LpMaximize)

# objective function
prob += plp.lpSum([variables[word] * np.log(word_dict[word]) for word in word_list])

# constraint 1: only five words
prob += plp.lpSum([variables[word] for word in word_list]) == 5

# constraint 2: each char is allowed to occur only once
for char in 'abcdefghijklmnopqrstuvwxyz':
    prob += plp.lpSum([variables[word] for word in word_list if char in word]) <= 1
          
solutions = []
   
# continue solving while not all solutions are found
while True:
    prob.solve()
    
    # break if ILP solution is not optimal
    if plp.LpStatus[prob.status] != "Optimal":
        break
        
    # retrieve results       
    results = [word for word in word_list if variables[word].value() == 1.0]
     
    # find result with most counts for exclusion
    results_dict = {r: word_dict[r] for r in results}        
    exclude = max(results_dict, key=results_dict.get)
        
    # exclude word from problem
    prob += plp.lpSum(variables[exclude]) == 0 
    
    # append new results to solutions list
    solutions.append(results)

GLPSOL--GLPK LP/MIP Solver 5.0
Parameter(s) specified in the command line:
 --cpxlp /tmp/59f7da6e6e9747ec8bbeef854ae07a40-pulp.lp -o /tmp/59f7da6e6e9747ec8bbeef854ae07a40-pulp.sol
Reading problem data from '/tmp/59f7da6e6e9747ec8bbeef854ae07a40-pulp.lp'...
27 rows, 10172 columns, 61032 non-zeros
10172 integer variables, all of which are binary
27422 lines were read
GLPK Integer Optimizer 5.0
27 rows, 10172 columns, 61032 non-zeros
10172 integer variables, all of which are binary
Preprocessing...
27 rows, 10172 columns, 61032 non-zeros
10172 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be well scaled
Constructing initial basis...
Size of triangular part is 27
Solving LP relaxation...
GLPK Simplex Optimizer 5.0
27 rows, 10172 columns, 61032 non-zeros
      0: obj =  -0.000000000e+00 inf =   2.400e+01 (6)
     46: obj =   5.092732070e+01 inf =   6.277e-18 (0)
*    84: obj =   7.097035546e+01

+   247: >>>>>   5.835624047e+01 <=   6.926062373e+01  18.7% (12; 3)
+   669: mip =   5.835624047e+01 <=     tree is empty   0.0% (0; 83)
INTEGER OPTIMAL SOLUTION FOUND
Time used:   0.5 secs
Memory used: 15.2 Mb (15939871 bytes)
Writing MIP solution to '/tmp/615e63904ea4461da08107bb774443ba-pulp.sol'...
GLPSOL--GLPK LP/MIP Solver 5.0
Parameter(s) specified in the command line:
 --cpxlp /tmp/57c85a3cea3940edab5762138af6e9b8-pulp.lp -o /tmp/57c85a3cea3940edab5762138af6e9b8-pulp.sol
Reading problem data from '/tmp/57c85a3cea3940edab5762138af6e9b8-pulp.lp'...
33 rows, 10172 columns, 61038 non-zeros
10172 integer variables, all of which are binary
27428 lines were read
GLPK Integer Optimizer 5.0
33 rows, 10172 columns, 61038 non-zeros
10172 integer variables, all of which are binary
Preprocessing...
27 rows, 10166 columns, 60996 non-zeros
10166 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  1.000e+00  ratio =  1.000e+00
Problem data seem to be w

+   734: >>>>>   5.256865312e+01 <=   6.168690577e+01  17.3% (43; 11)
+  1208: >>>>>   5.507140593e+01 <=   5.507140593e+01   0.0% (22; 91)
+  1208: mip =   5.507140593e+01 <=     tree is empty   0.0% (0; 189)
INTEGER OPTIMAL SOLUTION FOUND
Time used:   0.9 secs
Memory used: 17.3 Mb (18176072 bytes)
Writing MIP solution to '/tmp/d7e92a0354484175885eb10c9bf27709-pulp.sol'...
GLPSOL--GLPK LP/MIP Solver 5.0
Parameter(s) specified in the command line:
 --cpxlp /tmp/b275b54a268c4f95a9785007ef8b8ca0-pulp.lp -o /tmp/b275b54a268c4f95a9785007ef8b8ca0-pulp.sol
Reading problem data from '/tmp/b275b54a268c4f95a9785007ef8b8ca0-pulp.lp'...
38 rows, 10172 columns, 61043 non-zeros
10172 integer variables, all of which are binary
27433 lines were read
GLPK Integer Optimizer 5.0
38 rows, 10172 columns, 61043 non-zeros
10172 integer variables, all of which are binary
Preprocessing...
27 rows, 10161 columns, 60966 non-zeros
10161 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e

+  1733: >>>>>   4.703876094e+01 <=   5.722237620e+01  21.6% (37; 47)
+  2131: >>>>>   4.715040760e+01 <=   5.374073233e+01  14.0% (36; 81)
+  2392: mip =   4.715040760e+01 <=     tree is empty   0.0% (0; 257)
INTEGER OPTIMAL SOLUTION FOUND
Time used:   1.5 secs
Memory used: 16.4 Mb (17172334 bytes)
Writing MIP solution to '/tmp/d155278c9b804cfb95d42e2895c6e0cf-pulp.sol'...
GLPSOL--GLPK LP/MIP Solver 5.0
Parameter(s) specified in the command line:
 --cpxlp /tmp/4736c252f42440ff8d3e982d9a0f7d10-pulp.lp -o /tmp/4736c252f42440ff8d3e982d9a0f7d10-pulp.sol
Reading problem data from '/tmp/4736c252f42440ff8d3e982d9a0f7d10-pulp.lp'...
44 rows, 10172 columns, 61049 non-zeros
10172 integer variables, all of which are binary
27439 lines were read
GLPK Integer Optimizer 5.0
44 rows, 10172 columns, 61049 non-zeros
10172 integer variables, all of which are binary
Preprocessing...
27 rows, 10155 columns, 60930 non-zeros
10155 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e

+   760: >>>>>   3.479598872e+01 <=   6.499025931e+01  86.8% (43; 10)
+  1566: >>>>>   4.072192016e+01 <=   5.341195141e+01  31.2% (52; 43)
+  1981: >>>>>   4.099733902e+01 <=   4.545947533e+01  10.9% (19; 140)
+  1992: >>>>>   4.117439433e+01 <=   4.501173263e+01   9.3% (15; 148)
+  2018: mip =   4.117439433e+01 <=     tree is empty   0.0% (0; 239)
INTEGER OPTIMAL SOLUTION FOUND
Time used:   1.3 secs
Memory used: 18.1 Mb (18963166 bytes)
Writing MIP solution to '/tmp/e872e2a3e54d46568a350f7429d230b1-pulp.sol'...
GLPSOL--GLPK LP/MIP Solver 5.0
Parameter(s) specified in the command line:
 --cpxlp /tmp/d46e3f1099f54245b192bd49aa449502-pulp.lp -o /tmp/d46e3f1099f54245b192bd49aa449502-pulp.sol
Reading problem data from '/tmp/d46e3f1099f54245b192bd49aa449502-pulp.lp'...
50 rows, 10172 columns, 61055 non-zeros
10172 integer variables, all of which are binary
27445 lines were read
GLPK Integer Optimizer 5.0
50 rows, 10172 columns, 61055 non-zeros
10172 integer variables, all of which are bina

In [6]:
solutions

[['dwarf', 'glyph', 'jocks', 'muntz', 'vibex'],
 ['breck', 'flows', 'japyx', 'vingt', 'zhmud'],
 ['fjord', 'glyph', 'muntz', 'vibex', 'wacks'],
 ['brews', 'flock', 'japyx', 'vingt', 'zhmud'],
 ['brows', 'fleck', 'japyx', 'vingt', 'zhmud'],
 ['breck', 'fowls', 'japyx', 'vingt', 'zhmud'],
 ['blows', 'freck', 'japyx', 'vingt', 'zhmud'],
 ['bowls', 'freck', 'japyx', 'vingt', 'zhmud'],
 ['breck', 'japyx', 'vingt', 'wolfs', 'zhmud'],
 ['dhikr', 'expwy', 'fultz', 'gconv', 'jambs'],
 ['brock', 'flews', 'japyx', 'vingt', 'zhmud'],
 ['flong', 'japyx', 'twick', 'verbs', 'zhmud'],
 ['dumbs', 'fritz', 'gconv', 'japyx', 'whelk'],
 ['becks', 'fultz', 'japyx', 'mordv', 'whing'],
 ['frack', 'jowly', 'pbxes', 'vingt', 'zhmud'],
 ['blitz', 'fconv', 'gryph', 'judex', 'mawks'],
 ['block', 'fremt', 'japyx', 'vughs', 'windz'],
 ['bovld', 'freck', 'japyx', 'muntz', 'whigs'],
 ['chivw', 'expdt', 'flank', 'grosz', 'jumby'],
 ['flong', 'jarvy', 'pbxes', 'twick', 'zhmud'],
 ['expdt', 'gconv', 'jumby', 'whilk', 'z