# Word Shapes

The code in this Jupyter notebook implements ideas by prof. Hinrich Schütze in his word shapes lecture:
http://www.cis.lmu.de/~hs/teach/15w/intro/pdf/wordshapes.pdf 

Word shapes can help the computer better understand the nature of an unknown word (out-of-vocabulary). Word shapes can also be used as features for tasks such as Named Entity Recognition. 

In [1]:
# Some preliminaries
from collections import defaultdict
import regex as re

An efficient function that takes a word (as a string) and returns the word shapes (also as a string). The function exploits regular expressions in regex to obtain the word shape.

In [2]:
def get_word_shape(word):
    "Given a word, return its shape"
    
    # Replace all lower case letters with ’x’
    word_shape = re.sub(r"[[:lower:]]", 'x', word)
    
    # Replace all upper case letters with ’X’
    word_shape = re.sub(r"[[:upper:]]", 'X', word_shape)
    
    # Replace all digits letters with ’9’
    word_shape = re.sub('\d', '9', word_shape) 
    
    # “Deduplicate”: any sequence of n > 1 identical characters c
    # is replaced by a single copy of c
    word_shape = re.sub(r'(.)\1{1,}', r'\1', word_shape) 
    
    return word_shape    

## Examples

### (1) Word compounds

In [3]:
get_word_shape("state-of-the-art")

'x-x-x-x'

In [4]:
get_word_shape("Myths/Facts")

'Xx/Xx'

In [5]:
get_word_shape("pre-Islamic")

'x-Xx'

### (2) Currency words 

In [6]:
get_word_shape("£24.4m")

'£9.9x'

In [7]:
get_word_shape("$15.81m")

'$9.9x'

### (3) Words with measurments  

In [8]:
get_word_shape("32km/h")

'9x/x'

In [9]:
get_word_shape("2mm/year")

'9x/x'

### (4) Words with non-ASCII characters 

In [10]:
get_word_shape("fiancé")

'x'

In [11]:
get_word_shape("Über")

'Xx'

### (5) Web links and emails

In [12]:
get_word_shape("https://stackoverflow.com/questions/")

'x:/x.x/x/'

In [13]:
get_word_shape("andrew.johnson@nlp.edu")

'x.x@x.x'

### (6) Words from other languages 

In [14]:
# French
get_word_shape("d'accord")

"x'x"

In [15]:
# Czech
get_word_shape("Čecháček")

'Xx'

In [16]:
# Russian
get_word_shape("Пожалуйста")

'Xx'

## Corpus Analysis

First, read a large text corpus and populate a word count dictionary. This loop does not read the text file into memory thus a large text file can be processed more efficiently. The text file is assumet to properly tokenized beforehand (e.g., with the standard word tokenizer in NLTK). The monolingual corpus from the workshop in machine translation (WMT) has been used in this analysis. The corpus can be downloaded from here http://www.statmt.org/wmt10/training-monolingual.tgz

In [17]:
word_counts = defaultdict(int)

with open('/home/badr/word_shapes/news.en.tokenized.all') as f:
    for line in f:
        for w in line.split():
            word_counts[w] += 1

Number of word types in the corpus

In [18]:
len(word_counts)

2261601

Number of word tokens in the corpus

In [19]:
sum(word_counts.values())

1130643099

Given the word count dictionary, populate two dictionaries to obtain word shapes 

In [20]:
shape_counts = defaultdict(int)
words_per_shape = defaultdict(set)

for word in word_counts:
    w_shape = get_word_shape(word) 
    shape_counts[w_shape] += word_counts[word]
    words_per_shape[w_shape].add(word)

Number of word shapes 

In [21]:
len(shape_counts)

13681

Get the most frequent shapes in the corpus and show some examples

In [22]:
shape_counts_sorted = sorted(shape_counts.items(), key=lambda x: x[1], reverse=True)

s = "{0:10} {1:20} {2:20} {3:20} {4:20} {5:20}"

print("{0:10} {1:>50}".format('Shape', 'Examples'))
print(''.join('-' for _ in range(110)))

for (w_shape, count) in shape_counts_sorted[:50]:
    words = [w for w in words_per_shape[w_shape]]
    
    if len(words) > 5:
        print(s.format(w_shape, *words[:5]))

Shape                                                Examples
--------------------------------------------------------------------------------------------------------------
x          antivenoms           stomaches            speke                fillums              decoupaging         
Xx         Burullus             Sportica             Krabs                Chalard              Petticoat           
X          LHP                  ELMA                 RAYTOWN              MSSC                 BAGATELLE           
9          5321                 715931               21284377             20528                430342              
'x         'finest              'nearly              'grandstanding       ''nonsense           'sapo               
x-x        mad-hatter           shingle-back         check-boxes          busby-haired         semi-homestand      
Xx.        Honestly..           Korean..             Col.                 Hartlepool..         Qiyadi..            
9.9        183.