This program shows very simple usage of fasttext's wikinews word embeddings by looking at word similarities and word analogies using gensim. 

To run the program, you will need to:
1. Download 'wiki-news-300d-1M.vec.zip' and 'wiki-news-300d-1M-subword.vec.zip' from https://fasttext.cc/docs/en/english-vectors.html  
2. Put in a subfolder titled 'wiki-en'
3. Install gensim to your virtual environment 'pip install gensim'

Then you can run new cells

In [1]:
import fasttext
from gensim.models import KeyedVectors
#import numpy as np
#from sklearn.manifold import TSNE
#import matplotlib.pyplot as plt

In [5]:
# You need these two files in your folder for this to run properly (this cell will take a long time to run)
en_model = KeyedVectors.load_word2vec_format('wiki.en/wiki-news-300d-1M.vec')
sw_model = KeyedVectors.load_word2vec_format('wiki.en/wiki-news-300d-1M-subword.vec')

# Finding Word with Top Similarities

In [6]:
# prints the top similar words given one word for the regular and subword embeddings
def word_similarity(word):
  similar_word= en_model.most_similar(positive=[word])
  print("Non-subword result:")
  for i in range(5):
    print("{} ({:.2%})".format(
        similar_word[i][0], similar_word[i][1]))

  similar_word= sw_model.most_similar(positive=[word])
  print("\nSubword result:")
  for i in range(5):
    print("{} ({:.2%})".format(
        similar_word[i][0], similar_word[i][1]))

In [7]:
word_similarity('girl')

Non-subword result:
boy (86.18%)
girls (77.47%)
woman (74.41%)
lady (72.61%)
Girl (71.93%)

Subword result:
boy (87.73%)
girl- (81.40%)
girl-girl (78.55%)
girly-girl (77.92%)
girl-boy (77.81%)


In [8]:
word_similarity('wheelchair')

Non-subword result:
wheelchairs (80.41%)
Wheelchair (70.48%)
wheel-chair (68.36%)
wheelchair-bound (66.63%)
powerchair (65.19%)

Subword result:
wheelchairs (84.40%)
wheelchair-bound (76.75%)
wheel-chair (75.34%)
wheelchair-accessible (73.51%)
wheelchair-using (73.01%)


In [9]:
word_similarity('autism')

Non-subword result:
Autism (83.04%)
autistic (77.12%)
ADHD (69.74%)
autism-related (69.41%)
non-autistic (66.70%)

Subword result:
Autism (79.67%)
MMR-autism (76.53%)
autism-related (74.40%)
autistic (74.07%)
ADHD (72.68%)


In [10]:
word_similarity('neurotypical')

Non-subword result:
non-autistic (82.53%)
neurotypicals (79.43%)
autists (76.10%)
low-functioning (75.66%)
aspies (74.55%)

Subword result:
neurotypicals (83.25%)
Neurotypical (75.67%)
non-autistic (73.28%)
autistic (72.99%)
autistics (71.03%)


In [11]:
word_similarity('disability')

Non-subword result:
disabilities (79.35%)
Disability (78.34%)
disabilty (66.93%)
disability-related (66.93%)
disablity (65.48%)

Subword result:
disabilty (81.55%)
disabilities (80.94%)
non-disability (79.30%)
disability-related (77.25%)
cross-disability (74.04%)


In [12]:
word_similarity('disabled')

Non-subword result:
handicapped (73.02%)
Disabled (70.26%)
disable (65.97%)
diabled (65.33%)
disabling (63.79%)

Subword result:
non-disabled (80.13%)
handicapped (75.37%)
nondisabled (74.92%)
disabled. (74.38%)
diabled (73.78%)


In [13]:
word_similarity('deaf')

Non-subword result:
Deaf (79.00%)
hearing-impaired (71.58%)
hard-of-hearing (69.63%)
deafness (67.84%)
deaf-mute (67.39%)

Subword result:
deaf-blind (81.17%)
deaf-mute (79.24%)
non-deaf (78.95%)
deafblind (78.08%)
hearing-impaired (75.46%)


In [14]:
word_similarity('accessibility')

Non-subword result:
Accessibility (75.92%)
accesibility (70.11%)
accessability (70.00%)
accessibilty (69.57%)
accessiblity (69.34%)

Subword result:
accessibilty (80.31%)
accessability (80.23%)
accesibility (79.54%)
accessiblity (79.10%)
inaccessibility (75.02%)


## Words that didn't work

In [15]:
word_similarity('neurodivergent')

KeyError: "word 'neurodivergent' not in vocabulary"

In [16]:
word_similarity('neuroatypical')

KeyError: "word 'neuroatypical' not in vocabulary"

# Checking Word Similarities

In [17]:
# prints the similarity percentages for two words given a comparison word
def compare(worda, wordb, com_word):
  print("Non-subword result:")
  print("{} and {}: {:.2%}".format(worda, com_word, en_model.similarity(worda, com_word)))
  print("{} and {}: {:.2%}".format(wordb, com_word, en_model.similarity(wordb, com_word)))

  print("\nSubword result:")
  print("{} and {}: {:.2%}".format(worda, com_word, sw_model.similarity(worda, com_word)))
  print("{} and {}: {:.2%}".format(wordb, com_word, sw_model.similarity(wordb, com_word)))

In [18]:
compare('man','woman', 'doctor')

Non-subword result:
man and doctor: 53.02%
woman and doctor: 58.92%

Subword result:
man and doctor: 53.79%
woman and doctor: 57.38%


In [19]:
compare('man','woman', 'nurse')

Non-subword result:
man and nurse: 44.37%
woman and nurse: 57.77%

Subword result:
man and nurse: 46.69%
woman and nurse: 57.18%


In [20]:
compare('adult', 'elder', 'productivity')

Non-subword result:
adult and productivity: 29.30%
elder and productivity: 27.10%

Subword result:
adult and productivity: 31.16%
elder and productivity: 17.17%


# Trying to solve word analogies

In [21]:
# A is to B as C is to D
def word_analogy(worda, wordb, wordc):
  print("{} is to {} as {} is to {}".format(worda, wordb, wordc, 
                                            en_model.most_similar(negative=[worda], positive=[wordb, wordc])[0][0]) )

def sw_word_analogy(worda, wordb, wordc):
  print("{} is to {} as {} is to {}".format(worda, wordb, wordc, 
                                            sw_model.most_similar(negative=[worda], positive=[wordb, wordc])[0][0]) )

In [22]:
word_analogy('man', 'king', 'woman')
word_analogy('grass', 'green', 'sky')
word_analogy('human', 'house', 'bird') 
word_analogy('USA', 'Canada', 'fries')
word_analogy('USA', 'France', 'fries')

man is to king as woman is to queen
grass is to green as sky is to blue
human is to house as bird is to mansion
USA is to Canada as fries is to poutine
USA is to France as fries is to frites


In [23]:
sw_word_analogy('man', 'king', 'woman')
sw_word_analogy('grass', 'green', 'sky')
sw_word_analogy('human', 'house', 'bird') 
sw_word_analogy('USA', 'Canada', 'fries')
sw_word_analogy('USA', 'France', 'fries')

man is to king as woman is to queen
grass is to green as sky is to blue
human is to house as bird is to birdhouse
USA is to Canada as fries is to poutine
USA is to France as fries is to frites
