Skip to content

DOsinga/country2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

country2vec

Color a world or US map by how semantically close a word is to each country (or state) name. Built on Google News Word2Vec — the names of countries (and US states) end up acting as a fuzzy choropleth of meaning. "vodka" lights up Eastern Europe, "malaria" the African belt, "cowboy" the western US.

Try it live at douwe.com/projects/mapof.

Run

pip install -r requirements.txt
python build_db.py

build_db.py fetches the 3M-word Google News word2vec model via gensim (~1.5 GB download, cached under ~/gensim-data/), collapses every word's case-variants into a single float32 vector per lowercase word, and writes them to static/word2vec.db (~3.3 GB) as a sqlite-vec vec0 virtual table. A small words_lower index is also written for autocomplete prefix lookups. About four minutes of work after the download finishes.

mapof.py is the scoring code: it loads the country / state name embeddings once, then for each query word computes cosine distance against every region and rescales the result per map. The output is the (region_code, score, name) data that gets rendered as a Plotly choropleth — natural earth projection for the world map, albers usa for the US map. Map type is selected via ?map=world (default) or ?map=usa; adding a third map type means appending an entry to the COUNTRIES / STATES-style dicts at the top of mapof.py.

How it works

For each lowercase word, the stored vector is the sum of the word's lowercase and Titlecase variants in Google News. The ALL-CAPS variant is treated as noise and skipped unless it's the only form present. Querying then becomes a single primary-key lookup per word — both the user's word and each country (or US state) name resolve to one vector apiece, and cosine distance between them is used as the choropleth value (rescaled per map to [0, 1]).

The original (2016) version of this project imported the same data into Postgres and queried with the cube extension. The current code reads from sqlite-vec, so it has no Postgres dependency and ships the DB as a single file.

Files

  • mapof.py — scoring code; MAPS, COUNTRIES, STATES config dicts up top.
  • mapof.html — Plotly choropleth template (transparent background).
  • build_db.py — produces static/word2vec.db from Google News word2vec.
  • requirements.txt — minimal deps for the build and the scoring code.
  • static/word2vec.db — generated, not tracked in git.

About

Country2Vec uses Doc2Vec to color a map based on the distance between words and the country names

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors