# A Problem Set to introduce you to Dimentionality Reduction and Clustering In Julia

By Lyndon White (@oxinabox)  
The University of Western Australia.




# First we loadup some data
For the the example presented here, we will use a subhset of Word Embedding, trained using [Word2Vec.jl](https://github.com/tanmaykm/Word2Vec.jl).
These are 100 dimentional vectors, which encode syntactic and semantic information about words.

In [157]:
all_w

String["tucson","damascus","diving","cairo","pigeon","lima","jacksonville","tehran","arabia","algiers","minsk","philippines","camel","tashkent","taipei","kenya","mozambique","rowing","khartoum","bogotá","sailing","baltimore","kampala","iraq","duck","vietnam","singapore","moscow","gymnastics","atlanta","rabat","louisville","bucharest","swimming","dallas","brazil","houston","jakarta","japan","nairobi","taekwondo","sacramento","stockholm","antananarivo","congo","philadelphia","las","vegas","mouse","kayak","india","sudan","accra","thailand","milwaukee","seoul","boston","ukraine","london","soccer","france","south","austin","weightlifting","havana","mexico","spain","yemen","poland","budapest","madagascar","albuquerque","venezuela","equestrian","rugby","columbus","boxing","kinshasa","ireland","portland","indianapolis","llama","colombia","baku","field","archery","chicago","dove","seattle","caracas","china","indonesia","volleyball","ankara","ferret","luanda","santiago","nigeria","morocco","madr

In [17]:
using JLD
countries = ["afghanistan","algeria","angola","arabia","argentina","australia","bangladesh","brazil","britain","canada","china","colombia","congo","egypt","england","ethiopia","france","germany","ghana","india","indonesia","iran","iraq","ireland","italy","japan","kenya","korea","madagascar","malaysia","mexico","morocco","mozambique","myanmar","nepal","nigeria","pakistan","peru","philippines","poland","russia","south","spain","sudan","tanzania","thailand","uganda","ukraine","usa","uzbekistan","venezuela","vietnam","wales","yemen"]
usa_cities = ["albuquerque","atlanta","austin","baltimore","boston","charlotte","chicago","columbus","dallas","denver","detroit","francisco","fresno","houston","indianapolis","jacksonville","las","louisville","memphis","mesa","milwaukee","nashville","omaha","philadelphia","phoenix","portland","raleigh","sacramento","san","seattle","tucson","vegas","washington"]
world_capitals = ["accra","algiers","amman","ankara","antananarivo","athens","baghdad","baku","bangkok","beijing","beirut","berlin","bogotá","brasília","bucharest","budapest","cairo","caracas","damascus","dhaka","hanoi","havana","jakarta","kabul","kampala","khartoum","kinshasa","kyiv","lima","london","luanda","madrid","manila","minsk","moscow","nairobi","paris","pretoria","pyongyang","quito","rabat","riyadh","rome","santiago","seoul","singapore","stockholm","taipei","tashkent","tehran","tokyo","vienna","warsaw","yaoundé"]
animals = ["alpaca","camel","cattle","dog","dove","duck","ferret","goldfish","goose","guineafowl","llama","mouse","pigeon","yak"]
sports = ["archery","badminton","basketball","boxing","cycling","diving","equestrian","fencing","field","football","golf","gymnastics","handball","hockey","judo","kayak","pentathlon","polo","rowing","rugby","sailing","shooting","soccer","swimming","taekwondo","tennis","triathlon","volleyball","weightlifting","wrestling"]


words_by_class = [countries, usa_cities, world_capitals, animals, sports]
all_words = vcat(words_by_class...)
classes = vcat(((1:5) .* ones.(length.(words_by_class)))...);
embeddings = load("ClusteringAndDimentionalityReduction.jld", "embeddings")

Dict{String,Array{Float32,1}} with 185 entries:
  "ferret"       => Float32[0.0945707,-0.435267,0.0109875,-0.107674,0.169001,-0…
  "gymnastics"   => Float32[-0.269173,-0.343412,-0.00603042,-0.186179,0.0342606…
  "vegas"        => Float32[-0.00530534,-0.264874,0.0167432,-0.289836,-0.14033,…
  "archery"      => Float32[0.0279714,-0.485648,0.105468,-0.0696941,0.182807,-0…
  "jacksonville" => Float32[-0.418758,-0.0284594,0.00847164,-0.0989162,0.098186…
  "ankara"       => Float32[-0.139109,0.0872892,0.749557,-0.0308427,-0.0936718,…
  "pentathlon"   => Float32[-0.357405,-0.379595,-0.134314,-0.31008,-0.0245871,-…
  "seoul"        => Float32[0.0274904,-0.153844,-0.0936614,-0.0269344,-0.091449…
  "china"        => Float32[0.132423,-0.515862,-0.0381339,-0.287565,-0.285202,-…
  "korea"        => Float32[0.236904,-0.128355,-0.0816942,-0.0702621,-0.148426,…
  "argentina"    => Float32[-0.113967,-0.437523,-0.226014,-0.439572,-0.230062,-…
  "mozambique"   => Float32[0.309411,-0.13457,-0.632055,-0.30

# MultivariateStats.jl
[MultivariateStats.jl](https://github.com/JuliaStats/MultivariateStats.jl) is the main library for Dimentionality Reduction

In [2]:
using MultivariateStats
using Plots
plotly()

In [3]:
embeddings_mat = hcat(getindex.([embeddings], all_words)...)

100×185 Array{Float32,2}:
  0.0386423   -0.0747454   …  -0.194131    -0.0949871   0.0184777
 -0.0707636    0.00147601     -0.521243    -0.540243   -0.0992318
  0.122178    -0.030897        0.0806444    0.0674903   0.343439 
  0.187411    -0.201719       -0.237717    -0.0968779  -0.113297 
 -0.215721    -0.181733        0.125805     0.277859    0.254373 
 -0.33405     -0.0827407   …  -0.202835     0.153194    0.359169 
  0.198505     0.356985       -0.194464    -0.0815657   0.332574 
  0.290666     0.204581       -0.210431    -0.253662   -0.548761 
 -0.264896    -0.240784        0.11638      0.295445    0.0797238
 -0.370904    -0.276216        0.0468465    0.0898132  -0.0984195
 -0.140316    -0.1886      …   0.180491    -0.147654    0.090978 
 -0.0271654   -0.336009        0.00966041   0.116254    0.163717 
 -0.245324    -0.002544       -0.381931    -0.646284   -0.321171 
  ⋮                        ⋱                                     
 -0.426754    -0.0195873      -0.581407    -0.2974

In [77]:
#Direct projection -- no DR -- just throw away the information in the other axies
xs=embeddings_mat
scatter(xs[1,:], xs[2,:], xs[3,:]; hover=all_words, zcolor=classes)

### PCA

In [80]:

M = fit(PCA, embeddings_mat; maxoutdim=3)
xs = transform(M, embeddings_mat)
scatter(xs[1,:], xs[2,:], xs[3,:]; hover=all_words, zcolor=classes)

In [79]:
M = fit(PCA, embeddings_mat; maxoutdim=2)
xs = transform(M, embeddings_mat)
scatter(xs[1,:], xs[2,:]; hover=all_words, zcolor=classes)

In [78]:
M = fit(PCA, embeddings_mat; maxoutdim=1)
xs = transform(M, embeddings_mat)
scatter(xs[1,:], ones(length(xs)); hover=all_words, zcolor=classes)


In [81]:
embeddings_mat_f64 = convert(Matrix{Float64}, embeddings_mat)

M = fit(ICA, Float64.(embeddings_mat_f64),5)
xs = transform(M, embeddings_mat_f64)


5×185 Array{Float64,2}:
 -0.845775   0.129334    0.00349142  …   0.370089    0.145215    0.788521
 -0.146869  -1.48463    -2.47835         0.380199   -0.0228977   0.609718
  0.67031    0.0679638  -0.8022         -0.187843    0.43557     0.398659
 -1.64131   -1.68317     0.380134        0.0429227  -0.0657806   0.372419
  0.4542    -0.0607897   0.0588406      -2.71555    -3.16978    -1.37088 

In [82]:
scatter(xs[1,:], xs[2,:], xs[3,:]; hover=all_words, zcolor=classes)

# T-SNE
T-SNE is another popluar DR method.  
For some reason the [TSne.jl](https://github.com/lejon/TSne.jl) package is not registered.  
It is maintained though.
However it is sideways -- it is row major, so tanspose the inputs and outputs

In [75]:
Pkg.clone("git://github.com/lejon/TSne.jl.git");

[1m[34mINFO: Cloning TSne from git://github.com/lejon/TSne.jl.git
[0m

LoadError: TSne already exists

In [42]:
using TSne

[1m[34mINFO: Precompiling module TSne.
[0m

In [65]:
#μ = mean(std(embeddings_mat,1))
#σ = std(embeddings_mat.-μ,1)

1×185 Array{Float32,2}:
 0.211171  0.22969  0.240134  0.236254  …  0.250757  0.265375  0.271414

In [83]:
xs = tsne(embeddings_mat', 3, 500, 1000, 20.0)'

Computing t-SNE  0%|                                    |  ETA: 0:01:06[1m[34m
Computing t-SNE  1%|                                    |  ETA: 0:00:30[1m[34m
Computing t-SNE  2%|█                                   |  ETA: 0:00:22[1m[34m
Computing t-SNE  2%|█                                   |  ETA: 0:00:24[1m[34m
Computing t-SNE  2%|█                                   |  ETA: 0:00:24[1m[34m
Computing t-SNE  4%|█                                   |  ETA: 0:00:20[1m[34m
Computing t-SNE  4%|██                                  |  ETA: 0:00:18[1m[34m
Computing t-SNE  6%|██                                  |  ETA: 0:00:16[1m[34m
Computing t-SNE  6%|██                                  |  ETA: 0:00:15[1m[34m
Computing t-SNE  8%|███                                 |  ETA: 0:00:14[1m[34m
Computing t-SNE  8%|███                                 |  ETA: 0:00:14[1m[34m
Computing t-SNE  9%|███                                 |  ETA: 0:00:14[1m[34m
Computing t-SNE 10%|████    

3×185 Array{Float64,2}:
  43.1512    37.3727    48.3936  50.8145   …  -39.8871  -21.107   -25.2251
 -18.3487     9.72389   22.016   -3.92098     -25.0028  -30.9606  -41.0196
   0.575971   5.54296  -18.9601   9.95554      30.0511   45.5005   17.7756

In [84]:
scatter(xs[1,:], xs[2,:], xs[3,:]; hover=all_words, zcolor=classes)

## Clustering
The main clustering package for julia, is unexpectedly, named [Clustering.jl](https://github.com/JuliaStats/Clustering.jl)
 - It supports K-means, K-medoids, Affinity Propagation, DBSCAN
 - It also supports hierarchical clustering, but that is not currently in the docs.
 
You'll also want  [Distances.jl](https://github.com/JuliaStats/Distances.jl) for all your distance metric needs.
It is traditional with word2vec to use cosine distance.

In [131]:
using Clustering
using Distances

similarity = 1f0 - pairwise(CosineDist(), embeddings_mat)
availability = 0.01*ones(size(similarity,1)) 
#tweaking availability is how you control number of clusters

similarity[diagind(size(similarity)...)] = availability
aprop = affinityprop(similarity)

185-element Array{Float64,1}:
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 ⋮   
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01
 0.01

In [147]:
for (cluster_ii, examplar_ind) in enumerate(aprop.exemplars)
    println("-"^32)
    println("Exemplar: ", all_words[examplar_ind])
    cluster_member_inds = find(assignments(aprop).==cluster_ii)
    println(join(getindex.([all_words], cluster_member_inds), ", "))
end

--------------------------------
Exemplar: bangladesh
bangladesh, india, nepal, pakistan, dhaka
--------------------------------
Exemplar: colombia
argentina, brazil, colombia, mexico, peru, spain, venezuela, bogotá, brasília, caracas, havana, lima, madrid, quito, santiago
--------------------------------
Exemplar: indonesia
indonesia, malaysia, myanmar, philippines, thailand, bangkok, jakarta, manila, singapore
--------------------------------
Exemplar: iran
afghanistan, iran, iraq, uzbekistan, yemen, kabul, tehran
--------------------------------
Exemplar: korea
china, japan, korea, vietnam, hanoi, pyongyang, seoul, taipei, tokyo
--------------------------------
Exemplar: poland
france, germany, poland, south, warsaw
--------------------------------
Exemplar: uganda
angola, congo, ethiopia, ghana, kenya, madagascar, mozambique, nigeria, sudan, tanzania, uganda, accra, antananarivo, kampala, kinshasa, luanda, nairobi, pretoria, yaoundé
--------------------------------
Exemplar: wales


### Affinity Propagraion
If you see the availability right, it can get a breakdown where the ball-sports and clustered seperately from the other sports. Though you may have problems with some of the cities being classes as sports, as this word2vec repressentation was trained on a dump of wikipedia taken in 2014, and there are a lot of sports pages talking about the Athens and Beijing olypics.
