# Topic modelling with Word2Vec

Import [gensim library](https://radimrehurek.com/gensim/models/word2vec.html) to help us work with Word2vec. 

In [1]:
from gensim.models.word2vec import Word2Vec
from gensim.models import Phrases
import ujson as json
import re

## Training a simple model

In practice topic modelling requires a large quantity of training data to ensure good coverage. For this example though we trained the model on 10,000 posts so it runs nice and quick as a demo.

*Unfortunately we cannot provide you with the data we used for this example. The data file we used contained line-delimited JSON, where each object had an content property, for example:*

    { "interaction": { "content": "Some example post content" } }

*You can easily adapt the sample() function below to yield words from sample data you have access to.*

In [2]:
path='path to your data file'

In [3]:
# This function yields individual words to be features in our model
def sample():
    with open(path) as f:
        for idx,i in enumerate(f):
            try:
                j=json.loads(i)
                doc=re.split('\W+',j['interaction']['content'].lower())
                yield doc
            except Exception, e:
                yield e

In [None]:
# For example this is the first post broken into words
for i in sample():
    print i
    break

Train the classifier, passing the sample() function as the source for the features:
* **min_count** = number of times a word must appear in the training set to be included
* **size** = the dimensionality of the vectors

In [6]:
model = Word2Vec(min_count=10,size=300,workers=4)
model.build_vocab(sample())
model.train(sample())

302070

In [7]:
# The number of words/features in the model (plotted in the vector space)
len(model.vocab)

3773

In [14]:
# Words in the model are represented as a 300-dimension vector
model['brake']

array([-0.01042447,  0.02375209, -0.01082089, -0.0425309 ,  0.08850687,
        0.01835847,  0.09388897,  0.0229075 ,  0.06897521,  0.00579358,
        0.02837576,  0.01110327,  0.04553978, -0.03796777, -0.03296932,
       -0.0075057 ,  0.00058247,  0.00256892, -0.00880944, -0.03536413,
        0.05947323,  0.07111327, -0.00194172, -0.04563519, -0.0275267 ,
        0.05214039, -0.05474159,  0.01259535,  0.05187937, -0.04075418,
        0.03328146, -0.13077228,  0.01483395,  0.0171774 ,  0.05006329,
        0.09928679,  0.02557462,  0.02554442,  0.03180938,  0.10886257,
        0.04842255,  0.01555276,  0.06251799,  0.08433534,  0.10723314,
       -0.03573573,  0.00350808,  0.04262157, -0.01684722, -0.0122436 ,
        0.02143183, -0.08224846, -0.10467756,  0.09027454,  0.04416394,
        0.01605871,  0.07377852, -0.09633812, -0.03675043, -0.00154874,
       -0.04733178,  0.10887686, -0.05309524,  0.0071005 ,  0.01628326,
       -0.02001717, -0.01103722,  0.05804011, -0.00028361, -0.04

We can now query our simple model to look for term similarity.

In [9]:
model.most_similar("brake")

[(u'pads', 0.8729451894760132),
 (u'discs', 0.8260030746459961),
 (u'coil', 0.8060240149497986),
 (u'battery', 0.8055069446563721),
 (u'clutch', 0.8049130439758301),
 (u'radiator', 0.8034414052963257),
 (u'alternator', 0.8013132810592651),
 (u'diff', 0.7878439426422119),
 (u'mounts', 0.7793276309967041),
 (u'plugs', 0.7779099345207214)]

Notice how terms collect around topics, for example the brand Ford relates most closely to terms from Ford models.

In [10]:
model.most_similar(positive=["ford"])

[(u'jeep', 0.7569979429244995),
 (u'subaru', 0.7534029483795166),
 (u'dodge', 0.7268344759941101),
 (u'xlt', 0.7241834402084351),
 (u'mustang', 0.7216092944145203),
 (u'05', 0.7102600336074829),
 (u'01', 0.7099963426589966),
 (u'lx', 0.7038288116455078),
 (u'2011', 0.6979385018348694),
 (u'f', 0.6966347694396973)]

You would expect that brands would also have a strong level of similarity. In this case this is not immediately obvious and we may need a larger training set to train this into the model.

In [11]:
model.most_similar(positive=["ford","audi"])

[(u'convertible', 0.8094639778137207),
 (u'subaru', 0.7635812759399414),
 (u'jeep', 0.7633465528488159),
 (u'01', 0.758976936340332),
 (u'17', 0.7423786520957947),
 (u'52', 0.7350496053695679),
 (u'volkswagen', 0.7329504489898682),
 (u'mustang', 0.7269424796104431),
 (u'lx', 0.7261062264442444),
 (u'suv', 0.72319495677948)]

We can also use the model to identify terms that don't belong to a group.

In [12]:
model.doesnt_match(["ford", "audi", "bmw", "wheel"])

'wheel'

Notice that as our training set is small we quickly find terms that do not exist in the model. Topic modelling requires a large training set to make it effective for our use case.

In [13]:
model.most_similar(positive=["renault","clio"])

KeyError: "word 'clio' not in vocabulary"

## Exploring the Google News model

The [word2vec project page](https://code.google.com/p/word2vec/) provides a sample dataset that you can download and load. The model is trained on content aggregated by Google News and contains 1 billion words.

We can load this example model and explore the topic relationships.

In [15]:
# Load the model
google_news = Word2Vec.load_word2vec_format('/Users/richard/Downloads/GoogleNews-vectors-negative300.bin', binary=True)

Immediately we can see the similarity between topics such as countries.

In [16]:
google_news.most_similar(positive=['Germany'])

[(u'Austria', 0.7461062073707581),
 (u'German', 0.7178748846054077),
 (u'Germans', 0.6628648042678833),
 (u'Switzerland', 0.6506868004798889),
 (u'Hungary', 0.6504981517791748),
 (u'Germnay', 0.649348258972168),
 (u'Netherlands', 0.6437495946884155),
 (u'Cologne', 0.6430778503417969),
 (u'symbol_RSTI', 0.6389946937561035),
 (u'Annita_Kirsten', 0.634294867515564)]

And politicians...

In [17]:
google_news.most_similar(positive=['Clinton'])

[(u'Hillary_Clinton', 0.7631065845489502),
 (u'Obama', 0.7526832818984985),
 (u'Bill_Clinton', 0.7416832447052002),
 (u'Hillary_Rodham_Clinton', 0.7254317402839661),
 (u'Sen._Hillary_Clinton', 0.7086110711097717),
 (u'Hillary', 0.6970474720001221),
 (u'Senator_Hillary_Clinton', 0.6961780190467834),
 (u'McCain', 0.6851686835289001),
 (u'Clintons', 0.6733236312866211),
 (u'Barack_Obama', 0.6713167428970337)]

Adding multiple positive terms gives a stronger focus on a concept such as brands...

In [20]:
google_news.most_similar(positive=['Ford', 'Audi'])

[(u'BMW', 0.7375781536102295),
 (u'Porsche', 0.7340955138206482),
 (u'Volkswagen', 0.716567873954773),
 (u'Mercedes_Benz', 0.7152481079101562),
 (u'Nissan', 0.7118146419525146),
 (u'Volvo', 0.6946060657501221),
 (u'Mazda', 0.692476749420166),
 (u'Jaguar', 0.6696500182151794),
 (u'VW', 0.6689708232879639),
 (u'Toyota', 0.6618623733520508)]

You can also ask the model the similarity between any two terms.

In [21]:
google_news.similarity('Germany','France')

0.62707561920002575

Notice that China is considered less similar by the model, presumably because the stories and context in which it is discussed is different to Germany.

In [22]:
google_news.similarity('Germany','China')

0.38773252084548004

You can also use vector maths to find related topics. 

For example here we're effectively asking for the capital of Germany. This works because the vector from France to Paris is approximately equal to the vector from Germany to Berlin in the vector space.

    France - Paris ~= Germany - Capital of Germany

Rearraning this...

    Capital of Germany = Germany + Paris - France

In [23]:
google_news.most_similar(positive=['germany', 'paris'], negative=['france'])

[(u'berlin', 0.4841364920139313),
 (u'german', 0.4656967520713806),
 (u'lindsay_lohan', 0.4559224843978882),
 (u'heidi', 0.44840937852859497),
 (u'switzerland', 0.44479838013648987),
 (u'lil_kim', 0.4430604577064514),
 (u'las_vegas', 0.4418063759803772),
 (u'christina', 0.43938425183296204),
 (u'joel', 0.4375365078449249),
 (u'russia', 0.43744248151779175)]

Returning to our example on banking. We can explore terms similar to 'bank'.

In [24]:
google_news.most_similar(positive=['bank'])

[(u'banks', 0.7440758943557739),
 (u'banking', 0.690161406993866),
 (u'Bank', 0.6698698401451111),
 (u'lender', 0.6342284679412842),
 (u'banker', 0.6092954277992249),
 (u'depositors', 0.6031532287597656),
 (u'mortgage_lender', 0.579797625541687),
 (u'depositor', 0.5716427564620972),
 (u'BofA', 0.5714625120162964),
 (u'Citibank', 0.5589520931243896)]

Again using multiple terms we can focus on well-known banks.

In [25]:
google_news.most_similar(positive=['bank','hsbc'])

[(u'wells_fargo', 0.6432552337646484),
 (u'banks', 0.640464186668396),
 (u'citibank', 0.6323216557502747),
 (u'banking', 0.6259457468986511),
 (u'barclays', 0.6250307559967041),
 (u'citigroup', 0.6233223676681519),
 (u'BankAm', 0.6208717226982117),
 (u'Vietnam_Sacombank', 0.6041629314422607),
 (u'goldman_sachs', 0.599765956401825),
 (u'jpmorgan', 0.5980450510978699)]