# FastTextPy Usage

## Word Representations

Train a skip-gram model:

In [1]:
from fastTextPy import FastText

the data text8 can be downloaded from [ http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip)

In [2]:
model = 'skipgram' # use 'cbow' for CBOW 
sg = FastText().train("text8.txt",model=model,lr=0.025,dim=100)

It takes about 3 minutes to train the model on my machine

## Word Similarity 

now that the model is ready, we can test it with word similarity:

In [3]:
print(sg.similarity('apple','mac'))
print(sg.similarity('dog','cat'))

0.688960558038
0.732212970772


you can also inspect a single word:

In [4]:
sg['cat'].shape

(100,)

In [5]:
sg['cat'][:10]

array([ 0.26467094,  0.05061885,  0.16552883,  0.07861796, -0.58607894,
        0.19022098, -0.02961079,  0.21464205, -0.04263076,  0.29155275], dtype=float32)

Note that we are generating the word representations by computing the average of n-gram vectors on the fly(see [1](#enriching-word-vectors-with-subword-information) for detail). To generate all word embeddings in the vocabulary ,you can use `transform` method

In [6]:
sg.transform()

now all word embeddings reside in `sg.word_emb` , `sg.vocab` is a dictionary mapping from word to it's index in `sg.word_emb`

now we can see the most similary words for a given word, say 'apple':

In [7]:
sg.most_similar(u'apple',topn=5)

([u'macintosh', u'macintoshes', u'amiga', u'workstation', u'pc'],
 [0.88937724, 0.8704108, 0.80300415, 0.78809255, 0.78022552])

It returns words and their consine similaries with "apple"

To save the model for later use:

In [8]:
sg.save('text8.bin')

## Text Classification[2](#bag-of-tricks-for-efficient-text-classification) 

using dbpedia data to train the model, here is a sample data:

`` __label__14 , the yorkshire times , the yorkshire times is an online newspaper founded in 2011 by richard trinder and the sole online-only paper in yorkshire. rather than employing journalists the yorkshire times focuses instead on citizen journalism with opinion commentary and analysis prevailing over simply reporting local events . as of 1 january 2014 the newspaper receives 35000 unique readers and 500000 reads per month .``

In [9]:
sup = FastText().train("dbpedia.train",dim=10,model='supervised',
                      lr=0.1,word_ngrams=2,min_count=1,bucket=10000000,
                      epoch=5,thread=4)

train took about **10** seconds

before going further, let's have a look at the labels:

In [10]:
print(sup.labels)

{0: u'__label__2', 1: u'__label__12', 2: u'__label__4', 3: u'__label__8', 4: u'__label__14', 5: u'__label__1', 6: u'__label__9', 7: u'__label__3', 8: u'__label__13', 9: u'__label__11', 10: u'__label__10', 11: u'__label__7', 12: u'__label__6', 13: u'__label__5'}


Let's see what these label means:

In [11]:
classes={
  1 : "Company",
  2 :"EducationalInstitution",
  3 :"Artist",
  4 :"Athlete",
  5 :"OfficeHolder",
  6 :"MeanOfTransportation",
  7 :"Building",
  8 :"NaturalPlace",
  9 :"Village",
 10 :"Animal",
 11 :"Plant",
 12 :"Album",
 13 :"Film",
 14 :"WrittenWork"
}

In [12]:
classes = {u'__label__'+str(k):v for k,v in classes.items()}

In [13]:
classes

{u'__label__1': 'Company',
 u'__label__10': 'Animal',
 u'__label__11': 'Plant',
 u'__label__12': 'Album',
 u'__label__13': 'Film',
 u'__label__14': 'WrittenWork',
 u'__label__2': 'EducationalInstitution',
 u'__label__3': 'Artist',
 u'__label__4': 'Athlete',
 u'__label__5': 'OfficeHolder',
 u'__label__6': 'MeanOfTransportation',
 u'__label__7': 'Building',
 u'__label__8': 'NaturalPlace',
 u'__label__9': 'Village'}

now given line of texts, we can predict the theme of that line

In [14]:
lines="""the world is not enough , the world is not enough ( 1999 ) is the nineteenth spy film
in the james bond series and the third to star pierce brosnan as the fictional mi6 agent 
james bond . the film was directed by michael apted with the original story and screenplay 
written by neal purvis robert wade and bruce feirstein . 
it was produced by michael g . wilson and barbara broccoli ."""

the above text is a introduction to some film, lt's see can the model predict the right class:

In [15]:
sup.predict(lines)

[u'__label__13']

In [16]:
classes[u'__label__13']

'Film'

try another sentence:

In [17]:
sup.predict('I love this song')

[u'__label__12']

In [18]:
classes[u'__label__12']

'Album'

### Enriching Word Vectors with Subword Information

[1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/pdf/1607.04606v1.pdf)

```
@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}
```

### Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/pdf/1607.01759v2.pdf)

```
@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}
```
