In [3]:
import matplotlib
%matplotlib inline


How to download pre-trained models and corpora
==============================================

Demonstrates simple and quick access to common corpora, models, and other data.


In [4]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

One of Gensim's features is simple and easy access to some common data.
The `gensim-data <https://github.com/RaRe-Technologies/gensim-data>`_ project stores a variety of corpora, models and other data.
Gensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.
The module leverages a local cache that ensures data is downloaded at most once.

This tutorial:

* Retrieves the text8 corpus, unless it is already on your local machine
* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)
* Leverages the model to calculate word similarity
* Demonstrates using the API to load other models and corpora

Let's start by importing the api module.




In [5]:
import gensim.downloader as api

Now, lets download the text8 corpus and load it to memory (automatically)




In [6]:
corpus = api.load('text8')

2020-07-29 12:01:42,953 : INFO : Creating /home/rishabhmakes/gensim-data




2020-07-29 12:04:04,444 : INFO : text8 downloaded


In this case, corpus is an iterable.
If you look under the covers, it has the following definition:



In [7]:
import inspect
print(inspect.getsource(corpus.__class__))

class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        corpus = Text8Corpus(self.fn)
        for doc in corpus:
            yield doc



For more details, look inside the file that defines the Dataset class for your particular resource.




In [8]:
print(inspect.getfile(corpus.__class__))

/home/rishabhmakes/gensim-data/text8/__init__.py


As the corpus has been downloaded and loaded, let's create a word2vec model of our corpus.




In [9]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

2020-07-29 12:04:13,734 : INFO : collecting all words and their counts
2020-07-29 12:04:13,742 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-07-29 12:04:18,973 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2020-07-29 12:04:18,974 : INFO : Loading a fresh vocabulary
2020-07-29 12:04:19,164 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2020-07-29 12:04:19,165 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2020-07-29 12:04:19,349 : INFO : deleting the raw counts dictionary of 253854 items
2020-07-29 12:04:19,355 : INFO : sample=0.001 downsamples 38 most-common words
2020-07-29 12:04:19,356 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2020-07-29 12:04:19,542 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2020-07-29 12:04:19,543 : 

2020-07-29 12:05:23,659 : INFO : EPOCH 5 - PROGRESS: at 24.51% examples, 1018499 words/s, in_qsize 4, out_qsize 1
2020-07-29 12:05:24,668 : INFO : EPOCH 5 - PROGRESS: at 32.51% examples, 1014455 words/s, in_qsize 5, out_qsize 0
2020-07-29 12:05:25,669 : INFO : EPOCH 5 - PROGRESS: at 40.45% examples, 1011140 words/s, in_qsize 5, out_qsize 0
2020-07-29 12:05:26,670 : INFO : EPOCH 5 - PROGRESS: at 48.32% examples, 1007553 words/s, in_qsize 6, out_qsize 0
2020-07-29 12:05:27,673 : INFO : EPOCH 5 - PROGRESS: at 56.38% examples, 1008068 words/s, in_qsize 5, out_qsize 0
2020-07-29 12:05:28,675 : INFO : EPOCH 5 - PROGRESS: at 64.61% examples, 1010546 words/s, in_qsize 5, out_qsize 0
2020-07-29 12:05:29,678 : INFO : EPOCH 5 - PROGRESS: at 72.84% examples, 1012829 words/s, in_qsize 5, out_qsize 0
2020-07-29 12:05:30,679 : INFO : EPOCH 5 - PROGRESS: at 81.42% examples, 1016913 words/s, in_qsize 4, out_qsize 0
2020-07-29 12:05:31,695 : INFO : EPOCH 5 - PROGRESS: at 89.65% examples, 1016711 words/s

Now that we have our word2vec model, let's find words that are similar to 'tree'




In [10]:
print(model.most_similar('tree'))

  print(model.most_similar('tree'))
2020-07-29 12:05:43,526 : INFO : precomputing L2-norms of word weight vectors


[('trees', 0.7150980830192566), ('leaf', 0.7039482593536377), ('bark', 0.6521586179733276), ('avl', 0.625106155872345), ('fruit', 0.6216151714324951), ('flower', 0.6119766235351562), ('bird', 0.6109820604324341), ('cactus', 0.5930464267730713), ('egg', 0.5897567868232727), ('pond', 0.5840590596199036)]


You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:




In [11]:
import json
info = api.info()
print(json.dumps(info, indent=4))

{
    "corpora": {
        "semeval-2016-2017-task3-subtaskBC": {
            "num_records": -1,
            "record_format": "dict",
            "file_size": 6344358,
            "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
            "license": "All files released for the task are free for general research use",
            "fields": {
                "2016-train": [
                    "..."
                ],
                "2016-dev": [
                    "..."
                ],
                "2017-test": [
                    "..."
                ],
                "2016-test": [
                    "..."
                ]
            },
            "description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collect

There are two types of data: corpora and models.



In [12]:
print(info.keys())

dict_keys(['corpora', 'models'])


Let's have a look at the available corpora:



In [14]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
    print(
        '%s (%d records): %s' % (
            corpus_name,
            corpus_data.get('num_records', -1),
            corpus_data['description'][:40] + '...',
        )
    )

20-newsgroups (18846 records): The notorious collection of approximatel...
__testing_matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...
__testing_multipart-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...
fake-news (12999 records): News dataset, contains text and metadata...
patent-2017 (353197 records): Patent Grant Full Text. Contains the ful...
quora-duplicate-questions (404290 records): Over 400,000 lines of potential question...
semeval-2016-2017-task3-subtaskA-unannotated (189941 records): SemEval 2016 / 2017 Task 3 Subtask A una...
semeval-2016-2017-task3-subtaskBC (-1 records): SemEval 2016 / 2017 Task 3 Subtask B and...
text8 (1701 records): First 100,000,000 bytes of plain text fr...
wiki-english-20171001 (4924894 records): Extracted Wikipedia dump from October 20...


... and the same for models:



In [15]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets,...
glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 (3000000 records): Pre-trai

If you want to get detailed information about the model/corpus, use:




In [16]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

{
    "num_records": 12999,
    "record_format": "dict",
    "file_size": 20102776,
    "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py",
    "license": "https://creativecommons.org/publicdomain/zero/1.0/",
    "fields": {
        "crawled": "date the story was archived",
        "ord_in_thread": "",
        "published": "date published",
        "participants_count": "number of participants",
        "shares": "number of Facebook shares",
        "replies_count": "number of replies",
        "main_img_url": "image from story",
        "spam_score": "data from webhose.io",
        "uuid": "unique identifier",
        "language": "data from webhose.io",
        "title": "title of story",
        "country": "data from webhose.io",
        "domain_rank": "data from webhose.io",
        "author": "author of story",
        "comments": "number of Facebook comments",
        "site_url": "site URL from BS detector",
        "text": "tex

Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :




In [17]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

[--------------------------------------------------] 0.2% 0.1/66.0MB downloaded

KeyboardInterrupt: 

If you want to load the model to memory, then:




In [None]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("glass")

In corpora, the corpus is never loaded to memory, all corpuses wrapped to special class ``Dataset`` and provide ``__iter__`` method


