This notebook creates dictionaries for each keyword from a dictionary of unique words. The keywords are the keys of the dictionaries, and the values are the IDs of the documents (texts) and the weights of the corresponding documents.

Creating dictionaries takes a lot of time, so it was decided to split them into 4 packages to speed up the work. To do this, you need to create 3 copies of this notebook. In each notebook, specify the corresponding slice of the dictionary (this is described below) and the corresponding folder for saving the batch.

Dictionaries can be divided into any convenient number of batches, for this you need to create the appropriate number of copies of this notebook.

## Loading the required libraries

In [None]:
import json

In [None]:
# Google drive mount
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Reading tf-idf model with IDs of texts

In [None]:
tfidf_data_withID = []
for line in open('drive/MyDrive/Colab Notebooks/tfidf_data_withID.json', 'r'):
    tfidf_data_withID.append(json.loads(line))

In [None]:
print(tfidf_data_withID[:2])

[[['text0', 'anz', 0.0482463], ['text0', 'appearance', 0.0182636], ['text0', 'arch', 0.0301242], ['text0', 'australian', 0.0180499], ['text0', 'bankstown', 0.1825756], ['text0', 'beaten', 0.0292881], ['text0', 'born', 0.0080996], ['text0', 'brisbane', 0.1207643], ['text0', 'broncos', 0.2182702], ['text0', 'bulldogs', 0.107212], ['text0', 'canterbury', 0.2510069], ['text0', 'career', 0.0108858], ['text0', 'centre', 0.0163003], ['text0', 'children', 0.0134892], ['text0', 'club', 0.0280191], ['text0', 'com', 0.0168096], ['text0', 'contract', 0.0194018], ['text0', 'couple', 0.0218856], ['text0', 'cronulla', 0.0455894], ['text0', 'david', 0.0426766], ['text0', 'deal', 0.0387317], ['text0', 'debut', 0.0345433], ['text0', 'defeat', 0.0220487], ['text0', 'dropped', 0.0216048], ['text0', 'early', 0.0082615], ['text0', 'end', 0.0203032], ['text0', 'extended', 0.0195448], ['text0', 'external', 0.0030123], ['text0', 'fill', 0.0274206], ['text0', 'final', 0.0511782], ['text0', 'finished', 0.0181537

## Reading a dictionary of keywords

In [None]:
with open('drive/MyDrive/Colab Notebooks/gensim_vocabulary.json', 'r') as f:
    basic_dict = json.load(f)

In [None]:
print(basic_dict)



## Specifying dictionary slices for each notebook copy

To speed up data processing, you can run several copies of this notebook, processing one slice of data in each. Divided, for example, into 4 parts, as shown below for a dictionary of keywords consisting of 78844 words.
In each copy of the notebook, specify the corresponding slice for the **slice_basic_dict** variable:

1.   [:19750]
2.   [19750:39500]
3.   [39500:59250]
4.   [59250:]







In [None]:
lst = list(basic_dict.items())
slice_basic_dict = dict(lst[:19750])
# print(slice_basic_dict)

## Creating dictionaries

Dictionaries were saved in 4 batches in folders for each of 4 notebook:


1.   temporary1
2.   temporary2
3.   temporary3
4.   temporary4





In [None]:
for k, v in slice_basic_dict.items():
    temp_dict = dict()
    with open('drive/MyDrive/Colab Notebooks/all_temporary/temporary1/%s.json' % k, 'w') as f:
        for text in tfidf_data_withID:
            for word in text:
                if v not in temp_dict:
                    temp_dict[v] = []
                if word[1] == v:
                    temp_dict[v].append([word[0], word[2]])
        json.dump(temp_dict, f, indent=4)