In this notebook, we will understand how to choose the best storage mode to use for your ConvoKit corpus: the original RAM based implementation, or the new database based storage mode. 

In [1]:
from convokit import Corpus, Speaker, Utterance, download

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Historically...
Historically, ConvoKit allows you to work with conversational data directly in program memory through the Corpus class. Moreover, long term storage is provided by dumping the contents of a Corpus onto disk using the JSON format. This paradigm works well for distributing and storing static datasets, and for doing computations on conversational data that follow the pattern of doing computations on some or all of the data over a short time period and optionally storing these results in the permenant representation of the dataset. For example, ConvoKit distributes datasets included with the library in JSON format, which you can load into program memory to explore and compute with. 

In [2]:
# First we download the JSON files for the corpus from the ConvoKit servers, 
# if we don't already have a local copy. Then, we load the corpus into memory 
# by constructing a Corpus object and giving it the path to the corpus (as returned by download). 
reddit_small = Corpus(filename=download('reddit-corpus-small'), storage_type='mem')

# Now we can easily work with the data through the corpus object to access the data within it.
seen_utts = []
for conversation in reddit_small.iter_conversations():
    for utterance in conversation.iter_utterances():
        print(f'{utterance.speaker.id}: {utterance.text}')
        print('------------------------------')
        utterance.meta['seen'] = True
        seen_utts.append(utterance.id)
    break
    
# Finally, to write the changes to the corpus (in this case, which utterances we saw in our test)
# back to the JSON files so these changes are reflected in our local persistant representation of the corpus.
reddit_small.dump(corpus_id='reddit-corpus-small')

# We can confirm this by constructing a new corpus from the updated JSON files.
reddit_small_2 = Corpus(corpus_id='reddit-corpus-small', storage_type='mem')
for utterance in reddit_small_2.iter_utterances():
    if utterance.id in seen_utts:
        assert utterance.meta['seen'] == True
    else:
        assert 'seen' not in utterance.meta

Dataset already exists at /Users/eoin/.convokit/downloads/reddit-corpus-small
True
Loading corpus None from disk at /Users/eoin/.convokit/downloads/reddit-corpus-small


100%|██████████| 297132/297132 [00:01<00:00, 188091.79it/s]


AutoModerator: Talk about your day. Anything goes, but subreddit rules still apply. Please be polite to each other! 

------------------------------
belmont_lay: How to spoil a kpop fangirl's day. Tell her you want to send her a pic of g dragon.

She'll be expecting something like [this](https://cdn2.i-scmp.com/sites/default/files/styles/landscape/public/images/methode/2018/01/18/482a92dc-fc0b-11e7-b2f7-03450b80c791_1280x720_135443.jpg?itok=tXZBSZ0u), but no. Send his [candid pics without makeup](https://koreaboo-cdn.storage.googleapis.com/2017/10/GDragon-Beard-01.jpg).

Amazing what makeup can do for guys too, not just girls.
------------------------------
littlefiredragon: His "candid" pics look better leh
------------------------------
belmont_lay: wat.. he looks like a random extra in a JAV
------------------------------
rheinl: “Hi Kpop girl can I send you a pic of g-dragon?”

“Uhhh ok” (what the hell? this guy is so weird)

“Here you go” *sends pic of g-dragon with no makeup*

“U

100%|██████████| 297132/297132 [00:01<00:00, 177488.88it/s]


Notice how in the above cell, we construct a corpus in two different ways. First, we use the  `filename` argument on line 4, and later use the `corpus_id` argument on lines 18 and 21. This is because we can consider a local download of a ConvoKit provided corpus as a locally cached version of the original corpus as it is distributed (i.e., we should not write to it directly). 

A downloaded corpus called `<corpus_id>` will by default be locally cached at `~/.convokit/downloads/<corpus_id>`; we use the `filename` paramater with download because download returns the full path to the corpus on disk. 

On the other hand, in general your local corpora live on disk in your `data_dir`: the directory specified in the configuration file at `~/.convokit/config.yml`; `data_dir` is `~/.convokit/saved-corpora` by default. Using the `corpus_id` paramater for initilization and to dump will read from/write to `<data_dir>/<corpus_id>`. You can also specify `data_dir` as an argument in `dump`, `download`, or a `Corpus` initilization to override the global default.

Therefore, we use these two different ways to intilize a corpus to maintain the original, unaltered version of the Corpus at `~/.convokit/downloads/reddit-corpus-small`, while storing the version we are working with and modifying at `~/.convokit/saved-corpora/reddit-corpus-small`.

### In ConvoKit version (x.x.x)...
In ConvoKit version (x.x.x), we introduce an new option for storing conversational data: Database storage. Consider a use case where you want to collect conversational data over a long time period and ensure you maintain a persistant representation of the dataset if your data collection program unexpectedly crashes. In the memory storage paradigm, this would require regularly dumping your corpus to JSON files, requiring repeated expensive write operations. On the other hand, with database storage all your data is automatically saved for long term storage in the database as it is added to the corpus. Lets view an example of constructing a corpus of reddit comments as they are posted, using ConvoKit alongside the praw wrapper library around the reddit API. 

In [3]:
import praw
from time import sleep

# You can follow these instructions to get a client_id and client_secret to run this code yourself
# https://www.geeksforgeeks.org/how-to-get-client_id-and-client_secret-for-python-reddit-api-registration/
# (or, just view the output of running this code from before I removed my own credentials)
reddit = praw.Reddit(client_id='<redacted>',
                     client_secret='<redacted>',
                     user_agent='jack')

Version 6.4.0 of praw is outdated. Version 7.5.0 was released Sunday November 14, 2021.


In [4]:
reddit_live = Corpus(corpus_id='reddit_live', storage_type='db')

Corpus reddit_live_v0 not found in the DB; building new corpus


In [5]:
c = 0
ids = []
last_corpus_id=None
for comment in reddit.subreddit('funny').stream.comments(skip_existing=True):
    utt = Utterance(id=comment.id,
                    text=comment.body,
                    reply_to=comment.parent_id.split('_')[1],
                    speaker=Speaker(
                        id=comment.author.name if comment.author is not None else "n/a",),
                    conversation_id=comment.submission.id,
                    timestamp=comment.created_utc)
    ids.append(utt.id) # Will use this external list of ids for a check in the next cell.
    reddit_live = reddit_live.add_utterances([utt])
    last_corpus_id = reddit_live.id
    sleep(1)
    c += 1
    if c >= 5:
        # Simulating a server crash after 5 iterations
        del reddit_live
        break 
    

No filename or corpus name specified for DB storage; using name 667873
Corpus 667873_v0 not found in the DB; building new corpus


100%|██████████| 1/1 [00:00<00:00,  2.53it/s]


No filename or corpus name specified for DB storage; using name 193424
Corpus 193424_v0 not found in the DB; building new corpus


100%|██████████| 1/1 [00:00<00:00,  5.63it/s]


running _merge_utterances
No filename or corpus name specified for DB storage; using name 42080
Corpus 42080_v0 not found in the DB; building new corpus


100%|██████████| 2/2 [00:00<00:00,  8.38it/s]


No filename or corpus name specified for DB storage; using name 213506
Corpus 213506_v0 not found in the DB; building new corpus


100%|██████████| 1/1 [00:01<00:00,  1.24s/it]


running _merge_utterances
No filename or corpus name specified for DB storage; using name 209106
Corpus 209106_v0 not found in the DB; building new corpus


100%|██████████| 3/3 [00:01<00:00,  1.97it/s]


No filename or corpus name specified for DB storage; using name 671252
Corpus 671252_v0 not found in the DB; building new corpus


100%|██████████| 1/1 [00:00<00:00,  3.38it/s]


running _merge_utterances
No filename or corpus name specified for DB storage; using name 984186
Corpus 984186_v0 not found in the DB; building new corpus


100%|██████████| 4/4 [00:00<00:00, 14.55it/s]


No filename or corpus name specified for DB storage; using name 738356
Corpus 738356_v0 not found in the DB; building new corpus


100%|██████████| 1/1 [00:00<00:00,  5.42it/s]


running _merge_utterances
No filename or corpus name specified for DB storage; using name 636432
Corpus 636432_v0 not found in the DB; building new corpus


100%|██████████| 5/5 [00:00<00:00,  9.76it/s]


Now, with no dump necessary, the data is already stored persistently in the database despite the crash. 

In [6]:
reddit_live_2 = Corpus(corpus_id=last_corpus_id, storage_type='db', in_place=True)
print('still has utts',[utt.id for utt in reddit_live_2.iter_utterances()])
for id in ids:
    assert reddit_live_2.has_utterance(id)
    utterance = reddit_live_2.get_utterance(id)
    print(f'{utterance.speaker.id}: {utterance.text}')
    print('------------------------------')

Corpus 636432_v0 not found in the DB; building new corpus
still has utts ['hqpzhhs', 'hqpzhlj', 'hqpzj9l', 'hqpzjob', 'hqpzjve']
Appropriate_Jacket_5: Nice
------------------------------
hypercube33: Take them to flavor town
------------------------------
SkyShazad: Who thinks of this SHIT
------------------------------
NoLifeGuy_4k: So where can i get a best freind to turn into a diamond but without the fleshy bits
------------------------------
CatGotNoTail: Damn, you’re a hoss. I’ve only done it twice. What’s your secret?
------------------------------
