Getting started with Terality
=================

## Setup

Ensure you have created your Terality account by following the two easy steps [here](https://docs.terality.com/getting-terality/quick-start/setup).

## First steps

Terality exposes dataframes and other data structures with exactly the same API as pandas. No need to learn a new framework, just import the package and start processing data exactly as in pandas!

In [1]:
import pandas as pd
import terality as te

 An easy way to get started and create a `terality.DataFrame` is by importing a `pandas.DataFrame` using the function `from_pandas`:

In [2]:
df_pd = pd.DataFrame({"a": [1, 4, 9], "b": ["hello", "world", "!"]})
df_te = te.DataFrame.from_pandas(df_pd)

A `terality.DataFrame` is a different class than a pandas DataFrame, but they look, feel, behave and have the same API than their pandas equivalent.

In [3]:
df_te

Unnamed: 0,a,b
0,1,hello
1,4,world
2,9,!


In [4]:
df_te.info()

<class 'terality.DataFrame'>
Index: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       3 non-null      int64
 1   b       3 non-null      object
dtypes: int64(1), object(1)
memory usage: 230 bytes (run with deep=True)


In [5]:
df_te[(df_te["a"] >= 2) & (df_te["b"].str.len() > 3)]

Unnamed: 0,a,b
1,4,world


We can go back to a `pd.DataFrame` if needed by using `to_pandas()`:

In [6]:
df_pd_roundtrip = df_te.to_pandas()

In [7]:
pd.testing.assert_frame_equal(df_pd, df_pd_roundtrip)

We ran our example on a `DataFrame`, but the same applies to `Index`, `Series` and other pandas data structures.

## Processing data at scale

Our first steps were here to show you the very basics of Terality, but obviously there is little value in using a distributed processing solution for a dataframe with 3 lines. In this section, we are going to show you the real power of Terality!

Terality can provide value whenever you start to run in one of these two problems:
- memory errors
- slow computations

This will of course depend on your setting, but generally our users report that they benefit from using Terality for datasets over 1GB (memory size once loaded in pandas). And note that whatever data size you process, you will have no infrastructure to manage, everything is handled on Terality's side.

### Importing data

Let's start by importing some data. To enable you to run the tutorial yourself, we provide an open dataset to experiment on. This dataset contains all the reddit comments from May 2015, and is about 38GB once loaded in memory in pandas. Obviously, this is probably a little too much to run on your computer, that's what Terality is here for.

This dataset is stored in parquet (an efficient and modern data format) on AWS S3. In general you can import data from your own disk or cloud provider (such as AWS S3).

In [8]:
%%time
s3_folder = "s3://terality-public/datasets/reddit/full/"
comments = te.read_parquet(f"{s3_folder}comments/")

CPU times: user 220 ms, sys: 27.2 ms, total: 247 ms
Wall time: 38.7 s


### First explorations

Let's take a first look at what the data looks like.

In [9]:
comments.info(memory_usage="deep")

<class 'terality.DataFrame'>
Index: 54504400 entries, 0 to 54504399
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   id                54504400 non-null  object
 1   link_id           54504400 non-null  object
 2   parent_id         54504400 non-null  object
 3   name              54504400 non-null  object
 4   author            54504400 non-null  object
 5   subreddit_id      54504400 non-null  object
 6   created_utc       54504400 non-null  int64
 7   retrieved_on      54504400 non-null  int64
 8   score             54504400 non-null  int64
 9   ups               54504400 non-null  int64
10   downs             54504400 non-null  int64
11   gilded            54504400 non-null  int64
12   distinguished     442611 non-null  object
13   controversiality  54504400 non-null  int64
14   score_hidden      54504400 non-null  int64
15   edited            54504400 non-null  int64
16   archived          54504400 non-

Nearly 38 GB and 54M rows!

The data can be displayed exactly as a `pd.DataFrame`:

In [10]:
comments

Unnamed: 0,id,link_id,parent_id,name,author,subreddit_id,created_utc,retrieved_on,score,ups,downs,gilded,distinguished,controversiality,score_hidden,edited,archived,body
0,cqug90g,t3_34di91,t3_34di91,t1_cqug90g,rx109,t5_378oi,1430438400,1432703079,4,4,0,0,,0,0,0,0,くそ\n読みたいが買ったら負けな気がする\n図書館に出ねーかな
1,cqug90h,t3_34g8mx,t3_34g8mx,t1_cqug90h,WyaOfWade,t5_2qo4s,1430438400,1432703079,4,4,0,0,,0,0,0,0,gg this one's over. off to watch the NFL draft...
2,cqug90i,t3_34f7mc,t1_cqufim0,t1_cqug90i,Wicked_Truth,t5_2cneq,1430438400,1432703079,0,0,0,0,,0,0,0,0,Are you really implying we return to those tim...
3,cqug90j,t3_34f9rh,t1_cqug2sr,t1_cqug90j,jesse9o3,t5_2qh1i,1430438400,1432703079,3,3,0,0,,0,0,0,0,No one has a European accent either because i...
4,cqug90k,t3_34fvry,t3_34fvry,t1_cqug90k,beltfedshooter,t5_2qh1i,1430438400,1432703079,3,3,0,0,,0,0,0,0,"That the kid ""..reminds me of Kevin."" so sad..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54504395,crrbeoj,t3_37qart,t1_crow61z,t1_crrbeoj,TheMarraMan,t5_2tg3p,1433116799,1433505920,2,2,0,0,,0,0,0,0,Shame the US didn't do the same with it's rema...
54504396,crrbeok,t3_380av9,t3_380av9,t1_crrbeok,peperawr,t5_2s1me,1433116799,1433505920,3,3,0,0,,0,0,0,0,i think that the first pic looks fine
54504397,crrbeol,t3_37ylbt,t1_crrax2y,t1_crrbeol,Alcyoneous,t5_36buk,1433116799,1433505920,3,3,0,0,,0,0,0,0,"A filthy presser, hypocrite, AND genocidal man..."
54504398,crrbeom,t3_37zeyc,t3_37zeyc,t1_crrbeom,JDFNTO,t5_2rfxx,1433116799,1433505920,0,0,0,0,,0,0,0,0,"Well, i've many accounts on plat with pretty h..."


### Sorting

What are the most upvoted and downvoted comments in our dataset?

In [11]:
%%time
comments_best_scores = comments.sort_values(by="score", ascending=False)

CPU times: user 14.9 ms, sys: 0 ns, total: 14.9 ms
Wall time: 36.1 s


Around 40s for sorting a 38GB data set, not bad! We can now check the first and last line of the sorted comments:

In [12]:
best_comment = comments_best_scores.iloc[0, :]
print(f"Comment with score {best_comment['score']}:\n{best_comment['body']}")

Comment with score 6761:
Then you got yourself a one night standoff.


In [13]:
worst_comment = comments_best_scores.iloc[-1, :]
print(f"Comment with score {worst_comment['score']}:\n{worst_comment['body']}")

Comment with score -1712:
Long time redditor and English major here!

I think /r/pics and potentially /r/all (because this is a quality submission) would appreciate if you would resubmit this post with a revised title. If it was confusing to me, it may be confusing for others.

Might I suggest, *My friend Julia, a really good server, was made an instant fan of "The Simpsons" from chancing upon one of the directors during a shift.*

I'd also suggest thinking of a different way to relate the story of how it happened. There is plentiful karma to be had by choosing the right words!

Happy redditing :)


### Merging tables

Right now we have only one table related to comments. Let's add more information by adding some data related to the user who posted the comment and the subreddit where it was posted.
- First we'll add users:

In [14]:
users = te.read_parquet(f"{s3_folder}users.parquet")
users.info()

<class 'terality.DataFrame'>
Index: 2611446 entries, 0 to 2611445
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   author                  2611446 non-null  object
 1   author_flair_css_class  528362 non-null  object
 2   author_flair_text       525701 non-null  object
dtypes: object(3)
memory usage: 350.3 MB (run with deep=True)


In [15]:
%%time
comments = comments.merge(users, on=["author"], how="left")

CPU times: user 32.7 ms, sys: 4.47 ms, total: 37.2 ms
Wall time: 1min 31s


- and then subreddits:

In [16]:
subreddits = te.read_parquet(f"{s3_folder}subreddits.parquet")
subreddits.info()

<class 'terality.DataFrame'>
Index: 50138 entries, 0 to 50137
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   subreddit_id  50138 non-null  object
 1   subreddit     50138 non-null  object
dtypes: object(2)
memory usage: 6.7 MB (run with deep=True)


In [17]:
%%time
comments = comments.merge(subreddits, on=["subreddit_id"], how="left")

CPU times: user 49.1 ms, sys: 636 µs, total: 49.7 ms
Wall time: 2min 2s


In [18]:
comments.memory_usage(deep=True).sum()

48459045898

After the merges, our dataset is now 44GB.

Now that we have the full context of a comment, let's find out what are the most popular subreddits:

In [19]:
comments.value_counts(subset=["subreddit"])

subreddit
AskReddit               4234969
leagueoflegends         1223184
nba                      756195
funny                    745916
pics                     630924
                         ...   
1000aday                      1
0rbitalis                     1
0XhiVXJV1CXbP5lwA0Qc          1
01s                           1
00s                           1
Length: 50138, dtype: int64

### Groupby aggregations

Interested in some aggregated data? Who were the users with most upvoted and downvotes during this period?

In [20]:
comments.groupby("author")["ups"].sum().sort_values(ascending=False)

author
[deleted]          6497689
Donald_Keyman       333499
AutoModerator       278641
diamondpatch        179002
dick-nipples        147574
                    ...   
Kontiki1947          -4066
TheMacMan            -5058
MadrunBadrun         -5436
ItWillBeMine         -6276
tha_meme_master     -11238
Name: ups, Length: 2611446, dtype: int64

### Indexing

Being able to perform indexations on a huge datasets like this is also very convenient. We're going to find comments for certain users:

In [21]:
comments = comments.set_index("author")

Note that Terality will only actually build the index the first time you perform and indexation as to not waste resources. This is why the following cell might take some time the first time you run it. On following indexations, only a quick look-up will be performed.

In [22]:
comments.loc["tha_meme_master", ["subreddit", "score", "body"]]

Unnamed: 0_level_0,subreddit,score,body
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
tha_meme_master,todayilearned,-48,Were you trying to make a joke? Your comment i...
tha_meme_master,mildlyinteresting,-14,Can we please stop it with these stupid pun th...
tha_meme_master,videos,-24,Can we please stop it with these stupid pun th...
tha_meme_master,mildlyinteresting,-32,"Huh? Were you trying to make a joke, because i..."
tha_meme_master,pics,-56,"XD *le maymay alert* lel!1!1\n\nSeriously, I h..."
...,...,...,...
tha_meme_master,atheism,-19,I'm sorry but I really think you could have ma...
tha_meme_master,pics,-23,(yes! I *fucking* love lyric threads! all aboa...
tha_meme_master,todayilearned,-184,ha ha lmao @ manlets ITT thinking 6'4 is tall;...
tha_meme_master,Unexpected,-219,ha ha ha! nice *le maymay* m'gentlesir! i love...


Well, that was definitely not the result of a single bad comment!

### Exporting data

Once you're done processing your data, you probably want to get it back. 

One option would be the `to_pandas()` function we saw earlier, but you might not want to download and try to fit in memory a 44GB dataframe. Only use `to_pandas()` if you want to continue working on the dataframe in pandas AND the dataframe is small enough to fit in memory!

The usual way is to export your data as a parquet (or another format) file, either on disk or on on your own cloud provider. In this tutorial you have probably no interest in downloading the dataset, so you don't need to run this last cell:

In [23]:
# export_path = "..."
# comments.to_parquet(export_path)

## Conclusion

You're now ready to use Terality.

- you can keep on exploring the dataset, running other functions on it if you wish
- you can start working on your own data
- you can also check out the rest of the documentation for more details about Terality. The [User Guide](https://docs.terality.com/getting-terality/user-guide) is a great place to get started