<img src="https://raw.githubusercontent.com/Spratiher9/Files/master/DAT_Primary_Lock_up_Black.svg" alt="Databricks" width="500" height="600">
# __+__
<img src="https://raw.githubusercontent.com/Spratiher9/Files/master/terality.svg" alt="Terality" width="300" height="300">

## Welcome to Terality on Databricks
This is a quickstart guide of _terality on databricks_ showcasing how to get it up & running on databricks.

__Note:__ _This is quickstart notebook which is inspired by the [live demo notebook](https://api.terality2.com/v1/notebooks) of terality. It is not a official Terality written notebook._

For any queries related to this notebook, email the [author](mailto:souvik.pratiher@databricks.com).

## Setup

* Ensure you have created your Terality account [here](https://accounts.terality.com/sign-up). After registration you will get the required _API key_ from the terality dashboard.
* After that, create the below mentioned __*init script*__ and add it to the cluster.
* The init script takes care of the library installation. So *no need to do separate terality installation in the cluster* from the libraries tab or in the notebook.
* Add two environment variables in the cluster:
  1. __*TERALITY_EMAIL*__ which will __*contain the terality registered email*__
  2. __*TERALITY_KEY*__ which will __*contain the terality api key*__

Creating the init script

In [0]:
dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/terality-configure.sh","""
#!/bin/sh
source /etc/environment
pip install --upgrade --force-reinstall terality
terality account configure --email $TERALITY_EMAIL --overwrite --api-key $TERALITY_KEY
""", True)

Checking if the init script was successfully created or not

In [0]:
display(dbutils.fs.ls("/databricks/scripts/terality-configure.sh"))

path,name,size
dbfs:/databricks/scripts/terality-configure.sh,terality-configure.sh,171


## First steps

Terality exposes dataframes and other data structures with exactly the same API as pandas. No need to learn a new framework, just import the package and start processing data exactly as in pandas!

In [0]:
import pandas as pd
import terality as te

An easy way to get started and create a `terality.DataFrame` is by importing a `pandas.DataFrame` using the function `from_pandas`:

In [0]:
df_pd = pd.DataFrame({"a": [1, 4, 9], "b": ["hello", "world", "!"]})
df_te = te.DataFrame.from_pandas(df_pd)

A `terality.DataFrame` is a different class than a pandas DataFrame, but they look, feel, behave and have the same API than their pandas equivalent.

In [0]:
df_te

Unnamed: 0,a,b
0,1,hello
1,4,world
2,9,!


In [0]:
df_te.info()

In [0]:
df_te[(df_te["a"] >= 2) & (df_te["b"].str.len() > 3)]

Unnamed: 0,a,b
1,4,world


We can go back to a `pd.DataFrame` if needed by using `to_pandas()`:

In [0]:
df_pd_roundtrip = df_te.to_pandas()

In [0]:
pd.testing.assert_frame_equal(df_pd, df_pd_roundtrip)

We ran our example on a `DataFrame`, but the same applies to `Index`, `Series` and other pandas data structures.

## Processing data at scale

The first steps were there for showing the basics of Terality, but obviously there is little value in using a distributed processing solution for a dataframe with 3 lines. In this section, we are going to see a real usecase of __Terality__ with __Databricks__.

Terality can provide value whenever we start to run in one of these two problems:
- memory errors
- slow computations

This will of course depend on the setting, but generally users will benefit from using Terality for datasets over 1GB (memory size once loaded in pandas). And note that whatever data size is being processed, the users wont have to manage any infrastructure, everything will be handled on Terality's side.

### Importing data

Let's start by importing some data. To run the quickstart, we will use an open dataset provided by terality. This dataset contains a subpart of all the reddit comments from May 2015, and is about 5GB once loaded in memory in pandas.

This dataset is stored in parquet (an efficient and modern data format) on AWS S3. In general we can import data from our own disk or cloud provider (such as AWS S3).

In [0]:
%%time
s3_folder = "s3://terality-public/datasets/reddit/medium/"
comments = te.read_parquet(f"{s3_folder}comments/")

### First explorations

Let's take a first look at what the data looks like.

In [0]:
comments.info(memory_usage="deep")

5.3 GB and 7.6M rows.

The data can be displayed exactly as a `pd.DataFrame`:

In [0]:
comments

Unnamed: 0,id,link_id,parent_id,name,author,subreddit_id,created_utc,retrieved_on,score,ups,downs,gilded,distinguished,controversiality,score_hidden,edited,archived,body
0,cqug90g,t3_34di91,t3_34di91,t1_cqug90g,rx109,t5_378oi,1430438400,1432703079,4,4,0,0,,0,0,0,0,くそ\n読みたいが買ったら負けな気がする\n図書館に出ねーかな
1,cqug90h,t3_34g8mx,t3_34g8mx,t1_cqug90h,WyaOfWade,t5_2qo4s,1430438400,1432703079,4,4,0,0,,0,0,0,0,gg this one's over. off to watch the NFL draft...
2,cqug90i,t3_34f7mc,t1_cqufim0,t1_cqug90i,Wicked_Truth,t5_2cneq,1430438400,1432703079,0,0,0,0,,0,0,0,0,Are you really implying we return to those tim...
3,cqug90j,t3_34f9rh,t1_cqug2sr,t1_cqug90j,jesse9o3,t5_2qh1i,1430438400,1432703079,3,3,0,0,,0,0,0,0,No one has a European accent either because i...
4,cqug90k,t3_34fvry,t3_34fvry,t1_cqug90k,beltfedshooter,t5_2qh1i,1430438400,1432703079,3,3,0,0,,0,0,0,0,"That the kid ""..reminds me of Kevin."" so sad..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7630611,cqz1wyi,t3_34xrwi,t1_cqz1th8,t1_cqz1wyi,stephfly,t5_2vpf3,1430833306,1432782898,1,1,0,0,,0,0,0,0,open
7630612,cqz1wyj,t3_34xuj5,t3_34xuj5,t1_cqz1wyj,DTHKSTK,t5_36v9d,1430833306,1432782898,-7,-7,0,0,,0,0,0,0,http://livedoor.blogimg.jp/kwi9v709ul/imgs/7/c...
7630613,cqz1wyk,t3_34xt3o,t3_34xt3o,t1_cqz1wyk,BigBlueWookiee,t5_2s48x,1430833306,1432782898,1,1,0,0,,0,0,0,0,"Light Assault, VS in particular, is all about ..."
7630614,cqz1wyl,t3_34xok9,t1_cqz0nee,t1_cqz1wyl,snow-sakura,t5_36v9d,1430833306,1432782898,2,2,0,0,,0,0,0,0,一応、Google Wireless Transcoder (ガラケー用に表示を簡易化・変換...


### Sorting

What are the most upvoted and downvoted comments in our dataset?

In [0]:
%%time
comments_best_scores = comments.sort_values(by="score", ascending=False)

We can now check the third (to keep this quickstart SFW) and last line of the sorted comments:

In [0]:
best_comment = comments_best_scores.iloc[2, :]
print(f"Comment with score {best_comment['score']}:\n{best_comment['body']}")

In [0]:
worst_comment = comments_best_scores.iloc[-1, :]
print(f"Comment with score {worst_comment['score']}:\n{worst_comment['body']}")

### Merging tables

Right now we have only one table related to comments. Let's add more information by adding some data related to the user who posted the comment and the subreddit where it was posted.
- First we'll add users:

In [0]:
users = te.read_parquet(f"{s3_folder}users.parquet")
users.info()

In [0]:
%%time
comments = comments.merge(users, on=["author"], how="left")

- and then subreddits:

In [0]:
subreddits = te.read_parquet(f"{s3_folder}subreddits.parquet")
subreddits.info()

In [0]:
%%time
comments = comments.merge(subreddits, on=["subreddit_id"], how="left")

In [0]:
comments.memory_usage(deep=True).sum()

After the merges, our dataset is now 6.7GB.

Now that we have the full context of a comment, let's find out what are the most popular subreddits:

In [0]:
comments.value_counts(subset=["subreddit"])

### Groupby aggregations

Interested in some aggregated data? Who were the users with most upvotes and downvotes during this period?

In [0]:
comments.groupby("author")["ups"].sum().sort_values(ascending=False)

### Indexing

Being able to perform indexations on a huge datasets like this is also very convenient. We're going to find comments for certain users:

In [0]:
comments = comments.set_index("author")

Note that Terality will only actually build the index the first time we perform an indexation so as to not waste resources. This is why the following cell might take some time the first time you run it. On following indexations, only a quick look-up will be performed.

In [0]:
comments.loc["tha_meme_master", ["subreddit", "score", "body"]]

Unnamed: 0_level_0,subreddit,score,body
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
tha_meme_master,todayilearned,-48,Were you trying to make a joke? Your comment i...
tha_meme_master,mildlyinteresting,-14,Can we please stop it with these stupid pun th...
tha_meme_master,videos,-24,Can we please stop it with these stupid pun th...
tha_meme_master,mildlyinteresting,-32,"Huh? Were you trying to make a joke, because i..."
tha_meme_master,pics,-56,"XD *le maymay alert* lel!1!1\n\nSeriously, I h..."
...,...,...,...
tha_meme_master,todayilearned,-246,I have never understood Reddit's obsession wit...
tha_meme_master,funny,-43,"This is not a competition, and I'd prefer *not..."
tha_meme_master,Jokes,113,Ha ha this brings up memories. I used to be ha...
tha_meme_master,aww,-70,No he doesn't. I used to work in a dog shelter...


Well, that was definitely not the result of a single bad comment!

### Exporting data

Once you're done processing our data, we probably want to get it back. 

One option would be the `to_pandas()` function we saw earlier, but we might not want to download and try to fit in memory a 44GB dataframe. Only use `to_pandas()` if you want to continue working on the dataframe in pandas AND the dataframe is small enough to fit in memory!

The usual way is to export our data as a parquet (or another format) file, either on disk or on on our own cloud provider. In this tutorial we have probably no interest in downloading the dataset, so we don't need to run this last cell:

In [0]:
# export_path = "..."
# comments.to_parquet(export_path)

## Conclusion

We are now ready to use Terality on Databricks.

- We can keep on exploring the dataset, running other functions on it if we wish
- We can start working on our own data
- We can also check out the rest of the documentation for more details about Terality. The [User Guide](https://docs.terality.com/getting-terality/user-guide) is a great place to get started