# Data Wrangling

**Problem**: How to process 36 million tweets (20GB) on a 16GB RAM machine?

**Solution**: Use [Dask](www.dask.org) to distribute processing across multiple cores in parallel.

If you're not familiar with Dask check out this [Introduction to Dask](https://coiled.io/blog/what-is-dask/).

## What is Dask? 

Dask is a flexible library for parallel computing in Python, that follows the syntax of the PyData ecosystem. If you are familiar with Numpy, pandas and scikit-learn then think of Dask as their faster cousin. For example:

```python
import pandas as pd                   import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv')    df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean()   df.groupby(df.user_id).value.mean().compute()
```

## A Dask Cluster

<img src="https://raw.githubusercontent.com/coiled/pydata-global-dask/master/images/dask-cluster.svg"
     width="75%"
     alt="Dask cluster\">


## Dask DataFrames

For the most part, a Dask DataFrame feels like a Pandas DataFrame. However, internally a Dask DataFrame is composed of many Pandas DataFrames (see the image below). 

<img src="http://dask.pydata.org/en/latest/_images/dask-dataframe.svg" width="30%">

Dask DataFrames are partitioned along their index into different **partitions** where each parition is a normal Pandas DataFrame. These Pandas objects may live on disk or on other machines.

For many purposes Dask DataFrames can serve as drop-in replacements for Pandas DataFrames. Much like the Dask Delayed interface, Dask DataFrames are lazily evaluated. You can use use the DataFrame API to automatically build up a task graph representing complex computations and then call `compute()` to to evaluate the graph in parallel. 

## When to use Dask DataFrames

Pandas is great for tabular datasets that fit in memory. If your data fits in memory then you should use Pandas. **Dask becomes useful when the dataset you want to analyze is larger than your machine's RAM** where you would normally run into `MemoryError`s.

```python
    MemoryError:  ...
```

## 1. Spin up Remote Dask Cluster

In [7]:
import coiled

In [2]:
cluster = coiled.Cluster(
    name="dask-for-nlp",
    software="dask-nlp",
    n_workers=10,
    worker_cpu=4,
    worker_memory="24Gib",
    scheduler_options={'idle_timeout': '3 hours'}
)

Output()

In [3]:
# connect Dask to remote cluster
from distributed import Client

In [4]:
client = Client(cluster)
client


+---------+----------------+---------------+---------------+
| Package | client         | scheduler     | workers       |
+---------+----------------+---------------+---------------+
| msgpack | 1.0.3          | 1.0.2         | 1.0.2         |
| python  | 3.9.10.final.0 | 3.9.7.final.0 | 3.9.7.final.0 |
+---------+----------------+---------------+---------------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: http://44.198.174.76:8787,

0,1
Dashboard: http://44.198.174.76:8787,Workers: 10
Total threads: 80,Total memory: 305.74 GiB

0,1
Comm: tls://10.4.0.177:8786,Workers: 10
Dashboard: http://10.4.0.177:8787/status,Total threads: 80
Started: 15 minutes ago,Total memory: 305.74 GiB

0,1
Comm: tls://10.4.12.71:33437,Total threads: 8
Dashboard: http://10.4.12.71:46289/status,Memory: 30.57 GiB
Nanny: tls://10.4.12.71:38563,
Local directory: /dask-worker-space/worker-jtxb_zde,Local directory: /dask-worker-space/worker-jtxb_zde

0,1
Comm: tls://10.4.14.61:42929,Total threads: 8
Dashboard: http://10.4.14.61:34197/status,Memory: 30.57 GiB
Nanny: tls://10.4.14.61:41313,
Local directory: /dask-worker-space/worker-swzp0th0,Local directory: /dask-worker-space/worker-swzp0th0

0,1
Comm: tls://10.4.12.131:33859,Total threads: 8
Dashboard: http://10.4.12.131:35015/status,Memory: 30.57 GiB
Nanny: tls://10.4.12.131:33743,
Local directory: /dask-worker-space/worker-nfl6yi85,Local directory: /dask-worker-space/worker-nfl6yi85

0,1
Comm: tls://10.4.2.21:43041,Total threads: 8
Dashboard: http://10.4.2.21:45853/status,Memory: 30.57 GiB
Nanny: tls://10.4.2.21:45541,
Local directory: /dask-worker-space/worker-0w3e5_gt,Local directory: /dask-worker-space/worker-0w3e5_gt

0,1
Comm: tls://10.4.2.55:34687,Total threads: 8
Dashboard: http://10.4.2.55:36007/status,Memory: 30.57 GiB
Nanny: tls://10.4.2.55:37367,
Local directory: /dask-worker-space/worker-h60wodmp,Local directory: /dask-worker-space/worker-h60wodmp

0,1
Comm: tls://10.4.2.1:41137,Total threads: 8
Dashboard: http://10.4.2.1:46771/status,Memory: 30.57 GiB
Nanny: tls://10.4.2.1:38317,
Local directory: /dask-worker-space/worker-m9foksa5,Local directory: /dask-worker-space/worker-m9foksa5

0,1
Comm: tls://10.4.10.52:36629,Total threads: 8
Dashboard: http://10.4.10.52:43741/status,Memory: 30.57 GiB
Nanny: tls://10.4.10.52:34543,
Local directory: /dask-worker-space/worker-yzhl9z7h,Local directory: /dask-worker-space/worker-yzhl9z7h

0,1
Comm: tls://10.4.6.230:34213,Total threads: 8
Dashboard: http://10.4.6.230:38611/status,Memory: 30.57 GiB
Nanny: tls://10.4.6.230:45571,
Local directory: /dask-worker-space/worker-611m_6a2,Local directory: /dask-worker-space/worker-611m_6a2

0,1
Comm: tls://10.4.7.100:33495,Total threads: 8
Dashboard: http://10.4.7.100:35333/status,Memory: 30.57 GiB
Nanny: tls://10.4.7.100:36137,
Local directory: /dask-worker-space/worker-s1kqdubi,Local directory: /dask-worker-space/worker-s1kqdubi

0,1
Comm: tls://10.4.0.249:33401,Total threads: 8
Dashboard: http://10.4.0.249:46645/status,Memory: 30.57 GiB
Nanny: tls://10.4.0.249:45723,
Local directory: /dask-worker-space/worker-thfws1ze,Local directory: /dask-worker-space/worker-thfws1ze


## 2. Load Data

In [17]:
client.restart()

0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: http://44.198.174.76:8787,

0,1
Dashboard: http://44.198.174.76:8787,Workers: 10
Total threads: 80,Total memory: 305.74 GiB

0,1
Comm: tls://10.4.0.177:8786,Workers: 10
Dashboard: http://10.4.0.177:8787/status,Total threads: 80
Started: 2 hours ago,Total memory: 305.74 GiB

0,1
Comm: tls://10.4.12.71:45131,Total threads: 8
Dashboard: http://10.4.12.71:37557/status,Memory: 30.57 GiB
Nanny: tls://10.4.12.71:38563,
Local directory: /dask-worker-space/worker-lvcp68m6,Local directory: /dask-worker-space/worker-lvcp68m6

0,1
Comm: tls://10.4.14.61:45307,Total threads: 8
Dashboard: http://10.4.14.61:34171/status,Memory: 30.57 GiB
Nanny: tls://10.4.14.61:41313,
Local directory: /dask-worker-space/worker-aqs8jvkt,Local directory: /dask-worker-space/worker-aqs8jvkt

0,1
Comm: tls://10.4.12.131:33995,Total threads: 8
Dashboard: http://10.4.12.131:46031/status,Memory: 30.57 GiB
Nanny: tls://10.4.12.131:33743,
Local directory: /dask-worker-space/worker-sajwmvu3,Local directory: /dask-worker-space/worker-sajwmvu3

0,1
Comm: tls://10.4.2.21:42075,Total threads: 8
Dashboard: http://10.4.2.21:35993/status,Memory: 30.57 GiB
Nanny: tls://10.4.2.21:45541,
Local directory: /dask-worker-space/worker-7_ig6gvq,Local directory: /dask-worker-space/worker-7_ig6gvq

0,1
Comm: tls://10.4.2.55:36597,Total threads: 8
Dashboard: http://10.4.2.55:41641/status,Memory: 30.57 GiB
Nanny: tls://10.4.2.55:37367,
Local directory: /dask-worker-space/worker-thfw58rp,Local directory: /dask-worker-space/worker-thfw58rp

0,1
Comm: tls://10.4.2.1:37621,Total threads: 8
Dashboard: http://10.4.2.1:41589/status,Memory: 30.57 GiB
Nanny: tls://10.4.2.1:38317,
Local directory: /dask-worker-space/worker-jims5shs,Local directory: /dask-worker-space/worker-jims5shs

0,1
Comm: tls://10.4.10.52:39365,Total threads: 8
Dashboard: http://10.4.10.52:38239/status,Memory: 30.57 GiB
Nanny: tls://10.4.10.52:34543,
Local directory: /dask-worker-space/worker-_hrc_dsl,Local directory: /dask-worker-space/worker-_hrc_dsl

0,1
Comm: tls://10.4.6.230:32851,Total threads: 8
Dashboard: http://10.4.6.230:34383/status,Memory: 30.57 GiB
Nanny: tls://10.4.6.230:45571,
Local directory: /dask-worker-space/worker-umss1kwy,Local directory: /dask-worker-space/worker-umss1kwy

0,1
Comm: tls://10.4.7.100:38463,Total threads: 8
Dashboard: http://10.4.7.100:45179/status,Memory: 30.57 GiB
Nanny: tls://10.4.7.100:36137,
Local directory: /dask-worker-space/worker-lh8tvzvd,Local directory: /dask-worker-space/worker-lh8tvzvd

0,1
Comm: tls://10.4.0.249:39827,Total threads: 8
Dashboard: http://10.4.0.249:42327/status,Memory: 30.57 GiB
Nanny: tls://10.4.0.249:45723,
Local directory: /dask-worker-space/worker-5hqmb9xp,Local directory: /dask-worker-space/worker-5hqmb9xp


In [18]:
import dask.dataframe as dd

In [19]:
# read s3 data into dask dataframe
ddf = dd.read_csv(
    "s3://twitter-saudi-us-east-2/sa_eg_ae_022020_tweets_csv_hashed_*.csv",
    #blocksize="64MiB",
    usecols=[
        'tweetid',
        'userid',
        'user_screen_name',
        'follower_count', 
        'following_count',
        'tweet_language',
        'tweet_text',
        'tweet_time', 
        'tweet_client_name', 
        'is_retweet',
        'retweet_userid',
        'retweet_tweetid'],
    engine='python',
    on_bad_lines='warn',
    na_values='None',
    dtype={
        "tweetid": "object",
        "userid": "object",
        "user_screen_name": "object",
        "follower_count": "object",
        "following_count": "object",
        "tweet_language": "object",
        "tweet_text": "object",
        "tweet_time": "object",
        "tweet_client_name": "object",
        "is_retweet": "object",
        "retweet_userid": "object",
        "retweet_tweetid": "object"
    },
    storage_options={'key': 'AKIASCA3RIXDE7YH4P77', 'secret': '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'}
)

In [13]:
ddf

Unnamed: 0_level_0,tweetid,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
npartitions=379,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
,object,object,object,object,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...


In [20]:
ddf.head()

Unnamed: 0,tweetid,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
0,1185949426633924609,2MQmlE42efLqBwlUzJlSkfVEByrmi6Q3bVbj3Hlt4=,2MQmlE42efLqBwlUzJlSkfVEByrmi6Q3bVbj3Hlt4=,1422,1616,und,RT @daltawater: #كشف_تسربات_المياه\n#كشف_تسربا...,2019-10-20 16:02,Twitter for iPhone,True,,1185853959296245760
1,1196674108450385920,993642585892818944,rahil_76,12576,12682,ar,RT @aljoory120j: #فايز_المالكي\n📍لنشر ودعم حسا...,2019-11-19 06:18,Twitter for Android,True,,1196324752190824448
2,1186489444565733376,2MQmlE42efLqBwlUzJlSkfVEByrmi6Q3bVbj3Hlt4=,2MQmlE42efLqBwlUzJlSkfVEByrmi6Q3bVbj3Hlt4=,1422,1616,ar,RT @Kw_787: .\nٰ\nٰ\nوإني #أحبك على مرأى العال...,2019-10-22 03:48,Twitter for Android,True,,1184617361250406402
3,1187301099352530944,2MQmlE42efLqBwlUzJlSkfVEByrmi6Q3bVbj3Hlt4=,2MQmlE42efLqBwlUzJlSkfVEByrmi6Q3bVbj3Hlt4=,1422,1616,ar,RT @makkha245: متميزون في تفصيل الاثاث المنزلي...,2019-10-24 09:33,Twitter for Android,True,,1187130162275598337
4,1192322251292651520,993642585892818944,rahil_76,12576,12682,ar,RT @2whood: ساعة باتيك فيليب رجالي c\n\nالتوصي...,2019-11-07 06:05,Twitter for Android,True,,1192185729151098882


In [21]:
ddf.persist()

Unnamed: 0_level_0,tweetid,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
npartitions=379,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
,object,object,object,object,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...
