## Demo Notebook
# Coiled & MongoDB for Large-Scale NLP Analysis


This notebook walks through a basic NLP workflow to illustrate how we can use Coiled and MongoDB for large-scale NLP analyses.

1. Load in toy dataset (AirBnb Sample Dataset from MongoDB)
2. Apply some NLP preprocessing using NLTK and SpaCy
3. Create vectors for ML using Dask-ML
4. Train an XGBoost Classifier

## 1. Launch Coiled Cluster

In [1]:
import coiled

In [2]:
cluster = coiled.Cluster(
    name="dask-nlp-mongodb",
    software="rrpelgrim/dask-nlp",
    n_workers=20,
    shutdown_on_close=False,
    scheduler_options={'idle_timeout':'2 hours'}
)

Found software environment build
Created fw rule: inbound [8786-8787] [0.0.0.0/0] []
Created FW rules: coiled-dask-rrpelgr71-113585-firewall
Created fw rule: cluster [0-65535] [None] [coiled-dask-rrpelgr71-113585-firewall -> coiled-dask-rrpelgr71-113585-firewall]
Created FW rules: coiled-dask-rrpelgr71-113585-cluster-firewall
Created fw rule: cluster [0-65535] [None] [coiled-dask-rrpelgr71-113585-cluster-firewall -> coiled-dask-rrpelgr71-113585-cluster-firewall]
Created scheduler VM: coiled-dask-rrpelgr71-113585-scheduler (type: t3.medium, ip: ['3.238.138.156'])


In [3]:
from dask.distributed import Client

client = Client(cluster)
client


+-------------+---------------+---------------+---------------+
| Package     | client        | scheduler     | workers       |
+-------------+---------------+---------------+---------------+
| dask        | 2021.11.1     | 2022.02.0     | 2022.02.0     |
| distributed | 2021.11.1     | 2022.02.0     | 2022.02.0     |
| msgpack     | 1.0.3         | 1.0.2         | 1.0.2         |
| numpy       | 1.21.2        | 1.21.5        | 1.21.5        |
| pandas      | 1.3.4         | 1.4.1         | 1.4.1         |
| python      | 3.9.6.final.0 | 3.9.7.final.0 | 3.9.7.final.0 |
+-------------+---------------+---------------+---------------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: http://3.238.138.156:8787,

0,1
Dashboard: http://3.238.138.156:8787,Workers: 1
Total threads: 2,Total memory: 7.48 GiB

0,1
Comm: tls://10.4.5.130:8786,Workers: 1
Dashboard: http://10.4.5.130:8787/status,Total threads: 2
Started: Just now,Total memory: 7.48 GiB

0,1
Comm: tls://10.4.13.76:46331,Total threads: 2
Dashboard: http://10.4.13.76:38797/status,Memory: 7.48 GiB
Nanny: tls://10.4.13.76:34009,
Local directory: /dask-worker-space/worker-u2blmgg2,Local directory: /dask-worker-space/worker-u2blmgg2


## 2. Read Data from MongoDB

In [4]:
from dask_mongo import read_mongo
import urllib

In [5]:
# Replace the username, password, and cluster address with your own connection details
host_uri = "mongodb+srv://richard:" + urllib.parse.quote("Rp@976559MO") + "@cluster0.ffttf.mongodb.net/myFirstDatabase?retryWrites=true&w=majority"

In [6]:
bag = read_mongo(
    connection_kwargs={"host": host_uri},
    database="sample_airbnb",
    collection="listingsAndReviews",
    chunksize=500,
)

In [7]:
bag.take(1)

({'_id': '10006546',
  'listing_url': 'https://www.airbnb.com/rooms/10006546',
  'name': 'Ribeira Charming Duplex',
  'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.',
  'space': 'Privileged views of the Douro River and Ribeira square, our apartment offers the perfect conditions to discover the history and the charm of Porto. Apartment comfortable, charming, romantic and cozy in the heart of Ribeira. Within walking distance of all the most emblematic places of the city of Porto. The apartment is fully equipped to host 8 people, with cooker, oven, washing machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The apartment is located in a very typical area of the city that allows to cross with the most picturesque population of the city, welcoming, genuine and happy people that fills the streets w

This is a LOT of information.

Let's boil this down to something simple for this demo. Let's say we want to use the Description text to predict the Review Rating.

### Subset Data
Below we define a processing function that will extract only the relevant information from all records. We'll then select **only the Apartments property types**, flatten the data structure and turn it into a Dask Dataframe.

In [8]:
def process(record):
    try:
        yield {
            "description": record["description"],
            "review_rating": int(str(record["review_scores"]["review_scores_rating"])),
            #"accomodates": record["accommodates"],
            #"bedrooms": record["bedrooms"],
            #"price": float(str(record["price"])),
            #"country": record["address"]["country"],
        }
    except KeyError:
        pass

In [9]:
# Filter only apartments
b_flattened = (
    bag.filter(lambda record: record["property_type"] == "Apartment")
    .map(process)
    .flatten()
)

In [10]:
b_flattened.take(3)

({'description': 'Here exists a very cozy room for rent in a shared 4-bedroom apartment. It is located one block off of the JMZ at Myrtle Broadway.  The neighborhood is diverse and appeals to a variety of people.',
  'review_rating': 100},
 {'description': "Murphy bed, optional second bedroom available. Wifi available, Hulu, Netflix, TV Eat-in kitchen. Bathroom with great shower/bath.  Washer/dryer in basement. New York City! Great neighborhood - many terrific restaurants, bakeries, bagelries. Within easy walking distance are restaurants with the cuisines from India, Thailand, Japan, China, Mexico, South America and Europe.  As well as the many small independent stores that line Broadway, there chain stores such as Urban Outfitters (clothing), Whole Foods (groceries), Sephora (cosmetics), Michaels (crafts), and Modell's (sporting goods). Equidistant to Central Park and Riverside Park which have walking/running/biking trails as well as tennis and racquet ball courts. 10-15 blocks from C

## 3. Tokenization with NLTK

Let's tokenize the Description text and remove stop words.

In [11]:
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer
from functools import partial

In [12]:
tokenizer = RegexpTokenizer(r'\w+')

In [14]:
# define processing functions
def extract_description(element):
    return element['description'].lower()

def extract_rating(element):
    return element['review_rating']

def filter_stopword(word, stopwords):
    return word not in stopwords

def filter_stopwords(tokens, stopwords):
    return list(filter(partial(filter_stopword, stopwords=stopwords), tokens))

In [15]:
# define set of stopwords
stopword_set = set(stopwords.words('english'))

In [16]:
# get cleaned, tokenized description texts
description_text = b_flattened.map(extract_description)
description_text_tokens = description_text.map(tokenizer.tokenize)
description_text_clean = description_text_tokens.map(partial(filter_stopwords, stopwords=stopword_set))

In [17]:
# verify
description_text_clean.take(1)

(['exists',
  'cozy',
  'room',
  'rent',
  'shared',
  '4',
  'bedroom',
  'apartment',
  'located',
  'one',
  'block',
  'jmz',
  'myrtle',
  'broadway',
  'neighborhood',
  'diverse',
  'appeals',
  'variety',
  'people'],)

## 4. Lemmatization with SpaCy

In [18]:
import spacy

In [19]:
nlp = spacy.load("en_core_web_sm")

In [20]:
def lemmatize(text, nlp=nlp):
    doc = nlp(" ".join(text))
    lemmatized = [token.lemma_ for token in doc]
    return lemmatized

In [21]:
lemmas = description_text_clean.map(lemmatize)

In [22]:
lemmas.take(1)

(['exist',
  'cozy',
  'room',
  'rent',
  'share',
  '4',
  'bedroom',
  'apartment',
  'locate',
  'one',
  'block',
  'jmz',
  'myrtle',
  'broadway',
  'neighborhood',
  'diverse',
  'appeal',
  'variety',
  'people'],)

Great, we now have our lemmatized tokens and can turn this into a... [ Topic Modelling / ML Classification / ... ] problem 

We'll start by casting out Dask Bag into a Dask DataFrame and then pass it into the Dask-ML HashingVectorizer to get our NLP features.

## ML Classification

In [23]:
ddf = b_flattened.to_dataframe()

In [24]:
ddf

Unnamed: 0_level_0,description,review_rating
npartitions=12,Unnamed: 1_level_1,Unnamed: 2_level_1
,object,int64
,...,...
...,...,...
,...,...
,...,...


In [25]:
ddf.head()

Unnamed: 0,description,review_rating
0,Here exists a very cozy room for rent in a sha...,100
1,"Murphy bed, optional second bedroom available....",94
2,"The Apartment has a living room, toilet, bedro...",98
3,Loft Suite Deluxe @ Henry Norman Hotel Located...,88
4,"Clean, fully furnish, Spacious 1 bedroom flat ...",100


Let's write this to our S3 bucket as a Parquet file.

In [26]:
# ddf.to_parquet(
#     's3://coiled-datasets/airbnb-monogo/description-and-ratings.parquet',
#     engine="pyarrow",
# )

Now we're all set to turn this into an ML classification problem.

We'll create a train/test splits and then vectorize the Description column.

### Create train/test split

In [27]:
from dask_ml.model_selection import train_test_split

In [28]:
X = ddf['description'].to_dask_array(lengths=True)
y = ddf['review_rating'].to_dask_array(lengths=True)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.20, 
    random_state=40
)

### Vectorize

In [30]:
from dask_ml.feature_extraction.text import HashingVectorizer

HashingVectorizer has some built-in tokenization and preprocessing capabilities we could explore.

We'll just use it out-of-the-box for now.

In [31]:
vect = HashingVectorizer()

In [32]:
X_train_vect = vect.fit_transform(X_train)

In [33]:
X_train_vect

Unnamed: 0,Array,Chunk
Shape,"(nan, 1048576)","(nan, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (nan, 1048576) (nan, 1048576) Count 108 Tasks 12 Chunks Type float64 scipy.sparse.csr.csr_matrix",,

Unnamed: 0,Array,Chunk
Shape,"(nan, 1048576)","(nan, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


Vectorizing leads to array of unknown chunk size

In [34]:
X_train_vect.compute_chunk_sizes()

Unnamed: 0,Array,Chunk
Shape,"(2139, 1048576)","(211, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix
"Array Chunk Shape (2139, 1048576) (211, 1048576) Count 108 Tasks 12 Chunks Type float64 scipy.sparse.csr.csr_matrix",1048576  2139,

Unnamed: 0,Array,Chunk
Shape,"(2139, 1048576)","(211, 1048576)"
Count,108 Tasks,12 Chunks
Type,float64,scipy.sparse.csr.csr_matrix


In [35]:
X_train_vect.blocks[0].compute()

<176x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 15469 stored elements in Compressed Sparse Row format>

In [36]:
X_train_vect.shape

(2139, 1048576)

Each block in X is a **scipy.sparse matrix**.

Now use scipy.sparse matrix as input for distributed XGBoostClassifier.

## 5. Train XGBoost Model

In [37]:
import xgboost as xgb
from xgboost.dask import DaskXGBClassifier

In [38]:
clf = DaskXGBClassifier()

In [39]:
%%time
clf.fit(X_train_vect, y_train)

AttributeError: divisions not found

The error above is a bug in the XGBoost package. Issue was raised and there's a PR ready to be merged that will resolve this issue:
https://github.com/dmlc/xgboost/issues/7454

In [None]:
proba = xgb.predict_proba(X_test)