This notebook gives a demonstration of how you might quickly use this library to visualise clusters.

## Data

Let's start by importing a dataset. In particular we're going to toy around with a subset from a [customer service dataset from Twitter](https://www.kaggle.com/thoughtvector/customer-support-on-twitter).

In [1]:
import pandas as pd

df = pd.read_csv("data/tesco_support.csv").loc[lambda d: ~d['text'].str.contains("https")]
texts = list(df['text'].sample(2000))
texts[:3]

['@tesco rumours online are that the priority Xmas delivery go online soon but you need an email to book. I only signed up yesterday will I still have access to these and how do I Book?',
 "@Tesco is doing an advent calendar for cats this year?! That's top of my list of things to buy for Christmas 😂",
 '@Tesco do i need receipt to return item of clothing?']

We're going to pretend like we might want to predict labels in this data. What might be some good labels? 

## Arrays 

Let's try a embed-reduce-then-visualise technique for this. 

In [2]:
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

pipe = make_pipeline(TfidfVectorizer(), TruncatedSVD())

X = pipe.fit_transform(texts)

## Visualize

Given that we now have a numeric representation `X`, we can visualise and explore. 

Hint; you can click + drag in the embedding space! 

In [3]:
from cluestar import plot_text

plot_text(X, texts)

In [4]:
import pathlib 

schema = plot_text(X, texts).to_json()
pathlib.Path("docs/plot_one.json").write_text(schema)

788802

## Improve 

You might be able to unravel some clusters, but we might improve the experience by using UMAP to reduce the embeddings down. UMAP typically keeps the local clusters more intact. More info can be found [here](https://pair-code.github.io/understanding-umap/). 

In [5]:
from umap import UMAP

pipe = make_pipeline(TfidfVectorizer(), UMAP())

X = pipe.fit_transform(texts)

plot_text(X, texts)

  from .autonotebook import tqdm as notebook_tqdm
2022-03-13 16:35:15.935207: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-13 16:35:15.935229: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


You might be able to add color to certain clusters by highlighting moments when a specific word is used.

In [6]:
import pathlib 

schema = plot_text(X, texts).to_json()
pathlib.Path("docs/plot_two.json").write_text(schema)

780787

In [7]:
plot_text(X, texts, color_words=["plastic", "voucher", "deliver"])

In [8]:
import pathlib 

schema = plot_text(X, texts, color_words=["plastic", "voucher", "deliver"]).to_json()
pathlib.Path("docs/plot_three.json").write_text(schema)

832092

## More Fancy 

You can try to get extra fancy by introducing language models. The [whatlies](https://github.com/koaning/whatlies/) library makes it easy to produce a scikit-learn pipeline for that.

In [9]:
from whatlies.language import UniversalSentenceLanguage

pipe = make_pipeline(
    UniversalSentenceLanguage(), 
    UMAP()
)

X = pipe.fit_transform(texts)

plot_text(X, texts, color_words=["plastic", "voucher", "deliver"])

2022-03-13 16:35:24.082341: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-03-13 16:35:24.082476: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-03-13 16:35:24.082484: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-13 16:35:24.082496: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (pop-os): /proc/driver/nvidia/version does not exist
2022-03-13 16:35:24.082617: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropri

The goal here is exploration. By using these embedding tricks we're able to confirm that many folks complain about deliveries and vouchers. These could be interesting labels to try and predict, so a next step might be to attach labels to this dataset via something like [prodigy](https://prodi.gy/).

In [10]:
import pathlib 

schema = plot_text(X, texts, color_words=["plastic", "voucher", "deliver"]).to_json()
pathlib.Path("docs/plot_four.json").write_text(schema)

834426

## Version Numbers

This notebook was made with the following tool versions:

In [11]:
%load_ext watermark
%watermark --machine --python --packages numpy,pandas,scikit-learn,whatlies,tensorflow,umap

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 8.1.1

numpy       : 1.19.5
pandas      : 1.4.1
scikit-learn: 1.0.2
whatlies    : 0.6.5
tensorflow  : 2.4.4
umap        : 0.5.2

Compiler    : GCC 10.2.0
OS          : Linux
Release     : 5.11.0-7614-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

