In [10]:
import os
import requests
from nucliadb_sdk.knowledgebox import KnowledgeBox
from nucliadb_sdk.labels import Label
from nucliadb_sdk.utils import create_knowledge_box, get_or_create
from sentence_transformers import SentenceTransformer


## Setup

Make sure we've started **NucliaDB's container**

``` 
docker run -it \
       -e LOG=INFO \
       -p 8080:8080 \
       -p 8060:8060 \
       -p 8040:8040 \
       -v nucliadb-standalone:/data \
       nuclia/nucliadb:latest
```
Then, we'll check the connection:

In [4]:
response = requests.get(f"http://localhost:8080")
assert response.ok

## Setup - creating a KB

In nucliadb our data containers are called knowledge boxes.

To start working, we need to create one:

*We create it with the function get_or_create so that it won't be created again if it exists*

In [12]:
my_kb = get_or_create("my_reddit_data_kb")

## Setup - preparing data & model

We download our dataset and the sentence embedding model we are going to use  

In [13]:
from datasets import load_dataset
dataset = load_dataset("go_emotions", "raw")

sample = dataset["train"].shuffle().select(range(10000))

Found cached dataset go_emotions (/Users/ciniesta/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.12it/s]


In [14]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")

## Uploading data to our KB

we use the upload function to index text, labels and calculated vectors for each sentence of our dataset.
Tips:
- We can have more than one set of vectors in our data, just add another entry to the vectors dict `vectors={"roberta-vectors": vectors-roberta,"bert-vectors": vectors-bert }`
- If you want to avoid uploading the same data twice by mistake, just add a `key` to your upload, its an unique identifier and it will update the resources when uploading them again instead of duplicating them. `key="my_reddit_sample"`

In [15]:
for row in sample:
    label = row["subreddit"]
    my_kb.upload(
        text=row["text"],
        labels=[f"reddit/{label}"],
        vectors={"all-MiniLM-L6-v2": encoder.encode([row["text"]])[0].tolist()},
    )


Vectorset is not created, we will create it for you


## Checks

We uploaded only data with one label. 
But we could have added more if we had code from other modules, or if we wanted to label some other code features

Let's check if the numbers agree!

In [20]:
my_labels = my_kb.get_uploaded_labels()
print("Labelsets info : ")
print(my_labels)
print("Labelset: ", ", ".join(my_labels.keys()))
print("Labels:",", ".join(my_labels["reddit"].labels.keys()))
print("Tagged resources:",my_labels["reddit"].count)
my_vectorsets = my_kb.list_vectorset()
print("-----------------")
print("Vectorsets info : ")
print(my_vectorsets)
print("Vectorset: ", ", ".join(my_vectorsets.vectorsets.keys()))
print("Dimension:",", ",my_kb.list_vectorset().vectorsets["all-MiniLM-L6-v2"].dimension)

Labelsets info : 
{'reddit': LabelSet(count=10000, labels={'socialanxiety': 48, 'danganronpa': 47, 'loveafterlockup': 47, '90DayFiance': 43, 'Blackops4': 43, 'Marriage': 43, 'devils': 43, 'AnimalsBeingBros': 42, 'TheSimpsons': 42, 'gaybros': 42, 'videos': 42, 'wholesomememes': 42, '4PanelCringe': 41, 'OkCupid': 41, 'ProtectAndServe': 41, 'rant': 40, 'tifu': 40, 'traaaaaaannnnnnnnnns': 40, 'TeenMomOGandTeenMom2': 39, 'saltierthancrait': 39, 'timberwolves': 39, 'Dodgers': 38, 'DunderMifflin': 38, 'NYGiants': 38, 'NYYankees': 38, 'relationship_advice': 38, 'teenagers': 38, 'Advice': 37, 'AnimalsBeingJerks': 37, 'Artifact': 37, 'Mavericks': 37, 'WeWantPlates': 37, 'deadbydaylight': 37, 'morbidquestions': 37, 'youseeingthisshit': 37, 'AskMen': 36, 'ChoosingBeggars': 36, 'MaliciousCompliance': 36, 'Scotland': 36, 'SubredditSimulator': 36, 'TheWalkingDeadGame': 36, 'askwomenadvice': 36, 'cringe': 36, 'minnesotavikings': 36, 'nonononoyes': 36, 'rpdrcringe': 36, 'sadcringe': 36, 'ComedyCemetery

## Filter by label

Let's explore results from one of the subreddits

In [21]:
results = my_kb.search(
        filter=[Label(labelset="reddit", label="socialanxiety")]
    )
for result in results:
    print(result.text)
    print(result.labels)

hello there, i sent some hugs your way. ^.^
['socialanxiety']
Write down the problems and how you feel when and after you have them. Then take that to therapy and actually get vulnerable with your therapist.
['socialanxiety']
Hell yeah, high five. Good job!
['socialanxiety']
Yeah i miss lectures for this exact reason. Its pretty bad, but what can ya do :/
['socialanxiety']
Hell yeah, high five. Good job!
['socialanxiety']
Reading this made me pretty damn happy. Congrats hope it works out for you two.
['socialanxiety']
Hey, I love you! 😁
['socialanxiety']
Yeah i miss lectures for this exact reason. Its pretty bad, but what can ya do :/
['socialanxiety']
Congrats. Having a steady job (and one that I actually enjoy somewhat) is the #1 thing that helped me out with depression/anxiety.
['socialanxiety']
It developed because of my childhood and robbed me of a good bit of good stuff as an adult.
['socialanxiety']
Thank you :) I'll try to stay positive
['socialanxiety']
Sameee, if I got a doll

## Text search

Now let's try the full text search


In [22]:
results = my_kb.search(text="developer")
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")

It's just bizarre you would criticize [NAME] on his development capabilities when he's not a developer... His specialties are centered around crypto philosophy
['CryptoCurrency']
5.891018390655518
8cf5c77655264ef081e840043bb5ec41
ScoreType.BM25
------
It's just bizarre you would criticize [NAME] on his development capabilities when he's not a developer... His specialties are centered around crypto philosophy
['CryptoCurrency']
5.891018390655518
0517466b9e4c4ec8afabd2e539c235f7
ScoreType.BM25
------
I'm kinda in this same boat as OP and currently learning to code to become a web developer. Wondering if this field is compatible with ASD .
['aspergers']
5.637315273284912
4279a693878645c8aeb778d74d48b7fc
ScoreType.BM25
------


In [23]:
results = my_kb.search(text="tech")
for result in results:
    print(result.text)
    print(result.labels)
    
results = my_kb.search(text="technology")
for result in results:
    print(result.text)
    print(result.labels)
    
results = my_kb.search(text="code")
for result in results:
    print(result.text)
    print(result.labels)
    
results = my_kb.search(text="tech")
for result in results:
    print(result.text)
    print(result.labels)


Damn. Low tech bait bike
['KidsAreFuckingStupid']
Key that we keep welcoming ~~young~~ international tech talent to London.
['london']
Sitting at home drunk waiting for my tech to charge so I can sit out back and listen to music lol
['drunk']
Technology is amazing. Now we can SEE how new plagues start. Just incredible.
['AnimalsBeingBros']
Maybe because those are real, tangible things and not lines of code
['SubredditDrama']
What do you mean? It would be the same code for every system?
['instant_regret']
What do you mean? It would be the same code for every system?
['instant_regret']
How does ones SO even go through your ph ? Mine is locked with a pass code and a finger print
['adultery']
Because a lot of people on this thread are ignoring the pirate code and I’m trying to explain that rules are there for a reason.
['Seaofthieves']
Because a lot of people on this thread are ignoring the pirate code and I’m trying to explain that rules are there for a reason.
['Seaofthieves']
I'm kinda 

## Vector search

Now we perform semantic search. 
To do so, we just convert our query to vectors with the same model we used and use the search function
We need to use the field `vector` and we can use `min_score`if we want to define a minimun cosine similarity value for our results

In [24]:
query_vectors = encoder.encode(["Tech, developers, programming and coding"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")


Yeah, if companies really wanted to ameliorate the lack of programmers they could hire and train more juniors, but that doesn't benefit them much.
['TrueReddit']
0.42079004645347595
6609af856e8449f6b8f03962fe9ae66d
ScoreType.COSINE
------
You are terrible developers.
['TownofSalemgame']
0.39345312118530273
db673044c06e49fe9e764c7cc403d03b
ScoreType.COSINE
------
You are terrible developers.
['TownofSalemgame']
0.39345312118530273
ed9fa918d76a497e9b69f31809eedd26
ScoreType.COSINE
------
Where do you work/what do you do?
['rant']
0.3706532120704651
a28255cc49154a9084fc99f4d8bcdc99
ScoreType.COSINE
------
I mean, who doesnt? But youre right, techgore is not the place for it.
['lostredditors']
0.36273515224456787
a314e87cbbcc4d9db4b82d661ccb7140
ScoreType.COSINE
------
I work at an R1 university, with most of my team composed of CS/CE undergrad students. Yes, I've known many...
['OutOfTheLoop']
0.35480910539627075
c0fb23c7ff554fa59e7cf74cd702cc1c
ScoreType.COSINE
------
Maybe because those

In [25]:
query_vectors = encoder.encode(["What is happiness"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")

Money can't buy me happiness...
['2meirl4meirl']
0.6215714812278748
9e070c12f4b34d5a876ac39971325668
ScoreType.COSINE
------
Me, for example. I am very very happy.
['FrankOcean']
0.5764293670654297
5c3e0a1369fa4b9695d472d2f711b97c
ScoreType.COSINE
------
Emotions are temporary. That's why we have multiple different ones. If you were permanently happy, you'd never feel sadness or envy or anger again. 
['Drugs']
0.5472765564918518
5f67a5d2ecf84140ba22c36baf32fdaf
ScoreType.COSINE
------
Apathetic. Not happy, not sad, just don't care
['Jokes']
0.5173780918121338
ce587828dfad443182c59841ae0ee6e9
ScoreType.COSINE
------
I am happy for you. No, seriously, I am.
['drunk']
0.4953000843524933
dd87d424c0064e278b9eadaa6f65d1c1
ScoreType.COSINE
------
I hope that you find happiness, too. I'm happy that you've had a place to vent. 
['depression']
0.48470982909202576
b6f1fe0a6e974840a6fe104378db2d62
ScoreType.COSINE
------
Be happy because I won't be short anymore one day
['danganronpa']
0.444295018

In [26]:
query_vectors = encoder.encode(["The meaning of life"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")

There is no *inherent* meaning, but this does not necessarily render it meaningless. Only you can decide whether your life has meaning or not.
['DebateAnAtheist']
0.6319499015808105
eec5ea268fa8472d8fd7c858e87606b0
ScoreType.COSINE
------
Is this the real life?
['4PanelCringe']
0.4298684000968933
5cc64354c59845b8a5259363fa7c41b7
ScoreType.COSINE
------
Such a sad way to think of life all because these people believe there's something greater than this.
['exmuslim']
0.41576701402664185
48a482ac93904e149cb24e6d5b6e8cf3
ScoreType.COSINE
------
A life well spent....
['videos']
0.396190881729126
7c024dbe0da74392a54550ddb105d7ee
ScoreType.COSINE
------
My life!
['conspiracy']
0.390045166015625
26b32d31a45a45d9899c7ab080cf6ad3
ScoreType.COSINE
------
I'm going to put my life first and run away.
['rant']
0.35358867049217224
a3b4ac8c082846b8938fdc808369eabb
ScoreType.COSINE
------
Ah the high life. Must be sweet.
['ProtectAndServe']
0.346894234418869
4ee21622d92f4788a2feaf17707c405e
ScoreType.C

In [27]:
query_vectors = encoder.encode(["What is love?"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")

I would say love is never enough. Love is unconditional. Relationships are not.
['polyamory']
0.5383067727088928
32851600b3924d5e808ea7d6fab71c36
ScoreType.COSINE
------
Love has no number
['teenagers']
0.4977535605430603
b1d809a8dc004b7588f1bc932174925b
ScoreType.COSINE
------
Love has no number
['teenagers']
0.4977535605430603
d0c05c5f565049b7a216407b63389db5
ScoreType.COSINE
------
[NAME] loves (the taste of) people
['AnimalsBeingBros']
0.4653770923614502
1ff3e1bffe284275965c946aece7d429
ScoreType.COSINE
------
I love you. I had no idea before now, but I love you.
['TIHI']
0.453585147857666
9ceb5619df1043ccbaced517b714dcb3
ScoreType.COSINE
------
And also something against [NAME]. (My definition of [NAME] is just people at the bottom of love distribution)
['IncelsWithoutHate']
0.44246917963027954
b8f79e9b1bff4e538b649cfcf5e74e6f
ScoreType.COSINE
------
Don't judge our love
['COMPLETEANARCHY']
0.431450754404068
ee5ba51c526b4e368cdcb134385b4cc7
ScoreType.COSINE
------
Or a lovegod Tha