# Mapping Semantic Similarity with Atlas 

Atlas is a platform for interacting with both small and internet scale unstructured datasets. It was developed by Nomic, the world's first information cartography company, and allows for researchers to cluster and map texts based on semantic similarity. Below, texts are prepared for ingestion into Atlas, and the pipeline is run to serve a map of text embeddings to https://atlas.nomic.ai. To use this pipeline, first go to https://home.nomic.ai/ and create an account.
 
Atlas Documentation: https://docs.nomic.ai/

Example Code: https://github.com/nomic-ai/nomic/blob/main/examples/map_text.py

Example Output: https://atlas.nomic.ai/map/wiki500k 

## Setup

In [None]:
!pip install nomic

In [None]:
#Log in with nomic through command line 
!nomic login #[insert token; run without to generate token for first time]

In [None]:
from nomic import AtlasClient
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch

atlas = AtlasClient()

## Upload Texts for Mapping (CSV File Recommended)

In [None]:
import os
import pandas as pd

#Get current working directory 
path = os.getcwd()
print(path)

#Change working directory
path = os.chdir("/PATHNAME")

#Upload dataframeâˆš
df = pd.read_csv('filename.csv')

#Drop first column (unnamed)
df = df.iloc[: , 1:]

df

In [None]:
df = df.dropna()
sentences = df['Text'].tolist()
len(sentences)

In [None]:
#!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
model = AutoModel.from_pretrained("prajjwal1/bert-mini")
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-mini")
model

#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

In [None]:
#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

In [None]:
#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

In [None]:
#Convert to np array 
sentence_embeddings = sentence_embeddings.numpy()

In [None]:
#Set metadata
df['Title'] = df['Title'].astype(str)
title = df['Title'].tolist()
title

df['Genre'] = df['Genre'].astype(str)
genre = df['Genre'].tolist()
genre

In [None]:
data = [{'title': title[i % len(title)], 'genre': genre[i % len(genre)],'id': i}
            for i in range(len(sentence_embeddings))]

In [None]:
response = atlas.map_embeddings(embeddings=sentence_embeddings,
                                is_public=True,
                                data=data,
                                colorable_fields=['genre'],
                                map_name="Modeling Science Fiction, Fantasy and Detective Fiction",
                                map_description="Map of texts from Temple University's Science Fiction Collection and Project Gutenberg.",
                               )
print(response)