**Using Atlas to Visualize a Dataset of Text**

See [docs.nomic.ai](https://docs.nomic.ai) for documentation.

In [46]:
!pip install langchain nomic sentence-transformers transformers torch > /dev/null

In [None]:
import nomic
import time
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import AtlasDB
from langchain.document_loaders import TextLoader
from nomic import atlas
nomic.login('Mug83c2mM5lD-I-XEtNFAFrTtxIqNznl8SS0Obz9tApfe') #api key to a limited demo account. Make your own account at atlas.nomic.ai

In [None]:
embedd = HuggingFaceEmbeddings()

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=50,
                                          chunk_overlap=20,
                                          length_function=len)

In [None]:
AD = """How Atlas Works
Atlas is a platform for visually and programmatically interacting with massive unstructured datasets of text documents, images and embeddings.
Data model

Atlas lets you store and manipulate data like a standard noSQL document engine. On upload, your data is stored in an abstraction called a Project. You can add, update, read and delete (CRUD) data in a project via API calls from the Atlas Python client.
What kind of data can I store in Atlas?

Atlas can natively store:
    Embedding vectors
    Text Documents
Our roadmap includes first class support for data modalities such as images, audio and video. You can still store images, audio and video in Atlas now but you must generate embeddings for it yourself.
Data stored in an Atlas Project is semantically indexed by Atlas. This indexing allows you to interact, view and search through your dataset via meaning instead of matching on words.
How does Atlas semantically index data?
Atlas semantically indexes unstructured data by:
    Converting data points into embedding vectors (if they aren't embeddings already)
    Organizing the embedding vectors for fast semantic search and human interpretability
If you have embedding vectors of your data from an embedding API such as OpenAI or Cohere, you can attach them during upload.
If you don't already have embedding vectors for your data points, Atlas will create them by running your data through neural networks that semantically encode your data points. For example, if you upload text documents Atlas will run them through neural networks that semantically encode text. It is often cheaper and faster to use Atlas' internal embedding models as opposed to an external model APIs.
How is Atlas different from a noSQL database?

Unlike existing data stores, Atlas is built with embedding vectors as first class citizens. Embedding vectors are representations of data that computers can semantically manipulate. Most operations you do in Atlas, under the hood, are performed on embeddings.
Atlas makes embeddings human interpretable

Despite their utility, embeddings cannot be easily interpreted because they reside in high dimensions.

During indexing, Atlas builds a contextual two-dimensional data map of embeddings. This map preserves high-dimensional relationships present between embeddings in a two-dimensional, human interpretable view.

Reading an Atlas Map

Atlas Maps lay out your dataset contextually. We will use the above map of news articles generated by Atlas to describe how to read Maps.

An Atlas Map has the following properties:

    Points close to each other on the map are semantically similar/related. For example, all news articles about sports are at the bottom of the map. Inside the sports region, the map breaks down by type of sport because news articles about a fixed sport (e.g. baseball) have more similarity to each other than with news articles about other types of sports (e.g. tennis).
    Relative distances between points correlate with semantic relatedness but the numerical distance between 2D point positions does not have meaning. For example, the observation that the Tennis and Golf news article clusters are adjacent signify a relationships between Tennis and Golf in the embedding space. You should not, however, make claims or draw conclusions using the Euclidean distance between points in the two clusters. Distance information is only meaningful in the ambient embedding space and can be retrieved with vector_search.
    Floating labels correspond to distinct topics in your data. For example, the Golf cluster has the label 'Ryder Cup'. Labels are automatically determined from the textual contents of your data and are crucial for navigating the Map.
    Topics have a hierarchy. As you zoom around the Map, more granular versions of topics will emerge.
    Maps update as your data updates. When new data enters your project, Atlas can reindex the map to reflect how the new data relates to existing data.
All information and operations that are visually presented on an Atlas map have a programmatic analog. For example, you can access topic information and vector search through the Python client.
Technical Details
Atlas visualizes your embeddings in two-dimensions using a non-linear dimensionality reduction algorithm. Atlas' dimensionality reduction algorithm is custom-built for scale, speed and dynamic updates. Nomic cannot share the technical details of the algorithm at this time.
Data Formats and Integrity
Atlas stores and transfers data using a subset of the Apache Arrow standard.
pyarrow is used to convert python, pandas, and numpy data types to Arrow types; you can also pass any Arrow table (created by polars, duckdb, pyarrow, etc.) directly to Atlas and the types will be automatically converted.
Before being uploaded, all data is converted with the following rules:
    Strings are converted to Arrow strings and stored as UTF-8.
    Integers are converted to 32-bit integers. (In the case that you have larger integers, they are probably either IDs, in which case you should convert them to strings; or they are a field that you want perform analysis on, in which case you should convert them to floats.)
    Floats are converted to 32-bit (single-precision) floats.
    Embeddings, regardless of precision, are uploaded as 16-bit (half-precision) floats, and stored in Arrow as FixedSizeList.
    All dates and datetimes are converted to Arrow timestamps with millisecond precision and no time zone. (If you have a use case that requires timezone information or micro/nanosecond precision, please let us know.)
    Categorical types (called 'dictionary' in Arrow) are supported, but values stored as categorical must be strings.
Other data types (including booleans, binary, lists, and structs) are not supported. Values stored as a dictionary must be strings.
All fields besides embeddings and the user-specified ID field are nullable.
Permissions and Privacy
To create a Project in Atlas, you must first sign up for an account and obtain an API key.
Projects you create in Atlas have configurable permissions and privacy levels.
When you create a project, it's ownership is assigned to your Atlas team. You can add people to this team to collaborate on projects together. For example, if you want to invite somone to help you tag points on an Atlas Map, you would add them to your team and give them the appropriate editing permissions on your project.
"""

In [None]:
text_split = splitter.split_text(AD)

In [47]:
text_split[0]

'How Atlas Works'

In [53]:
text_dataset = []

for indi,text in enumerate(text_split):
  text_dataset.append({"id":indi,
                       "text":text})

In [92]:
embeddings = embedd.embed_documents(text_split)

In [93]:
import pandas

#load a demo dataset of 25k news articles
news_articles = pandas.read_csv('https://raw.githubusercontent.com/nomic-ai/maps/main/data/ag_news_25k.csv').to_dict('records')

In [94]:
news_articles[0]

{'id': 0,
 'text': 'Nasdaq planning \\$100m-share sale The owner of the Nasdaq index, an icon of the internet boom, is planning to sell \\$100m of shares to the public and list itself on the market it operates.',
 'label': 2}

In [141]:
from nomic import atlas

#By specifying modality='embedding' you are saying you will upload your own embeddings.
project = AtlasProject(name='atlas testing',
                       unique_id_field='id',
                       modality='embedding')


[32m2023-07-01 16:22:18.905[0m | [1mINFO    [0m | [36mnomic.project[0m:[36m_create_project[0m:[36m749[0m - [1mCreating project `atlas testing` in organization `kamaljp`[0m


In [142]:
project.schema

In [143]:
from nomic import atlas, AtlasProject
import numpy as np


#add your OpenAI embeddings and metadata to the Atlas DB project
project.add_embeddings(
    embeddings=np.array(embeddings),
    data=text_dataset
)

1it [00:01,  1.61s/it]
[32m2023-07-01 16:22:37.172[0m | [1mINFO    [0m | [36mnomic.project[0m:[36m_add_data[0m:[36m1371[0m - [1mUpload succeeded.[0m


In [145]:
project.create_index(name=project.name,
                     build_topic_model=True,
                     topic_label_field='text')
print(project.maps[0])

[32m2023-07-01 16:23:01.916[0m | [1mINFO    [0m | [36mnomic.project[0m:[36mcreate_index[0m:[36m1081[0m - [1mCreated map `atlas testing` in project `atlas testing`: https://atlas.nomic.ai/map/2768e6c9-9870-4ca3-8b0f-5577d3f64a20/3273cc1d-88ba-4a12-9233-f7cea9477e78[0m


atlas testing: https://atlas.nomic.ai/map/2768e6c9-9870-4ca3-8b0f-5577d3f64a20/3273cc1d-88ba-4a12-9233-f7cea9477e78


In [150]:
map = project.maps[0]

In [151]:
map

In [152]:
print(project.get_data(ids=[0,10]))

[{'id': '0', 'id_': 'AA', 'text': 'How Atlas Works'}, {'id': '10', 'id_': 'Cg', 'text': 'your data is stored in an abstraction called a'}]


In [103]:
map.topics.df

Unnamed: 0,id,topic_depth_1,topic_depth_2,topic_depth_3
0,0,Embeddings,"Atlas - Create, team, project,",Atlas
1,1,Embeddings,"Atlas - Create, team, project,",Atlas
2,2,Embeddings,Embeddings (2),OpenAI - Programmatically Interacting
3,3,Embeddings,Embeddings (2),OpenAI - Programmatically Interacting
4,4,Embeddings,Embeddings (2),OpenAI - Programmatically Interacting
...,...,...,...,...
205,205,Embeddings,"Atlas - Create, team, project,",Atlas
206,206,Embeddings,"Atlas - Create, team, project,",Atlas
207,207,Sports,Sports (2),Sports (3)
208,208,Computer Science,🤷‍♂️5🤷‍♀️,Account and permissions management


In [104]:
map.topics.hierarchy

{'Computer Science': ['🤷\u200d♂️5🤷\u200d♀️'],
 'Sports': ['Sports (2)', 'Distance'],
 'Embeddings': ['Search and Access of Topics and Documents',
  'Neural networks for semantically running encode',
  'Embeddings (2)',
  'Atlas - Create, team, project,'],
 'Convert types to Arrow': ['🤷\u200d♂️9🤷\u200d♀️'],
 'Search and Access of Topics and Documents': ['Arrow',
  'Map of topics and their labels',
  'Search and Access of Topics and Vectors'],
 'Embeddings (2)': ['Audio, Video, Image, and Data',
  'Embeddings (3)',
  'OpenAI - Programmatically Interacting'],
 '🤷\u200d♂️9🤷\u200d♀️': ['Convert Pandas DataFrame to Pyarrow',
  'Bit integers',
  'Arrow Types'],
 'Neural networks for semantically running encode': ['🤷\u200d♂️17🤷\u200d♀️'],
 'Distance': ['🤷\u200d♂️18🤷\u200d♀️'],
 'Atlas - Create, team, project,': ['Atlas'],
 '🤷\u200d♂️5🤷\u200d♀️': ['Account and permissions management', 'Stored data'],
 'Sports (2)': ['Sports (3)', 'Golf and Tennis']}

In [153]:
map.topics.metadata

Unnamed: 0,depth,topic_id,topic_depth_1,topic_description,topic_short_description,topic_depth_2,topic_depth_3
0,1,1,Convert Python types to and from Arrow.,converted/bit/convert/strings/precision/Arrow/...,Convert Python types to and from Arrow.,,
1,1,2,Atlas - Map - Create - Stores -,Atlas/Map/create/stores/builds/contextual/rein...,Atlas - Map - Create - Stores -,,
2,1,3,Sports,sports/distance/Tennis/topics/news/sport/Golf/...,Sports,,
3,1,4,Embeddings,embeddings/search/images/embedding/vectors/net...,Embeddings,,
4,2,5,Sports,distance/2D/Euclidean/close/semantically/dista...,Distance,Distance,
5,2,6,Sports,sports/Tennis/Golf/sport/region/type/baseball/...,Sports (2),Sports (2),
6,2,7,Embeddings,audio/networks/map/space/view,audio network map,audio network map,
7,2,8,Sports,permissions/account/collaborate/team/project/P...,Account permissions,Account permissions,
8,2,9,Convert Python types to and from Arrow.,types/strings/dictionary/stored/semantically/I...,Types,Types,
9,2,10,Sports,topics/Map/granular/zoom/Floating/labels/corre...,Map of topics,Map of topics,


In [110]:
query = "how atlas works"

query_embed = np.array([embedd.embed_query(query)])

In [111]:
map.embeddings.vector_search(queries=query_embed,
                             k = 2)

([['0', '147']], [[4.755779414722383e-08, 0.49030232429504395]])

In [162]:
project.get_data(ids=['0', '147'])

[{'id': '0', 'id_': 'AA', 'text': 'How Atlas Works'},
 {'id': '147',
  'id_': 'kw',
  'text': 'Atlas stores and transfers data using a subset of'}]

In [167]:
def vector_search(query,atlas_map,project,doc_num):
  query_embed = np.array([embedd.embed_query(query)])
  neigh, embeds =  atlas_map.embeddings.vector_search(queries=query_embed,
                                     k = doc_num)
  docs = project.get_data(ids = neigh[0])
  texts = []

  print(f'The Ids are {neigh[0]}')

  for d in docs:
    texts.append(d['text'])

  return texts

In [114]:
tns = """

Search is a lot about discovery—the basic human need to learn and broaden your horizons. But searching still requires a lot of hard work by you, the user. So today I’m really excited to launch the Knowledge Graph, which will help you discover new information quickly and easily.

Take a query like [taj mahal]. For more than four decades, search has essentially been about matching keywords to queries. To a search engine the words [taj mahal] have been just that—two words.

But we all know that [taj mahal] has a much richer meaning. You might think of one of the world’s most beautiful monuments, or a Grammy Award-winning musician, or possibly even a casino in Atlantic City, NJ. Or, depending on when you last ate, the nearest Indian restaurant. It’s why we’ve been working on an intelligent model—in geek-speak, a “graph”—that understands real-world entities and their relationships to one another: things, not strings.

The Knowledge Graph enables you to search for things, people or places that Google knows about—landmarks, celebrities, cities, sports teams, buildings, geographical features, movies, celestial objects, works of art and more—and instantly get information that’s relevant to your query. This is a critical first step towards building the next generation of search, which taps into the collective intelligence of the web and understands the world a bit more like people do.

Google’s Knowledge Graph isn’t just rooted in public sources such as Freebase, Wikipedia and the CIA World Factbook. It’s also augmented at a much larger scale—because we’re focused on comprehensive breadth and depth. It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. And it’s tuned based on what people search for, and what we find out on the web.

The Knowledge Graph enhances Google Search in three main ways to start:
1. Find the right thing

Language can be ambiguous—do you mean Taj Mahal the monument, or Taj Mahal the musician? Now Google understands the difference, and can narrow your search results just to the one you mean—just click on one of the links to see that particular slice of results:


Taj Mahal

This is one way the Knowledge Graph makes Google Search more intelligent—your results are more relevant because we understand these entities, and the nuances in their meaning, the way you do.
2. Get the best summary

With the Knowledge Graph, Google can better understand your query, so we can summarize relevant content around that topic, including key facts you’re likely to need for that particular thing. For example, if you’re looking for Marie Curie, you’ll see when she was born and died, but you’ll also get details on her education and scientific discoveries:
Marie Curie

How do we know which facts are most likely to be needed for each item? For that, we go back to our users and study in aggregate what they’ve been asking Google about each item. For example, people are interested in knowing what books Charles Dickens wrote, whereas they’re less interested in what books Frank Lloyd Wright wrote, and more in what buildings he designed.

The Knowledge Graph also helps us understand the relationships between things. Marie Curie is a person in the Knowledge Graph, and she had two children, one of whom also won a Nobel Prize, as well as a husband, Pierre Curie, who claimed a third Nobel Prize for the family. All of these are linked in our graph. It’s not just a catalog of objects; it also models all these inter-relationships. It’s the intelligence between these different entities that’s the key.
3. Go deeper and broader

Finally, the part that’s the most fun of all—the Knowledge Graph can help you make some unexpected discoveries. You might learn a new fact or new connection that prompts a whole new line of inquiry. Do you know where Matt Groening, the creator of the Simpsons (one of my all-time favorite shows), got the idea for Homer, Marge and Lisa’s names? It’s a bit of a surprise:
Matt Groening

We’ve always believed that the perfect search engine should understand exactly what you mean and give you back exactly what you want. And we can now sometimes help answer your next question before you’ve asked it, because the facts we show are informed by what other people have searched for. For example, the information we show for Tom Cruise answers 37 percent of next queries that people ask about him. In fact, some of the most serendipitous discoveries I’ve made using the Knowledge Graph are through the magical “People also search for” feature. One of my favorite books is The White Tiger, the debut novel by Aravind Adiga, which won the prestigious Man Booker Prize. Using the Knowledge Graph, I discovered three other books that had won the same prize and one that won the Pulitzer. I can tell you, this suggestion was spot on!

We’ve begun to gradually roll out this view of the Knowledge Graph to U.S. English users. It’s also going to be available on smartphones and tablets—read more about how we’ve tailored this to mobile devices. And watch our video (also available on our site about the Knowledge Graph) that gives a deeper dive into the details and technology, in the words of people who've worked on this project:
Introducing the Knowledge Graph
2:45
Introducing the Knowledge Graph

We hope this added intelligence will give you a more complete picture of your interest, provide smarter search results, and pique your curiosity on new topics. We’re proud of our first baby step—the Knowledge Graph—which will enable us to make search more intelligent, moving us closer to the "Star Trek computer" that I've always dreamt of building. Enjoy your lifelong journey of discovery, made easier by Google Search, so you can spend less time searching and more time doing what you love."""

In [115]:
tns_text = splitter.split_text(tns)

In [126]:
project = AtlasProject(name='atlas documentation')

[32m2023-07-01 16:09:25.558[0m | [1mINFO    [0m | [36mnomic.project[0m:[36m__init__[0m:[36m672[0m - [1mLoading existing project `atlas documentation` from organization `kamaljp`.[0m


In [137]:
exist_length = len(text_split)
exist_length

210

In [154]:
tns_dataset = []

for indi,text in enumerate(tns_text):
  tns_dataset.append({"id":indi + exist_length,
                       "text":text,
                      "id_":""})

In [None]:
tns_dataset

In [117]:
tns_embeddings = embedd.embed_documents(tns_text)

In [130]:
project.id_field

'id'

In [147]:
project.schema

id: int32
text: string
id_: string
_embeddings: fixed_size_list<item: halffloat>[768]
  child 0, item: halffloat

In [155]:
project.add_embeddings(
      embeddings=np.array(tns_embeddings),
      data=tns_dataset
  )

1it [00:01,  1.05s/it]
[32m2023-07-01 16:25:53.003[0m | [1mINFO    [0m | [36mnomic.project[0m:[36m_add_data[0m:[36m1371[0m - [1mUpload succeeded.[0m


In [156]:
project.create_index(name=project.name,
                     build_topic_model=True,
                     topic_label_field='text')

[32m2023-07-01 16:27:20.134[0m | [1mINFO    [0m | [36mnomic.project[0m:[36mcreate_index[0m:[36m1081[0m - [1mCreated map `atlas testing` in project `atlas testing`: https://atlas.nomic.ai/map/2768e6c9-9870-4ca3-8b0f-5577d3f64a20/7400158b-3782-47dc-825d-d26c53417352[0m


In [173]:
map_1 = project.maps[0]

In [174]:
map_1

In [176]:
query = "Strings not things"

vector_search(query,map_1,project,5)

The Ids are ['240', '159', '166', '160', '185']


['things, not strings.',
 'Strings are converted to Arrow strings and',
 'convert them to strings; or they are a field that',
 'Arrow strings and stored as UTF-8.',
 'but values stored as categorical must be strings.']

In [177]:
query = "What is Atlas"

vector_search(query,map_1,project,5)

The Ids are ['0', '1', '147', '58', '78']


['How Atlas Works',
 'Atlas is a platform for visually and',
 'Atlas stores and transfers data using a subset of',
 'Unlike existing data stores, Atlas is built with',
 'Reading an Atlas Map']