In [1]:
import chromadb
import pandas as pd

In [2]:
news = pd.read_csv("rss_feed_data.csv")
news.head()

Unnamed: 0,title,link,domain,published,summary
0,"Sugar in India, Fueled by Child Marriage and H...",https://www.nytimes.com/2024/03/24/world/europ...,www.nytimes.com,"Sun, 24 Mar 2024 12:08:58 +0000",An investigation into the sugar-cane industry ...
1,Senegal Votes in an Election That Almost Didn’...,https://www.nytimes.com/2024/03/24/world/afric...,www.nytimes.com,"Sun, 24 Mar 2024 12:03:16 +0000","The top opposition politician, Ousmane Sonko, ..."
2,Russia’s Battle With Extremists Has Simmered f...,https://www.nytimes.com/2024/03/24/world/europ...,www.nytimes.com,"Sun, 24 Mar 2024 11:03:40 +0000",The Islamic State has long threatened to strik...
3,"In Hezbollah’s Sights, a Stretch of Northern I...",https://www.nytimes.com/2024/03/24/world/middl...,www.nytimes.com,"Sun, 24 Mar 2024 09:01:10 +0000",For the few Israelis remaining in the evacuate...
4,"Inside the Battle for a Bunker in Avdiivka, Uk...",https://www.nytimes.com/2024/03/24/world/europ...,www.nytimes.com,"Sun, 24 Mar 2024 04:01:10 +0000",A struggle for a position held by Ukrainian fo...


In [3]:
client = chromadb.PersistentClient(path="storage")
client

<chromadb.api.client.Client at 0x21de4474430>

In [4]:
collection = client.create_collection(name="rss_news")

- it automatically handles embedding with a relatively simple model, that may be enough for my small text chunks
- for each query a separate list is generated for ids, metadatas, distances and documents itself.
- if I do one query it is a list within list always, consider this.

In [5]:
collection.add(documents=news.apply(lambda row: str(row['title']) + ' ' + str(row['summary']), axis=1).tolist(),
               metadatas=news[['link', 'domain', 'published']].to_dict(orient='records'),
               ids=news.apply(lambda row: str(row['link']), axis=1).tolist())

C:\Users\vlady\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:07<00:00, 10.7MiB/s]


In [27]:
results = collection.query(
    query_texts=["Relations between Russia and Ukraine", "China and US trade war"],
    n_results=3
)
results

{'ids': [['https://www.nytimes.com/2024/03/23/world/europe/ukraine-russia-moscow-attack.html',
   'https://www.nytimes.com/2024/03/23/world/europe/moscow-attack-putin.html',
   'https://www.nytimes.com/2024/03/24/world/europe/russia-extremism-isis-syria.html'],
  ['https://www.nytimes.com/2024/03/24/business/china-development-forum-economy.html',
   'https://www.nytimes.com/2024/03/22/technology/china-ai-talent.html',
   'https://www.nytimes.com/2024/03/22/world/middleeast/israel-gaza-security-council-veto.html']],
 'distances': [[0.960934042930603, 0.972374677658081, 1.0162230730056763],
  [1.1636111736297607, 1.2395870016028705, 1.2690579891204834]],
 'metadatas': [[{'domain': 'www.nytimes.com',
    'link': 'https://www.nytimes.com/2024/03/23/world/europe/ukraine-russia-moscow-attack.html',
    'published': 'Sat, 23 Mar 2024 20:08:27 +0000'},
   {'domain': 'www.nytimes.com',
    'link': 'https://www.nytimes.com/2024/03/23/world/europe/moscow-attack-putin.html',
    'published': 'Sat,

In [20]:
[print(document + r'\n') for document in results['documents'][0]]

From Russia, Elaborate Tales of Fake Journalists As the Ukraine war grinds on, the Kremlin has created increasingly complex fabrications online to discredit Ukraine’s leader and undercut aid. Some have a Hollywood-style plot twist.\n
Here’s the latest on the attack in Russia. nan\n
Russian Attack Leaves Over a Million in Ukraine Without Electricity Power plants and a major hydroelectric dam were damaged in what Ukrainian officials said was one of the war’s largest assaults on energy infrastructure.\n


[None, None, None]

In [28]:
results['documents'][0]

['Ukraine Rejects Russian Speculation That It Had Role in Attack Kyiv has accused Russia of falsely suggesting it was to blame for the terrorist attack in Moscow and of using the assault to escalate the fighting in Ukraine.',
 'Putin Tries to Link Moscow Concert Hall Attack to Ukraine American officials, who have assessed that a branch of the Islamic State was responsible, have voiced concern that the Russian leader could seek to falsely blame Ukraine.',
 'Russia’s Battle With Extremists Has Simmered for Years The Islamic State has long threatened to strike Russia for helping the Syrian president, Bashar al-Assad, stay in control.']

In [29]:
results['documents'][1]

['China’s Plan to Spur Growth: A New Slogan for Building Factories As China’s leaders promote their strategy, other countries worry about manufacturing overcapacity and plans for more exports.',
 'In One Key A.I. Metric, China Pulls Ahead of the U.S.: Talent China has produced a huge number of top A.I. engineers in recent years. New research shows that, by some measures, it has already eclipsed the United States.',
 'U.S. Call for Gaza Cease-Fire Runs Into Russia-China Veto The American draft resolution before the Security Council did not go far enough to end the Israel-Hamas war, Russia and China said, after the United States had vetoed three earlier resolutions.']

## Testing OOP implementation

- I can pass news df directly out of the NewsRetriever without saving them.
- Add later management to delete the old news from vector db directly.

In [2]:
from NewsVectorStorage import NewsVectorStorage
import pandas as pd

news = pd.read_csv("rss_feed_data.csv")
news_vector_storage = NewsVectorStorage(news_dataframe=news)

In [3]:
news_vector_storage.load_news()