## Experiments with NYTimes API

Compilation of experiments with [NYTimes API]("https://developer.nytimes.com/")

In [10]:
from services.classify_civic.update_current_events.sync_nytimes_articles import *

In [None]:
res = load_section_topstories("politics")

In [None]:
res

In [None]:
res["results"][0]

Now let's see the output if we do the full sync + write to DB.

In [11]:
section_to_topstories_map: dict[str, list] = (
    load_all_section_topstories(sections=["politics", "us", "world"])
)

Loading articles for politics section...
Finished loading 25 articles.
----------
Loading articles for us section...
Finished loading 24 articles.
----------
Loading articles for world section...
Finished loading 37 articles.
----------


In [12]:
for section, topstories_list in section_to_topstories_map.items():
    print(f"Writing the latest articles for the {section} section to DB.")
    processed_articles = process_articles(topstories_list)
    write_articles_to_db(processed_articles)
    print('-' * 10)

Writing the latest articles for the politics section to DB.
Writing 25 articles to DB.
Finished writing 25 articles to DB.
----------
Writing the latest articles for the us section to DB.
Writing 24 articles to DB.
Finished writing 24 articles to DB.
----------
Writing the latest articles for the world section to DB.
Writing 37 articles to DB.
Finished writing 37 articles to DB.
----------


In [13]:
df = load_all_articles_as_df()

Execution time: 0 minutes, 0 seconds


In [16]:
foo = df["abstract"][0] + " " + df["title"][0] + " " + df["captions"][0]

In [17]:
foo

'Marilyn Lands flipped a State House seat in the deep-red state by 25 percentage points, underscoring the continued political potency of reproductive rights. Democrat Running on Abortion and I.V.F. Access Wins Special Election in Alabama The Democratic candidate Marilyn Lands defeated her Republican opponent, Teddy Powell, by about 25 percentage points — an extraordinary margin in a swing district.'

In [14]:
df

Unnamed: 0,id,nytimes_uri,title,abstract,url,published_date,captions
0,1,nyt://article/d45c86b2-117e-5fed-807d-a41dff8d...,Democrat Running on Abortion and I.V.F. Access...,Marilyn Lands flipped a State House seat in th...,https://www.nytimes.com/2024/03/27/us/politics...,2024-03-27T09:43:38-04:00,The Democratic candidate Marilyn Lands defeate...
1,2,nyt://article/6d602738-cde6-5e8e-8921-f49634e9...,"New Georgia Data Gives Insight on Primaries, P...",The vote history data supports the polling tha...,https://www.nytimes.com/2024/03/27/upshot/bide...,2024-03-27T05:04:15-04:00,Data from the March 12 primary in Georgia is c...
2,3,nyt://article/6d39505a-2866-5087-8fff-28af9239...,One Grieving Mother Hasn’t Given Up Hope for a...,A year after losing her daughter in the Covena...,https://www.nytimes.com/2024/03/27/us/politics...,2024-03-27T05:03:32-04:00,Katy Dieckhaus looking through one of the jour...
3,4,nyt://interactive/682e22bb-3e31-5fba-a32c-8ac5...,How Trump Moved Money to Pay $100 Million in L...,Trump supporters poured money into his effort ...,https://www.nytimes.com/interactive/2024/03/27...,2024-03-27T05:03:02-04:00,
4,5,nyt://article/cbe159e1-0021-5857-8bbd-c448ac91...,"They Grow Your Berries and Peaches, but Often ...",Farmers of fruits and vegetables say coverage ...,https://www.nytimes.com/2024/03/27/business/ec...,2024-03-27T05:00:52-04:00,Bernie Smiarowski grows potatoes and strawberr...
...,...,...,...,...,...,...,...
81,82,nyt://article/780eb804-a9fc-510d-a8cb-d756c7c2...,A French-Malian Singer Is Caught in an Olympic...,Aya Nakamura’s music is one of France’s top cu...,https://www.nytimes.com/2024/03/26/world/europ...,2024-03-26T00:01:13-04:00,Aya Nakamura is France’s most popular singer a...
82,83,nyt://article/057743db-2792-535e-a2e5-f14081be...,U.N. Security Council Calls for Immediate Ceas...,The U.S. decision not to vote on the resolutio...,https://www.nytimes.com/2024/03/25/world/middl...,2024-03-25T19:00:14-04:00,"Linda Thomas-Greenfield, the U.S. ambassador t..."
83,84,nyt://article/99a756c9-b045-570d-801e-c794f72c...,Tuesday Briefing: U.N. Voted for a Gaza Cease-...,"Also, searching for Iceland’s northern lights.",https://www.nytimes.com/2024/03/25/briefing/un...,2024-03-25T16:35:37-04:00,Palestinians inspected the damage to a buildin...
84,85,nyt://article/8f4bce50-e374-589c-a078-1e56c202...,Videos and Online Profiles Link Suspects to Mo...,Clothing and other details appear to show a co...,https://www.nytimes.com/2024/03/25/world/europ...,2024-03-25T13:42:30-04:00,A photograph released by the Islamic State pur...


Now let's play around with how to vectorize articles

In [18]:
all_articles_df: pd.DataFrame = load_all_articles_as_df()

Execution time: 0 minutes, 0 seconds


In [19]:
all_articles_df["full_text"] = (
    all_articles_df["title"] + " "
    + all_articles_df["abstract"] + " "
    + all_articles_df["captions"]
)

In [21]:
MAX_CHUNK_SIZE = 512
all_articles_df["full_text_truncated"] = all_articles_df["full_text"].apply(
    lambda x: x[:MAX_CHUNK_SIZE]
)

In [22]:
all_articles_dict_list = all_articles_df.to_dict(orient="records")

In [23]:
from langchain.docstore.document import Document as LangchainDocument

In [24]:
raw_knowledge_base: list[LangchainDocument] = [
    LangchainDocument(
        page_content=article["full_text_truncated"],
        metadata={
            "id": article["id"], "nytimes_uri": article["nytimes_uri"]
        }
    )
    for article in all_articles_dict_list
]

In [28]:
raw_knowledge_base[0].__dict__

{'page_content': 'Democrat Running on Abortion and I.V.F. Access Wins Special Election in Alabama Marilyn Lands flipped a State House seat in the deep-red state by 25 percentage points, underscoring the continued political potency of reproductive rights. The Democratic candidate Marilyn Lands defeated her Republican opponent, Teddy Powell, by about 25 percentage points — an extraordinary margin in a swing district.',
 'metadata': {'id': 1,
  'nytimes_uri': 'nyt://article/d45c86b2-117e-5fed-807d-a41dff8da796'},
 'type': 'Document'}

In [30]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy

In [31]:
DEFAULT_EMBEDDING_MODEL = "thenlper/gte-small"

#tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
embedding_model = HuggingFaceEmbeddings(
    model_name=DEFAULT_EMBEDDING_MODEL,
    multi_process=True,
    #model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # set True for cosine similarity
)

Load pretrained SentenceTransformer: thenlper/gte-small


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Use pytorch device_name: mps


In [33]:
knowledge_vector_db = FAISS.from_documents(
    raw_knowledge_base, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

CUDA/NPU is not available. Starting 4 CPU workers
Start multi-process pool on devices: cpu, cpu, cpu, cpu
Loading faiss.
Successfully loaded faiss.


In [37]:
knowledge_vector_db.save_local("faiss_index_nytimes")

In [38]:
#knowledge_vector_db.save_local("faiss_index_nytimes")
new_db = FAISS.load_local("faiss_index_nytimes", embedding_model)


In [34]:
user_query = "Can you believe what's happening with Democrats and IVF access?"
query_vector = embedding_model.embed_query(user_query)

CUDA/NPU is not available. Starting 4 CPU workers
Start multi-process pool on devices: cpu, cpu, cpu, cpu


In [35]:
retrieved_docs = knowledge_vector_db.similarity_search(query=user_query, k=5)

CUDA/NPU is not available. Starting 4 CPU workers
Start multi-process pool on devices: cpu, cpu, cpu, cpu


In [39]:
retrieved_docs = new_db.similarity_search(query=user_query, k=5)

CUDA/NPU is not available. Starting 4 CPU workers
Start multi-process pool on devices: cpu, cpu, cpu, cpu


In [36]:
retrieved_docs

[Document(page_content='Democrat Running on Abortion and I.V.F. Access Wins Special Election in Alabama Marilyn Lands flipped a State House seat in the deep-red state by 25 percentage points, underscoring the continued political potency of reproductive rights. The Democratic candidate Marilyn Lands defeated her Republican opponent, Teddy Powell, by about 25 percentage points — an extraordinary margin in a swing district.', metadata={'id': 1, 'nytimes_uri': 'nyt://article/d45c86b2-117e-5fed-807d-a41dff8da796'}),
 Document(page_content='Democrat Running on Abortion and I.V.F. Access Wins Special Election in Alabama Marilyn Lands flipped a State House seat in the deep-red state by 25 percentage points, underscoring the continued political potency of reproductive rights. The Democratic candidate Marilyn Lands defeated her Republican opponent, Teddy Powell, by about 25 percentage points — an extraordinary margin in a swing district.', metadata={'id': 26, 'nytimes_uri': 'nyt://article/d45c86

In [41]:
type(retrieved_docs[0])

langchain_core.documents.base.Document