# The models we will use
- sentence-transformers/multi-qa-mpnet-base-dot-v1:
    - This model is pre-trained on a large corpus of natural questions and answers.
    - It is used to calculate sentence embeddings, which can be used for semantic similarity between sentences.
    - The model is fine-tuned on a specific task, such as answering questions based on given context.
- E5-large-v2
    - E5 is a transformer-based model, trained to generate embeddings for a wide variety of NLP tasks, including semantic search. It excels at capturing both query intent and document meaning.
- OpenAI text-embedding-3-small (via API)
    - This model is a smaller version of the OpenAI text-embedding-3 model, one of the most powerful embedding models offered by OpenAI. It generates dense, high-quality embeddings optimized for a wide range of tasks, including semantic search.

Load BBC News Data

In [75]:
import pandas as pd

# Load the dataset
file_path = 'bbc_news_data/bbc_news.csv'
df = pd.read_csv(file_path)

# Display the first few rows to confirm the data is loaded correctly
df.head()

Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Filter News for the First Half of 2024

In [76]:
# Convert the 'pubDate' to datetime format
df['pubDate'] = pd.to_datetime(df['pubDate'], errors='coerce')

# Filter news from January to June 2024
df_filtered = df[(df['pubDate'] >= '2024-01-01') & (df['pubDate'] < '2024-07-01')]

# Drop duplicates based on 'title' or 'description'
df_filtered = df_filtered.drop_duplicates(subset=['title'], keep='first')

# Drop duplicates based on 'description'
df_filtered = df_filtered.drop_duplicates(subset=['description'], keep='first')

# Reset the index after filtering and removing duplicates
df_filtered = df_filtered.reset_index(drop=True)

print(f"Filtered {len(df_filtered)} unique news articles.")
df_filtered.head()

Filtered 7305 unique news articles.


Unnamed: 0,title,pubDate,guid,link,description
0,Justin Welby: Political leaders should treat o...,2024-01-01 00:00:04,https://www.bbc.co.uk/news/uk-67844356,https://www.bbc.co.uk/news/uk-67844356?at_medi...,The Archbishop of Canterbury urges politicians...
1,Almost three million tested for cancer in England,2024-01-01 00:09:56,https://www.bbc.co.uk/news/health-67841348,https://www.bbc.co.uk/news/health-67841348?at_...,Record numbers are being tested for cancer but...
2,Household energy price rise of 5% comes into f...,2024-01-01 00:00:16,https://www.bbc.co.uk/news/business-67785266,https://www.bbc.co.uk/news/business-67785266?a...,A higher cap for the next three months adds £9...
3,Primrose Hill stabbing: Harry Pitman named as ...,2024-01-01 17:11:13,https://www.bbc.co.uk/news/uk-england-london-6...,https://www.bbc.co.uk/news/uk-england-london-6...,"Harry Pitman, 16, was attacked on London's Pri..."
4,Israel Supreme Court strikes down judicial ref...,2024-01-01 19:47:58,https://www.bbc.co.uk/news/world-middle-east-6...,https://www.bbc.co.uk/news/world-middle-east-6...,The controversial plans triggered nationwide p...


Merge title and description Columns

In [79]:
# Create a new column 'content' by merging 'title' and 'description'
df_filtered['content'] = df_filtered['title'].fillna('') + ' ' + df_filtered['description'].fillna('')

df_filtered.head()

Unnamed: 0,title,pubDate,guid,link,description,content
0,Justin Welby: Political leaders should treat o...,2024-01-01 00:00:04,https://www.bbc.co.uk/news/uk-67844356,https://www.bbc.co.uk/news/uk-67844356?at_medi...,The Archbishop of Canterbury urges politicians...,Justin Welby: Political leaders should treat o...
1,Almost three million tested for cancer in England,2024-01-01 00:09:56,https://www.bbc.co.uk/news/health-67841348,https://www.bbc.co.uk/news/health-67841348?at_...,Record numbers are being tested for cancer but...,Almost three million tested for cancer in Engl...
2,Household energy price rise of 5% comes into f...,2024-01-01 00:00:16,https://www.bbc.co.uk/news/business-67785266,https://www.bbc.co.uk/news/business-67785266?a...,A higher cap for the next three months adds £9...,Household energy price rise of 5% comes into f...
3,Primrose Hill stabbing: Harry Pitman named as ...,2024-01-01 17:11:13,https://www.bbc.co.uk/news/uk-england-london-6...,https://www.bbc.co.uk/news/uk-england-london-6...,"Harry Pitman, 16, was attacked on London's Pri...",Primrose Hill stabbing: Harry Pitman named as ...
4,Israel Supreme Court strikes down judicial ref...,2024-01-01 19:47:58,https://www.bbc.co.uk/news/world-middle-east-6...,https://www.bbc.co.uk/news/world-middle-east-6...,The controversial plans triggered nationwide p...,Israel Supreme Court strikes down judicial ref...


In [5]:
from sentence_transformers import SentenceTransformer

# Load the models
model_mpnet = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
model_e5 = SentenceTransformer('intfloat/e5-large-v2')

# Define embedding functions
def embed_with_mpnet(texts):
    return model_mpnet.encode(texts, convert_to_tensor=True)

def embed_with_e5(texts):
    return model_e5.encode(texts, convert_to_tensor=True)


  from tqdm.autonotebook import tqdm, trange


In [52]:
from openai import OpenAI
client = OpenAI(api_key='<your api key>')

def batch(iterable, batch_size):
    """Helper function to split a list into smaller batches."""
    for i in range(0, len(iterable), batch_size):
        yield iterable[i:i + batch_size]

def embed_with_openai_batched(texts, batch_size=5000):
    """Embed a list of texts using OpenAI API in batches."""
    emb_list = []

    # Process texts in batches
    for batch_texts in batch(texts, batch_size):
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch_texts
        )
        # Collect embeddings from the response
        for emb in response.data:
            emb_list.append(emb.embedding)

    return emb_list



if used on google colab

In [26]:
# import numpy as np

# # Generate embeddings for each model
# df_embedded['mpnet_embedding'] = list(embed_with_mpnet(df_embedded['content'].tolist()))
# df_embedded['e5_embedding'] = list(embed_with_e5(df_embedded['content'].tolist()))
# df_embedded['openai_embedding'] = list(embed_with_openai(df_embedded['content'].tolist()))

# # Save embeddings to a local file (NumPy format)
# np.save('mpnet_embeddings.npy', np.array(df_embedded['mpnet_embedding'].tolist()))
# np.save('e5_embeddings.npy', np.array(df_embedded['e5_embedding'].tolist()))
# np.save('openai_embeddings.npy', np.array(df_embedded['openai_embedding'].tolist()))

# print("Embeddings saved successfully.")


In [80]:
import numpy as np

# Helper function to move tensors to CPU and convert to NumPy
def to_numpy(tensor):
    return tensor.detach().cpu().numpy()

# Generate embeddings and move them to CPU
df_filtered['mpnet_embedding'] = [
    to_numpy(embedding) for embedding in embed_with_mpnet(df_filtered['content'].tolist())
]

df_filtered['e5_embedding'] = [
    to_numpy(embedding) for embedding in embed_with_e5(df_filtered['content'].tolist())
]

df_filtered


Unnamed: 0,title,pubDate,guid,link,description,content,mpnet_embedding,e5_embedding
0,Justin Welby: Political leaders should treat o...,2024-01-01 00:00:04,https://www.bbc.co.uk/news/uk-67844356,https://www.bbc.co.uk/news/uk-67844356?at_medi...,The Archbishop of Canterbury urges politicians...,Justin Welby: Political leaders should treat o...,"[0.053571213, 0.08881092, -0.17919879, 0.08407...","[0.020664986, -0.03336663, -0.0038330439, -0.0..."
1,Almost three million tested for cancer in England,2024-01-01 00:09:56,https://www.bbc.co.uk/news/health-67841348,https://www.bbc.co.uk/news/health-67841348?at_...,Record numbers are being tested for cancer but...,Almost three million tested for cancer in Engl...,"[0.41341785, -0.009250954, -0.23477213, -0.173...","[0.0058183777, -0.04546223, -0.0034279262, -0...."
2,Household energy price rise of 5% comes into f...,2024-01-01 00:00:16,https://www.bbc.co.uk/news/business-67785266,https://www.bbc.co.uk/news/business-67785266?a...,A higher cap for the next three months adds £9...,Household energy price rise of 5% comes into f...,"[-0.22885162, -0.16641885, -0.20064497, -0.054...","[0.019705322, -0.033490993, 0.01671098, -0.023..."
3,Primrose Hill stabbing: Harry Pitman named as ...,2024-01-01 17:11:13,https://www.bbc.co.uk/news/uk-england-london-6...,https://www.bbc.co.uk/news/uk-england-london-6...,"Harry Pitman, 16, was attacked on London's Pri...",Primrose Hill stabbing: Harry Pitman named as ...,"[0.10062553, 0.095038205, -0.1422333, 0.230631...","[-0.009218128, -0.057979673, -0.0024853114, -0..."
4,Israel Supreme Court strikes down judicial ref...,2024-01-01 19:47:58,https://www.bbc.co.uk/news/world-middle-east-6...,https://www.bbc.co.uk/news/world-middle-east-6...,The controversial plans triggered nationwide p...,Israel Supreme Court strikes down judicial ref...,"[0.18970646, 0.09234373, -0.18555368, -0.26797...","[-0.018320104, -0.03711855, 0.032993697, -0.03..."
...,...,...,...,...,...,...,...,...
7300,Ex-Olympian among first-time election candidates,2024-06-30 20:59:43,https://www.bbc.com/news/articles/cw000381nzyo#12,https://www.bbc.com/news/articles/cw000381nzyo,"Marc Jenkins said he had a ""massive case of im...",Ex-Olympian among first-time election candidat...,"[-0.323094, 0.19344077, -0.115317434, -0.10344...","[0.0026792518, -0.056466263, 0.015168941, -0.0..."
7301,Where are the seats that could decide the elec...,2024-06-25 14:49:05,https://www.bbc.com/news/articles/c133p016pg4o#1,https://www.bbc.com/news/articles/c133p016pg4o,The parties' top battleground targets across t...,Where are the seats that could decide the elec...,"[-0.16227768, -0.15128608, -0.24611586, 0.1056...","[-0.004053095, -0.069911756, 0.0006357004, -0...."
7302,I recognised my sister in video of refugees ca...,2024-06-30 22:46:01,https://www.bbc.com/news/articles/c3g3nk15jrdo#2,https://www.bbc.com/news/articles/c3g3nk15jrdo,Eritreans tell the BBC their relatives are bei...,I recognised my sister in video of refugees ca...,"[0.002135083, -0.2321746, -0.25640497, -0.0904...","[-0.008929085, -0.0357612, 0.032336436, -0.041..."
7303,'We have to accept this is England's identity',2024-06-30 21:41:24,https://www.bbc.com/sport/football/articles/cx...,https://www.bbc.com/sport/football/articles/cx...,Former captain Alan Shearer on England's drama...,'We have to accept this is England's identity'...,"[0.095595285, 0.20854455, -0.1086355, -0.11150...","[-0.012937819, -0.028105535, -0.007200785, -6...."


In [81]:
df_filtered['openai_embedding'] = embed_with_openai_batched(df_filtered['content'].tolist(), 1000)
# np.save('openai_embeddings.npy', np.array(df_filtered['openai_embedding'].tolist()))
print("OpenAI embeddings generated successfully.")

OpenAI embeddings generated successfully.


Implement Query Search Function

In [9]:
# from sklearn.metrics.pairwise import cosine_similarity

# def search_news(query, model_name):
#     # Embed the query with the appropriate model
#     if model_name == 'mpnet':
#         query_embedding = embed_with_mpnet([query])[0]
#         news_embeddings = np.load('mpnet_embeddings.npy')
#     elif model_name == 'e5':
#         query_embedding = embed_with_e5([query])[0]
#         news_embeddings = np.load('e5_embeddings.npy')
#     # elif model_name == 'openai':
#     #     query_embedding = embed_with_openai([query])[0]
#     #     news_embeddings = np.load('openai_embeddings.npy')
    
#     # Compute cosine similarity
#     similarities = cosine_similarity([query_embedding], news_embeddings)[0]

#     # Get the top 5 most similar news articles
#     top_indices = similarities.argsort()[-5:][::-1]
#     top_articles = df_embedded.iloc[top_indices]

#     return top_articles[['guid', 'link', 'content']]


In [97]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def to_numpy(tensor):
    """Helper function to move a tensor to CPU and convert it to NumPy."""
    return tensor.detach().cpu().numpy()

def search_news(query, model_name):
    # Generate the query embedding based on the chosen model
    if model_name == 'mpnet':
        query_embedding = embed_with_mpnet([query])[0]  # Output is a tensor
        query_embedding = to_numpy(query_embedding)     # Move to CPU and convert to NumPy
        news_embeddings = np.vstack(df_filtered['mpnet_embedding'].values)
    elif model_name == 'e5':
        query_embedding = embed_with_e5([query])[0]     # Output is a tensor
        query_embedding = to_numpy(query_embedding)     # Move to CPU and convert to NumPy
        news_embeddings = np.vstack(df_filtered['e5_embedding'].values)
    elif model_name == 'openai':
        query_embedding = embed_with_openai_batched([query])[0]  # Already a NumPy array
        news_embeddings = np.vstack(df_filtered['openai_embedding'].values)

    # Compute cosine similarity
    similarities = cosine_similarity([query_embedding], news_embeddings)[0]

    # Get the top 5 most similar news articles
    top_indices = similarities.argsort()[-5:][::-1]

    # Extract the top articles and their corresponding similarities
    top_articles = df_filtered.iloc[top_indices]
    top_articles['cosine_similarity'] = similarities[top_indices]

    return top_articles


## Test Query Search

In [99]:
def search_with_models(query: str):
    models_names = ['mpnet', 'e5', 'openai']  # List of model names to use
    results = {}

    # Search across all models
    for model_name in models_names:
        results[model_name] = search_news(query, model_name)

    # Display the results
    for model_name, result_df in results.items():
        print(f"\033[94m \t Model: {model_name} \033[0m")
        for _, row in result_df.iterrows():
            print(f"\033[96m Similarity: {row['cosine_similarity']:.3f}::\033[0m {row['content']}")

In [100]:
query = "Recommended diet and exercise habits for managing obesity"
search_with_models(query)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94m 	 Model: mpnet [0m
[96m Similarity: 0.464::[0m Weighing up diet drugs Obesity jabs 'could reduce heart attack risk' says new study
[96m Similarity: 0.413::[0m May horrified to learn about risks of diabetes disordered eating Better awareness and more NHS support are needed, the former prime minister says.
[96m Similarity: 0.400::[0m Child obesity in pandemic could have lifelong effects, study says Researchers say children who gained weight in the pandemic could develop diseases later in life.
[96m Similarity: 0.389::[0m Just five more ways Michael Mosley made us healthier Michael Mosley’s simple and accessible health hacks made him a household name.  Remember these?
[96m Similarity: 0.380::[0m Post Christmas debt: Which bills should I pay first? What help and options are available to people struggling with debt repayments?
[94m 	 Model: e5 [0m
[96m Similarity: 0.827::[0m Weighing up diet drugs Obesity jabs 'could reduce heart attack risk' says new study
[96m Simila

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


In [101]:
query = "Keep your children safe on the internet"
search_with_models(query)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94m 	 Model: mpnet [0m
[96m Similarity: 0.624::[0m Tame toxic algorithms to protect children, big tech told Big tech companies will have to make changes to their algorithms to comply with new online safety laws.
[96m Similarity: 0.600::[0m Tech firms told to hide 'toxic' content from children Social media firms like Instagram and TikTok will have to make changes to comply with new online safety laws.
[96m Similarity: 0.571::[0m Bereaved parents win online harm battle Tech firms will have to hand over personal data of children whose death may be related to online harm.
[96m Similarity: 0.541::[0m No More Phones 4u School Kids Schools in England given new guidance on stopping mobile phone use
[96m Similarity: 0.539::[0m Meta tool to block nude images in teens' private messages The move is designed to help stop teenagers receiving inappropriate pictures, even in encrypted chats
[94m 	 Model: e5 [0m
[96m Similarity: 0.819::[0m Bereaved parents win online harm battle Tech f

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


In [102]:
search_with_models("What are the lates news about AI ethics?")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94m 	 Model: mpnet [0m
[96m Similarity: 0.587::[0m Urgent need for terrorism AI laws, warns think tank The government should 'urgently consider' AI-specific legislation a think-tank says.
[96m Similarity: 0.581::[0m How AI is being used to prevent illegal fishing Illegal fishing remains a huge global problem, but AI is now being used to tackle the issue.
[96m Similarity: 0.579::[0m AI could 'supercharge' election disinformation, US tells the BBC US Deputy Attorney General Lisa Monaco says the US wants tougher sentences for crimes involving AI.
[96m Similarity: 0.563::[0m Concern rises over AI in adult entertainment AI in adult entertainment could have negative effects on society and individuals, experts say.
[96m Similarity: 0.555::[0m AI and humanity’s future: chilling or thrilling? Amol & Nick take on Stephen Fry’s challenge & look at what the future of AI might entail.
[94m 	 Model: e5 [0m
[96m Similarity: 0.806::[0m AI and humanity’s future: chilling or thrilling? 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


In [105]:
search_with_models("How the war between Israel and Hamas is affecting global politics?")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94m 	 Model: mpnet [0m
[96m Similarity: 0.714::[0m Why have Israel and Iran attacked each other? The long-running shadow war between the two countries has come into the open.
[96m Similarity: 0.687::[0m Israel-Palestinian bitterness deepened by Hamas attack and war Both Israelis and Palestinians believe the world does not understand their pain and suffering.
[96m Similarity: 0.660::[0m Israel 'vows revenge' as it 'weighs up response' The fallout from Iran's unprecedented aerial attack and fears of regional escalation dominates the papers.
[96m Similarity: 0.653::[0m Bowen: As Israel debates Iran attack response, can US and allies stop slide into all-out war? World leaders are scrambling to prevent the Middle East entering a damaging wider conflict.
[96m Similarity: 0.641::[0m Jeremy Bowen: The Israel-Gaza war is at a crossroads Will the killing of foreign aid workers exhaust the patience of Israel's allies?
[94m 	 Model: e5 [0m
[96m Similarity: 0.835::[0m Could the Isr

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


In [106]:
search_with_models("Latest trends in the fashion industry")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94m 	 Model: mpnet [0m
[96m Similarity: 0.557::[0m In pictures: London Fashion Week's big moments A surprise appearance from the supermodel topped off the 40th anniversary celebrations.
[96m Similarity: 0.555::[0m Champagne, caffeine and chaos: Fashion week descends on Paris Louis Vuitton, Dior, Hermes, Rick Owens, Loewe, Kenzo and Dries Van Noten all introduce new collections in Paris. 
[96m Similarity: 0.520::[0m The outfits: Stars turn on the style for the Oscars The fashion is as important as the films at the glittering Hollywood award ceremony.
[96m Similarity: 0.519::[0m How Zendaya perfected 'method dressing' Film stars have made fancy dress fashionable in their recent red carpet looks.
[96m Similarity: 0.508::[0m London Fashion Week: Celebrating 40 years of catwalks As London Fashion Week marks four decades of creativity, we look back at some memorable moments.
[94m 	 Model: e5 [0m
[96m Similarity: 0.808::[0m How Zendaya perfected 'method dressing' Film stars h

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


## Open-AI model wins, Now lets find the right treshold

In [118]:
def search_open_ai_news(query):
    # Generate the query embedding based on the chosen model
    query_embedding = embed_with_openai_batched([query])[0] 
    news_embeddings = np.vstack(df_filtered['openai_embedding'].values)

    # Compute cosine similarity
    similarities = cosine_similarity([query_embedding], news_embeddings)[0]

    # Get the top 5 most similar news articles
    top_indices = similarities.argsort()[-10:][::-1]

    # Extract the top articles and their corresponding similarities
    top_articles = df_filtered.iloc[top_indices]
    top_articles['cosine_similarity'] = similarities[top_indices]

    # Display the results
    print(f"\033[94mQuery: {query} \033[0m")
    for _, row in top_articles.iterrows():
        print(f"\033[96m (Similarity: {row['cosine_similarity']:.3f})\033[0m RSS news content: {row['content']}")


In [119]:
search_open_ai_news("Keep your children safe on the internet")

[94mQuery: Keep your children safe on the internet [0m
[96m (Similarity: 0.528)[0m RSS news content: Tame toxic algorithms to protect children, big tech told Big tech companies will have to make changes to their algorithms to comply with new online safety laws.
[96m (Similarity: 0.477)[0m RSS news content: Tech firms told to hide 'toxic' content from children Social media firms like Instagram and TikTok will have to make changes to comply with new online safety laws.
[96m (Similarity: 0.477)[0m RSS news content: Bereaved parents win online harm battle Tech firms will have to hand over personal data of children whose death may be related to online harm.
[96m (Similarity: 0.423)[0m RSS news content: Brianna Ghey's mother and Molly Russell's father join forces to combat online harm Esther Ghey and Ian Russell want better protection for teenagers on social media.
[96m (Similarity: 0.416)[0m RSS news content: Could Ofcom ban social media for under-18s? A quick, simple guide to t

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


In [120]:
search_open_ai_news("What are the lates news about AI ethics?")

[94mQuery: What are the lates news about AI ethics? [0m
[96m (Similarity: 0.569)[0m RSS news content: AI and humanity’s future: chilling or thrilling? Amol & Nick take on Stephen Fry’s challenge & look at what the future of AI might entail.
[96m (Similarity: 0.528)[0m RSS news content: Concern rises over AI in adult entertainment AI in adult entertainment could have negative effects on society and individuals, experts say.
[96m (Similarity: 0.520)[0m RSS news content: AI could 'supercharge' election disinformation, US tells the BBC US Deputy Attorney General Lisa Monaco says the US wants tougher sentences for crimes involving AI.
[96m (Similarity: 0.520)[0m RSS news content: CES 2024: AI pillows and toothbrushes - is it all getting a bit silly? Companies are clamouring to present their products as AI-powered, but are their claims justified?
[96m (Similarity: 0.487)[0m RSS news content: What happens when you think AI is lying about you? BBC Tech Editor Zoe Kleinman tried to 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


In [121]:
search_open_ai_news("Recommended diet and exercise habits for managing obesity")
search_open_ai_news("How the war between Israel and Hamas is affecting global politics?")
search_open_ai_news("Latest trends in the fashion industry")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94mQuery: Recommended diet and exercise habits for managing obesity [0m
[96m (Similarity: 0.427)[0m RSS news content: Weighing up diet drugs Obesity jabs 'could reduce heart attack risk' says new study
[96m (Similarity: 0.365)[0m RSS news content: Doctors question science behind blood sugar diet trend Experts say there is "no strong evidence" the monitors, proven to be effective in managing diabetes, can also help people without the condition.
[96m (Similarity: 0.344)[0m RSS news content: May horrified to learn about risks of diabetes disordered eating Better awareness and more NHS support are needed, the former prime minister says.
[96m (Similarity: 0.312)[0m RSS news content: Child obesity in pandemic could have lifelong effects, study says Researchers say children who gained weight in the pandemic could develop diseases later in life.
[96m (Similarity: 0.294)[0m RSS news content: Why fat Labradors can blame their genes Scientists used the 'sausage in a box' test to find

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94mQuery: How the war between Israel and Hamas is affecting global politics? [0m
[96m (Similarity: 0.569)[0m RSS news content: Israel-Palestinian bitterness deepened by Hamas attack and war Both Israelis and Palestinians believe the world does not understand their pain and suffering.
[96m (Similarity: 0.554)[0m RSS news content: No let-up for Gazans while world focused on Iran attacks Fighting continues in Gaza, where a humanitarian crisis is overshadowed by the wider regional conflict.
[96m (Similarity: 0.551)[0m RSS news content: Chris Mason: How Gaza conflict is contorting UK politics The Conservatives, Labour, and Parliament itself are finding themselves torn by the furious arguments provoked by the Israel-Gaza war.
[96m (Similarity: 0.546)[0m RSS news content: Jeremy Bowen: The Israel-Gaza war is at a crossroads Will the killing of foreign aid workers exhaust the patience of Israel's allies?
[96m (Similarity: 0.542)[0m RSS news content: Was this the week Israel and He

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


In [122]:
search_open_ai_news("Best Marvel movie this year") 
search_open_ai_news("biden vs trump")
search_open_ai_news("Best fishing places in the world")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94mQuery: Best Marvel movie this year [0m
[96m (Similarity: 0.407)[0m RSS news content: Moana: Disney's hit animation to get surprise cinema sequel this year Disney's unexpected announcement sets up a box office battle with the first Wicked film.
[96m (Similarity: 0.366)[0m RSS news content: Marvel star Jeremy Renner: I'm so blessed a year after accident The Marvel star reflects on his recovery after being run over by his own snow plough last New Year's Day.
[96m (Similarity: 0.363)[0m RSS news content: Deadpool 3 and Wicked trailers air in Super Bowl adverts Fans were given a first look at the upcoming films during Sunday night's NFL showdown.
[96m (Similarity: 0.348)[0m RSS news content: Marvel star Majors avoids jail and gets probation Ex-Marvel star avoids jail for assaulting his ex-girlfriend but will attend an intervention programme.
[96m (Similarity: 0.345)[0m RSS news content: All you need to know for tonight's Bafta Games Awards Zelda, Baldur's Gate 3 and Spider-M

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


[94mQuery: biden vs trump [0m
[96m (Similarity: 0.560)[0m RSS news content: Where Biden and Trump stand on key issues How the two candidates' policies compare on the economy, immigration, abortion and other big issues.
[96m (Similarity: 0.544)[0m RSS news content: Biden and Trump make competing trips to US border The two likely presidential candidates make speeches in an effort to show they can tackle illegal crossings.
[96m (Similarity: 0.537)[0m RSS news content: Biden says he's ready for election debate with Trump US President says he is "happy" to face rival who claims he is ready "anytime, anywhere, anyplace".
[96m (Similarity: 0.530)[0m RSS news content: Watch key moments from Biden and Trump's first debate The pair threw insults and clashed on stage about the biggest issues American voters care about - here's what they said.
[96m (Similarity: 0.526)[0m RSS news content: Big stakes and high tension as Biden-Trump debate looms Thursday's debate between the two will be 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]


## Optimal threshold 0.45

In [123]:
search_open_ai_news("Northern Lights")

[94mQuery: Northern Lights [0m
[96m (Similarity: 0.609)[0m RSS news content: Northern lights give spectacular surprise display across UK In a sudden display the aurora was seen from the Highlands down to Cornwall on Sunday night.
[96m (Similarity: 0.598)[0m RSS news content: Northern Lights in dazzling display across the UK A solar storm of this scale can cause disruptions to infrastructure such as the power grid.
[96m (Similarity: 0.567)[0m RSS news content: Can I see the Northern Lights tonight? Missed the Northern lights last night? Don't worry, there will be another opportunity to see them tonight. Here's how.
[96m (Similarity: 0.557)[0m RSS news content: More Northern Lights soon as Sun storms strengthen Another spectacular light show could come within two weeks as Sun storms reach 11-year high.
[96m (Similarity: 0.535)[0m RSS news content: In pictures: Northern Lights dazzle around the world The aurora borealis was visible around the world on Friday night, stunning ph

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_articles['cosine_similarity'] = similarities[top_indices]
