#Introduction

This exercise is designed to deepen your understanding and skills in modern deep learning techniques. We have two main tasks for you. The first is focused on using SBERT for semantic search, and the second involves hands-on exercises with gradient descent and the attention mechanism

#Part 1

Task Description
Create something innovative using SBERT and semantic search, or even more! The guidelines are intentionally broad to encourage creativity. Here are some ideas to get you started:

Implement a GIF search engine or YouTube search function using images and CLIP.
(Optional)

Use SetFit for supervised tasks with SBERT models.


Consider building a search engine using a Gradio or Streamlit app.

In [None]:
!pip install -U sentence_transformers --q

In [None]:
from IPython.display import HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import pandas as pd
import requests
from io import StringIO

# Download the file from the URL
url = "https://raw.githubusercontent.com/SigneByrith/GIF-/main/tgif-v1.0.tsv"
response = requests.get(url)
data = StringIO(response.text)

# Read the data into a Pandas DataFrame with specified column names
column_names = ['url', 'description']  # Adjust column names as needed
df = pd.read_csv(data, sep='\t', names=column_names)

# Now 'df' contains the data with the specified column names

In [None]:
df.head(5)

Unnamed: 0,url,description
0,https://38.media.tumblr.com/9f6c25cc350f12aa74...,"a man is glaring, and someone with sunglasses ..."
1,https://38.media.tumblr.com/9ead028ef62004ef6a...,a cat tries to catch a mouse on a tablet
2,https://38.media.tumblr.com/9f43dc410be85b1159...,a man dressed in red is dancing.
3,https://38.media.tumblr.com/9f659499c8754e40cf...,an animal comes close to another in the jungle
4,https://38.media.tumblr.com/9ed1c99afa7d714118...,a man in a hat adjusts his tie and makes a wei...


In [None]:
df.shape

(125782, 2)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125782 entries, 0 to 125781
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   url          125782 non-null  object
 1   description  125782 non-null  object
dtypes: object(2)
memory usage: 1.9+ MB


In [None]:
# Check for duplicates based on the 'URL' column
duplicates_url = df[df.duplicated(subset=['url'])]

# Print the duplicate rows based on 'URL'
print("Duplicate Rows based on 'URL':")
print(duplicates_url)

Duplicate Rows based on 'URL':
                                                      url  \
49      https://38.media.tumblr.com/8cc0d8b94f1694a74f...   
60      https://38.media.tumblr.com/712368408b70ce32ee...   
75      https://38.media.tumblr.com/832c81c5fd750a6e2a...   
100     https://38.media.tumblr.com/a5ae5a1270a7246170...   
145     https://38.media.tumblr.com/634701cb36814d10a4...   
...                                                   ...   
125777  https://38.media.tumblr.com/5c0633e677a97a023c...   
125778  https://38.media.tumblr.com/402a02c59c7c47c300...   
125779  https://38.media.tumblr.com/02fa66bd747ddbed58...   
125780  https://38.media.tumblr.com/01e70784925ab9fe09...   
125781  https://38.media.tumblr.com/51d2172ef413bc3e88...   

                                              description  
49      a man is sitting in a room, he has a gun on hi...  
60               a woman is putting her finger on a badge  
75      two women are next to each other, one of them ..

In [None]:
len(df)

125782

In [None]:
len(df['url'].unique())

102068

In [None]:
duplicates = df['url'].value_counts().sort_values(ascending = False)

In [None]:
duplicates.head(5)

https://38.media.tumblr.com/ddbfe51aff57fd8446f49546bc027bd7/tumblr_nowv0v6oWj1uwbrato1_500.gif    4
https://33.media.tumblr.com/46c873a60bb8bd97bdc253b826d1d7a1/tumblr_nh7vnlXEvL1u6fg3no1_500.gif    4
https://38.media.tumblr.com/b544f3c87cbf26462dc267740bb1c842/tumblr_n98uooxl0K1thiyb6o1_250.gif    4
https://33.media.tumblr.com/88235b43b48e9823eeb3e7890f3d46ef/tumblr_nkg5leY4e21sof15vo1_500.gif    4
https://31.media.tumblr.com/69bca8520e1f03b4148dde2ac78469ec/tumblr_npvi0kW4OD1urqm0mo1_400.gif    4
Name: url, dtype: int64

In [None]:
dupe_url = "https://31.media.tumblr.com/69bca8520e1f03b4148dde2ac78469ec/tumblr_npvi0kW4OD1urqm0mo1_400.gif"
dupe_df = df[df['url'] == dupe_url]

# let's take a look at this GIF and it's duplicated descriptions
for _, gif in dupe_df.iterrows():
    HTML(f"<img src={gif['url']} style='width:120px; height:90px'>")
    print(gif["description"])

man is laying down in a field shooting a gun


a soldier is on the ground shooting a weapon


a soldier is lying on the ground while shots are fired at him.


a soldier is laid down firing a fun.


In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
text_to_embedding = df['description'].tolist()

In [None]:
print(text_to_embedding)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
corpus_embeddings = model.encode(text_to_embedding, convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/3931 [00:00<?, ?it/s]

In [None]:
print(corpus_embeddings)

tensor([[-0.0401, -0.0151, -0.0378,  ...,  0.0036, -0.0127,  0.0259],
        [ 0.0185, -0.0249,  0.0773,  ...,  0.1123,  0.0210,  0.0758],
        [-0.0034, -0.0659, -0.0570,  ...,  0.0038, -0.0003,  0.0096],
        ...,
        [ 0.0529, -0.0009,  0.0328,  ...,  0.0446, -0.0500,  0.0013],
        [ 0.0550,  0.0434, -0.0448,  ..., -0.0665, -0.0131,  0.0522],
        [ 0.0101, -0.0595, -0.0088,  ...,  0.0861, -0.0242, -0.0298]])


In [None]:
from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
from IPython.display import display
from IPython.display import Image as IPImage
import os
from tqdm.autonotebook import tqdm
torch.set_num_threads(4)

In [None]:
def search(query, k=3):
    query_emb = model.encode([query], convert_to_tensor=True, show_progress_bar=False)
    hits = util.semantic_search(query_emb, corpus_embeddings, top_k=k)[0]

    print("Query:", query)
    for hit in hits:
        img_url = df['url'][hit['corpus_id']]  # Use the URL directly from the DataFrame
        print(img_url)
        display(IPImage(url=img_url, width=200))

In [None]:
search("Soldier")

Query: Soldier
https://38.media.tumblr.com/1ec9c3ce8539f2454988a72f14d9f61c/tumblr_neqejmkXNv1tg7a1io1_400.gif


https://33.media.tumblr.com/538470555df8bc2b1a1425dbbf4daf37/tumblr_nqodf4dQo71uza3xso1_500.gif


https://31.media.tumblr.com/69bca8520e1f03b4148dde2ac78469ec/tumblr_npvi0kW4OD1urqm0mo1_400.gif


In [None]:
!pip install gradio -q

In [None]:
import gradio as gr
from sentence_transformers import util

# Assuming you have the DataFrame df, the model, and corpus_embeddings

# Function to perform a semantic search
def semantic_search(query, df, model, corpus_embeddings, k=3):
    # Encode the query
    query_embedding = model.encode(query, convert_to_tensor=True)

    # Calculate cosine similarity between the query and document embeddings
    similarities = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]

    # Get the indices of the most similar documents
    most_similar_indices = similarities.argsort(descending=True)[:k]

    # Retrieve the relevant information for the most similar documents
    results = df.iloc[most_similar_indices]

    return results[['url', 'description']]

# Gradio Interface
iface = gr.Interface(
    fn=lambda query: semantic_search(query, df, model, corpus_embeddings),
    inputs=gr.Textbox(),
    outputs=gr.Dataframe(),
    live=True
)

iface.launch()


Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB
Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://5a86907481846abd15.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
import gradio as gr
from sentence_transformers import util

# Assuming you have the DataFrame df, the model, and corpus_embeddings

# Function to perform a semantic search
def semantic_search(query, df, model, corpus_embeddings, k=3):
    # Encode the query
    query_embedding = model.encode(query, convert_to_tensor=True)

    # Calculate cosine similarity between the query and document embeddings
    similarities = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]

    # Get the indices of the most similar documents
    most_similar_indices = similarities.argsort(descending=True)[:k]

    # Retrieve the relevant information for the most similar documents
    results = df.iloc[most_similar_indices]

    return results[['url', 'description']]

# Gradio Interface
gr.Interface(     fn=lambda query: semantic_search(query, df, model, corpus_embeddings),     inputs=["text"],     outputs="text",     examples=[["Deep Learning"], ["Neural Networks"], ["Coding stuff"], ["More examples"], ["Too many examples?"], ["NO MORE EXAMPLES PLEASE LORD NO!"]],     examples_per_page=6).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://13d2221b3380d31a6a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
gr.Interface(
    fn=search,
    inputs=["text"],
    outputs="text",
).launch()

ValueError: ignored

In [None]:
import gradio as gr
import requests
from PIL import Image
from io import BytesIO
from sentence_transformers import util

# Assuming 'model' and 'corpus_embeddings' are defined and loaded elsewhere in your code

def search(query, k=3):
    query_emb = model.encode([query], convert_to_tensor=True, show_progress_bar=False)
    hits = util.semantic_search(query_emb, corpus_embeddings, top_k=k)[0]

    result_urls = ["Query: " + query]
    images = []
    for hit in hits:
        img_url = df['url'][hit['corpus_id']]  # Assume df is your DataFrame containing URLs
        result_urls.append(img_url)

        # Retrieve and display the image
        try:
            response = requests.get(img_url)
            img = Image.open(BytesIO(response.content))
            images.append(img)
        except Exception as e:
            images.append(f"Error loading image: {e}")

    return result_urls, images

iface = gr.Interface(
    fn=search,
    inputs=gr.Textbox(),
    outputs=[gr.Textbox(label="Search Results"), gr.Gallery(label="Images")]
).launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://02b599fddc017a5e3b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
