# [**Dataset**](https://www.kaggle.com/competitions/learn-ai-bbc)

# **Prerequisite**


*   [Image Classification with Vector Semantic Search using Pinecone metadata filters](https://youtu.be/85czhoo14NE?si=xqh3Xby_R7w19uIp)
*   [Real-Time Image Clustering using Amazon Titan Multimodal Embedding](https://youtu.be/uV3Wfd3FbaI?si=RlIbB7JoxJc9H7wV)

*   [Accelerating Data Processing for Gen AI Applications in Python with Pandarallel](https://youtu.be/YhEHnA323rU?si=TyeattT7-uIhCqcw)









In [None]:
!unzip /content/learn-ai-bbc.zip

Archive:  /content/learn-ai-bbc.zip
  inflating: BBC News Sample Solution.csv  
  inflating: BBC News Test.csv       
  inflating: BBC News Train.csv      


In [None]:
!pip install sentence-transformers pandarallel psycopg

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting pandarallel
  Downloading pandarallel-1.6.5.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting psycopg
  Downloading psycopg-3.2.1-py3-none-any.whl.metadata (4.2 kB)
Collecting dill>=0.3.1 (from pandarallel)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12=

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import psycopg

  from tqdm.autonotebook import tqdm, trange


In [None]:
dbhost="{DB Host Name}"
dbuser="{DB User Name}"
dbpass="{DB Password}"
dbport=5432

In [None]:
dbconn = psycopg.connect(host=dbhost, user=dbuser, password=dbpass, port=dbport, connect_timeout=10, autocommit=True,dbname='{Database Name}')

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
df = pd.read_csv('/content/BBC News Train.csv')

In [None]:
df.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [None]:
def generate_embeddings(query):
  embeddings = model.encode(query)
  return embeddings

In [None]:
ms=generate_embeddings("Hello There")
ms
type(ms)
len(ms)

384

In [None]:
# Generate embeddings for all the products descriptions - approx 3 min to complete

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers=8)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [None]:
# Generate Embeddings for all the products
df['Text_Embedding'] = df['Text'].parallel_apply(generate_embeddings)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=187), Label(value='0 / 187'))), HB…

  self.pid = os.fork()


In [None]:
df.head()

Unnamed: 0,ArticleId,Text,Category,Text_Embedding
0,1833,worldcom ex-boss launches defence lawyers defe...,business,"[-0.06418245, 0.031060327, -0.00069747923, -0...."
1,154,german business confidence slides german busin...,business,"[-0.049361926, -0.002379681, 0.035682473, 0.09..."
2,1101,bbc poll indicates economic gloom citizens in ...,business,"[0.025284942, -0.031954322, 0.03909955, 0.0636..."
3,1976,lifestyle governs mobile choice faster bett...,tech,"[-0.000986551, 0.042648092, 0.08374498, -0.061..."
4,917,enron bosses in $168m payout eighteen former e...,business,"[-0.042055998, -0.00088698365, 0.052950922, -0..."


In [None]:
df['Text_Embedding'] = df['Text_Embedding'].apply(lambda x: x.tolist())

**DB Queries:**
```
CREATE EXTENSION vector;
SELECT typname FROM pg_type WHERE typname = 'vector';
commit;

CREATE TABLE IF NOT EXISTS bbc_news(
                   ArticleId text,
                   text text,
                   Category text,
                   text_embeddings vector(384));
                   
select * from "bbc_news";

```



In [None]:
for _, x in df.iterrows():
    dbconn.execute("""INSERT INTO "bbc_news" (articleid, text, category, text_embeddings)  VALUES(%s, %s, %s, %s);""",
                   (x.get('ArticleId'), x.get('text'), x.get('Category'), x.get('Text_Embedding')))

TESTING DATA

In [None]:
df_testing = pd.read_csv('/content/BBC News Test.csv')

In [None]:
df_testing.head()

Unnamed: 0,ArticleId,Text
0,1018,qpr keeper day heads for preston queens park r...
1,1319,software watching while you work software that...
2,1138,d arcy injury adds to ireland woe gordon d arc...
3,459,india s reliance family feud heats up the ongo...
4,1020,boro suffer morrison injury blow middlesbrough...


In [None]:
# Generate Embeddings for all the products
df_testing['Text_Embedding'] = df_testing['Text'].parallel_apply(generate_embeddings)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=92), Label(value='0 / 92'))), HBox…

  self.pid = os.fork()


In [None]:
df_testing['Text_Embedding'] =df_testing['Text_Embedding'].apply(lambda x: x.tolist())

In [None]:
df_testing.head()

Unnamed: 0,ArticleId,Text,Text_Embedding
0,1018,qpr keeper day heads for preston queens park r...,"[-0.018894726410508156, 0.001542752841487527, ..."
1,1319,software watching while you work software that...,"[-0.07700029760599136, 0.01986844092607498, -0..."
2,1138,d arcy injury adds to ireland woe gordon d arc...,"[-0.07531845569610596, 0.00927541870623827, 0...."
3,459,india s reliance family feud heats up the ongo...,"[-0.10554075986146927, 0.0013207214651629329, ..."
4,1020,boro suffer morrison injury blow middlesbrough...,"[0.08324451744556427, -0.0036723732482641935, ..."


In [None]:
result=dbconn.execute("""SELECT category
                         FROM bbc_news
                         ORDER BY text_embeddings <=> '[-0.06418245,0.031060327,-0.00069747923,-0.001254605,0.010202226,-0.0041051437,0.1210043,0.017848758,0.0047275187,0.01996402,0.01910533,-0.027667247,0.00016823946,0.040221643,-0.032578543,-0.05156166,0.014940971,-0.068629816,0.0044336272,0.0023428306,-0.012191427,-0.06814183,0.05348911,-0.036054894,0.014913419,-0.044452738,-0.03984601,0.0028015908,-0.016985144,-0.07781369,-0.063158095,-0.008165021,0.03515301,-0.042971622,0.0710303,-0.026208596,0.051535286,0.088836595,-0.01141963,-0.025606442,0.0020459562,-0.036141198,-0.055687353,-0.011461562,-0.038211472,-0.040146183,0.03353631,0.05100319,-0.06399899,0.042756345,-0.04916515,-0.055906102,0.030145766,-0.017198551,0.011842128,-0.021683635,0.05143202,0.038378827,0.020512316,-0.056056008,-0.0026238288,-0.074806325,0.043092884,0.032466244,0.056903873,0.08726222,-0.0055537126,0.05919007,-0.040878713,-0.010441614,0.039247043,-0.0754292,-0.02912103,-0.01785786,0.013553183,0.036246073,0.00026543206,0.0032891228,-0.004618371,0.06423265,0.023332268,0.09014717,-0.049542878,-0.02119709,0.038121957,-0.0026320424,-0.0662046,-0.0020913936,0.003613283,0.004126807,0.070032805,-0.10433024,0.039874714,0.070354015,0.055676702,-0.045860156,-0.0346973,0.06647841,0.021374812,-0.0036097546,0.04023091,0.06215527,-0.03353251,-0.03041444,0.011442478,-0.02050386,0.09762736,0.070718005,0.05127584,0.010563443,-0.08902976,0.032453645,-0.071947545,-0.09519225,0.0339761,0.03243488,-0.040594816,0.056581885,-0.020792143,-0.05856744,0.11649469,0.12042101,-0.0028211242,0.029986989,0.012132258,0.012059824,-0.005280822,6.71842e-33,0.033639047,0.062258434,-0.019128766,0.031250983,0.018583724,0.11171384,-0.0051778397,0.06130234,0.030393995,0.08289113,-0.05307123,0.061011,0.059343748,-0.038307205,-0.0435467,-0.006722372,-0.03352224,0.015241266,-0.01996281,0.026874317,0.11323421,-0.04503156,0.060848113,-0.013722262,0.03572782,-0.0044367025,-0.014765125,0.050950084,0.07183322,0.048191782,0.026521925,-0.07102302,0.0049820994,0.030386068,0.05275685,0.024863666,-0.048785754,-0.05664433,0.08421592,-0.024254711,-0.03232658,0.031814415,0.004640267,-0.052595496,-0.030965243,-0.032241587,-0.025186595,0.019349184,0.033989515,-0.00069225545,0.020057818,-0.0009920183,-0.0039131185,0.009659955,0.06762877,0.004216707,0.050655056,-0.02041366,0.05360359,0.03447907,0.010134954,0.07141544,-0.058229357,0.038891748,-0.13800597,0.032496512,-0.038137265,-0.029413452,-0.09272151,-0.03811128,0.0001525534,-0.014312227,-0.035189006,0.0027224803,-0.06841714,-0.008080176,0.026346087,0.059809074,-0.023152057,0.020793993,0.018944759,0.019207962,0.0889116,-0.039666694,-0.06207578,-0.0022689842,0.013941204,-0.009846586,-0.0063739982,0.063120484,0.01092962,-0.030239452,0.028001955,0.07260187,-0.048473187,-7.502723e-33,-0.07593875,0.014352179,0.033871584,-0.15311944,-0.05287037,-0.08605439,0.016796412,0.006296123,0.0027459688,0.022998536,-0.020557744,0.032219306,0.0030728756,0.037475817,-0.021542532,-0.020014567,0.05208741,-0.11154548,0.022782909,-0.05536271,0.07537194,0.012158309,0.01177464,-0.061434194,-0.09761525,0.079225354,0.08634952,-0.021104915,0.029367685,0.019312669,-0.097336166,0.025768735,-0.06543946,0.09953434,-0.013028967,-0.074198835,-0.016333142,-0.06259394,-0.06180599,-0.07973729,0.07708636,0.014676573,-0.06909614,-0.03364944,-0.033242676,-0.037181232,-0.058727864,0.0014264294,0.03113409,-0.012871396,-0.023034886,0.02099927,-0.0013971011,0.035521254,-0.09454347,0.028541293,0.025814518,0.0045513944,0.028379045,-0.049582943,0.010770378,0.044710577,0.03512706,0.047835052,0.015392713,-0.004841013,0.024091491,0.0050706547,0.018362345,-0.04985265,0.030377747,-0.07083106,-0.086895466,-0.11765294,0.04561284,0.15040687,-0.13853276,-0.10042356,-0.122298464,-0.031510692,0.06820703,-0.017469332,0.056281183,0.060636282,0.0012665596,-0.027258566,0.039461225,-0.024052842,-0.039802138,0.054748785,-0.024339596,-0.16065069,0.01506478,0.0635781,0.016410608,-5.5196384e-08,-0.10492711,-0.007498637,0.07797902,0.013435701,-0.025177337,-0.15271917,0.0016767886,-0.04643778,0.0005151229,0.02749584,0.054287083,-0.032871068,-0.03636414,-0.01118495,0.04808088,-0.097333215,-0.09257528,-0.013811797,-0.00934335,0.020090735,0.010288978,0.030413248,-0.009156929,0.0006285097,-0.0028316735,-0.036256496,0.011296437,0.06491094,0.007132604,-0.07580063,-0.050360218,-0.003313123,0.052817505,-0.010459737,-0.12194524,-0.04256713,-0.041381128,0.022702789,-0.05596025,0.02651167,-0.07096438,0.004085907,0.08233722,0.040608298,-0.0319136,-0.07535045,-0.118135996,-0.029068721,0.08043371,-0.052574176,-0.005074627,-0.0029975744,0.004395773,0.034589957,0.018381752,-0.046352636,-0.013647941,0.025746867,-0.0729407,0.028625831,-0.014350562,-0.088218816,0.028385857,
0.012719163]'
                          limit 1""").fetchall()[0][0]

In [None]:
result

'business'

In [None]:
def similarity_search(search_text):
  query=f"""SELECT category
                        FROM bbc_news
                        ORDER BY text_embeddings <=> '{search_text}' limit 1;"""
  r= dbconn.execute(query).fetchall()
  return r[0][0]

In [None]:
df_testing['classifier'] = df_testing['Text_Embedding'].apply(similarity_search)

In [None]:
df_testing.head()

Unnamed: 0,ArticleId,Text,Text_Embedding,classifier
0,1018,qpr keeper day heads for preston queens park r...,"[-0.018894726410508156, 0.001542752841487527, ...",sport
1,1319,software watching while you work software that...,"[-0.07700029760599136, 0.01986844092607498, -0...",tech
2,1138,d arcy injury adds to ireland woe gordon d arc...,"[-0.07531845569610596, 0.00927541870623827, 0....",sport
3,459,india s reliance family feud heats up the ongo...,"[-0.10554075986146927, 0.0013207214651629329, ...",business
4,1020,boro suffer morrison injury blow middlesbrough...,"[0.08324451744556427, -0.0036723732482641935, ...",sport


In [None]:
output=df_testing[['ArticleId','classifier']].rename(columns={'classifier':'Category'})

In [None]:
output.head()

Unnamed: 0,ArticleId,Category
0,1018,sport
1,1319,tech
2,1138,sport
3,459,business
4,1020,sport


In [None]:
output.to_csv('output.csv',index=False)