### **Implemented Retrieval-Augmented Generation (RAG) using GPT-4 for IPL 2024 news articles.**

### Azure Open AI Configuration

In [None]:
%%capture
!pip3 install openai --upgrade

In [None]:
import os
from openai import AzureOpenAI

In [None]:
from google.colab import userdata
key= userdata.get('OAIKEY')

In [None]:
client = AzureOpenAI(
    api_key=key,
    api_version="2024-02-01",
    azure_endpoint = "https://ragprojectv1.openai.azure.com/"
)

gpt_four = "gpt-four-ai"
emd_deployment_name = "adaembedoai" # embedding model

In [None]:
# Test Connection
prompt = "Tell me a funny joke"

response = client.chat.completions.create(
    model= gpt_four, # model = "deployment_name".
    messages=[
        {"role": "system", "content": "Act as a standup comdeian"},
        {"role": "user", "content": prompt}
    ], max_tokens= 25, temperature= 0
)

print(response.choices[0].message.content)

Sure, here's one for you:

Why don't scientists trust atoms?

Because they make up everything!


In [None]:
response = client.embeddings.create(
    input = "Your text string goes here",
    model= emd_deployment_name  # model = "deployment_name".
)

In [None]:
len(response.data[0].embedding)

1536

### Data Ingestion and Processing

In [None]:
%%capture
!pip3 install -qU langchain-community \
  langchain-core \
  pinecone-client \
  langchain-pinecone \
  newspaper3k

In [None]:
# Define the URLs of the articles
url = [ "https://www.financialexpress.com/sports/ipl/kkr-vs-srh-qualifier-1-live-scorecard-ipl-2024-match-71-kolkata-knight-riders-vs-sunrisers-hyderabad-live-score/3495970/",
        "https://www.financialexpress.com/sports/ipl/rr-vs-rcb-live-match-score-ipl-2024-rajasthan-royals-vs-royal-challengers-bengaluru-eliminator-live-match-updates-scorecard/3497628/",
        "https://www.financialexpress.com/sports/ipl/srh-vs-rr-live-score-sunrisers-hyderabad-vs-rajasthan-royals-scorecard-qualifier-2-may-24-ipl-match-today-live-updates/3500393/",
        "https://www.financialexpress.com/sports/ipl/kkr-vs-srh-live-score-ipl-2024-final-match-live-updates-kolkata-knight-riders-vs-sunrisers-hyderabad-ipl-final-may-26-today-scorecard-latest-updates/3501995/"
]

In [None]:
# Import necessary modules
from newspaper import Article
import pandas as pd

# Function to extract article text from a given URL
def extract_article_text(url):
    article = Article(url)
    article.download()
    article.parse()
    return article.text

# Extract text for each article
data = {
    'source': [],
    'text': []
}

for url in url:
    text = extract_article_text(url)
    data['source'].append(url)
    data['text'].append(text)

# Create a DataFrame
df = pd.DataFrame(data)

In [None]:
import re

def clean_ipl_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Replace specific characters with a space
    text = re.sub(r"[@#|)'(]", ' ', text)

    text = re.sub(r'pic\.twitter\.com/[\w\d]+', '', text)

    # Remove Emojis
    text = re.sub(r'[^\x00-\x7F]+', '', text)

    # Replace multiple spaces or newlines with a single space
    text = re.sub(r'\s+', ' ', text)

    # Trim leading and trailing whitespace
    text = text.strip()

    return text

In [None]:
df['text'] = df['text'].apply(clean_ipl_text)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Splitting text into chunks

def chunk_text(text, chunk_size=350, chunk_overlap= 15):
    splitter = RecursiveCharacterTextSplitter(
        separators = ["\n\n", "\n", " "],  # List of separators based on requirement (defaults to ["\n\n", "\n", "."])
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_text(text)
    return chunks

df['chunks'] = df['text'].apply(chunk_text)

In [None]:
# Exploded the dataframe for embeddings
#flattened_df = df.explode('chunks')

In [None]:
# Added two new columns for creation of id
# flattened_df['year'] = '2024'
# flattened_df['no'] =  range(1, len(flattened_df)+1)

In [None]:
flattened_df = df.copy()

In [None]:
flattened_df.head(5)

Unnamed: 0,source,text,chunks
0,https://www.financialexpress.com/sports/ipl/kk...,Kolkata Knight Riders vs Sunrisers Hyderabad Q...,[Kolkata Knight Riders vs Sunrisers Hyderabad ...
1,https://www.financialexpress.com/sports/ipl/rr...,Rajasthan Royals vs Royal Challengers Bengalur...,[Rajasthan Royals vs Royal Challengers Bengalu...
2,https://www.financialexpress.com/sports/ipl/sr...,Rajasthan Royals vs Sunrisers Hyderabad Highli...,[Rajasthan Royals vs Sunrisers Hyderabad Highl...
3,https://www.financialexpress.com/sports/ipl/kk...,Kolkata Knight Riders vs Sunrisers Hyderabad I...,[Kolkata Knight Riders vs Sunrisers Hyderabad ...


In [None]:
#flattened_df.loc[flattened_df['no'] ==  2]['chunks'][0]

In [None]:
def create_embeddings(text, model=emd_deployment_name):
    # Create embeddings for each document chunk
    embeddings = client.embeddings.create(input = text, model=model).data[0].embedding
    return embeddings

In [None]:
# create embeddings for the whole data chunks and store them in a list

embeddings = []
for chunk in flattened_df['chunks']:
    embeddings.append(create_embeddings(chunk))

# store the embeddings in the dataframe
flattened_df['embeddings'] = embeddings

In [None]:
flattened_df.head()

Unnamed: 0,source,text,chunks,embeddings
0,https://www.financialexpress.com/sports/ipl/kk...,Kolkata Knight Riders vs Sunrisers Hyderabad Q...,[Kolkata Knight Riders vs Sunrisers Hyderabad ...,"[0.0097493976354599, -0.022035764530301094, 0...."
1,https://www.financialexpress.com/sports/ipl/rr...,Rajasthan Royals vs Royal Challengers Bengalur...,[Rajasthan Royals vs Royal Challengers Bengalu...,"[0.002805645577609539, -0.002850682707503438, ..."
2,https://www.financialexpress.com/sports/ipl/sr...,Rajasthan Royals vs Sunrisers Hyderabad Highli...,[Rajasthan Royals vs Sunrisers Hyderabad Highl...,"[-0.003593101631850004, 0.0017598184058442712,..."
3,https://www.financialexpress.com/sports/ipl/kk...,Kolkata Knight Riders vs Sunrisers Hyderabad I...,[Kolkata Knight Riders vs Sunrisers Hyderabad ...,"[-0.0005278954049572349, -0.013996017165482044..."


In [None]:
flattened_df['year'] = '2024'
flattened_df['no'] =  range(1, len(flattened_df)+1)

In [None]:
# create a id and metadata columns
flattened_df['id'] = flattened_df['year'].astype(str) + '_' + flattened_df['no'].astype(str)
flattened_df['metadata'] = flattened_df.apply(lambda x: { 'text': x['text'],'source': x['source']}, axis=1)

### Pinecone Index Configuration

In [None]:
PC_KEY = userdata.get('PC_TOKEN')

In [None]:
from pinecone import Pinecone
from pinecone import ServerlessSpec

In [None]:
# Configure Pinecone Vectorbase Client
pc = Pinecone(api_key=PC_KEY)

# Config Pinecone ServerlessSpec
cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

# Create a index in Pinecone

index_name = 'ipl-rag-2024'

In [None]:
# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=len(response.data[0].embedding),
        metric='cosine',
        spec=spec
    )

# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 4}},
 'total_vector_count': 4}

In [None]:
# Upserting vectors and metadata in index
for _, row in flattened_df.iterrows():
    record = {
        "id": row["id"],
        "values": row["embeddings"],
        "metadata": row["metadata"]
    }
    index.upsert(vectors=[record])

In [None]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 4}},
 'total_vector_count': 4}

### Retrieval Of Relevant Documents

In [None]:
query = "Who won RCB vs RR IPL match?"

query_vectors = create_embeddings(query)

In [None]:
# get relevant contexts (including the questions)
result = index.query(vector=query_vectors, top_k= 3, include_metadata=True)

In [None]:
result

{'matches': [{'id': '2024_2',
              'metadata': {'source': 'https://www.financialexpress.com/sports/ipl/rr-vs-rcb-live-match-score-ipl-2024-rajasthan-royals-vs-royal-challengers-bengaluru-eliminator-live-match-updates-scorecard/3497628/',
                           'text': 'Rajasthan Royals vs Royal Challengers '
                                   'Bengaluru Highlights, IPL 2024 Eliminator: '
                                   'Rajasthan Royals have been the best '
                                   'bowling side in this edition of Indian '
                                   'Premier League and tonight, they are '
                                   'living up to that image. Winning toss and '
                                   'electing to field first, RR bowlers '
                                   'maintained a chokehold on the RCB batters '
                                   'from the word go. Though Virat Kohli '
                                   'achieved that magnificent

In [None]:
# get list of retrieved text
contexts = [item['metadata']['text'] for item in result['matches']]

relevnat_docs = "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query

In [None]:
print(relevnat_docs)

Rajasthan Royals vs Royal Challengers Bengaluru Highlights, IPL 2024 Eliminator: Rajasthan Royals have been the best bowling side in this edition of Indian Premier League and tonight, they are living up to that image. Winning toss and electing to field first, RR bowlers maintained a chokehold on the RCB batters from the word go. Though Virat Kohli achieved that magnificent milestone of becoming first-ever player to get 8,000 runs in the history of IPL, RCB were not able to get as many runs on the board. Both Kohli and Faf Du Plessis fell cheaply and Rajat Patidar and Cameron Green couldnt give the much-needed acceleration. Not a single player managed to score a fifty showing the sheer dominance of RR bowlers. Ravichandran Ashwin picked up two wickets in an over whereas Trent Boult and Yuzvendra Chahal picked up one wicket each. Avesh Khan emerged as the most successful bowler as he picked up three wickets. But he conceded 44 runs in his spell. Rajat Patidar 34 off 22 was the top-scorer

### Ask a question to get answer

In [None]:
def generate_answer(user_input):

    # Convert the question to a query vector
    query_vector = create_embeddings(user_input)

    # get relevant contexts to answer question
    result = index.query(vector=query_vector, top_k= 3, include_metadata=True)

    # get list of retrieved text

    context_data = [item['metadata']['text'] for item in result['matches']]

    context = "\n\n---\n\n".join(context_data)+"\n\n-----\n\n"+user_input


    # create a message object
    messages=[
        {"role": "system", "content": "You are an AI assiatant that answers on given context. Please dont answer if you dont know."},
        {"role": "user", "content": context}
    ]

    # use chat completion to generate a response
    response = client.chat.completions.create(
        model=gpt_four,
        temperature=0,
        max_tokens=50,
        messages=messages
    )

    return response.choices[0].message.content

In [None]:
user_input = "Who won ILP 2024 title and Who was man of the match ?"

generate_answer(user_input)