# Combined Retrieval (Text and Image) System Documentation

## Author: 
ShrugalTayal (shrugal20408@iiitd.ac.in)

## Introduction
This Jupyter Notebook explores combined retrieval using both text and image data. The main tasks include:

1. Composite Similarity Score Calculation:

Calculate the average similarity score for pairs generated from image and text retrieval techniques.

2. Pair Ranking Based on Composite Similarity Score:

Rank pairs based on the computed composite similarity scores.

By combining image and text data, we aim to enhance retrieval accuracy and effectiveness. Let's delve into the implementation and analysis to understand the benefits of combined retrieval.

In [2]:
import pickle
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import os
import re
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Data Loading
- Load the data from the CSV file containing text reviews.
- Develop necessary internal indexing techniques estd. columnar relation between columns

In [3]:
image_features_path = r'C:\Users\HP\Documents\Shrugal IIITD\Semester 8\Information Retrieval\CSE508_Winter2024_A2_2020408\dumps\image_feature_extraction_dumps\extracted_features.pkl'
text_reviews_path = r'C:\Users\HP\Documents\Shrugal IIITD\Semester 8\Information Retrieval\CSE508_Winter2024_A2_2020408\dumps\text_feature_extraction_dumps\preprocessed_text.pkl'

# Load extracted image features and reviews from pickle files
with open(image_features_path, 'rb') as f:
    image_features_dict = pickle.load(f)

with open(text_reviews_path, 'rb') as f:
    reviews = pickle.load(f)

In [4]:
# Read data from CSV file
data = pd.read_csv(r'C:\Users\HP\Documents\Shrugal IIITD\Semester 8\Information Retrieval\CSE508_Winter2024_A2_2020408\res\A2_Data.csv')

# Extract the 'Image' column
image_column = data['Image']

# Extract image URLs from the 'Image' column
image_urls = image_column.str.extract(r'(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)', expand=False)

# Remove any trailing characters from the URLs
for index, url in enumerate(image_urls):
    image_urls[index] = url[:-2]
    # Display the updated image URLs
    print(f"{index}: {image_urls[index]}")

image_urls = image_urls.to_list()

0: https://images-na.ssl-images-amazon.com/images/I/81q5+IxFVUL._SY88.jpg
1: https://images-na.ssl-images-amazon.com/images/I/71HSx4Y-5dL._SY88.jpg
2: https://images-na.ssl-images-amazon.com/images/I/71Md5ihUFLL._SY88.jpg
3: https://images-na.ssl-images-amazon.com/images/I/71Isri9SEaL._SY88.jpg
4: https://images-na.ssl-images-amazon.com/images/I/71w8aOdrTuL._SY88.jpg
5: https://images-na.ssl-images-amazon.com/images/I/81dxkALs4CL._SY88.jpg
6: https://images-na.ssl-images-amazon.com/images/I/71cS64LddWL._SY88.jpg
7: https://images-na.ssl-images-amazon.com/images/I/71z9biVe+ML._SY88.jpg
8: https://images-na.ssl-images-amazon.com/images/I/51p2V+jw7AL._SY88.jpg
9: https://images-na.ssl-images-amazon.com/images/I/71raKRNMOPL._SY88.jpg
10: https://images-na.ssl-images-amazon.com/images/I/61kCyfAeq-L._SY88.jpg
11: https://images-na.ssl-images-amazon.com/images/I/614m5CUST7L._SY88.jpg
12: https://images-na.ssl-images-amazon.com/images/I/71rPvq9ZAPL._SY88.jpg
13: https://images-na.ssl-images-am

In [5]:
# Handle missing values
data['Review Text'] = data['Review Text'].fillna('')  # Replace NaN with empty string

# Extract text reviews from the data
text_reviews = data['Review Text'].tolist()

# Preprocessing techniques
def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    # Tokenization
    tokens = nltk.word_tokenize(text)
    # Removing punctuation
    tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]
    # Stopword removal
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

# Apply preprocessing to text reviews
preprocessed_reviews = [preprocess_text(review) for review in text_reviews]
print('preprocessed_reviews', preprocessed_reviews)

preprocessed_reviews ['love vintag spring vintag strat  good tension great stabil  float bridg want spring way go ', 'work great guitar bench mat  rug enough abus take care  take care  make organ workspac much easier screw wo nt roll around  color good ', 'use everyth acoust bass ukulel  know smaller model avail uke  violin  etc   nt yet order  work smaller instrument one nt extend foot maximum width  gentl instrument  grippi materi keep secur  greatest benefit write music comput need set guitar use keyboardmous  easier hang stand  sever gave one friend christma well  use mine stage  fold small enough fit right gig bag ', 'great price good qualiti  nt quit match radiu sound hole close enough ', 'bought bass split time primari bass dean edg  might win  bass boost outstand  activ pickup realli allow adjust sound want  recommend anyon  beginn like long ago  excel bass start  tour andor music make money  bass beati stage  color bit darker pictur   around  great buy ', 'toy side instrument 

In [6]:
def Cosine_similarity(vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    similarity = dot_product / (norm_vector1 * norm_vector2)
    return similarity

## Custom Inputs

In [7]:
print('Input:')
print('Image and Text Query Input :')
input_image_url = input('Image: \n')
input_image_features = image_features_dict[input_image_url] # https://images-na.ssl-images-amazon.com/images/I/81q5+IxFVUL._SY88.jpg
input_review = input('Review: \n') # Loving these vintage springs on my vintage strat. They have a good tension and great stability. If you are floating your bridge and want the most out of your springs than these are the way to go.
input_review = preprocess_text(input_review)
print('Image: \n' + input_image_url)
print('Review: ' + input_review)

Input:
Image and Text Query Input :
Image: 
https://images-na.ssl-images-amazon.com/images/I/81q5+IxFVUL._SY88.jpg
Review: love vintag spring vintag strat  good tension great stabil  float bridg want spring way go 


## Function: calculate_cosine_similarity_features

This function calculates the cosine similarity scores between the input image features and a dictionary of image features.

In [8]:
def calculate_cosine_similarity_features(input_image_features, top_k=3):
    similarity_scores = {}
    for image_url, features in image_features_dict.items():
        similarity_scores[image_url] = cosine_similarity(input_image_features.reshape(1, -1), features.reshape(1, -1))[0][0]
    return similarity_scores


image_retrieval_results = calculate_cosine_similarity_features(input_image_features)
image_retrieval_keys = list(image_retrieval_results.keys())
cosine_similarity_features = list(image_retrieval_results.values())

## Function: calculate_cosine_similarity_reviews

This function calculates the cosine similarity scores between the processed input text and a list of preprocessed reviews.

In [9]:
index_mapped_reviews = []
for key in image_retrieval_results:
    index = image_urls.index(key)
    index_mapped_reviews.append(preprocessed_reviews[index])

def calculate_cosine_similarity_reviews(input_text, text_list):
    # Initialize TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Fit and transform the input text and text list
    tfidf_matrix = tfidf_vectorizer.fit_transform([input_text] + text_list)

    # Calculate cosine similarity between input text and text list
    similarity_scores = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])

    # Return cosine similarity scores
    return similarity_scores[0]

cosine_similarity_reviews = calculate_cosine_similarity_reviews(input_review, index_mapped_reviews)

## Composite Cosine Similarity Score Calculation and Ranking

In [10]:
# Convert lists to NumPy arrays
arr_cosine_similarity_features = np.array(cosine_similarity_features)
arr_cosine_similarity_reviews = np.array(cosine_similarity_reviews)

# Sum corresponding elements of the arrays
sum_array = arr_cosine_similarity_features + arr_cosine_similarity_reviews

# Divide each element of the array by 2
result_array = sum_array / 2

# Convert the result back to a list if needed
result_array = result_array.tolist()

In [11]:
# Create DataFrame
df = pd.DataFrame({'Image URL': image_retrieval_keys, 'Text Review': index_mapped_reviews, 'Image Cosine Similarity': cosine_similarity_features, 'Text Cosine Similarity': cosine_similarity_reviews, 'Composite Cosine Similarity': result_array})

# Sort DataFrame by 'Composite Cosine Similarity' column in descending order
df_sorted = df.sort_values(by='Composite Cosine Similarity', ascending=False)

# Display DataFrame
df_sorted.head(3)

Unnamed: 0,Image URL,Text Review,Image Cosine Similarity,Text Cosine Similarity,Composite Cosine Similarity
0,https://images-na.ssl-images-amazon.com/images...,love vintag spring vintag strat good tension ...,1.0,1.0,1.0
268,https://images-na.ssl-images-amazon.com/images...,nice solid spring defeinit silent easi instal...,0.672529,0.31766,0.495095
738,https://images-na.ssl-images-amazon.com/images...,great qualiti adjust tension well made,0.743428,0.200606,0.472017
