In [None]:
from google.colab import drive
drive.mount('/content/drive')

1. Image Feature Extraction [25]
a. Use basic image pre-processing techniques as altering contrast, resizing,
geometrical orientation, random flips, brightness and exposure or any other
relevant operation.
b. Use a pre-trained Convolutional Neural Network Architecture as ResNet,
VGG16, Inception-v3, MobileNet ( or any other CNN , preferably pre-trained on
ImageNet Dataset), to extract relevant features from the images in the given
training Set. Choose only one of the networks for your final pipeline.
c. Normalize the extracted features.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing import image
import requests
from io import BytesIO
from PIL import UnidentifiedImageError

# Step 1: Load pre-trained ResNet50 model
model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Step 2: Load and preprocess images
def load_and_preprocess_image(image_path):
    if image_path.startswith('http'):
        response = requests.get(image_path)
        img = image.load_img(BytesIO(response.content), target_size=(224, 224))
    else:
        img = image.load_img(image_path, target_size=(224, 224))
    img = image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = preprocess_input(img)
    return img

# Step 3: Extract features from images
def extract_features(image_path):
    img = load_and_preprocess_image(image_path)
    features = model.predict(img)
    return features.flatten()

# Step 4: Normalize extracted features
def normalize_features(features):
    normalized_features = features / np.linalg.norm(features)
    return normalized_features

# Step 5: Read CSV file
data = pd.read_csv('/content/drive/MyDrive/A2_Data.csv')

# Step 6: Extract features for all images
image_features = []
error_count = 0
for index, row in data.iterrows():
    # Remove square brackets from image path
    image_path = row['Image'].strip("[]'")
    try:
        features = extract_features(image_path)
        normalized_features = normalize_features(features)
        image_features.append(normalized_features)
    except UnidentifiedImageError as e:
        print(f"Error processing image at path: {image_path}")
        print(f"Error message: {str(e)}")
        error_count += 1
        continue
    except Exception as e:
        print(f"Error processing image at path: {image_path}")
        print(f"Error message: {str(e)}")
        error_count += 1

# Convert list to numpy array
image_features = np.array(image_features)

print("Shape of extracted features:", image_features.shape)
print("Number of images processed:", len(data) - error_count)
# Write extracted features to a CSV file
features_df = pd.DataFrame(image_features)
features_df.to_csv('/content/drive/MyDrive/extracted_features.csv', index=False)



Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
Error processing image at path: https://images-na.ssl-images-amazon.com/images/I/71F3npeHUDL._SY88.jpg', 'https://images-na.ssl-images-amazon.com/images/I/71wHUWncMGL._SY88.jpg
Error message: cannot identify image file <_io.BytesIO object at 0x7fd3ac70be20>
Error processing image at path: https://images-na.ssl-images-amazon.com/images/I/71B8OOE5N8L._SY88.jpg', 'https://images-na.ssl-images-amazon.com/images/I/81SX3oAWbNL._SY88.jpg
Error message: cannot identify image file <_io.BytesIO object at 0x7fd3ac8228e0>
Error processing image at path: https://images-na.ssl-images-amazon.com/images/I/71gUNOo9ulL._SY88.png', 'https://images-na.ssl-images-amazon.com/images/I/41DLH9EbDjL._SY88.jpg', 'https://images-na.ssl-images-amazon.com/images/I/41DLH9EbDjL._SY88.jpg
Error message: cannot identify image file <_io.BytesIO object at 0x7fd3ac57bd80>
Error pro

In [None]:
# Read extracted features from the CSV file
extracted_features_df = pd.read_csv('/content/drive/MyDrive/extracted_features.csv')

# Print the first few rows of the DataFrame
print("First few rows of extracted features:")
print(extracted_features_df.head())

# Optionally, print specific rows or a range of rows
print("\nSpecific rows from extracted features:")
print(extracted_features_df.iloc[10:20])  # Print rows 10 to 19


First few rows of extracted features:
     0         1    2    3    4    5         6        7         8    9  ...  \
0  0.0  0.003699  0.0  0.0  0.0  0.0  0.006365  0.00175  0.000000  0.0  ...   
1  0.0  0.000373  0.0  0.0  0.0  0.0  0.000000  0.00000  0.002696  0.0  ...   
2  0.0  0.005153  0.0  0.0  0.0  0.0  0.000000  0.00000  0.000000  0.0  ...   
3  0.0  0.005026  0.0  0.0  0.0  0.0  0.000000  0.00000  0.000000  0.0  ...   
4  0.0  0.003462  0.0  0.0  0.0  0.0  0.000000  0.00000  0.000000  0.0  ...   

   100342  100343  100344  100345    100346  100347  100348    100349  100350  \
0     0.0     0.0     0.0     0.0  0.000000     0.0     0.0  0.000000     0.0   
1     0.0     0.0     0.0     0.0  0.000000     0.0     0.0  0.000000     0.0   
2     0.0     0.0     0.0     0.0  0.000000     0.0     0.0  0.000000     0.0   
3     0.0     0.0     0.0     0.0  0.000000     0.0     0.0  0.003468     0.0   
4     0.0     0.0     0.0     0.0  0.000534     0.0     0.0  0.000000     0.0   



2. Text Feature Extraction [25]
a. Implement relevant pre-processing techniques as Lower-Casing, Tokenization,
removing punctuations, Stop Word Removal, Stemming and Lemmatization on
the given text reviews in the data
b. Calculate the Term Frequency-Inverse Document Frequency (TF-IDF) scores for
the textual reviews.
Note: Please make sure to save your extracted features and the TF-IDF score using
the pickle module so that you can run your code quickly in the demo

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import pickle

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
data = pd.read_csv('/content/drive/MyDrive/A2_Data.csv')

# Handling missing values
data['Review Text'].fillna('', inplace=True)

# Text Pre-processing
def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    # Tokenization
    tokens = word_tokenize(text)
    # Removing Punctuations and Stopwords
    tokens = [token for token in tokens if token not in string.punctuation and token not in stopwords.words('english')]
    # Stemming and Lemmatization
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)

# Apply text pre-processing to the reviews
data['Processed_Review'] = data['Review Text'].apply(preprocess_text)

# Calculate TF-IDF scores
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Processed_Review'])

# Save TF-IDF matrix and processed reviews
with open('tfidf_matrix.pkl', 'wb') as f:
    pickle.dump(tfidf_matrix, f)

with open('processed_reviews.pkl', 'wb') as f:
    pickle.dump(data['Processed_Review'], f)

print("TF-IDF matrix and processed reviews saved successfully.")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


TF-IDF matrix and processed reviews saved successfully.


In [None]:
import pickle

# Open and print the content of tfidf_matrix.pkl
with open('tfidf_matrix.pkl', 'rb') as f:
    tfidf_matrix_loaded = pickle.load(f)

print("TF-IDF Matrix:")
print(tfidf_matrix_loaded)

# Open and print the content of processed_reviews.pkl
with open('processed_reviews.pkl', 'rb') as f:
    processed_reviews_loaded = pickle.load(f)

print("\nProcessed Reviews:")
print(processed_reviews_loaded)

TF-IDF Matrix:
  (0, 2266)	0.15249335228213476
  (0, 5544)	0.16588768102702053
  (0, 5521)	0.1646604093196775
  (0, 823)	0.20034361009496549
  (0, 2071)	0.3001633781498329
  (0, 4774)	0.28150388943417615
  (0, 2297)	0.09821595091936058
  (0, 5057)	0.26284440071851933
  (0, 2273)	0.1187719346167174
  (0, 4855)	0.1977862876772809
  (0, 4758)	0.5365295429991451
  (0, 5467)	0.45691524524039634
  (0, 3003)	0.2898943204813532
  (1, 1136)	0.15936194573512816
  (1, 481)	0.15026338858383356
  (1, 4268)	0.23358995486390477
  (1, 5622)	0.17560958459769407
  (1, 4397)	0.14842772681414163
  (1, 1701)	0.19491761152654735
  (1, 3278)	0.1264325184979872
  (1, 5645)	0.2891409990506189
  (1, 3510)	0.2891409990506189
  (1, 3062)	0.12608855408576616
  (1, 949)	0.41680119017864525
  (1, 5000)	0.3187238914702563
  :	:
  (998, 278)	0.22734100903504434
  (998, 4265)	0.3969431633159404
  (998, 5399)	0.17362978012801422
  (998, 2506)	0.16836643346580088
  (998, 1140)	0.16603152529121076
  (998, 5534)	0.55098975

In [None]:
import pandas as pd
import numpy as np
import pickle

# Load extracted features from CSV file
extracted_features_df = pd.read_csv('/content/drive/MyDrive/extracted_features.csv')

# Convert DataFrame to numpy array
extracted_features = extracted_features_df.values

# Save extracted features to a pickle file
with open('extracted_features.pkl', 'wb') as f:
    pickle.dump(extracted_features, f)


3. Image Retrieval and Text Retrieval [25]
a. For the input (image, review) pair, find the most similar images (preferably your
top three) to your input based on extracted image features/embeddings using a
similarity measure (cosine similarity) and a suitable data-structure.
b. For the input (image, review) pair, find the most similar reviews (preferably your
top three) to your input review based on TF-IDF scores using a similarity
measure (Cosine Similarity)
c. Save your results using Python’s pickle module to save and load your results.

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import pickle

# Load extracted features from pickle file
with open('/content/extracted_features.pkl', 'rb') as f:
    extracted_features = pickle.load(f)

# Load processed reviews
with open('/content/processed_reviews.pkl', 'rb') as f:
    processed_reviews = pickle.load(f)

# Load TF-IDF matrix
with open('/content/tfidf_matrix.pkl', 'rb') as f:
    tfidf_matrix = pickle.load(f)

# Function to find most similar images
def find_most_similar_images(input_image_features, num_images=3):
    similarity_scores = cosine_similarity([input_image_features], extracted_features)
    most_similar_images_indices = similarity_scores.argsort()[0][-num_images:][::-1]
    return most_similar_images_indices, similarity_scores[0][most_similar_images_indices]

# Function to find most similar reviews
def find_most_similar_reviews(input_review_tfidf, num_reviews=3):
    similarity_scores = cosine_similarity(input_review_tfidf, tfidf_matrix)
    most_similar_reviews_indices = similarity_scores.argsort()[0][-num_reviews:][::-1]
    return most_similar_reviews_indices, similarity_scores[0][most_similar_reviews_indices]

# Function to format output
def format_output(image_indices, image_similarities, review_indices, review_similarities):
    output = []
    for i in range(len(image_indices)):
        image_url = data.loc[image_indices[i], 'Image']
        image_similarity = image_similarities[i]
        review = processed_reviews[review_indices[i]]
        review_similarity = review_similarities[i]
        output.append((image_url, image_similarity, review, review_similarity))
    return output

# Sample test case input
print("Please enter the image URL:")
image_url = input()
print("Please enter the review text:")
review_text = input()

# Process review text
processed_review = preprocess_text(review_text)

# Extract features for input image
input_image_features = extract_features(image_url)

# Calculate TF-IDF for input review
input_review_tfidf = tfidf_vectorizer.transform([processed_review])

# Find most similar images and reviews
image_indices, image_similarities = find_most_similar_images(input_image_features)
review_indices, review_similarities = find_most_similar_reviews(input_review_tfidf)

# Format the output
image_output = format_output(image_indices, image_similarities, review_indices, review_similarities)
review_output = format_output(review_indices, review_similarities, image_indices, image_similarities)

# Print the output
print("USING IMAGE RETRIEVAL")
for i, (image_url, image_similarity, review, review_similarity) in enumerate(image_output, 1):
    print(f"{i}) Image URL: {image_url}")
    print(f"   Review: {review}")
    print(f"   Cosine similarity of images: {image_similarity:.4f}")
    print(f"   Cosine similarity of text: {review_similarity:.4f}")

print("\nUSING TEXT RETRIEVAL")
for i, (image_url, image_similarity, review, review_similarity) in enumerate(review_output, 1):
    print(f"{i}) Image URL: {image_url}")
    print(f"   Review: {review}")
    print(f"   Cosine similarity of images: {review_similarity:.4f}")
    print(f"   Cosine similarity of text: {image_similarity:.4f}")

# Calculate composite similarity scores
composite_image_similarity = np.mean(image_similarities)
composite_review_similarity = np.mean(review_similarities)

# Print composite similarity scores
print("\nComposite similarity scores:")
print(f"Composite similarity scores of images: {composite_image_similarity:.4f}")
print(f"Composite similarity scores of text: {composite_review_similarity:.4f}")


Please enter the image URL:
https://images-na.ssl-images-amazon.com/images/I/71bztfqdg+L._SY88.jpg
Please enter the review text:
Review: I have been using Fender locking tuners for about five years on various strats and teles. Definitely helps with tuning stability and way faster to restring if there is a break.
USING IMAGE RETRIEVAL
1) Image URL: ['https://images-na.ssl-images-amazon.com/images/I/71haYjMvVWL._SY88.jpg']
   Review: using fender locking tuner five year various strats teles definitely help tuning stability way faster restring break
   Cosine similarity of images: 1.0000
   Cosine similarity of text: 0.9850
2) Image URL: ['https://images-na.ssl-images-amazon.com/images/I/61d3F7FBv+L._SY88.jpg']
   Review: went fender chrome non-locking fender gold locking made guitar look beautiful play beautiful think locking tuner way go new locking tuner look youtube instruction
   Cosine similarity of images: 0.4363
   Cosine similarity of text: 0.2983
3) Image URL: ['https://images-n

In [None]:
# Save image retrieval results
with open('image_retrieval_results.pkl', 'wb') as f:
    pickle.dump((image_indices, image_similarities), f)

# Save text retrieval results
with open('text_retrieval_results.pkl', 'wb') as f:
    pickle.dump((review_indices, review_similarities), f)


In [None]:
import pickle

# Load image retrieval results
with open('image_retrieval_results.pkl', 'rb') as f:
    image_results = pickle.load(f)

# Load text retrieval results
with open('text_retrieval_results.pkl', 'rb') as f:
    text_results = pickle.load(f)

# Print image retrieval results
print("Image Retrieval Results:")
print("Image Indices:", image_results[0])
print("Image Similarities:", image_results[1])

# Print text retrieval results
print("\nText Retrieval Results:")
print("Review Indices:", text_results[0])
print("Review Similarities:", text_results[1])


Image Retrieval Results:
Image Indices: [753 651 932]
Image Similarities: [1.         0.43625011 0.2942573 ]

Text Retrieval Results:
Review Indices: [758 622 439]
Review Similarities: [0.98497965 0.29826057 0.15816347]


In [None]:
import pandas as pd
from PIL import Image

# Load the CSV file containing image paths and reviews
data = pd.read_csv("/content/drive/MyDrive/A2_Data.csv")  # Replace with the actual path to your CSV file

# Function to display images from the provided indices
def display_images_from_indices(indices, dataframe):
    for index in indices:
        image_path = dataframe.loc[index, 'Image']  # Assuming 'Image' is the column containing image paths
        if pd.notnull(image_path):
            try:
                image = Image.open(image_path)
                image.show()
            except FileNotFoundError:
                print(f"Image not found at path: {image_path}")
        else:
            print(f"Invalid image path at index: {index}")

# Image indices from the Image Retrieval Results
image_indices = [753, 651, 932]

# Display images corresponding to the Image Retrieval Results
print("Images corresponding to Image Retrieval Results:")
display_images_from_indices(image_indices, data)

# Review indices from the Text Retrieval Results
review_indices = [758, 622, 439]

# Display images corresponding to the Text Retrieval Results
print("\nImages corresponding to Text Retrieval Results:")
display_images_from_indices(review_indices, data)


Images corresponding to Image Retrieval Results:
Image not found at path: ['https://images-na.ssl-images-amazon.com/images/I/71haYjMvVWL._SY88.jpg']
Image not found at path: ['https://images-na.ssl-images-amazon.com/images/I/61d3F7FBv+L._SY88.jpg']
Image not found at path: ['https://images-na.ssl-images-amazon.com/images/I/71jaTAQxhnL._SY88.jpg']

Images corresponding to Text Retrieval Results:
Image not found at path: ['https://images-na.ssl-images-amazon.com/images/I/71bztfqdg+L._SY88.jpg']
Image not found at path: ['https://images-na.ssl-images-amazon.com/images/I/61DvLcapd8L._SY88.jpg']
Image not found at path: ['https://images-na.ssl-images-amazon.com/images/I/61clqkZnKxL._SY88.jpg', 'https://images-na.ssl-images-amazon.com/images/I/61NeE5N1eQL._SY88.jpg']


4. Combined Retrieval (Text and Image)
a. Get a composite similarity score (average) for the pairs generated in 3a) and 3b)
b. Rank the pairs based on the composite similarity score.

In [None]:
import numpy as np

# Image Similarities from Image Retrieval Results
image_similarities = np.array([1.0, 0.43625011, 0.2942573])

# Text Similarities from Text Retrieval Results
text_similarities = np.array([0.98497965, 0.29826057, 0.15816347])

# Calculate average similarity for each pair
composite_similarity_scores = (image_similarities + text_similarities) / 2

# Rank the pairs based on composite similarity score
ranked_indices = np.argsort(composite_similarity_scores)[::-1]

# Print the ranked pairs
print("Ranked Pairs based on Composite Similarity Score:")
for rank, index in enumerate(ranked_indices, 1):
    print(f"Rank {rank}: Image Index {index} | Composite Similarity Score: {composite_similarity_scores[index]}")


Ranked Pairs based on Composite Similarity Score:
Rank 1: Image Index 0 | Composite Similarity Score: 0.992489825
Rank 2: Image Index 1 | Composite Similarity Score: 0.36725534000000004
Rank 3: Image Index 2 | Composite Similarity Score: 0.226210385


5. Results and Analysis
a. Present the top-ranked (image, review) pairs along with the cosine similarity
scores.
b. Observe which out of the two retrieval techniques gives a better similarity score
and argue the reason.
c. Discuss the challenges faced and potential improvements in the retrieval process.

In [None]:
# Calculate composite similarity scores
composite_similarity_scores = []
for i in range(len(image_indices)):
    composite_similarity = (image_similarities[i] + review_similarities[i]) / 2
    composite_similarity_scores.append(composite_similarity)

# Rank the pairs based on composite similarity scores
ranked_pairs = sorted(zip(image_indices, review_indices, composite_similarity_scores), key=lambda x: x[2], reverse=True)

# Results and Analysis
print("Top-ranked (image, review) pairs along with composite similarity scores:")
for rank, (image_index, review_index, similarity_score) in enumerate(ranked_pairs, 1):
    image_url = data.loc[image_index, 'Image']
    review_text = data.loc[review_index, 'Review Text']
    print(f"{rank}) Image URL: {image_url}")
    print(f"   Review: {review_text}")
    print(f"   Composite Similarity Score: {similarity_score:.4f}")

# Compare retrieval techniques and provide analysis
average_image_similarity = sum(image_similarities) / len(image_similarities)
average_review_similarity = sum(review_similarities) / len(review_similarities)
print("\nAverage similarity score from Image Retrieval:", average_image_similarity)
print("Average similarity score from Text Retrieval:", average_review_similarity)
if average_image_similarity > average_review_similarity:
    print("Image Retrieval yields a better similarity score.")
else:
    print("Text Retrieval yields a better similarity score.")




Top-ranked (image, review) pairs along with composite similarity scores:
1) Image URL: ['https://images-na.ssl-images-amazon.com/images/I/71haYjMvVWL._SY88.jpg']
   Review: I have been using Fender locking tuners for about five years on various strats and teles. Definitely helps with tuning stability and way faster to restring if there is a break.
   Composite Similarity Score: 0.9925
2) Image URL: ['https://images-na.ssl-images-amazon.com/images/I/61d3F7FBv+L._SY88.jpg']
   Review: I went from fender chrome non-locking to fender gold locking. It made my guitar look beautiful and play beautiful. I think locking tuners are the way to go. If you are new to locking tuners look on YouTube for instructions.
   Composite Similarity Score: 0.3673
3) Image URL: ['https://images-na.ssl-images-amazon.com/images/I/71jaTAQxhnL._SY88.jpg']
   Review: Now all I have to do is install these on my Burswood Strat Copy. I know I'm gonna have to drill holes for the locating pins I may even have to drill t



Challenges Faced:

Handling image and text representations: Integrating image features and textual features for comparison can be challenging due to their different data representations.

Scalability: Processing large datasets for feature extraction and similarity calculation can be computationally expensive and time-consuming.

Noise in data: Noisy or irrelevant features in images or textual data can affect the accuracy of similarity calculations.

Potential Improvements:

Fine-tuning feature extraction models: Fine-tuning pre-trained models or using domain-specific feature extractors can improve the quality of extracted features.

Ensemble methods: Combining multiple retrieval techniques or similarity metrics can enhance the robustness of the retrieval system.

Incorporating semantic understanding: Utilizing techniques from natural language processing and computer vision to extract semantic meaning from images and text can lead to more meaningful similarity comparisons.

Data augmentation: Augmenting the dataset with additional images or reviews can help improve the diversity and representativeness of the data, leading to better retrieval results.

Interactive feedback: Implementing user feedback mechanisms to refine retrieval results based on user preferences and relevance feedback can enhance the overall user experience.