**Data Cleaning and Transformation**

Remove Duplicates: If there are any duplicate rows, remove them.

Handle Missing Values: Handle any missing values in columns like category and about_product that may be useful for content-based filtering.

Standardize Categories: Normalize text fields (convert to lowercase, remove extra spaces, etc.).

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv("/content/amazon.csv")

print("Data Shape:", data.shape)
print(data.columns)

# Cleaning
data = data.drop_duplicates()
data.fillna('', inplace=True)

data['category'] = data['category'].str.lower().str.strip()
data['about_product'] = data['about_product'].str.lower().str.strip()
data['combined_features'] = data['category'] + " " + data['about_product']

print(data.head())


Data Shape: (1465, 16)
Index(['product_id', 'product_name', 'category', 'discounted_price',
       'actual_price', 'discount_percentage', 'rating', 'rating_count',
       'about_product', 'user_id', 'user_name', 'review_id', 'review_title',
       'review_content', 'img_link', 'product_link'],
      dtype='object')
   product_id                                       product_name  \
0  B07JW9H4J1  Wayona Nylon Braided USB to Lightning Fast Cha...   
1  B098NS6PVG  Ambrane Unbreakable 60W / 3A Fast Charging 1.5...   
2  B096MSW6CT  Sounce Fast Phone Charging Cable & Data Sync U...   
3  B08HDJ86NZ  boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...   
4  B08CF3B7N1  Portronics Konnect L 1.2M Fast Charging 3A 8 P...   

                                            category discounted_price  \
0  computers&accessories|accessories&peripherals|...             ₹399   
1  computers&accessories|accessories&peripherals|...             ₹199   
2  computers&accessories|accessories&peripherals|...   

**Collaborative Filtering**

We will now create a User-Item Interaction Matrix based on implicit feedback (browsing history or purchase interactions). Since the dataset doesn't have explicit ratings, we will treat interactions as binary values (i.e., user has interacted with the product or not).

In [None]:
interaction_matrix = pd.pivot_table(data, index='user_id', columns='product_id', aggfunc='count', fill_value=0)
print(interaction_matrix.head())


                                                   about_product             \
product_id                                            B002PD61Y4 B002SZEOLG   
user_id                                                                       
AE22Y3KIS7SE6LI3HE2VS6WWPU4Q,AHWEYO2IJ5I5GDWZAH...             0          0   
AE23RS3W7GZO7LHYKJU6KSKVM4MQ,AEQUNEY6GQOTEGUMS6...             0          0   
AE242TR3GQ6TYC6W4SJ5UYYKBTYQ                                   0          0   
AE27UOZENYSWCQVQRRUQIV2ZM7VA,AGMYSLV6NNOAYES25J...             0          0   
AE2JTMRKTUOIVIZWS2WDGTMNTU4Q,AF4QXCB32VC2DVE7O3...             0          0   

                                                                          \
product_id                                         B003B00484 B003L62T7W   
user_id                                                                    
AE22Y3KIS7SE6LI3HE2VS6WWPU4Q,AHWEYO2IJ5I5GDWZAH...          0          0   
AE23RS3W7GZO7LHYKJU6KSKVM4MQ,AEQUNEY6GQOTEGUMS6...          0  

**Singular Value Decomposition (SVD) for Collaborative Filtering**

We'll apply Singular Value Decomposition (SVD) to factorize the interaction matrix and use it to generate user-product
recommendations based on similarity.

In [5]:
from sklearn.decomposition import TruncatedSVD
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Apply SVD for dimensionality reduction
svd = TruncatedSVD(n_components=20, random_state=42)
svd_matrix = svd.fit_transform(interaction_matrix)

# Reconstruct the matrix with the reduced dimensions
reconstructed_matrix = np.dot(svd_matrix, svd.components_)

# Calculate similarity between users
user_similarity = cosine_similarity(reconstructed_matrix)


print(user_similarity[:5])


[[ 1.         -0.40654363  0.45059835 ...  0.56906135 -0.28500942
  -0.08969197]
 [-0.40654363  1.         -0.10407396 ... -0.63880637  0.47620303
   0.12734604]
 [ 0.45059835 -0.10407396  1.         ... -0.0130442  -0.05828457
   0.06469386]
 [-0.2365112   0.0230934  -0.56591956 ...  0.23565627  0.49816114
   0.17609086]
 [ 0.15889617  0.11432748  0.49287107 ... -0.10516032  0.29107228
  -0.39183791]]


**Generate Collaborative Filtering Recommendations**

Using the similarity between users, we will recommend products to a user based on what similar users have interacted with.

In [7]:
def get_collaborative_recommendations(user_id, top_n=5):
    # Get the index of the user in the interaction matrix
    user_idx = interaction_matrix.index.get_loc(user_id)
    similarity_scores = user_similarity[user_idx]
    similar_user_indices = similarity_scores.argsort()[-(top_n + 1):-1][::-1]

    recommended_products = set()

    for idx in similar_user_indices:
        similar_user = interaction_matrix.index[idx]
        similar_user_interactions = interaction_matrix.loc[similar_user]

        # Recommend products that similar users interacted
        recommended_products.update(similar_user_interactions[similar_user_interactions > 0].index)

    return list(recommended_products)[:top_n]



**Content-Based Filtering**

Now, let’s implement Content-Based Filtering using TF-IDF vectorization. We will use the combined_features column (which includes both category and about_product) to compute the similarity between products.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(data['combined_features'])
content_similarity_matrix = cosine_similarity(tfidf_matrix)


print(content_similarity_matrix[0])


[1.         0.1026719  0.20312501 ... 0.02831326 0.03023311 0.00254099]


**Hybrid Recommendation System**

Now that we have both collaborative and content-based recommendations, we can combine the two methods to form a Hybrid Recommendation System. The idea is to take the union of both sets of recommendations or give weights to each. bold text

In [14]:
def get_content_based_recommendations(product_id, top_n=5):
    product_idx = data[data['product_id'] == product_id].index[0]
    similarities = content_similarity_matrix[product_idx]
    similar_product_idx = similarities.argsort()[-(top_n + 1):-1][::-1]

    recommended_product_ids = data.iloc[similar_product_idx]['product_id'].tolist()

    return recommended_product_ids

# Example:
product_id = "B07JW9H4J1"
content_based_recommendations = get_content_based_recommendations(product_id)
print(f"Content-based recommendations for product {product_id}: {content_based_recommendations}")


Content-based recommendations for product B07JW9H4J1: ['B07JW9H4J1', 'B07JW9H4J1', 'B07JH1CBGW', 'B07JH1C41D', 'B07JW1Y6XV']


**Real-Time Recommendations**

For real-time updates, you can simulate real-time browsing or interaction and immediately generate recommendations based on the recent activity.

In [16]:
def get_hybrid_recommendations(user_id, product_id, top_n=5, alpha=0.6):
    # collaborative filtering
    collaborative_recs = get_collaborative_recommendations(user_id, top_n=top_n)

    # content-based filtering
    content_recs = get_content_based_recommendations(product_id, top_n=top_n)

    # Combine recommendations: a weighted average of both
    combined_recs = set(collaborative_recs) | set(content_recs)

    # Return top N unique recommendations
    return list(combined_recs)[:top_n]


**Evaluation**

We can evaluate the recommendation system using Precision, Recall, or Mean Average Precision (MAP) by comparing the generated recommendations with actual user preferences.

In [18]:
recently_browsed_product = "B07JW9H4J1"
recent_user_id = "AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBBSNLYT3ONILA,AHCTC6ULH4XB6YHDY6PCH2R772LQ,AGYHHIERNXKA6P5T7CZLXKVPT7IQ,AG4OGOFWXJZTQ2HKYIOCOY3KXF2Q,AENGU523SXMOS7JPDTW52PNNVWGQ,AEQJHCVTNINBS4FKTBGQRQTGTE5Q,AFC3FFC5PKFF5PMA52S3VCHOZ5FQ"  # Example user who viewed the product

# Function to get product names for a list of product_ids
def get_product_names(product_ids):
    # Assuming 'data' is the original dataset loaded with product details
    product_names = data[data['product_id'].isin(product_ids)]['product_name'].tolist()
    return product_names

# real-time recommendations based on the recent activity (product IDs)
real_time_recommendations_ids = get_hybrid_recommendations(user_id=recent_user_id, product_id=recently_browsed_product, top_n=5)

# Get product names for the recommended product IDs
real_time_recommendations_names = get_product_names(real_time_recommendations_ids)

print(f"Real-time recommendations for user {recent_user_id} after viewing product {recently_browsed_product}: {real_time_recommendations_names}")


Real-time recommendations for user AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBBSNLYT3ONILA,AHCTC6ULH4XB6YHDY6PCH2R772LQ,AGYHHIERNXKA6P5T7CZLXKVPT7IQ,AG4OGOFWXJZTQ2HKYIOCOY3KXF2Q,AENGU523SXMOS7JPDTW52PNNVWGQ,AEQJHCVTNINBS4FKTBGQRQTGTE5Q,AFC3FFC5PKFF5PMA52S3VCHOZ5FQ after viewing product B07JW9H4J1: []
