<a href="https://colab.research.google.com/github/Ishita95-harvad/Ishitatheresearchanalyst.github.io/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Amazon Product Reviews Workflow for Collaborative Item-Based Filtering

1. Introduction

This document outlines a structured workflow for analyzing Amazon product reviews using Collaborative Item-Based Filtering. The approach aims to improve recommendation systems by leveraging user interactions with products.

2. Workflow Stages

A. Data Acquisition

In [7]:
!pip install opendatasets
import opendatasets as od
import pandas as pd
import os # Import the os module

# Download dataset from Kaggle
od.download("https://www.kaggle.com/datasets/irvifa/amazon-product-reviews")

# Get the current working directory
current_directory = os.getcwd()

# Define the expected directory and filename
dataset_directory = os.path.join(current_directory, "amazon-product-reviews")
dataset_filename = "amazon_co-ecommerce_sample.csv"  # Update with actual filename if different

# Construct the full file path
file_path = os.path.join(dataset_directory, dataset_filename)
# Check if the file exists
if os.path.exists("amazon_co-ecommerce_sample.csv"):
    # Load the dataset  # Indented this block to be part of the 'if' statement
    reviews_df = pd.read_csv("amazon_co-ecommerce_sample.csv")

    # Initial Data Exploration
    print(reviews_df.head())
    print(reviews_df.info())
else:
    print(f"Error: File not found at {'amazon_co-ecommerce_sample.csv'}")
    print("Please ensure the dataset was downloaded and the filename is correct.")

Skipping, found downloaded files in "./amazon-product-reviews" (use force=True to force download)
Error: File not found at amazon_co-ecommerce_sample.csv
Please ensure the dataset was downloaded and the filename is correct.


B. Data Preprocessing

In [8]:
!pip install opendatasets  # Install opendatasets if not already installed
import opendatasets as od
import pandas as pd
import os # Import the os module

# Download dataset from Kaggle
od.download("https://www.kaggle.com/datasets/irvifa/amazon-product-reviews")

# Get the current working directory
current_directory = os.getcwd()

# Define the expected directory and filename
dataset_directory = os.path.join(current_directory, "amazon-product-reviews")
dataset_filename = "amazon_co-ecommerce_sample.csv"  # Update with actual filename if different

# Construct the full file path
file_path = os.path.join(dataset_directory, dataset_filename)
# Check if the file exists
if os.path.exists("amazon_co-ecommerce_sample.csv"):
    # Load the dataset  # Indented this block to be part of the 'if' statement
    reviews_df = pd.read_csv("amazon_co-ecommerce_sample.csv")

    # Initial Data Exploration
    print(reviews_df.head())
    print(reviews_df.info())

    # Remove duplicate and null values #Moved this block into the if condition
    reviews_df.drop_duplicates(inplace=True)
    reviews_df.dropna(inplace=True)

    # Convert textual reviews into numerical ratings (if applicable)
    reviews_df['rating'] = reviews_df['rating'].astype(float)

    # Normalize ratings
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    reviews_df[['rating']] = scaler.fit_transform(reviews_df[['rating']])

    # Create user-item interaction matrix
    user_item_matrix = reviews_df.pivot(index='user_id', columns='product_id', values='rating').fillna(0)
else:
    print(f"Error: File not found at {'amazon_co-ecommerce_sample.csv'}")
    print("Please ensure the dataset was downloaded and the filename is correct.")

Skipping, found downloaded files in "./amazon-product-reviews" (use force=True to force download)
Error: File not found at amazon_co-ecommerce_sample.csv
Please ensure the dataset was downloaded and the filename is correct.


C. Similarity Computation

In [37]:
!pip install opendatasets  # Install opendatasets if not already installed
import opendatasets as od
import pandas as pd
import os  # Import the os module
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np  # Import numpy
from sklearn.preprocessing import MinMaxScaler


# Download dataset from Kaggle
od.download("https://www.kaggle.com/datasets/irvifa/amazon-product-reviews")

# Get the current working directory
current_directory = os.getcwd()

# Define the expected directory and filename
dataset_directory = os.path.join(current_directory, "amazon-product-reviews")
dataset_filename = "amazon_co-ecommerce_sample.csv"  # Update with actual filename if different

# Construct the full file path
file_path = os.path.join(dataset_directory, dataset_filename)

# Check if the file exists
# Use file_path instead of just the filename
if os.path.exists(file_path):
    # Load the dataset
    reviews_df = pd.read_csv(file_path)  # Use file_path to read the CSV

    # Initial Data Exploration
    print(reviews_df.head())
    print(reviews_df.info())

    # Remove duplicate and null values
    reviews_df.drop_duplicates(inplace=True)
    reviews_df.dropna(inplace=True)

    # Convert textual reviews into numerical ratings (if applicable)
    reviews_df['rating'] = reviews_df['rating'].astype(float)

    # Normalize ratings
    scaler = MinMaxScaler()
    reviews_df[['rating']] = scaler.fit_transform(reviews_df[['rating']])

    # **Ensure product_id is treated as a string**
    reviews_df['product_id'] = reviews_df['product_id'].astype(str)

    # Create user-item interaction matrix
    user_item_matrix = reviews_df.pivot(
        index='user_id', columns='product_id', values='rating'
    ).fillna(0)

    # Calculate cosine similarity between products
    item_similarity = cosine_similarity(user_item_matrix.T)

    # Convert the similarity matrix to a DataFrame for easier handling
    item_similarity_df = pd.DataFrame(
        item_similarity,
        index=user_item_matrix.columns,
        columns=user_item_matrix.columns,
    )

    # Print some similarity values (optional)
    print(item_similarity_df.head())

else:
    print(f"Error: File not found at {file_path}")  # Print the full file path
    print("Please ensure the dataset was downloaded and the filename is correct.")



Skipping, found downloaded files in "./amazon-product-reviews" (use force=True to force download)
Error: File not found at /content/amazon-product-reviews/amazon_co-ecommerce_sample.csv
Please ensure the dataset was downloaded and the filename is correct.


D. Recommendation Generation

In [17]:
# Function to recommend similar items
def recommend_items(item, similarity_matrix, num_recommendations=5):
    """
    Recommends items similar to the given item based on cosine similarity.

    Args:
        item (str): The ID of the item for which to generate recommendations.
        similarity_matrix (pd.DataFrame): The item-item similarity matrix.
        num_recommendations (int, optional): The number of recommendations to generate. Defaults to 5.

    Returns:
        pd.Series: A Series containing the recommended item IDs and their similarity scores.
                   Returns an error message if the item is not found in the similarity matrix.
    """
    if item not in similarity_matrix.columns:
        return "Item not found in similarity matrix"

    similar_items = similarity_matrix[item].sort_values(ascending=False)[1:num_recommendations+1]
    # [1:] Excludes the item itself from recommendations
    return similar_items

# Assuming item_similarity_df was created in a previous cell
# If not, make sure to define or load it before running this code
# Example usage
# Replace 'B0000SX2UC' with an actual product ID from your dataset that exists in item_similarity_df
item_to_recommend = '0594451647' # Replacing with a product id that exists in the dataset
try: #Adding a try-except block to catch the NameError if item_similarity_df is not defined
    recommendations = recommend_items(item_to_recommend, item_similarity_df, num_recommendations=10)
except NameError:
    print("item_similarity_df is not defined. Please make sure it is calculated and available in the current scope.")
else:
    if isinstance(recommendations, str):  # Check if an error message was returned
        print(recommendations)
    else:
        print(f"Recommended items for {item_to_recommend}:\n{recommendations}")

item_similarity_df is not defined. Please make sure it is calculated and available in the current scope.


E. Evaluation and Optimization

In [34]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Placeholder function for evaluation (Assuming we have ground-truth labels)
def evaluate_model(true_labels, predicted_labels):
    precision = precision_score(true_labels, predicted_labels, average='weighted')
    recall = recall_score(true_labels, predicted_labels, average='weighted')
    f1 = f1_score(true_labels, predicted_labels, average='weighted')

    return {"Precision": precision, "Recall": recall, "F1 Score": f1}

# Example placeholder for A/B testing (Implementation depends on business setup)
def ab_testing(strategy_a, strategy_b):
    # Compare two different recommendation strategies
    pass

# Example of hyperparameter tuning (can be applied to similarity computation, normalization, etc.)
def tune_hyperparameters():
    # Example: Try different similarity metrics like Pearson correlation
    pass