Lets start by dividing the task into some components that are necessary to implement:

1. Data Preprocessing 
2. Text-Based Product Recommendation
3. Image-Based Product Recommendation
4. Integration - Combine recommendation systems and get a list of final recommendations
5. Evaluation - Fine tunning the parameters

In [1]:
import pandas as pd
from urllib.request import urlopen
from PIL import Image
from io import BytesIO
import os
import requests

# Load the dataset
dataset = pd.read_csv('28k_apparel_data.csv')


dataset.info()

# Display the first few rows of the dataframe
dataset.head()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17593 entries, 0 to 17592
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         17593 non-null  int64 
 1   asin               17593 non-null  object
 2   brand              17543 non-null  object
 3   color              17593 non-null  object
 4   medium_image_url   17593 non-null  object
 5   product_type_name  17593 non-null  object
 6   title              17593 non-null  object
 7   formatted_price    17593 non-null  object
dtypes: int64(1), object(7)
memory usage: 1.1+ MB


Unnamed: 0.1,Unnamed: 0,asin,brand,color,medium_image_url,product_type_name,title,formatted_price
0,4,B004GSI2OS,FeatherLite,Onyx Black/ Stone,https://images-na.ssl-images-amazon.com/images...,SHIRT,Featherlite Ladies' Long Sleeve Stain Resistan...,$26.26
1,6,B012YX2ZPI,HX-Kingdom Fashion T-shirts,White,https://images-na.ssl-images-amazon.com/images...,SHIRT,Women's Unique 100% Cotton T - Special Olympic...,$9.99
2,15,B003BSRPB0,FeatherLite,White,https://images-na.ssl-images-amazon.com/images...,SHIRT,FeatherLite Ladies' Moisture Free Mesh Sport S...,$20.54
3,27,B014ICEJ1Q,FNC7C,Purple,https://images-na.ssl-images-amazon.com/images...,SHIRT,Supernatural Chibis Sam Dean And Castiel O Nec...,$7.39
4,43,B0079BMKDS,FeatherLite,White,https://images-na.ssl-images-amazon.com/images...,APPAREL,Featherlite Ladies' Silky Smooth Pique (White)...,$13.53


In [2]:
def preprocessing(dataset):
    # Check for missing values
    null_values = dataset.isnull().sum().sort_values(ascending=False)
    null_values = pd.DataFrame(data=null_values, columns=['Null Values'])
    missing_values = null_values[null_values['Null Values'] > 0]
    print(missing_values)

    dataset['brand'].fillna(dataset['brand'].mode()[0], inplace=True)

    null_values_after_fill = dataset.isnull().sum().sort_values(ascending=False)
    missing_values_after_fill = null_values_after_fill[null_values_after_fill > 0]
    if len(missing_values_after_fill) == 0:
        print("All missing values filled successfully.")
    else:
        print("Some missing values could not be filled.")

    return dataset 


preprocessing(dataset)

       Null Values
brand           50
All missing values filled successfully.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['brand'].fillna(dataset['brand'].mode()[0], inplace=True)


Unnamed: 0.1,Unnamed: 0,asin,brand,color,medium_image_url,product_type_name,title,formatted_price
0,4,B004GSI2OS,FeatherLite,Onyx Black/ Stone,https://images-na.ssl-images-amazon.com/images...,SHIRT,Featherlite Ladies' Long Sleeve Stain Resistan...,$26.26
1,6,B012YX2ZPI,HX-Kingdom Fashion T-shirts,White,https://images-na.ssl-images-amazon.com/images...,SHIRT,Women's Unique 100% Cotton T - Special Olympic...,$9.99
2,15,B003BSRPB0,FeatherLite,White,https://images-na.ssl-images-amazon.com/images...,SHIRT,FeatherLite Ladies' Moisture Free Mesh Sport S...,$20.54
3,27,B014ICEJ1Q,FNC7C,Purple,https://images-na.ssl-images-amazon.com/images...,SHIRT,Supernatural Chibis Sam Dean And Castiel O Nec...,$7.39
4,43,B0079BMKDS,FeatherLite,White,https://images-na.ssl-images-amazon.com/images...,APPAREL,Featherlite Ladies' Silky Smooth Pique (White)...,$13.53
...,...,...,...,...,...,...,...,...
17588,183081,B01MRV2IFS,YueLian,Black,https://images-na.ssl-images-amazon.com/images...,SHIRT,YueLian Women's Chiffon Short Sleeves Sun Prot...,$19.25
17589,183092,B01LY4QWLF,Vintage America,White,https://images-na.ssl-images-amazon.com/images...,SHIRT,Vintage America Women's Large Lace Up Collared...,$23.24
17590,183096,B07167SCNH,Tart Collections,Black,https://images-na.ssl-images-amazon.com/images...,SHIRT,"Tart Womens Collections Ann Wrap Top, Xs, Black",$29.99
17591,183101,B07575N2WX,Soprano,Gray,https://images-na.ssl-images-amazon.com/images...,SHIRT,Soprano Womens Small Tie-Fringe Slub-Knit Tank...,$22.83


In [4]:
image_directory = 'images/'
os.makedirs(image_directory, exist_ok=True)

def download_and_save_image(url, filename):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            with open(filename, 'wb') as f:
                f.write(response.content)
                print(f"Image saved: {filename}")
        else:
            print(f"Failed to download image from {url}. Status code: {response.status_code}")
    except Exception as e:
        print(f"Error downloading image from {url}: {e}")

# Iterate over the dataset and download images
for index, row in dataset.iterrows():
    image_url = row['medium_image_url']
    image_name = f"{row['asin']}.jpg"
    image_path = os.path.join(image_directory, image_name)
    download_and_save_image(image_url, image_path)

KeyError: 'asin'

Text-Based Recommendation:


Text Preprocessing:
Tokenize the text data and remove stopwords.


In [3]:
pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Download stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Remove stopwords and tokenize text
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    filtered_tokens = [token for token in tokens if token.isalnum() and token not in stop_words]  # Remove stopwords and non-alphanumeric tokens
    return " ".join(filtered_tokens)

# Apply text preprocessing
dataset['title'] = dataset['title'].apply(preprocess_text)
dataset['product_type_name'] = dataset['product_type_name'].apply(preprocess_text)

# Display the preprocessed text
print(dataset[['title', 'product_type_name']].head())


[nltk_data] Downloading package stopwords to /home/fozia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/fozia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


                                               title product_type_name
0  featherlite ladies long sleeve stain resistant...             shirt
1  women unique 100 cotton special olympics world...             shirt
2  featherlite ladies moisture free mesh sport sh...             shirt
3  supernatural chibis sam dean castiel neck fema...             shirt
4     featherlite ladies silky smooth pique white xl           apparel


In [5]:
# Drop unnecessary columns
dataset = dataset.drop(columns=['Unnamed: 0', 'asin', 'color', 'medium_image_url', 'formatted_price'])

# Display the modified dataframe
print(dataset.head())

                         brand product_type_name  \
0                  FeatherLite             shirt   
1  HX-Kingdom Fashion T-shirts             shirt   
2                  FeatherLite             shirt   
3                        FNC7C             shirt   
4                  FeatherLite           apparel   

                                               title  
0  featherlite ladies long sleeve stain resistant...  
1  women unique 100 cotton special olympics world...  
2  featherlite ladies moisture free mesh sport sh...  
3  supernatural chibis sam dean castiel neck fema...  
4     featherlite ladies silky smooth pique white xl  


In [6]:
pd.set_option('display.max_colwidth', None)

dataset

Unnamed: 0,brand,product_type_name,title
0,FeatherLite,shirt,featherlite ladies long sleeve stain resistant tapered twill shirt 2xl onyx stone
1,HX-Kingdom Fashion T-shirts,shirt,women unique 100 cotton special olympics world games 2015 white size l
2,FeatherLite,shirt,featherlite ladies moisture free mesh sport shirt white
3,FNC7C,shirt,supernatural chibis sam dean castiel neck female purple l
4,FeatherLite,apparel,featherlite ladies silky smooth pique white xl
...,...,...,...
17588,YueLian,shirt,yuelian women chiffon short sleeves sun protection outerwear blouse
17589,Vintage America,shirt,vintage america women large lace collared blouse white l
17590,Tart Collections,shirt,tart womens collections ann wrap top xs black
17591,Soprano,shirt,soprano womens small tank top gray


Bag of Words (BoW)

Bag of Words (BoW) is a way of representing text data in numerical form, where 
each word in a text document is treated as a separate feature

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# Function to create Bag of Words (BoW) representation of product types
def create_bow(product_type):
    bow = {}
    for pt in product_type.split('/'):
        bow[pt.strip()] = 1
    return bow

# Create BoW representations for product types
bags_of_words = [create_bow(pt) for pt in dataset['product_type_name']]

# Create DataFrame to store BoW representations
bow_df = pd.DataFrame(bags_of_words, index=dataset['title']).fillna(0)

# Calculate cosine similarity matrix between products
cosine_similarity_matrix = cosine_similarity(bow_df)

# Create DataFrame with cosine similarity scores
similarity_df = pd.DataFrame(cosine_similarity_matrix, index=bow_df.index, columns=bow_df.index)

# Ask the user for a product they like
product = input('Enter a product you like: ')

# Find the index of the product in the similarity DataFrame
product_index = similarity_df.index.get_loc(product)

# Get the top 10 most similar products to the input product
top_10 = similarity_df.iloc[product_index].sort_values(ascending=False)[1:11]

# Print the top 10 most similar products to the input product
print(f'Top 10 similar products to {product}:')
print(top_10)


Top 10 similar products to featherlite ladies long sleeve stain resistant tapered twill shirt 2xl onyx stone:
title
featherlite ladies long sleeve stain resistant tapered twill shirt 2xl onyx stone     1.0
women unique 100 cotton special olympics world games 2015 white size l                1.0
featherlite ladies moisture free mesh sport shirt white                               1.0
supernatural chibis sam dean castiel neck female purple l                             1.0
soprano womens small tank top gray                                                    1.0
fifth degree womens gold foil graphic tees junior top short sleeve printed shirt l    1.0
ladies green seamless ribbed diamond patterned cap sleeve top wide                    1.0
tart womens collections ann wrap top xs black                                         1.0
feel piece sami dip dye top one size navy                                             1.0
ladies fuchsia pink seamless stone set tube top                           

Using Tf-IDF Vectorizer

