In [1]:
import os
import pandas as pd
import configparser
#import mysql.connector
from sqlalchemy import create_engine

import seaborn as sns
import matplotlib.pyplot as plt

config = configparser.ConfigParser()
config.read('..\\config.ini')

host = config['mysql']['host']
database = config['mysql']['database']
user = config['mysql']['user']
password = config['mysql']['password']
port = config['mysql']['port']

In [2]:
def read_query(query):
    engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database}')
    df = pd.read_sql(query, con=engine)
    print('Query Executed!')
    return df

The goal is to come up with a Collaborative Filtering recommender system. In the approach, we will need to analyze a customer's past transaction history and then compare them to other customers to identify trends and similarities. Based on the patterns we find, we can recommend new items to the customer that they may be interested in. 

More specifically, we cannot build a User-based CF where items are recommended to  a customer based on similarities between the user and the other users since we only know the age-group of the customers. Instead, we opt for item-based CF which recommends items to a user based on similarities between the items they have interacted with and other items they may like.

## Exploring article features:

In [4]:
q = """
SELECT 
    *
FROM
    articles
LIMIT 5;
"""

read_query(q)

Query Executed!


Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [17]:
q="""
SELECT 
    COUNT(*) AS total_items
FROM
    articles;
"""

read_query(q)

Query Executed!


Unnamed: 0,total_items
0,105542


There are a total of 105542 items.

According to the metadata, the images folder contains a folder of images corresponding to each `article_id` and are placed in subfolders starting with the first three digits of the article_id. Also, not all article_id values have a coressponding image.

Since we will only list articles with images on our website, we can first find the articles that have no corresponding images and just remove them from our dataset.

Getting the `article_id` of items that have a image:

In [23]:
location_of_subfolders = "..//h-and-m-personalized-fashion-recommendations//images"

# Get a list of all files in the folder
subfolder_names = os.listdir(location_of_subfolders)

all_articles = []

# Print the names of all files in the folder
for name in subfolder_names:
    location_of_images = os.path.join(location_of_subfolders,name)
    image_names = os.listdir(location_of_images)
    all_articles = all_articles + [int(x.replace('.jpg', '')) for x in image_names]

Total items with images available:

In [24]:
len(all_articles)

105100

So we will have a total of 105100 items to choose from and recommend! 442 tems are missing images and so we will remove them from our training set.

In [18]:
all_items = """
SELECT 
    *
FROM
    articles
"""

items_all = read_query(all_items)

Query Executed!


In [25]:
new_articles_df = items_all[items_all['article_id'].isin(all_articles)]