# Project GEVPRO (H&M) - MAIN

We will work with the following dataset: 
Source: https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/overview
This dataset has been reduced to use less memory. Only sales data from > 08-01-2020 is selected. 

In [3]:
# Used libaries:
%matplotlib inline
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
import matplotlib.image as mpimg

from sklearn.metrics.pairwise import cosine_similarity
from tqdm.notebook import tqdm

ModuleNotFoundError: No module named 'seaborn'

## 0.  Research questions:
We try to answer the following main research question:“Can we use the H&M dataset to explore data about its latest fashion trends and customer base?” This main research question is devided into 3 sub categories“: 
1. Sales --> What is popular
2. Customer base --> Who is our customer?
3. Personalised Fashion reccomendation --> Can we reccomend products based on other customers

## 1.  Reducing the dataset: 
This is only ran on the default dataset (transactions_train.csv) found on kaggle to reduce the memory size to 91.3mb (instead of 3+gb).

In [4]:
# df = pd.read_csv('data/transactions_train.csv', low_memory=False)
# df['t_dat'] = pd.to_datetime(df['t_dat'])
# mask = df['t_dat'] > '08-01-2020'
# df_reduced = df.loc[mask]
# df_reduced.to_csv('C:\\Users\\Nils\\Jupyter Notebooks\\Project_gevpro\\transactions_reduced.csv')

In [5]:
# This loads the reduced transaction (sales) dataset.
customers = pd.read_csv('data/customers.csv', low_memory=False)
articles = pd.read_csv('data/articles.csv', low_memory=False)
transactions = pd.read_csv('data/transactions_reduced.csv', low_memory=False)

## 2. Exploring the data

#### 2.1 Exploring customers.csv

The following table will describe information contained in customers.csv:

| Column | Description |
| --- | --- |
| customer_id | A unique customer id |
| FN | ? |
| Active | Whether an account is active or not |
| club_member_status | Is member of H&M club |
| Fashion_news_frequency | If customers get notified about H&M news |
| age | The age of customers |

In [6]:
customers.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [7]:
customers.describe()

Unnamed: 0,FN,Active,age
count,476930.0,464404.0,1356119.0
mean,1.0,1.0,36.38696
std,0.0,0.0,14.31363
min,1.0,1.0,16.0
25%,1.0,1.0,24.0
50%,1.0,1.0,32.0
75%,1.0,1.0,49.0
max,1.0,1.0,99.0


#### 2.2 Exploring articles.csv
This section will show the information contained in the articles.csv file.

In [8]:
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [9]:
articles.describe()

Unnamed: 0,article_id,product_code,product_type_no,graphical_appearance_no,colour_group_code,perceived_colour_value_id,perceived_colour_master_id,department_no,index_group_no,section_no,garment_group_no
count,105542.0,105542.0,105542.0,105542.0,105542.0,105542.0,105542.0,105542.0,105542.0,105542.0,105542.0
mean,698424600.0,698424.563378,234.861875,1009515.0,32.233822,3.206183,7.807972,4532.777833,3.171534,42.664219,1010.43829
std,128462400.0,128462.384432,75.049308,22413.59,28.086154,1.563839,5.376727,2712.692011,4.353234,23.260105,6.731023
min,108775000.0,108775.0,-1.0,-1.0,-1.0,-1.0,-1.0,1201.0,1.0,2.0,1001.0
25%,616992500.0,616992.5,252.0,1010008.0,9.0,2.0,4.0,1676.0,1.0,20.0,1005.0
50%,702213000.0,702213.0,259.0,1010016.0,14.0,4.0,5.0,4222.0,2.0,46.0,1009.0
75%,796703000.0,796703.0,272.0,1010016.0,52.0,4.0,11.0,7389.0,4.0,61.0,1017.0
max,959461000.0,959461.0,762.0,1010029.0,93.0,7.0,20.0,9989.0,26.0,97.0,1025.0


#### 2.3 Exploring transactions.csv
This section will show the information contained in the transactions.csv file.

In [10]:
transactions.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2020-08-02,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,853474004,0.033881,2
1,2020-08-02,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,865594003,0.013542,2
2,2020-08-02,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,865699001,0.025407,2
3,2020-08-02,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,822959001,0.050831,2
4,2020-08-02,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,888024005,0.084729,2


In [11]:
transactions.describe()

Unnamed: 0,article_id,price,sales_channel_id
count,1993503.0,1993503.0,1993503.0
mean,801188900.0,0.02957675,1.67581
std,130548500.0,0.02038062,0.4680714
min,108775000.0,0.0003220339,1.0
25%,754238000.0,0.0169322,1.0
50%,845790000.0,0.02540678,2.0
75%,889550000.0,0.03388136,2.0
max,956217000.0,0.5067797,2.0


### 3. Exploring H&M's customer base

In the customer section of this project we have explored the following research questions: 
* What is the most frequent age of H&M customers? 
* What is the distribution of the ages of H&M customers? W
* What is the most frequent postal code? 
* How many customers have a club membership?
* What kind of fashion news frequency is most popular with what club membership?
* How many people receive fashion news? 
* What is the spread of fashion news in contrast to club member status? 
* What is the relation between receiving fashion news and the customers age?

We have expressed the data in different ways, varying from plots to tables. With this research, we try to create a better overview of the customers shopping at H&M.

In [12]:
df = pd.read_csv('data/customers.csv', low_memory=False)
df = df.dropna(subset=['age', 'postal_code'])  # drop rows with missing values in this column
df.shape

(1356119, 7)

#### 3.1 What is the most frequent age of H&M customers?

In [13]:
#What are the 10 most frequent ages?
df["age"].value_counts().head(10)

#The most frequent age is 21, and the most people shopping at H&M are between 20 and 30

21.0    67530
24.0    56124
20.0    55196
25.0    54989
23.0    54867
26.0    53658
22.0    51869
27.0    49134
28.0    44294
29.0    40697
Name: age, dtype: int64

#### 3.2 What is the distribution of the ages of H&M customers? 

In [14]:
#What is the distribution of age?
sns.set_theme(style='darkgrid')
ax = sns.histplot(x="age", data=df, color='green', bins=40)

#the distribution of age centers around 20/30 and 45/55.

NameError: name 'sns' is not defined

#### 3.3 What is the most frequent postal code? 

In [None]:
#what are the 10 most frequent postal codes
df['postal_code'].value_counts().head(10)

#the most frequent postal code appears to be a value that gives 120303

#### 3.4 How many customers have a club membership?

In [None]:
#how many customers have a club_membership?
sns.set_theme(style='darkgrid')
ax = sns.histplot(x="club_member_status", data=df, color='green', bins=40)

#The biggest part of the customers have a club membership

#### 3.5 What kind of fashion news frequency is most popular with what club membership?

In [None]:
#uniting the values that all mean none but have different named values for this 
df.loc[~df['fashion_news_frequency'].isin(['Regularly', 'Monthly']), 'fashion_news_frequency'] = 'None'

In [None]:
#showing the new values for fashion news frequency
df['fashion_news_frequency'].unique()

In [None]:
#the club_member_status against the fashion news frequency
pd.crosstab(df['club_member_status'], df['fashion_news_frequency'])

#There are no people that leave the club that have monthly news updates. 
#The people that have no news updates, which are the most people that leave the club, dont receive fashion news. 
#The most people have an active status but do not receive news updates. 

In [None]:
# division of fashion_news_frequency
sns.set_theme(style='darkgrid')
ax = sns.histplot(x="fashion_news_frequency", data=df, color='green', bins=40)

#most people don't like to receive fashion news

In [None]:
# A visual representation of the fashion news frequency and the club member status
fashion_status = sns.countplot(y='fashion_news_frequency', hue='club_member_status', data=df, palette="Set3")

#### 3.6 What is the relation between receiving fashion news and the customers age?

In [None]:
#fashion news in relation to age 
pd.crosstab(df['fashion_news_frequency'], df['age'], margins=True)

In [None]:
#fashion news frequency in relation to age 
sns.displot(data=df, x="age", hue="fashion_news_frequency", col="fashion_news_frequency")

#the highest amount of customers don't receive fashion news
#There are almost no people that receive fashion news monthly

### 4. Exploring H&M's dataset sales to figure out the latest trends
This section will try to answer the following questions:
1. What are the most sold articles?
   - What is the most sold article?
2. What are the most sold types of articles?
3. What are the worst selling articles?
   - What is the least sold article?
4. What are the worst selling types of articles?
5. What color is the most popular?
6. What was the most succesful week in sales?
7. Do expensive (categories) sell better than cheaper articles? 

In [None]:
# This loads the reduced transaction (sales) dataset.
customers = pd.read_csv('data/customers.csv', low_memory=False)
articles = pd.read_csv('data/articles.csv', low_memory=False)
transactions = pd.read_csv('data/transactions_reduced.csv', low_memory=False)

#### 4.1 What are the most sold articles?

In [None]:
# What are the most sold articles?
transactions["article_id"].value_counts().head(10)

#### 4.2 What is the most sold article?

In [None]:
# The most sold article
import os
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread(os.getcwd()+'/photos/'+str(751471001).zfill(10)+'.jpg')
imgplot = plt.imshow(img)
plt.show()

# The most sold article is a pair of black trousers.

#### 4.3 What are the most sold types of articles?

In [None]:
# What are the most sold types of articles?
df = pd.merge(transactions,articles, on="article_id", how="inner")
df["product_type_name"].value_counts().head(10).plot.bar(color="lightblue")

plt.title("Top 10 most sold types of items")
plt.xlabel("Type of product")
plt.ylabel("Amount of sales")

# The most sold type of item appear to be trousers.

In [None]:
# What are the worst selling articles?
transactions['article_id'].value_counts().tail(10)

# These are 10 of the worst selling items in the transactions file.
# Of course there will also be items that aren't sold at all.
# These articles understandably don't appear in the transaction file.
# The amount of articles without sales will most likely be far too many to give as a reasonable output.
# Therefore we have decided to not consider items with 0 sales.

#### 4.4 What are the least sold articles?

In [None]:
# One of the least sold articles
import os
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread(os.getcwd()+'/photos/'+str(865792012).zfill(10)+'.jpg')
imgplot = plt.imshow(img)
plt.show()What is the least sold article?

#### 4.5 What are the worst selling types of articles?

In [None]:
# What are the least sold types of articles?
df = pd.merge(transactions,articles, on="article_id", how="inner")
df["product_type_name"].value_counts().tail(10).plot.bar(color="orange")

plt.title("Top 10 least sold types of items")
plt.xlabel("Type of product")
plt.ylabel("Amount of sales")

# The least sold types of item appear to be the zipper head, sewing kit and bumbag.

#### 4.6 What color is the most popular?

In [None]:
# What color is the most popular?
df = pd.merge(transactions,articles, on="article_id", how="inner")
df["colour_group_name"].value_counts().head(5).plot.bar(color="lightpink")

plt.title("Top 5 most popular colors according to sales")
plt.xlabel("Product color")
plt.ylabel("Amount of sales")

# It turns out that black is by far the most popular color.

#### 4.7 What was the most succesful week in sales?

In [None]:
# How do the weekly sales compare?
transactions['t_dat'] = pd.to_datetime(transactions['t_dat'], errors ='coerce') 
week = transactions['t_dat'].dt.week
week.value_counts(sort=False).plot(color="purple")

plt.title("Comparison of sales per week in 2020")
plt.xlabel("Week")
plt.ylabel("Amount of sales")

# Week 32 was the most succesful in sales.
# This week lasted from the 3rd until the 9th of August 2020.

### 5. Exploring H&M's dataset to build a reccomendation system
#### Building a item based reccomendation system (using cosine similarity)
In this chapter we will be using an item based collaborative filtering approach to reccomending items to users. The item reccomendations can be used to reccomend other items to users when they are shopping for products. The main idea is to find products that are frequently bought together.

We first start by reducing the dataset even more. The current appoach of using cosine similarity could not be used on the intire dataset since the matrix would become to large to fit into memory. We reduced the dataset to only contain data after 09/01/2020 and selected the first 20 000 items that users bought in h&m stores. This does probably impact accuracy, since a lot of previous transactional data is not taken into account.  

Used sources for this chapter:
* https://heartbeat.comet.ml/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
* https://en.wikipedia.org/wiki/Cosine_similarity
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
* https://www.datasource.ai/uploads/6b86b1630562b323a26143f90d97fe08.html

In [None]:
customers = pd.read_csv('data/customers.csv', low_memory=False)
articles = pd.read_csv('data/articles.csv', low_memory=False)
transactions = pd.read_csv('data/transactions_reduced.csv', low_memory=False)

In [None]:
# Reducing the dataset even further
articles = pd.read_csv('data/articles.csv', low_memory=False)
transactions = pd.read_csv('data/transactions_reduced.csv', low_memory=False)
transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])
mask = transactions['t_dat'] > '09-01-2020'
transactions = transactions.loc[mask]
transactions.head()

In [None]:
# Selecting only relevant data and count how many articles customers have bought
collaborative_filtering_df = transactions[['customer_id', 'article_id']]
collaborative_filtering_df = collaborative_filtering_df.groupby(['article_id', 'customer_id']).size().reset_index()
collaborative_filtering_df = collaborative_filtering_df.set_axis(['article_id', 'customer_id', 'quantity'], axis=1, inplace=False).sort_values(by=['customer_id'])
collaborative_filtering_df.head()

In [None]:
# Select only first 20 000 rows of data and represent previously reduced data as a matrix
customer_article_matrix = collaborative_filtering_df[:20000].pivot_table(
    index='customer_id',
    columns='article_id',
    values='quantity',
    aggfunc='sum')

customer_article_matrix

In [None]:
# Replacing the nan's with 0's and values >1 with 1's
customer_article_matrix = customer_article_matrix.fillna(0)
customer_article_matrix.where(customer_article_matrix > 0, 0, inplace=True)

In [None]:
"""
We transpose (T) in order to rotate dataframe 90 degrees to left.
We do this so that index now represents articles and columns customers.
Afterwards we apply cosine similarity to the dataset. 
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
"""

article_similarity_matrix = pd.DataFrame(cosine_similarity(customer_article_matrix.T))
article_similarity_matrix

In [None]:
# Since column names and index names are not yet set, we set index and column names to belonging article id.
article_similarity_matrix.columns = customer_article_matrix.T.index
article_similarity_matrix['article_id'] = customer_article_matrix.T.index
article_similarity_matrix = article_similarity_matrix.set_index('article_id')

In [None]:
"""
This code only selects top 5 results and merges information from the arcticles csv to get additional information on products.
I decided to use a bikini bottom (882759003), because looking at other kaggle notebooks, I saw that bikini tops are usually bought
with a matching top. So I thought it was a good strategy to test our model.
""" 
bikini_bottom_id = int(882759003)
similar_items = article_similarity_matrix.loc[bikini_bottom_id].sort_values(ascending=False).reset_index()[1:6]
similar_items.set_axis(['article_id', 'similarity'], axis=1, inplace=False)
similar_items_detailed = pd.merge(similar_items, articles, how='inner')
similar_items_detailed

In [None]:
# This code displays the article (bikini bottom) that is used to for prediction 
import os
img = mpimg.imread(os.getcwd()+'\\photos\\'+str(bikini_bottom_id).zfill(10)+'.jpg')
imgplot = plt.imshow(img)
plt.show()

In [None]:
# These are the results for the prediction visualized.
article_ids = similar_items_detailed['article_id'].to_list()

fig = plt.figure(figsize=(20, 5))
columns = 5
rows = 1
for i in range(1, columns*rows +1):
    img = mpimg.imread(os.getcwd()+'\\photos\\'+str(article_ids[i-1]).zfill(10)+'.jpg')
    fig.add_subplot(rows, columns, i)
    plt.imshow(img)
plt.show()

As we can see, the method shows promising results for this clothing item. It reccomends the matching bikini top and shows alternative tops that are usually bought. It also reccomends hair clips which surprisingly people buy as well while shopping for bikini bottoms. This method needs to be further investigated and evaluated before it can be implemented. It also needs more thinking on how it can be implented efficiently, but I think this is out of scope for this assingment.