# Lecture 2: EDA

In [65]:
import json

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [66]:
articles = pd.read_csv('../data/articles.csv')
customers = pd.read_csv('../data/customers.csv')
transactions = pd.read_csv('../data/transactions_train.csv')
dataset_dict = {"articles":articles,'customers':customers,'transactions':transactions}

In [67]:
have_na_values = []
for name,datafile in dataset_dict.items():
    for column in datafile.columns:
        if datafile[column].isna().sum() > 0:
            have_na_values.append([name,column])
print("Columns containing empty cells:")
print(have_na_values)

Columns containing empty cells:
[['articles', 'detail_desc'], ['customers', 'FN'], ['customers', 'Active'], ['customers', 'club_member_status'], ['customers', 'fashion_news_frequency'], ['customers', 'age']]


In [68]:
for datafile_key, column in have_na_values:
    print(str(round(dataset_dict[datafile_key][column].isna().sum()/dataset_dict[datafile_key].shape[0]*100,2)) + '% of ' + datafile_key + ', ' + str(column) + ' has missing values')

0.39% of articles, detail_desc has missing values
65.24% of customers, FN has missing values
66.15% of customers, Active has missing values
0.44% of customers, club_member_status has missing values
1.17% of customers, fashion_news_frequency has missing values
1.16% of customers, age has missing values


Some detailed article descriptions missing seems okay to me.

In [69]:
print("Unique values in customer['FN'] : "  + str(customers['FN'].unique()))
print("Unique values in customer['Active'] : "  + str(customers['Active'].unique()))

Unique values in customer['FN'] : [nan  1.]
Unique values in customer['Active'] : [nan  1.]


In [70]:
print("total customer count: " + str(customers.shape[0]))
print('customer count where FN is 1 or missing: ' + str(customers[(customers['FN'] == 1) | customers['FN'].isna()].shape[0]))
print('customer count where Active is 1 or missing: ' + str(customers[(customers['Active'] == 1) | customers['Active'].isna()].shape[0]))

total customer count: 1371980
customer count where FN is 1 or missing: 1371980
customer count where Active is 1 or missing: 1371980


FN indicates if the customer follows fashion news. The missing values seem intended: 1 means that the user follows FN, missing value means he doesn't. There are no other values. Similarly, 'active' also seems to be a binary value where an empty value is used instead of zero. Again, no other values are present. In both cases, the missing values can be replaced by 0.

The missing ages could be replaced by the mean of the other ages.

In [71]:
print("Unique values in customer['club_member_status'] : "  + str(customers['club_member_status'].unique()))
print("Unique values in customer['fashion_news_frequency'] : "  + str(customers['fashion_news_frequency'].unique()))

Unique values in customer['club_member_status'] : ['ACTIVE' nan 'PRE-CREATE' 'LEFT CLUB']
Unique values in customer['fashion_news_frequency'] : ['NONE' 'Regularly' nan 'Monthly' 'None']


My first assumption was that a missing club member status was intentional to indicate the user was not a member. However, only 0.44% of users is missing a value. While the missing values are probably not intentional, I think it's safe to assume that if H&M can not tell that a user is a club member, he probably doesn't get any club benefits (or whatever being club member actually means) and is functionally not a club member. Either replace missing values by 'PRE-CREATE' or add new value and contact users about their member status.
If H&M doesn't know if a user is receiving fashion news, I would assume that the user is not being sent news and the missing value can be replaced by 'NONE'.

Additionally, the 'None' and 'NONE' values in fashion_news_frequency should be merged into just 'NONE'

In [72]:
# https://stackoverflow.com/a/49966702
for name, datafile in dataset_dict.items():
    for column in datafile.columns:
        counts = datafile[column].value_counts()
        lowest_count = counts.values[-1]
        lowest_key = counts.keys()[-1]
        print( str(name) + ', ' + str(column) + ': "' + str(lowest_key) + '" occurs ' + str(lowest_count) + ' times, which is ' + str((lowest_count/datafile[column].shape[0])*100) + '% of the time')
        ties = 1
        index = -2
        try:
            while counts.values[index] == lowest_count and ties < 1000:
                ties += 1
                index -= 1
        except IndexError:
            pass
        if ties > 1:
            if ties == 1000:
                print('\t At least ' + str(ties) + ' ties.')
            else:
                print('\t Exactly ' + str(ties) + ' ties.')
        else:
            print('\t No ties')

articles, article_id: "959461001" occurs 1 times, which is 0.0009474900987284683% of the time
	 At least 1000 ties.
articles, product_code: "959461" occurs 1 times, which is 0.0009474900987284683% of the time
	 At least 1000 ties.
articles, prod_name: "Lounge dress" occurs 1 times, which is 0.0009474900987284683% of the time
	 At least 1000 ties.
articles, product_type_no: "483" occurs 1 times, which is 0.0009474900987284683% of the time
	 Exactly 12 ties.
articles, product_type_name: "Clothing mist" occurs 1 times, which is 0.0009474900987284683% of the time
	 Exactly 12 ties.
articles, product_group_name: "Fun" occurs 2 times, which is 0.0018949801974569365% of the time
	 No ties
articles, graphical_appearance_no: "1010029" occurs 8 times, which is 0.007579920789827746% of the time
	 No ties
articles, graphical_appearance_name: "Hologram" occurs 8 times, which is 0.007579920789827746% of the time
	 No ties
articles, colour_group_code: "80" occurs 14 times, which is 0.0132648613821985

Knowing the least common values isn't very interesting for most columns: it makes sense that unique values like IDs each occur only once, and it makes sense that binary values like customer FN have a commonly occurring least-common value without ties. I will discuss a few of the more interesting ones. These are the ones where least-occurring could be caused by a typo, like product_type_name in articles.

In [73]:
def get_lowest_ties(datafile_name, column_name):
    print('Least occurring in ' + datafile_name + ', ' +  column_name + ': ')
    counts = dataset_dict[datafile_name][column_name].value_counts()
    lowest_count = counts.values[-1]
    ties = 1
    index = -1
    try:
        while counts.values[index] == lowest_count and ties < 1000:
            print("\t " + str(counts.keys()[index]))
            index -= 1
    except IndexError:
        pass

In [74]:
get_lowest_ties('articles','product_type_name')

Least occurring in articles, product_type_name: 
	 Clothing mist
	 Blanket
	 Cushion
	 Headband
	 Keychain
	 Washing bag
	 Sewing kit
	 Towel
	 Wood balls
	 Bra extender
	 Pre-walkers
	 Bumbag


Least common product categories actually just contain niche things. Capitalisation is consistent.

In [75]:
table = dataset_dict['articles']
print("Both articles in the 'Fun' category: ")
print(table[table['product_group_name'] == 'Fun'][['prod_name','detail_desc']].values)

Both articles in the 'Fun' category: 
[['HLW MASK OWN'
  'Scary glow-in-the-dark fancy dress mask in plastic foam in the shape of a skull with holes for the eyes and an elastic strap with a hook and loop fastening at the back.']
 ['HLW Bucket'
  'Plastic bucket in a spooky shape with a handle at the top.']]


In [76]:
print(str('Bluish Green' in dataset_dict['articles']['perceived_colour_master_name'].values))
print(str('Blueish Green' in dataset_dict['articles']['perceived_colour_master_name'].values))
print(str('blueish green' in dataset_dict['articles']['perceived_colour_master_name'].values))
print(str('Blueish green' in dataset_dict['articles']['perceived_colour_master_name'].values))
print(str('Blue Green' in dataset_dict['articles']['perceived_colour_master_name'].values))

True
False
False
False
False


Bluish is apparently correctly spelled, and there are no other "perceived_colour_master_name"s with a similar name (that I could come up with at least)

According to [a comment by the competition organizer](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/306016#1680549), the provided transaction price values do not represent any real currency. [In the same thread](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/306016#1680549), it is explained that the price has been scaled for privacy reasons. According to [a users Kaggle post](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/310496), the prices in the dataset are the true prices in euros divided by 590. This could be the scaling referred to by the organizers.

In [77]:
# From https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/310496#1709991
from fractions import Fraction
import numpy as np
smallest_price_diff = min(np.diff(np.sort(transactions.price.unique())))
Fraction(smallest_price_diff).limit_denominator()

Fraction(1, 59000)

Smallest price difference is 1/59000 -> 1/59000 is 1 cent -> 59000 cents is 1 unit
    -> 590 euros is one unit

While I did not analyze the images, it seems some of them are [mislabeled](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/324232). If you want to analyze images without using the ~30GB dataset, consider using [decreased resolution images posted on the Kaggle forums](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/306152). When you want to find the image matching an article id, make sure to [read the article id correctly](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/discussion/307390) (don't remove leading zeros).