# Import packages

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Import data

In [2]:
data  = pd.read_csv('../input/Sales_Transactions_Dataset_Weekly.csv')

# Have a look

In [3]:
data.head()

In [4]:
data.describe()

In [5]:
data

No obvious indications of bad or missing values here. We could just have a look at the whole frame, since it's just 811 rows, but that's not really a scaleable approach.


In [6]:
data.isnull().values.any()

It seems like this particular data set doesn't have any missing values. That's not to say there isn't any noise in the data, just that none of the elements are undefined.

# General approach

First we'll put together a rough method for finding groups of products that are more related to one another. That is, related in terms of their sales patterns, even if they're from completely different categories. 
As for the time scale: to start with, we'll just look at the overall pattern over the year for which we have data. A potential challenge is that with such a limited data set, we only have one example per calendar day. It may be hard to separate seasonal, monthly, weekly patterns from noise.

# Feature selection
Having had a look at the columns above, we see that there's product counts per week as well normalized sales. Without making any assumptions on the types of products in the list, we don't know that the per-item sales counts are a valid way to compare products. Some products tend to be purchased several at a time, while others only one at a time. We'll consider that their relative variations are more important, and keep only the normalized columns for now.

In [7]:
data_norm = data.copy()

data_norm[['Normalized {}'.format(i) for i in range(0,52)]].head()

In [8]:
data_norm = data_norm[['Normalized {}'.format(i) for i in range(0,52)]]

In [9]:
data_norm.head()

# Weekly sales differences ("velocities")
To understand of products are related, we probably want to know if their sales vary together week to week. To approach this question, it might be helpful to calculate the "sales velocites", or the difference matrix showing how much the sales went up or down in each week. The assumption is that products with similar sales fluctuations are similar, and should be restocked around the same time. *In reality, this should be subject to constraints on the restock-order volume, and frequency!!!!*

In [101]:
data_norm.diff(axis=1).head()

In [102]:
# Drop the now nonsense first column
data_norm_diff = data_norm.diff(axis=1).drop('Normalized 0', axis=1).copy()

In [103]:
data_norm_diff.head()

In [104]:
import matplotlib.pyplot as plt
plt.scatter(range(0,51), data_norm_diff.values[0])
plt.scatter(range(0,51), data_norm_diff.values[10])


In [14]:
data_norm_diff.head()

## Example relatedness test for Product 1
Having gotten the differences for all of the products, accross the whole year, now we can see how each product varied which the others on a weekly basis.

In [15]:
data_norm_diff_prod1 =  data_norm_diff.values - data_norm_diff.values[0,:]

In [16]:
data_norm_diff_prod1

In [17]:
data_norm_diff_prod1_sum = data_norm_diff_prod1.sum(axis=1)

Let's plot the "errors" for Product 1, relative to all of the other products.

In [105]:
plt.scatter(range(0,811),data_norm_diff_prod1_sum)

In [18]:
print(data_norm_diff_prod1_sum.shape)
data_norm_diff_prod1_sum

In [19]:
prod1_velocities = pd.DataFrame(data_norm_diff_prod1_sum**2, columns=["Vel_total_diff"])

In [20]:
prod1_velocities.sort_values(by="Vel_total_diff")

In [21]:
def getWeeklyDiffs(products_sales_table):
    
    return products_sales_table.diff(axis=1).drop(products_sales_table.columns[0], axis=1).copy()

def getProductErrors(product_index, products_diffs):
    
    return products_diffs - products_diffs.iloc[product_index]
    
def getTotalSquaredError(per_product_error):
    
    return pd.DataFrame(per_product_error.sum(axis=1)**2, columns=["Total Error"])
    
def makeProductVelErrorMatrix(products_diffs, nproducts):
    
    product_error_matrix = pd.DataFrame()
    
    for i in range(0,nproducts):
    
        product_errors_table = getProductErrors(i, product_diffs)
        
        product_errors_sumsq = getTotalSquaredError(product_errors_table)
        
        product_error_matrix[i] = product_errors_sumsq
        
    return product_error_matrix
        
        
    

In [22]:
product_diffs  = getWeeklyDiffs(data_norm)

In [23]:
error_matrix = makeProductVelErrorMatrix(product_diffs, 811)


In [24]:
import seaborn as sb
plt.figure(figsize=(15,15))

sb.heatmap(error_matrix, 
           square=True)


The error matrix shows us how each product's normalized monthly sales changed relative to the other products. In other words, darker colors indicate products that, over the course of a year, tended to vary together, in terms of sales increases or decreases. Brighter colors indicate products that didn't tend to vary together.

Keep in mind, the sales trends are compared on a per month basis, and then the total-suqared discrepancy for the whole year is taken. So the matrix is giving us just a general view of the whole year

All that I'm comfortable to say so far, is that there seems to be a cluster of products (#1-200) that make up a "dark bloc" in the upper left of the matrix--- these products appear to be more closely related with each other. There is another, larger block, product #s 200-811, that also appear related, but the trend seems less uniform than with the smaller 。
 
This probably just indicates that the products, in the order they are given, are already somehow pre-catgorized. We would probably want to have more information about these categories before acting. Are they food vs. non-food items? Generic vs. brand-name?

In general, we can use a fine-tuned form of the method above, to determine perhaps smaller groupings of similar products.

Apply

In [25]:
from sklearn.decomposition import PCA

In [26]:
pca = PCA(n_components=3, whiten=True)

In [27]:
pca.fit_transform(error_matrix)

In [28]:
pca.explained_variance_ratio_

In [29]:
components = pca.components_

In [30]:
plt.scatter(y = components[1,:], 
            x = components[0,:])

In [83]:
pca_data_norm = PCA(n_components=2)

In [84]:
pca_data_norm.fit_transform(data_norm.T)

In [85]:
pca_data_norm.explained_variance_ratio_
print(pca_data_norm.explained_variance_ratio_.sum())

The difficulty so far is that PCA is not able to squeeze much variance into only 2 components, or even 3.
That's a problem because 3 is the most that we mere mortals can plot.

If we take a look at the variance ratio for the 1st component, 8.65%--- that implies that even though we don't get a nice clean dimension reduction, we can still do better than 811 dimensions.
Let's to a quick test, to see how many components we need to explain an arbitraty 99% of the variance:

In [100]:
def determineNComponents(data):
    
    n_components = 0
    sum_explained_variance = 0
    
    while sum_explained_variance < 0.99:
        n_components += 1
        pca_data_norm = PCA(n_components=n_components)
        pca_data_norm.fit_transform(data)
        sum_explained_variance = pca_data_norm.explained_variance_ratio_.sum()
        
    return n_components

determineNComponents(data_norm.T)

We can compress our data to 50 dimensions, and still explain 99% of the variance. Maybe we can take these as product groups?
T

In [93]:
pca_data_norm = PCA(n_components=50)
pca_data_norm.fit_transform(data_norm.T)


In [98]:
plt.scatter(range(0,811), pca_data_norm.components_[1,:])

In [64]:
components_data_norm = pca_data_norm.components_

In [60]:
import seaborn as sb
sb.jointplot(y = components_data_norm[1,:], 
            x = components_data_norm[0,:],
            kind = 'hex')

In [65]:
pca_data_diffs = PCA(n_components=2, whiten=True)

In [66]:
pca_data_diffs.fit_transform(data_norm_diff.T)

In [67]:
pca_data_diffs.explained_variance_ratio_

In [68]:
components_data_diffs = pca_data_diffs.components_

In [69]:
plt.scatter(y = components_data_diffs[1,:], 
            x = components_data_diffs[0,:],
             alpha = 0.5)

In [75]:
from sklearn.cluster import KMeans  

In [76]:
kmeans = KMeans(n_clusters=20)  
kmeans.fit(data_norm)  

In [77]:
print(kmeans.cluster_centers_)  

In [78]:
plt.scatter(x= components_data_norm[0,:],
            y= components_data_norm[1,:],
            c=kmeans.labels_, 
            cmap='rainbow') 