<img style='float:right;border-radius:5%;width:20%;height:20%;background-blend-mode:screen;margin-right:25px;margin-top:25px;' src='mirum_purple.png'>
<br>
<br>
<br>
<br>

<h1 style='text-align:center;'> Market Basket Analysis </h1>
<br>

**Instructor**: Elie Kawerk, Ph.D. 
<p style="padding-left:74px"> Data Scientist at Mirum Agency MEA </p>

<img style='float:right;width:250px;height:210px;margin-right:30px;padding-left:15px;' src='http://1.bp.blogspot.com/-sYQ4ex-tM0E/UnKYjSpV5cI/AAAAAAAABLY/ZXWnHiADtyw/s1600/social-media-humor-21.png'>

# Introduction

The goal of market basket analysis is to discover items that are most likely to be purchased together. Such discoveries are usually done by mining historical data of transactions. This discipline is very important for e-commerce websites and retail magazines as it allows them to better organize their layouts. 

In this workshop, you will use the customers' purchase history of a grocery store to cluster items that should be placed or bundled together.

The datasets are available in this repo as `csv` files:

- `purchase_history.csv`: contains the data of individual baskets or carts. It consists of 2 columns, the first column `customer_id` refers to the id of a customer, while the 2nd column `basket` refers to the item ids purchased by the customer. Note that the item ids are separated by commas.
<br>
<br>
- `item_to_id.csv`: contains the mapping between the names of the grocery items (`Item_name`) and their corresponding ids (`Item_id`).


### API Versions

You can first start by checking the versions of the libraries that you'll be using throughout the workshop.

In [None]:
# Load the watermark extension
# If watermark is not available, go to your terminal and type: pip install watermark
%load_ext watermark

In [None]:
# Enter your name here
%watermark  -a "Author: Your name here; Date: " -dt -v -p sklearn,pandas,matplotlib,seaborn

# 1. Import Libraries

In [None]:
# Data Analysis and Plotting
import ____ as ____  # pandas  as pd
import ____.____ as ____ # pyplot as plt
import numpy as np # numpy
import ____ as ____ # seaborn as sns
from mpl_toolkits.mplot3d import Axes3D # 3D plotting
from matplotlib import cm
%matplotlib inline

# Other utilities
import itertools as it
from collections import Counter

# display
from IPython.display import display

# Set custom preferences for displaying and visualizing data
sns.set_style('white')
pd.set_option('max_colwidth', 120)
pd.set_option('max_columns', 200)
pd.set_option('precision', 2)

# 2. Import and Inspect the Datasets

You can start by reading the two `csv` files containing the datasets.

**Task**: Use `pd.read_csv` to read the two available datasets as `pandas.DataFrame`s. 

- Assign the dataframe corresponding to `purchase_history.csv` to `df_baskets`.
- Assign the dataframe corresponding to `item_to_id.csv` to `df_categories`.

In [None]:
# Read purchase_history.csv; assign the dataframe to `df_baskets`
df_baskets = ____

# Read item_to_id.csv; assign the dataframe to `df_categories`
df_categories = ____

**Task**: Inspect the head of df_baskets.

In [None]:
# Inspect the head of df_baskets
____.____()

**Task**: Set `customer_id` as the index of `df_baskets`. Inspect `df_basket` again to make sure that everything went OK.

In [None]:
# Set `customer_id` as the index of `df_baskets`
____.____('____', ____=____, ____=T____rue)

# Inspect the head of `df_baskets` again ...
____.____()

**Task**: Inspect the head of `df_categories`. Inspect the names of `df_categories` columns; did you notice anything about these names?

In [None]:
# Inspect the head of `df_categories`
____.____()

In [None]:
# make column names lowercase
____

In [None]:
# set 'item_id' as index
____.____('____', ____=____)

# 3. Missing Values

**Task**: Check the number of missing values per column for each of the two dataframes. Devise a method to treat these missing values in case they are present.

In [None]:
# Inspect the number of missing values per column for `df_baskets` 
____

In [None]:
# Inspect the number of missing values per column for `df_categories` 
____

# 4. Construct a Customer-Item Dataframe 

You will now construct a customer-item dataframe where each row has the the customer ids as an index and the item ids as columns. Each cell in this table should contain the number of purchases made by a certain customer for a certain item.

You will first start by building a transaction-items dataframe where each column corresponds to an item id.

**Tasks**:

- Write a function called `customer_item_count` that takes a `pandas.Series` `s` as an argument:
    - Raise a `ValueError` if the `basket` column is not contained in `s`.
    - Manipulate the `basket` column in `s` to determine the list of items bought by a customer. Assign the result to `items`.
    - Return a `pandas.Series` object whose index corresponds to the id of an item and whose values corresponds to the number of purchases of that item.  You can use the `Counter` function from the `collections` module to facilitate your task.
    
    
- Apply the function `customer_item_counts` to the `df_baskets` dataframe using the `.apply()` method:
    - Make sure to apply this function along the rows of `df_basket` by setting the `axis` argument correctly. 
    - Finally chain the result with the method `.fillna()` to fill missing values with zeros. Assign the final dataframe to `df_customer_transaction`.

In [None]:
def customer_item_count(s):
    '''
    This function takes a pandas.Series s corresponding to a row in df_baskets
    and returns a pandas.Series whose index corresponds to the id of an item
    and whose values corresponds to the number of purchases of a particular
    item.
    
    Parameters
    ---------
    s: pandas Series
    
    Returns
    --------
    pandas Series    
    
    Raises
    ------
    ValueError if the 'basket' column is not contained in s
    
    '''
    
    # Make sure that 'basket` is contained in s
    if ____ not in ____:
        raise ____("____")
    
    # Compute the list of items in s['basket']
    items =  ____
    
    # Return a pd.Series of whose index corresponds to item ids and
    # whose values corresponds to counts of items
    return ____

#Transform df_baskets into a tidy form
df_customer_transaction = ____.____(____,____=____).____(____, ____='____')

Now that you are done from constructing a transaction-item dataframe, it's time to construct a customer-item dataframe that takes into account all historical purchases made by a customer.

**Task**: 

Manipulate `df_customer_transaction` to obtain a customer-item dataframe in which the index corresponds to the id of a customer and where the cells contain the total number of purchases of an item. Assign the result to `df_customer_items`.

*Hint*: You can use the `groupby()` method to achieve this task.

In [None]:
# Group by customer_id
df_customer_items = ____

**Task**

Print out a random sample of `df_customer_items` consisting of 10 rows.

In [None]:
# print out a random sample consisting of 10 records
____

In [None]:
# Downcast the columns to integer
for col in df_customer_items.columns:
    df_customer_items[col] = pd.to_numeric(df_customer_items[col], downcast='integer')

# 5. Most purchased item

The owner of the store wants to know which items in the transcations database was purchased the most. For this purpose, you will analyze the data to find the most purchased item.

**Tasks**:

- Manipulate `df_customer_items` to find the total count of purchases of the different items. Assign the result to `item_purchases`.
- Use `item_purchases` to find the name and the id of the most purchased item. Also find the total number of times this item was purchased.

In [None]:
# Construct 'item_purchases'
item_purchases = ____

# Find the id of the most purchased item
max_purchased_item_id = ____

# Find the name of the most purchased item
max_purchased_item_name = ____

# Find the number of times the most purchased item appeared
max_purchased_item_count = ____

# print the result
print("The most purchased item was {0}, it has the item id {1} and was purchased {2} times.".format(
                  max_purchased_item_name, max_purchased_item_id, max_purchased_item_count))

# 6. Customer who Purchased the Greatest Number of Items

The store's owner wants to reward the customer who purchased the greatest number of items. Your task here is to identify this customer's id and determine the total number of items she/he purchased throughout the studied period.

**Tasks**:

- Manipulate `df_customer_items` to find the total count of purchases for all customers. Assign the result to `customer_purchases`.
- Use `item_purchases` to find the id of the customer who purchased the greatest number of items. Also find the total number of items bought by this customer.

In [None]:
# Sum over columns to find the # of purchases of each customer
customer_purchases = ____

# Find the id of the customer who purchased the maximum number of items
most_purchasing_customer_id = ____

# Find the total number of items purchased by the most purchasing customer
tot_items_purchased = ____

print("The customer who purchased the greatest number of items has the id {0},"
      " and has purchased {1} items.".format(most_purchasing_customer_id, tot_items_purchased))

# 7 Distribution of basket size

The store's owner wants to get an idea of the distribution of the customers' basket size and to visualize how it looks like. 

You will help the owner in this task by using the awesome `seaborn` library.

**Task**:

- Use `sns.distplot()` to plot the distribution of basket size.
- Also plot vertical lines indicating the median and the average basket size.

In [None]:
fig, ax = plt.subplots(figsize=(13,8))

# Plot the distribution of basket size
____
____
____
ax.set_title("Distribution of basket size", fontsize=15)
ax.set(xlabel='Basket Size')
ax.legend(loc='best')
plt.show()

# 8. Most purchasing customer by item

For each item, find the id of the customer who baught that item the most. You should include each item's name and id.

**Tasks**:

- Use `df_customer_items` to construct a dataframe `max_purchase_by_item`  that shows for each item_id, the maximum number of purchases by a single customer as well as the id of the customer. 
- Set the name of `max_purchase_by_item`'s index to `item_id`.
- Join `max_purchase_by_item` with `df_categories` using `max_purchase_by_item`'s `.join()` method. Assign the result to `max_purchase_by_item`.
- Display `max_purchase_by_item` and inspect the results.

In [None]:
# Construct 'max_purchase_by_item'
max_purchase_by_item = ____
    
# Set the name of 'max_purchase_by_item' to 'item_id'    
____ = '____'

# Join max_purchase_by_item to df_categories
max_purchase_by_item = ____.____(____)

# Display max_purchase_by_item
display(____)

# 9. Items similarity matrix

In this section, you will build a matrix that represents the similarity between items based on the past purchase behavior of customers. You will first have to normalize each column in the `df_customer_items` dataframe to bring all the 'item' features on a similar scale. 

**Tasks**:

- Import `normalize` from `sklearn.preprocessing`.
- Normalize the columns of `df_customer_item` using the `normalize` function. Assign the result to `customer_item_matrix`.
- Find a way to construct a matrix that measures the similarity between items. Assign the result to `item_similarity`.
- Transform `item_similarity` to a dataframe whose indices and columns correpond to the ids of items. Assign the result to `df_item_similarity`.
- Inspect the resulting dataframe.

In [None]:
# Import normalize from sklearn.preprocessing
from ____.____ import ____

# Normalize df_customer_items by row
customer_item_matrix = ____(____, axis=0)

# Find the matrix of item similarity
item_similarity = ____

# Transform the similarity matrix into a dataframe
df_item_similarity = pd.DataFrame(____, 
                                  index= ____,
                                  columns= ____)

# Delete item_similarity to save space
del item_similarity

# Inspect df_item_similarity
df_item_similarity

# 10. Machine Learning

Wow! That was pretty lengthy. You went all the way from having a table that contained individual transactions to a dataframe representing the similarity between the items present in the transactional database.

Now it's time to apply some machine learning techniques to cluster similar items together!

Let's first start by performing some basic imports.

In [None]:
# Machine Learning
from sklearn.metrics import silhouette_samples, silhouette_score # Model evaluation
from sklearn.cluster import KMeans # Clustering
from sklearn.decomposition import PCA # Dimensionality reduction
from sklearn.pipeline import make_pipeline # Import make_pipeline
from sklearn.preprocessing import StandardScaler # StandardScaler 

## 11. Dimensionality Reduction with PCA

Before feeding `df_item_similarity` to a clustering model, we shall reduce its dimensionality using Principal Component Analysis (PCA). The PCA algorithm diagonalizes the covariance matrix of `df_item_similarity` and finds decorrelated components as new features. These new components are linear combinations of the original components.

In order to reduce the dimensionality of the dataset, we can use the variance explained ratio to capture an arbitratry percentage of `df_item_similarity`'s variance.

**Tasks**:

- Set the variable `percentage_variance` to 0.95 in order to capture 95% of the variance in the PCA model.
- Instantiate a `PCA` object; set the parameter `n_components` to `percentage_variance`. Assign this object to `pca`.
- Fit `pca` to `df_item_similarity`.
- Use `pca` to transform `df_item_similarity` to `X_transformed`.
- Convert `X_transformed` into a dataframe whose index corresponds to the index of `df_item_similarity` and whose columns are denoted `PCi` where `i` is an integer that runs from 1 to the reduced total number of dimensions.

In [None]:
# Create a pipeline: - Standardize the data, -then reduce dimensionality to capture 95% of the variance
percentage_variance = ____
pca = ____(____=____)

# Find the PCA transform of df_item_similarity
X_transformed = ____.____(____)

# Convert 'X_transformed' into a dataframe
X_transformed = ____

### Plotting the cumulative variance explained ratio

The code chunk below produces a barplot of the cumulative variance explained ratio.

In [None]:
# Plot cumulative variance explained ratio

# Determine the number of columns to take into account
no_cols = ____

# create a pandas.Series of the cumulative explained variance ratio
# Set the index as the column names and sort the values of the series
cum_explained_variance = pd.Series(____,
                                   index = ____).sort_values(ascending=____)

# Draw a barplot of 'cum_explained_variance'
fig, ax = plt.subplots(figsize=(15,5))
____.plot(____='____', ax=ax, color='lightblue')

ax.grid(ls='--')
ax.axhline(y=percentage_variance, color='red', ls='--')
plt.suptitle('Cumulative explained variance ratio', fontsize=20)
plt.xticks(range(1,no_cols), rotation=0)
plt.show()

To capture 95% of the variance, PCA reduced the dimensionality of the dataset from 48 to 40. Not bad!

## 12. Clustering

It's now time to perform clustering using the K-Means algorithm. Since you do not know apriori the optimal number of clusters, you shall vary this number and determine the inertia as well as the average silhouette score.

**Tasks**:

- Define a list named `clusters` that contains the number of clusters ranging from 2 to 30.
- Define two empty lists named `silhouette_scores` and `inertias`.
- Iterate over the `clusters` list using `n_clusters` as a for-loop variable:

    - Instantiate a `KMeans` class with `n_clusters`; assign this instance to `kmeans`.
    - Fit `kmeans` on `X_transformed`.
    - Precict the labels of `X_transformed`, assign the result to `predicted_labels`.
    - Append the inertia of the clustering model to the list `inertias`.
    - Append the silhouette score to the list `silhouette_scores`.

In [None]:
# Perform K-means clustering for k ranging from 2 to 30 on transformed data

# Define the list 'clusters'
clusters = ____

# Defube silhouette_scores and inertias
silhouette_scores = []
inertias = []

for ____ in ____:
    
    # Instantiate a KMeans object with i clusters, assign it to kmeans
    kmeans = ____
    
    # Fit kmeans to X_transformed
    ____
    
    # Predict labels
    predicted_labels = ____

    # Determine inertia
    ____
    
    # Determine silhouette score
    score_silhouette = ____
    silhouette_scores.append(____)
      
    print('Finished clustering with {0} clusters'.format(n_clusters))   

### Plot Inertia and Silhouette Score

Execute the following chunk of code to plot the inertia diagram as well as the silhouette plot. Review the plots to estimate the optimal number of clusters.

In [None]:
inertia_scores = pd.Series(inertias, index=clusters)
silhouette_scores = pd.Series(silhouette_scores, index = clusters)

fig, ax = plt.subplots(1,2, figsize=(12,6))
inertia_scores.plot(ax=ax[0], marker='o')
ax[0].set_xlabel('Number of Clusters')
ax[0].set_ylabel('Inertia')
ax[0].grid(ls='--',alpha=0.5)

silhouette_scores.plot(ax=ax[1], marker='o')
ax[1].set_xlabel('Number of Clusters')
ax[1].set_ylabel('Silhouette Score')
ax[1].grid(ls='--',alpha=0.5)
plt.show()

## 13. Silhouette Plot

You will now draw a silhouette plot of the clustering model with the optimal number of clusters. For your convenience, the function `silhouette_plot` is written in the code chunk below. You can use it to obtain a silhouette plot of your clustering model.

**Tasks:**

- Use the graphs above to select the optimal number of clusters `n_optimal_clusters`. 
- Instantiate a `KMeans()` class with `n_clusters` set to `n_optimal_clusters`; assign the result to `kmeans`.
- Fit `kmeans` on `X_transformed`.
- Call `silhouette_plot` and pass `kmeans` and `X_transformed` as arguments.

In [None]:
##########################################################################################
def silhouette_plot(model, X):
    '''
    This function draws a silhouette plot of a clustering model given the features 
    matrix X
    
    Parameters
    ---------
    model: sklean.cluster class
        The clustering model to visualize.
    
    X: A pandas.DataFrame corresponding
        The numpy.array of pandas.DataFrame corresponding to the features matrix.
    
    Returns
    -------
    plot: matplotlib figure object
        A plot showing a silhouette plot of the clusters deduced by the model on X.
   
    '''
    
    y_km = model.predict(X)
    cluster_labels = np.unique(y_km)

    n_clusters = cluster_labels.shape[0]
    
    silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
    
    y_ax_lower, y_ax_upper = 0, 0
    yticks = []

    plt.figure(figsize=(12,8))
    
    for i, c in enumerate(cluster_labels):
        c_silhouette_vals = silhouette_vals[y_km == c]
        c_silhouette_vals.sort()
        y_ax_upper += len(c_silhouette_vals)
        color = cm.jet(float(i) / n_clusters)
        plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, 
                 edgecolor='none', color=color)

        yticks.append((y_ax_lower + y_ax_upper) / 2.)
        y_ax_lower += len(c_silhouette_vals)

    silhouette_avg = silhouette_score(X_transformed, y_km, metric='euclidean') 
    plt.axvline(silhouette_avg, color="red", linestyle="--") 

    plt.yticks(yticks, cluster_labels + 1)
    plt.ylabel('Cluster')
    plt.xlabel('Silhouette coefficient')

    plt.tight_layout()
    plt.suptitle('K-means with {0} clusters'.format(len(cluster_labels)), fontsize=20)
    plt.xlim(-0.1,0.35)
    plt.show()
##########################################################################################

# Set n_optimal_clusters
n_optimal_clusters = ____

# Instantiate a KMeans 
kmeans = ____(____=____)

# Fit kmeans to X_transformed
____

# Call silhouette plot
____

## 14. Visualizing the Clusters in 2D

Finally, now that you selected the optimal model, you can visualize the shape of the obtained clusters in 2D. 

You'll also examine the items forming each of the obtained clusters.

For your convenience, the function `show_clusters` is written in the code chunk below. You can call this function to produce a figure showing a projection of the data along the first two principal components. The function also outputs the items contained in each cluster. 

**Tasks:**

- Call the function `show_clusters` and pass the trained `kmeans` object as well as the dataframe `X_transformed` as arguments.
- Inspect the output and analyze whether the results make sense.

In [None]:
################################################################################################
def show_clusters(model, X):
    """
    Display a figure that shows the projection of X along the first 2 principal components
    
    Parameters
    ----------
    model: a trained sklearn.cluster class
    
    X: a pandas.DataFrame object corresponding to the PCA-reduced features matrix,
       the column names should be 'PCi' with i running from 1 to the maximum number
       of components.
    
    Returns
    -------
    plot: matplotlib figure object
    
    """
    labels = model.predict(X)

    fig = plt.figure(figsize=(15, 10))
    colors =  it.cycle (["b","g","r","c","m","y","k"])

    groups = X.groupby(labels)
    
#     ax = fig.add_subplot(111, projection='3d')
    ax = fig.add_subplot(111,)
    for (label,group) in groups:        
#         ax.scatter(group['PC1'],group['PC2'],group['PC3'],c=next(colors),label = label, )
        ax.scatter(group['PC1'],group['PC2'],c=next(colors),label = label, )

        print("\n*********** Cluster [{}] ***********\n".format(label+1))
        names = df_categories.loc[ df_categories.index.isin(group.index), 'item_name']
        for index, name in enumerate(names):
            print("\t{} {}".format(index+1,name))

    # annotate
    for itemid in X.index:
        x = X.loc[itemid,"PC1"]
        y = X.loc[itemid,"PC2"]
#         z = X.loc[itemid, "PC3"]
        name = df_categories.loc[df_categories.index == itemid,"item_name"].values[0]
#         ax.text(x,y,z,name)
        ax.text(x, y ,name)
        
    plt.legend(loc='best')
    plt.show()
################################################################################################
    
# Call show_clusters and pass X_transformed and labels as arguments
____

# Thank you for attending the workshop!