# Today you are a Machine Learning Engineer at the Department of New Products at Target Cosmetics!
This work relies on processed data from Kaggle https://www.kaggle.com/mkechinov/ecommerce-events-history-in-cosmetics-shop

This work is motivated by the publication https://arxiv.org/pdf/2010.02503.pdf

### So far you have seen user-product interaction data that can lead to classification of a user-product relationship as ending in purchase or no-purchase, and for clustering (categorizing) user behaviors.

### In this assignment, we will have a very small training set to work with. Additionally, the test set we'll use has very few features. We'll first expose you to an Auto-Machine Learning library called TPOT and show you how it can be used to search over many ML model architectures. Then we will use the Label Spreading method to do semi-supervised learning, allowing us to leverage a small amount of labeled data in combination with a larger amount of unlabeled data. Finally we'll have a more open-ended task centering on system design for Zero-shot learning

### Labeled data is sparse, and in our hypothetical application, (cosmetics purchase prediction) the intention is to maximize Recall (so that no popular cosmetic is understocked). Digital overstocking is allowed since it will not cause disengagement in customers.

## Task 1: Exploratory Data Analysis (EDA) and Data Preparation

In [None]:
# similar data as last week, and just to remind you the original Kaggle data was session-level data like the following:
from IPython.display import Image
Image(filename='user_journey_descriptions.png')  # change this path to wherever you downloaded this image

FileNotFoundError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
## Importing some libraries we'll use
import os
import numpy as np
import pandas as pd
import seaborn as sb
import warnings

warnings.filterwarnings('ignore')

In [None]:
# Load the data from previous months (past)
Past = pd.read_csv("/content/drive/MyDrive/FourthBrain/Week 4/Past_month_products.csv") 
#^ change to wherever you stored the data, either locally or on Google Drive
print(Past.shape)
Past.head()

(5000, 37)


Unnamed: 0,product_id,user_id,NumOfEventsInJourney,NumSessions,interactionTime,maxPrice,minPrice,NumCart,NumView,NumRemove,InsessionCart,InsessionView,InsessionRemove,Weekend,Fr,Mon,Sat,Sun,Thu,Tue,Wed,2019,2020,Jan,Feb,Oct,Nov,Dec,Afternoon,Dawn,EarlyMorning,Evening,Morning,Night,Purchased?,Noon,Category
0,5866936,561897800.0,1.333333,1.333333,5550.0,15.84,15.84,0.0,1.333333,0.0,0.0,1.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.333333,0.0,0.333333,0.333333,0.666667,0.333333,0.333333,0.333333,0.0,0.0,0.0,0.0,0.666667,0.333333,0.0,0.0,0,0.0,1.0
1,5647110,532652900.0,2.25,1.5,27556.5,5.8,5.565,1.25,0.25,0.25,3.75,2.25,9.0,0.0,0.0,0.25,0.0,0.25,0.0,0.25,0.25,0.5,0.5,0.0,0.5,0.0,0.25,0.25,0.75,0.0,0.0,0.25,0.0,0.0,0,0.0,1.0
2,5790472,457810900.0,1.0,1.0,0.0,6.2725,6.2725,0.25,0.75,0.0,17.25,30.0,2.5,0.0,0.25,0.25,0.25,0.25,0.0,0.0,0.0,0.5,0.5,0.0,0.5,0.25,0.25,0.0,0.0,0.0,0.0,0.75,0.25,0.0,0,0.0,1.0
3,5811598,461264100.0,1.5,1.5,131532.5,5.56,5.56,0.25,1.0,0.25,3.25,10.5,1.0,0.0,0.0,0.25,0.25,0.0,0.25,0.25,0.0,0.5,0.5,0.5,0.0,0.0,0.25,0.25,0.0,0.0,0.0,0.5,0.0,0.25,0,0.25,1.0
4,5846363,515799300.0,1.875,1.375,11055.875,4.08625,4.08625,0.5,1.0,0.25,4.875,3.375,4.25,0.0,0.125,0.125,0.375,0.0,0.25,0.125,0.0,0.75,0.25,0.125,0.125,0.25,0.25,0.25,0.375,0.0,0.125,0.25,0.25,0.0,1,0.0,1.0


In [None]:
# Next, load the data regarding products to be launched next month
Next = pd.read_csv("/content/drive/MyDrive/FourthBrain/Week 4/Next_month_products.csv")
print(Next.shape)
Next.head()

(30091, 5)


Unnamed: 0,product_id,maxPrice,minPrice,Purchased?,Category
0,5866502,7.616667,7.616667,0,1.0
1,5870408,6.27,6.27,0,3.0
2,5900580,10.008,10.008,0,1.0
3,5918778,5.98,5.98,0,2.5
4,5848772,26.83,26.83,0,1.0


### Notice that the next month's data (our test data in this exercise) has many fewer features present, only the `product_id`, `maxPrice`, `minPrice`, and `Category` columns are common to both the training and test data. The training dataset is very small, however, and there are many more samples in the test data.

### Imagine that you are helping plan the launch of new products. You have to figure out how to mine the past cosmetic sales data from last month, utilize relevant features and to make estimations as to which products will sell more. 


## EDA: Doing our your due diligence. Find the following:
1. Percentage of Purchased events in train data: 
2. Percentage of Purchased events in test data:
3. Are there any overlaps in product ID between train and test data?

In [None]:
### START CODE HERE ###
y_train = Past['Purchased?'].values
print(f"Percentage of Purchased in Training data = {(np.sum(y_train)/len(y_train))*100}")
y_test = Next['Purchased?'].values
print(f"Percentage of Purchased in Test data = {(np.sum(y_test)/len(y_test))*100}")

# Verify that every product ID in the training data appears only once
print(f"Every product ID in the training data appears only once: {len(np.unique(Past['product_id'])) == Past.shape[0]}")
# Verify that every product ID in the test data appears only once
print(f"Every product ID in the test data appears only once: {len(np.unique(Next['product_id'])) == Next.shape[0]}")

# Concatenate the product_id columns of the training and test DataFrames
frames = [Past.iloc[:,0], Next.iloc[:,0]]
result = np.array(pd.concat(frames))
# Get all the unique product IDs and their counts
prod, prod_counts = np.unique(result, return_counts=True)
# Determine whether any product IDs appear in both the training and test data
num = (prod_counts > 1).astype(int)
overlap = set(Past['product_id']).intersection(set(Next['product_id']))
print(f"Number of product ids with count > 0 for training and test data combined = {sum(num)}")
print(f"These product IDs are present in both the training and test data: {overlap}")
### END CODE HERE ###

Percentage of Purchased in Training data = 34.38
Percentage of Purchased in Test data = 34.42557575354757
Every product ID in the training data appears only once: True
Every product ID in the test data appears only once: True
Number of product ids with count > 0 for training and test data combined = 0
These product IDs are present in both the training and test data: set()


## Next, create `X_train`, `y_train`, `X_test`, and `y_test` using a function called `return_train_test_data`. Remember the following: 
1. The `Purchased?` column is the target
2. `X_train` and `X_test` should contain the same features
3. `product_id` should NOT be one of those features. Can you see why?

In [None]:
def return_train_test_data(df_old, df_new):
    ### START CODE HERE ###
    X_train = df_old[['maxPrice', 'minPrice', 'Category']].values
    y_train = df_old[['Purchased?']].values
    X_test  = df_new[['maxPrice', 'minPrice', 'Category']].values
    y_test  = df_new[['Purchased?']].values
    ### END CODE HERE ###
    return X_train, y_train, X_test, y_test
    
X_train, y_train, X_test, y_test = return_train_test_data(Past, Next)    
print(X_train.shape, y_train.shape, X_test.shape)

(5000, 3) (5000, 1) (30091, 3)


## Task 2: Build the best classifier you can using only the Past month's data.
### Start by using the TPOT library. Then try something more manual, implementing one of the methods we have covered so far in the course.

You may have to install tpot first, you can follow [these instructions](https://epistasislab.github.io/tpot/installing/), using either conda or pip. 
There are some other dependencies, some of which are optional (like xgboost)

In [None]:
# If you're using Colab, this should work:
!pip install tpot

Collecting tpot
  Downloading TPOT-0.11.7-py3-none-any.whl (87 kB)
[K     |████████████████████████████████| 87 kB 2.0 MB/s 
Collecting update-checker>=0.16
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
Collecting xgboost>=1.1.0
  Downloading xgboost-1.4.2-py3-none-manylinux2010_x86_64.whl (166.7 MB)
[K     |████████████████████████████████| 166.7 MB 8.6 kB/s 
[?25hCollecting deap>=1.2
  Downloading deap-1.3.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (160 kB)
[K     |████████████████████████████████| 160 kB 64.8 MB/s 
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py) ... [?25l[?25hdone
  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11952 sha256=4878644ab5da18afc0c494dcd6b9be7f3119d998bfaaf73544c3bb189321d757
  Stored in directory: /root/.cache/pip/wheels/e2/d2/79/eaf81edb391e27c87f51

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((5000, 3), (5000, 1), (30091, 3), (30091, 1))

In [None]:
# TPOT for classification
from tpot import TPOTClassifier
### START CODE HERE ###
# Instantiate and train a TPOT auto-ML classifier
# These parameters are set fairly arbitrarily, and with some trial-and-error:
# Set generations to 5, population_size to 40, and verbosity to 2 (so you can see each generation's performance)
tpot = TPOTClassifier(generations=5, population_size=40, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
# Evaluate the classifier on the test data
# By default, the scoring function is accuracy
print(f"{tpot.score(X_test, y_test)}")
### END CODE HERE ###

# Export the optimized pipeline as Python code.
tpot.export('tpot_products_pipeline.py')

Optimization Progress:   0%|          | 0/240 [00:00<?, ?pipeline/s]



TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: BernoulliNB(XGBClassifier(input_matrix, learning_rate=0.5, max_depth=3, min_child_weight=8, n_estimators=100, n_jobs=1, subsample=1.0, verbosity=0), alpha=10.0, fit_prior=False)
0.8724867900701206


### When you call `tpot.export('tpot_products_pipeline.py')` above, it writes a python file with the code necessary to produce the classifier that TPOT found through it's AutoML search. You can open it up and see what it found. Pretty cool, right??

## In the cell below, paste the appropriate lines of `tpot_products_pipeline.py` (and modify the relevant names) to write a function which returns the predicted labels generated by the best classifier which TPOT found. Call the predicted labels `pred` 

## There is randomness to the way that TPOT searches for a classifier, so yours may be different from ours, and your peers. This is okay! If there's some scikit-learn functionality in `tpot_products_pipeline.py` that you're not familiar with, look it up!

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((5000, 3), (5000, 1), (30091, 3), (30091, 1))

In [None]:
X_test

array([[ 7.61666667,  7.61666667,  1.        ],
       [ 6.27      ,  6.27      ,  3.        ],
       [10.008     , 10.008     ,  1.        ],
       ...,
       [ 6.35      ,  6.35      ,  1.83333333],
       [ 3.4725    ,  3.4725    ,  1.        ],
       [ 5.56      ,  5.56      ,  1.        ]])

In [None]:
# this is just what we got on a particular run of TPOT
### START CODE HERE ###
from sklearn.ensemble import ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

def return_tpot_results(X_train, y_train, X_test):
    ### START CODE HERE ###    
    exported_pipeline = make_pipeline(
    RFE(estimator=ExtraTreesClassifier(criterion="entropy", max_features=0.1, n_estimators=100), step=0.15000000000000002),
    GradientBoostingClassifier(learning_rate=0.1, max_depth=9, max_features=0.35000000000000003, min_samples_leaf=19, min_samples_split=6, n_estimators=100, subsample=0.9500000000000001)
    )

    exported_pipeline.fit(X_train, y_train)
    prediction = exported_pipeline.predict(X_test)
    ### END CODE HERE ### 
    return prediction

pred = return_tpot_results(X_train, y_train, X_test)
### START CODE HERE ###

## Evaluate the results of the best classifier which TPOT found

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score as accuracy
from sklearn.metrics import recall_score as recall
from sklearn.metrics import precision_score as precision
from sklearn.metrics import f1_score

### START CODE HERE ###
# TPOT confusion matrix
cmtp = confusion_matrix(y_test, pred) 
acc  = accuracy(y_test, pred)
rec  = recall(y_test, pred)
prec = precision(y_test, pred)
f1   = f1_score(y_test, pred)
### END CODE HERE ###
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmtp)

Accuracy = 0.8726529527101127, Precision = 0.9604910399322704, Recall = 0.6571097596293078, F1-score = 0.7803507967442395
Confusion Matrix is:
[[19452   280]
 [ 3552  6807]]


## Now, in the time remaining in the first breakout session, see if you can compete with this performance by manually training a model, rather than having TPOT do it for you. You can use anything we've covered so far in the course.

In [None]:
y_train

array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]])

In [None]:
pred

array([0, 0, 0, ..., 1, 0, 0])

In [None]:
### START CODE HERE ####
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
# TPOT confusion matrix
cmx = confusion_matrix(y_test, pred)
acc  = accuracy(y_test, pred)
rec  = recall(y_test, pred)
prec = precision(y_test, pred)
f1   = f1_score(y_test, pred)
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmx)
### END CODE HERE ###

Accuracy = 0.8319431059120668, Precision = 0.9263428755226761, Recall = 0.5560382276281495, F1-score = 0.6949387705857514
Confusion Matrix is:
[[19274   458]
 [ 4599  5760]]


## Task 3: Semi-supervised learning: Apply label spreading on the data.

We won't use any of the labels for the test set. We'll just use labels for the training set. We will, however, use the **features** from the test set along with the features from the training set. Since we're using a large number of sampled features, but only a small number of these samples have labels, this is semi-supervised learning.

Step 1: Concatenate `X_train` and `X_test`, calling this matrix `X`. These are the feature vectors for all samples.

Step 2: Concatenate `y_train` with a vector of all -1's, effectively creating a dummy label for the `X_test` rows in `X`. Call this `y`.

Step 3: Run label spreading on this data. Use knn spreading with `n_neighbors` varying as 2,3,5,7,9,11. What's the best neighborhood?


### Concatenate `X_train` and `X_test`

In [None]:
### START CODE HERE ###
X = np.concatenate((X_train, X_test), axis=0)
### END CODE HERE ### 
print(X.shape[0])

35091


### Create `y`. Make it a $kx1$ column vector where $k$ is the number of rows in `X`

In [None]:
### START CODE HERE ###
y = np.concatenate((y_train, -1*np.ones((X_test.shape[0],1))), axis=0)
### END CODE HERE ###

### scikit-learn provides two label propagation models: LabelPropagation and LabelSpreading. Both work by constructing a similarity graph over all items in the input dataset. LabelSpreading is similar to the basic Label Propagation algorithm, but it uses an affinity matrix based on the normalized graph Laplacian and soft clamping across the labels. Let's use LabelSpreading for this notebook.

### Instantiate and train the LabelSpreading model. Use a KNN kernel and set `alpha` to 0.01. Try the `n_neighbors` values mentioned above.

In [None]:
from sklearn.semi_supervised import LabelSpreading
### START CODE HERE ###
recall_scores = []
n_neighbors = [2,3,5,7,9,11]
params = dict()

for i in n_neighbors:
  model = LabelSpreading(kernel='knn',alpha=0.01, n_neighbors=i)
  model.fit(X, y)
  pred = model.predict(X)
  rec = recall(y, pred, average='weighted')
  params[rec] = i
  #recall_scores.append()
best_recall_score = max(params.keys())
params['n_neighbors'] = params[best_recall_score]
### END CODE HERE ###
print(f"best_recall: {best_recall_score:.4f}, best n_neighbors: {params['n_neighbors']}")

best_recall: 0.1247, best n_neighbors: 7


### Based on the best `n_neighbors`, fit the `LabelSpreading` on kernel `knn` and alpha `0.01`

In [None]:
### START CODE HERE ###
model = LabelSpreading(kernel='knn',alpha=0.01, n_neighbors=7)
model.fit(X, y)
### END CODE HERE ###

LabelSpreading(alpha=0.01, gamma=20, kernel='knn', max_iter=30, n_jobs=None,
               n_neighbors=7, tol=0.001)

### Extract the label predictions (transductions) for the test data

In [None]:
### START CODE HERE ###
pred = model.transduction_[5000:]
### END CODE HERE

### Evaluate the test predictions against the true test labels

In [None]:
### START CODE HERE ###
cm   = confusion_matrix(y_test, pred)
acc  = accuracy(y_test, pred)
rec  = recall(y_test, pred)
prec = precision(y_test, pred)
f1   = f1_score(y_test, pred)
### END CODE HERE ###
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cm)

Accuracy = 0.8061546641853046, Precision = 0.7644927536231884, Recall = 0.6314316053673135, F1-score = 0.6916204070843247
Confusion Matrix is:
[[17717  2015]
 [ 3818  6541]]


## Collect your results in the table below


|Method    |   Recall      |F1-score    | Accuracy    |
|----------|---------------|------------|-------------|
| TPOT (AutoML) |0.6571097596293078|0.7803507967442395|0.8726529527101127|
| Label Spreading |0.6314316053673135|0.6916204070843247|0.8061546641853046|

# Task 4, System Design for Zero Shot Learning:
So far we have been looking at 3 product level features (min price, max price, Product Category) to classify if a particular product will get get purchased or not.
Now, let's say you have access to some more information regarding each Past sold cosmetic item and the Next cosmetic item. Design a System to enable accurate identification of an item that is more likely to be purchased.
Think through the following:
1. What additional data fields do you need per cosmetic in past and Next catalogue? How would you process these data fields?
2. You have access to picture images of each cosmetic. How will you use these images to extract relevant features for gauging interest in the new cosmetics?
3. Design an end-to-end system workflow using the additional cosmetic data and cosmetic images to predict its purchasing polularity.

In [None]:
"""
1. Having data fields such as # of months the object goes on sale in a year, product location (in the front,
  in the back of store, or in the middle) might help. I can process these as numbers such as 0 for 
  front, 1 for middle, and 2 for back or 0 for top of page, 1 for middle of page, and 3 for bottom.

2. store images as matrix of numbers and find correlation between buying the cosmetic, and which numbers
   in the matrix are higher (i.e. middle number in matrix being high = more likely to buy). Can later
   look at the images which have high buys and change the other images to match.

3. clean images to pick ones that are clearly visible not blurry. Turn images into matrix then find
   correlation between matrix values and buying product. use model (kmeans?) and then look at which
   images/products the model predicted as being highly bought. Change the other images to match the
   color or placement of the predicted image, then run prediction again to see if model predicts as
   highly bought. Sell with new images and see if higher buying results.
"""

'\n1. Having data fields such as # of months the object goes on sale in a year, product location (in the front,\n  in the back of store, or in the middle) might help. I can process these as numbers such as 0 for \n  front, 1 for middle, and 2 for back or 0 for top of page, 1 for middle of page, and 3 for bottom.\n\n2. store images as matrix of numbers and find correlation between buying the cosmetic, and which numbers\n   in the matrix are higher (i.e. middle number in matrix being high = more likely to buy). Can later\n   look at the images which have high buys and change the other images to match.\n\n3. clean images to pick ones that are clearly visible not blurry. Turn images into matrix then find\n   correlation between matrix values and buying product. use model (kmeans?) and then look at which\n   images/products the model predicted as being highly bought. Change the other images to match the\n   color or placement of the predicted image, then run prediction again to see if model

# **Summary and Discussion:** **Discuss** "What would you report back as the best method to gauge product popularity?" 
# Think in terms of Data, Process and Outcomes specifically.
## Consider the following:
1. Can you store the data in some other way to enable ZSL or more efficient information storage/retrieval?
2. Given a new data set on the job, how would you report the best "method"? What are the steps to always follow? 
3. What is the metric/metrics you would use to report your results?

#Share screen and discuss findings. Think about generalizability (something that works across data sets) Also, look into ML system design in terms of Data, Process and Outcome.