# Assignment 8: Automated Machine Learning (Part 2)
## Objective:

As we learned from the class, the high demand for machine learning has produced a large amount of data scientists who have developed expertise in tools and algorithms. The features in the data will directly influence the results. However, it is tedious and unscalable to manually design and select features without domain knowledge. Thus, using some AutoML techniques will significantly help data scientists save labour and time. 
After completing this assignment, you should be able to answer the following questions:

1. Why do we need AutoML?
2. How does auto feature generation work?
3. How to use featuretools library to automatically generate features?
4. How to get useful features in a large feature space?

Imagine you are a data scientist in an online retailer company, for example, Amazon. Your task is to provide some recommendations to customers based on their historical purchase records.

In this assignment, we predict whether the customer will buy **Banana** in the next 4 weeks. It is a classification problem. To simplify the problem, we have already generated some features and provide the accuracy of the model (Random Forest model). The task for you is to generate **10** useful features and beat our model performance (AUC = 0.61, see below). 

For example, <br>
`MODE(orders.MODE(order_products.product_name)) = Bag of Organic Bananas` means whether the most frequent purchase of the customer is Bag of Organic Bananas. 

```
1: Feature: MODE(orders.MODE(order_products.product_name)) = Bag of Organic Bananas
2: Feature: MODE(order_products.aisle_id) is unknown
3: Feature: SUM(orders.NUM_UNIQUE(order_products.product_name))
4: Feature: MODE(orders.MODE(order_products.product_name)) = Boneless Skinless Chicken Breasts
5: Feature: MODE(order_products.product_name) = Boneless Skinless Chicken Breasts
6: Feature: STD(orders.NUM_UNIQUE(order_products.aisle_id))
7: Feature: MODE(order_products.aisle_id) = 83
8: Feature: MEDIAN(orders.MINUTE(order_time))
9: Feature: MODE(orders.DAY(order_time)) = 23
10: Feature: MODE(orders.MODE(order_products.department)) = produce

AUC 0.61
```


## Preliminary
If you never use featuretools before, you need to learn some basic knowledge of this topic. 
I found that these are some good resources: 
* [featuretools documentation](https://docs.featuretools.com/en/stable/)
* [Tutorial: Automated Feature Engineering in Python](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219)

The data can be downloaded from [A8-2-data.zip](A7-2-data.zip). 

## 0. Preparation
Import relevant libraries and load the dataset: <br>
users: <br>
* user_id: customer identifier
* label:  1 if the customer will buy banana in next 4 weeks, 0 otherwise

orders: <br>
* order_id: order identifier
* user_id: customer identifier
* order_time: date of the order was placed on 

order_products: <br>
* order_id: order identifier
* order_product_id: foreign key
* reordered:  1 if this product has been ordered by this user in the past, 0 otherwise
* product_name: name of the product
* aisle_id: aisle identifier
* department: the name of the department
* order_time: date of the order was placed on

In [1]:
import pandas as pd
!pip install featuretools
import featuretools as ft
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import os
ft.__version__

Collecting featuretools
  Downloading featuretools-1.6.0-py3-none-any.whl (356 kB)
[?25l[K     |█                               | 10 kB 18.2 MB/s eta 0:00:01[K     |█▉                              | 20 kB 13.9 MB/s eta 0:00:01[K     |██▊                             | 30 kB 7.3 MB/s eta 0:00:01[K     |███▊                            | 40 kB 6.7 MB/s eta 0:00:01[K     |████▋                           | 51 kB 5.4 MB/s eta 0:00:01[K     |█████▌                          | 61 kB 6.3 MB/s eta 0:00:01[K     |██████▍                         | 71 kB 7.0 MB/s eta 0:00:01[K     |███████▍                        | 81 kB 7.3 MB/s eta 0:00:01[K     |████████▎                       | 92 kB 8.1 MB/s eta 0:00:01[K     |█████████▏                      | 102 kB 7.8 MB/s eta 0:00:01[K     |██████████                      | 112 kB 7.8 MB/s eta 0:00:01[K     |███████████                     | 122 kB 7.8 MB/s eta 0:00:01[K     |████████████                    | 133 kB 7.8 MB/s eta 0:



'1.6.0'

In [3]:
orders = pd.read_csv("orders.csv")
order_products = pd.read_csv("order_products.csv")
users = pd.read_csv("users.csv")

print(users["label"].value_counts())
orders.shape

False    628
True     139
Name: label, dtype: int64


(5997, 4)

## Task 1. Feature Generation
In this task, you need to use featuretools to generate candidate features by using the above three tables.

### 1.1 Representing Data with EntitySet

Define entities and their relationships (see [https://docs.featuretools.com/en/stable/generated/featuretools.EntitySet.html](https://docs.featuretools.com/en/stable/generated/featuretools.EntitySet.html))

In [None]:
# Get the relationship between entities
def load_entityset(orders, order_products, users):
    # --- Write your code below ---
    # return the EntitySet object
    dataframes = {
    "users" : (users.iloc[:,1:], "user_id"),
    "orders" : (orders.iloc[:,1:], "order_id"),
    "order_products" : (order_products.iloc[:,1:], "order_id1")
    }

    relationships = [("users", "user_id", "orders", "user_id"), 
                     ("orders", "order_id", "order_products", "order_id")]

    return (dataframes, relationships,ft.EntitySet("my-entity-set",dataframes,relationships))


### 1.2 Deep Feature Synthesis

In [None]:
# Automatically generate features
es = load_entityset(orders, order_products, users)

# use ft.dfs to perform feature engineering
# --- Write your code below ---
feature_matrix, feature_defs = ft.dfs(entityset=es[2],target_dataframe_name="users")


  "integer column".format(index))


In [None]:
# output what features you generate
feature_matrix

Unnamed: 0_level_0,label,COUNT(orders),COUNT(order_products),MAX(order_products.aisle_id),MAX(order_products.reordered),MEAN(order_products.aisle_id),MEAN(order_products.reordered),MIN(order_products.aisle_id),MIN(order_products.reordered),MODE(order_products.department),...,SUM(orders.MEAN(order_products.aisle_id)),SUM(orders.MEAN(order_products.reordered)),SUM(orders.MIN(order_products.aisle_id)),SUM(orders.MIN(order_products.reordered)),SUM(orders.NUM_UNIQUE(order_products.department)),SUM(orders.NUM_UNIQUE(order_products.product_name)),SUM(orders.SKEW(order_products.aisle_id)),SUM(orders.SKEW(order_products.reordered)),SUM(orders.STD(order_products.aisle_id)),SUM(orders.STD(order_products.reordered))
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,4,21,121.0,1.0,60.523810,0.523810,21.0,0.0,snacks,...,241.366667,2.100000,88.0,1.0,16.0,21.0,1.138542,-0.608581,160.426417,1.095445
2,True,7,85,123.0,1.0,63.752941,0.341176,1.0,0.0,produce,...,410.886447,2.277656,117.0,0.0,35.0,85.0,2.695193,7.175015,261.248415,2.443579
3,False,5,41,123.0,1.0,69.048780,0.341463,13.0,0.0,produce,...,347.170707,1.669697,95.0,0.0,20.0,41.0,-0.092793,2.432523,208.749499,1.999461
7,False,4,73,123.0,1.0,65.493151,0.493151,21.0,0.0,beverages,...,262.386905,1.773810,90.0,0.0,31.0,73.0,0.823014,-1.487708,142.991508,1.230281
10,False,4,114,123.0,1.0,67.342105,0.175439,5.0,0.0,produce,...,246.256917,0.628327,51.0,0.0,21.0,114.0,0.257729,6.950682,146.689346,1.074301
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,False,4,45,128.0,1.0,52.711111,0.400000,3.0,0.0,snacks,...,210.139835,1.669231,12.0,0.0,21.0,45.0,0.429243,-0.967009,180.136629,1.489449
997,False,4,33,129.0,1.0,64.272727,0.212121,9.0,0.0,produce,...,241.861538,0.607692,78.0,0.0,18.0,33.0,1.461201,1.981310,138.419278,0.963430
998,False,7,60,116.0,1.0,64.966667,0.683333,4.0,0.0,dairy eggs,...,492.290548,5.014069,154.0,3.0,36.0,60.0,-0.674960,-3.921700,239.109143,1.308153
999,True,12,300,131.0,1.0,73.936667,0.606667,3.0,0.0,produce,...,849.528198,7.922145,185.0,2.0,90.0,300.0,-3.345759,-9.896658,436.775733,3.930090


## Task 2. Feature Selection
In this task, you are going to select 10 features that are useful and train the *Random Forest* model. The goal is to beat the accuracy performance as we have shown before. Note that you have to use the Random Forest and the hyperparameters we provide in Section 2.2. In other words, your job is to achieve a higher AUC than 0.61 through feature generation/selection rather than through hyperparameter tuning or model selectoin. 

### 2.1 Select top features

In [None]:
# --- Write your code below ---
# Select top-10 features and return X, y (X.shape = (767, 10)
y0=feature_matrix.corr(method='spearman')
y1=pd.DataFrame(y0["label"].dropna())
y2=y1.sort_values(by=['label'], ascending=False)
y3=y2[1:11]
y5=y3.reset_index()
y=feature_matrix["label"].dropna()
x=feature_matrix[y5["index"]].dropna()
x.shape

(767, 10)

### 2.2 Get accuracy and list features

In [None]:
clf = RandomForestClassifier(n_estimators=400, n_jobs=-1)
scores = cross_val_score(estimator=clf,X=x, y=y, cv=3,
                             scoring="roc_auc", verbose=True)

print("AUC %.2f" % (scores.mean()))

# Print top-10 features
for i in y5["index"]:
  print(i)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


AUC 0.67
MEAN(orders.NUM_UNIQUE(order_products.product_name))
MEAN(orders.COUNT(order_products))
MEAN(orders.SUM(order_products.aisle_id))
COUNT(order_products)
SUM(orders.NUM_UNIQUE(order_products.product_name))
SUM(order_products.aisle_id)
NUM_UNIQUE(order_products.product_name)
MIN(orders.COUNT(order_products))
MIN(orders.NUM_UNIQUE(order_products.product_name))
MEAN(orders.NUM_UNIQUE(order_products.department))


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    6.7s finished


## Task3. Writing Questions

1. Please list three advantages and disadvantages of featuretools. 
2. For those disadvantages you listed above, do you have any idea to improve it? 

--- Write your answer here---

Task 3.1) 

Advantages:

1)Reduced development time by easily generating features.

2)Feature tools works alongside with many exiting tools such as pandas, scikit-learn

3)New Feature generation by transforming the existing data.

4)Good handling of nested data.

Disadvantages:

1)In the case of supervised machine learning, one must supply own labels.

2)If data is split into train/test before feature tools, additional steps needs to be taken to ensure same features are generated for training and testing.

3)Some features selected might not be useful to solve the problem in hand.

4)Feature tools is for slecting features from a small datasets that can be saved and run on a one machine.


Task 3.2)

1)About Disadvantage 1: To simplify the process one can use compose, which is an open sorce project for automatically generating labels with cutoff times. 

2)About Disadvantage 2: One way is to create a separate EntitySet using the test data and call calculate_feature_matrix() with the feature definitions from the training set.

3)About Disadvantage 3: Employing filter methods, wrapper methods and some domain knowledge can be used to select appropriate features.

4)About Disadvantage 4: In the case of big data use feature labs apis for running feature tools natively on apache spark.



## Submission
Complete the code in this notebook, and submit it to the CourSys activity Assignment 8.