# Assignment 2 DSC 102 2020 WI

## Introduction

In this assignment we will conduct data engineering for the Amazon dataset. The extracted features will be used for your next assignment, where you train a model (or models) to predict user ratings for a product.

We will be using Apache Spark for this assignment. The default Spark API will be DataFrame, as it is now the recommended choice over the RDD API. That being said, please feel free to switch back to the RDD API if you see it as a better fit for the task. We provide you an option to request RDD format to start with. Also you can switch between DataFrame and RDD in your solution. 

You will be conducting the tasks on AWS EMR. You will first spawn a smaller cluster for development and then switch to a deployment cluster for testing.

## Dataset description
You are expected to extract features from three tables, their schema and descriptions are listed below:
```
1. product
    |-- asin: string, the product id, e.g., 'B00I8HVV6E'
    |-- salesRank: map, a map between category and sales rank, e.g., {'Home &amp; Kitchen': 796318}
    |    |-- key: string, category, e.g., 'Home &amp; Kitchen'
    |    |-- value: integer, rank, e.g., 796318
    |-- categories: array, list of list of categories, e.g., [['Home & Kitchen', 'Artwork']]
    |    |-- element: array, list of categories, e.g., ['Home & Kitchen', 'Artwork']
    |    |    |-- element: string, category, e.g., 'Home & Kitchen'
    |-- title: string, title of product, e.g., 'Intelligent Design Cotton Canvas'
    |-- price: float, price of product, e.g., 27.9
    |-- related: map, related information, e.g., {'also_viewed': ['B00I8HW0UK']}
    |    |-- key: string, the attribute name of the information, e.g., 'also_viewed'
    |    |-- value: array, array of product ids, e.g., ['B00I8HW0UK']
    |    |    |-- element: string product id , e.g., 'B00I8HW0UK'
2. product_processed
    |-- asin: string, same as above
    |-- title: string, the imputed title column, e.g., 'Intelligent Design Cotton Canvas'
    |-- category: string, the extracted category column, e.g., 'Home & Kitchen'
3. review
    |-- reviewerID: string, the review id, e.g., 'A1MIP8H7G33SHC'
    |-- asin: string, the product id, e.g., 'B00I8HVV6E'
    |-- overall: float, the rating associated with the review, e.g., 5.0
```

The ```review``` table will be useful for extracting the rating information for each product in Task 1. We will be working primarily with ```product``` table throughout Task 1-4. ```product_processed``` is used for Task 5-6.

Refer to https://spark.apache.org/docs/latest/api/python/pyspark.sql.html for API guide.

## Task summary
You will be asked to complete six tasks in total. In each task you will need to implement a function ```task_i()```. The function signatures are fixed. Each function will take in several inputs and conduct the desired transformations. At the end of each task, you will be asked to extract several statistical properties (mean, variance, etc.) from the transformed data. You will need to programmatically put these properties in a python dictionary named ```res```, the schema of which is also given.

Each of the tasks will be tested in unit. It means each function you write will be tested in isolation from the rest of code you write. We will award partial points even if other parts failed.

## Conventions
### Result format
Each task comes with a pre-defined schema for the output results. The result must be stored as python native dictionary and must contain all the keys and nested structures.

For example the following schema:
```
res
 | -- count_total: int -- count of total rows of the entire table after your operations
 | -- mean_price: float -- mean value of column price
```
The desired python code for composing up the dictionary would be like:

```python
data = ... # Your transformed data
res = {
    'count_total': None,
    'mean_price': None
} # Skeleton given for the result
res['count_total'] = data.count() # Do not hard-code the value!
res['mean_price'] = data.select(F.avg(F.col('price'))) # Do not hard-code the value!
```
### Dealing with ```null``` values
The input tables contain empty values, ```null``` values and dangling references. You do not to deal with empty values and dangling reference unless instructed. For ```null``` values we will follow the common practice in SQL world. Unless instructed otherwise, you need to ignore all the ```null``` value entries when calculating statistics such as count, mean and variance. Of course, do not ignore ```null``` when you are explicitly asked to count the number of ```null``` entries.

## Submission
You will **not** submit this notebook. Instead, you need to put your implementation of ```task_1``` to ```task_6```, along with all the dependencies you imported, in the file co-located with this notebook: ```assignment2.py```. Then rename the file to ```<your pid>_assignment2.py```. For instance, if your pid is ```a45333444```, your file will be named ```a45333444_assignment2.py```.

You need to make sure your script runs under the deployment environment (```emr-launch -n 4 -d```), otherwise you may lose points.

TBD: upload the py file.

### Set the following parameters

In [None]:
PID = '' # your pid, for instance: 'a43223333'
INPUT_FORMAT = 'dataframe' # choose a format of your input data, valid options: 'dataframe', 'rdd'
DEPLOY = False # Is it deployment phase

In [None]:
%load_ext autoreload
%autoreload 2
import os
from pyspark.sql import SparkSession
from math import isclose
from utilities import SEED
from utilities import PA2Test
from utilities import PA2Data
import time
if INPUT_FORMAT == 'dataframe':
    import pyspark.ml as M
    import pyspark.sql.functions as F
    import pyspark.sql.types as T
elif INPUT_FORMAT == 'rdd':
    import pyspark.mllib as M
# Boiler plates
os.environ['PYSPARK_SUBMIT_ARGS'] = '--py-files utilities.py,assignment2.py \
--master yarn \
--deploy-mode client \
--conf spark.memory.fraction=0.8 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.crossJoin.enabled=true \
pyspark-shell'
class args:
    data_root = 's3://dsc102-pa2-public/dataset'
    review_filename = 'user_reviews_train.csv'
    product_filename = 'metadata_header.csv'
    product_processed_filename = 'product_processed.csv'
    output_root = 's3://{}-pa2/test_results'.format(PID)
    test_results_root = 's3://dsc102-pa2-public/test_results'
    pid = PID
review_path = os.path.join(args.data_root, args.review_filename)
product_path = os.path.join(args.data_root, args.product_filename)
product_processed_path = os.path.join(args.data_root, args.product_processed_filename)

begin = time.time()

spark = SparkSession.builder.appName(args.pid).getOrCreate()
url = spark.conf.get("spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES")
print("Connect to Spark UI: {}".format(url))

path_dict = {
    'review': review_path,
    'product': product_path,
    'product_processed': product_processed_path
}

tests = PA2Test(spark, args.test_results_root)

data_io = PA2Data(spark, path_dict, args.output_root, deploy=DEPLOY)

data_dict, count_dict = data_io.load_all(input_format=INPUT_FORMAT)

In [None]:
# Import your own dependencies



#-----------------------------

# (WIP)Task0: warm up 
This task is provided for you to get familiar with Spark API. This task won't be graded.

Your task is to implement the function below. 
1. For each product ID ```asin``` in ```product_data```, fetch the mean value of ratings. The ratings are stored in the column ```overall``` of ```review_data```, with product ID referencing to the former table. Store the mean value in a new column named ```meanRating``` in table ```product_data```.

1. Similarly, put the count of ratings in a new column named ```countRating```.

1. You need to conduct the above operations, then extract some statistics out of the generated columns. You need to put the statistics in a python dictionary named ```res```. The description and schema of it are as follows:
    ```
    res
     | -- count_total: int -- count of total rows of the entire table after your operations
     | -- mean_meanRating: float -- mean value of column meanRating
     | -- variance_meanRating: float -- variance of meanRating
     | -- numNulls_meanRating: int -- count of null-value entries of meanRating
     | -- mean_countRating: float -- mean value of countRating
     | -- variance_countRating: float -- variance of countRating
     | -- numNulls_countRating: int -- count of null-value entries of countRating
     
    ```
If for a product ID, there is not a single reference in ```review```, meaning it was never reviewed, you should put ```null``` in both ```meanRating``` and ```countRating```. 

In [None]:
def task_0():
    res = None
    return res

# Task1: mean and count of ratings 
First you will aggregate and extract some information from the user review table. We want to know for each product, what are the mean rating and the number of ratings it received.

Your task is to implement the function below. 
1. For each product ID ```asin``` in ```product_data```, fetch the mean value of ratings. The ratings are stored in the column ```overall``` of ```review_data```, with product ID referencing to the former table. Store the mean value in a new column named ```meanRating``` in table ```product_data```.

1. Similarly, put the count of ratings in a new column named ```countRating```.

1. You need to conduct the above operations, then extract some statistics out of the generated columns. You need to put the statistics in a python dictionary named ```res```. The description and schema of it are as follows:
    ```
    res
     | -- count_total: int -- count of total rows of the entire table after your operations
     | -- mean_meanRating: float -- mean value of column meanRating
     | -- variance_meanRating: float -- variance of meanRating
     | -- numNulls_meanRating: int -- count of null-value entries of meanRating
     | -- mean_countRating: float -- mean value of countRating
     | -- variance_countRating: float -- variance of countRating
     | -- numNulls_countRating: int -- count of null-value entries of countRating
     
    ```
If for a product ID, there is not a single reference in ```review```, meaning it was never reviewed, you should put ```null``` in both ```meanRating``` and ```countRating```. 

In [None]:
# %load -s task_1 assignment2.py
def task_1(data_io, review_data, product_data):
    # -----------------------------Column names--------------------------------
    # Inputs:
    asin_column = 'asin'
    overall_column = 'overall'
    # Outputs:
    mean_rating_column = 'meanRating'
    count_rating_column = 'countRating'
    # -------------------------------------------------------------------------

    # ---------------------- Your implementation begins------------------------





    # -------------------------------------------------------------------------

    # ---------------------- Put results in res dict --------------------------
    # Calculate the values programmaticly. Do not change the keys and do not
    # hard-code values in the dict. Your submission will be evaluated with
    # different inputs.
    # Modify the values of the following dictionary accordingly.
    res = {
        'count_total': None,
        'mean_meanRating': None,
        'variance_meanRating': None,
        'numNulls_meanRating': None,
        'mean_countRating': None,
        'variance_countRating': None,
        'numNulls_countRating': None
    }
    # Modify res:




    # -------------------------------------------------------------------------

    # ----------------------------- Do not change -----------------------------
    data_io.save(res, 'task_1')
    return res
    # -------------------------------------------------------------------------


In [None]:
res = task_1(data_io, data_dict['review'], data_dict['product'][['asin']])


# Task 2: flattening ```categories``` and ```salesRank```
Implement a function ```task_2()``` to conduct the following operations:

1. For the ```product``` table, each item in column ```categories``` contains an array of arrays of hierarchical catetories. The schema is ```ArrayType(ArrayType(StringType))```. We are only going to use the most general category, which is the first element of the nested array: ```array[0][0]```. Create a new column named as ```category```. And for each row, put the ```array[0][0]``` of column ```categories``` in ```category```. You should skip those ```null``` entries in ```categories``` and put a ```null``` also in ```categories```. Also put a ```null``` if ```categories``` value is not ```null```, but the array is empty.

1. On the other hand, each value in column ```salesRank``` is a dictionary with a single ```(category, rank)``` pair. Your task is to retrieve this key-value pair and put them in two columns, respectively. Put the category in a new column named ```bestSalesCategory``` and the rank in ```bestSalesRank```. You should put ```null``` in these new columns if the original entry in ```salesRank``` was ```null```. Note this ```bestSalesCategory``` may or may not be identical to ```category```.

1. You need to conduct the above operations, then extract some statistics out of the generated columns. You need to put the statistics in a python dictionary named ```res```. The description and schema of it are as follows:
    ```
    res
     | -- count_total: int -- count of total rows of the entire table
     | -- mean_bestSalesRank: float -- mean value of *bestSalesRank*, excluding null-value entries
     | -- variance_bestSalesRank: float -- variance of *bestSalesRank*, excluding null-value entries
     | -- numNulls_category: int -- count of null-value entries of *category*
     | -- countDistinct_category: int -- count of all distinct values of *category*, excluding null
     | -- numNulls_bestSalesCategory: int -- count of null-value entries of *bestSalesCategory*
     | -- countDistinct_bestSalesCategory: int -- count of distinct values of *bestSalesCategory*, excluding null
    ```

Hint: use ```DataFrame.withColumn()``` to apply operations on column and add the result as a new column. To drop a column, use ```DataFrame.drop()```, to rename one, use ```DataFrame.withColumnRenamed()```.

References: https://spark.apache.org/docs/latest/ml-features

In [None]:
# %load -s task_2 assignment2.py
def task_2(data_io, product_data):
    # -----------------------------Column names--------------------------------
    # Inputs:
    salesRank_column = 'salesRank'
    categories_column = 'categories'
    asin_column = 'asin'
    # Outputs:
    category_column = 'category'
    bestSalesCategory_column = 'bestSalesCategory'
    bestSalesRank_column = 'bestSalesRank'
    # -------------------------------------------------------------------------

    # ---------------------- Your implementation begins------------------------





    # -------------------------------------------------------------------------

    # ---------------------- Put results in res dict --------------------------
    res = {
        'count_total': None,
        'mean_bestSalesRank': None,
        'variance_bestSalesRank': None,
        'numNulls_category': None,
        'countDistinct_category': None,
        'numNulls_bestSalesCategory': None,
        'countDistinct_bestSalesCategory': None
    }
    # Modify res:




    # -------------------------------------------------------------------------

    # ----------------------------- Do not change -----------------------------
    data_io.save(res, 'task_2')
    return res
    # -------------------------------------------------------------------------


In [None]:
res = task_2(data_io, data_dict['product'][['asin', 'categories', 'salesRank']])
print (res)

# Task 3: flattening ```related```

Inside the ```related``` column there is a map containing four keys/attributes: ```also_bought```, ```also_viewed```, ```bought_together```, and ```buy_after_viewing```. Each of them contains an array of ```asin```s (Amazon product ID). We call these arrays attribute arrays. We need to flatten the schema by calculating the length of the arrays. In addition to the above, we would like to know the average prices of the products.

1. The logic for all four attributes are identical. For the sake of simplicity, you are only required to flatten the ```also_viewed``` attribute. Your task is to implement the following function ```task_3()```.

1. For each row, you need to :
    1. First calculate the mean price of all products from the ```also_viewed``` attribute array. 
    1. Then you need to put the mean price in a new column, the name of which is ```meanPriceAlsoViewed```. When you calculate these mean values, remember to ignore the products if they do not match any record in ```product```, or if they have ```null``` in price.
    1. Similary, put the length of that array in a new column ```countAlsoViewed```. You do not need to check if the product IDs in that array are dangling references or not. Put ```null``` (instead of zero) in the new column, if the attribute array is ```null``` or empty.

1. You need to conduct the above operations, then extract some statistics out of the generated columns. You need to put the statistics in a python dictionary named ```res```. The description and schema of which is as follows:
    ```
    res
     | -- count_total: int -- number of rows of the entire processed table
     | -- mean_meanPriceAlsoViewed: float -- mean value of meanPriceAlsoViewed
     | -- variance_meanPriceAlsoViewed: float -- variance of meanPriceAlsoViewed
     | -- numNulls_meanPriceAlsoViewed: int -- count of null-value entries of meanPriceAlsoViewed
     | -- mean_countAlsoViewed: float -- mean value of countAlsoViewed
     | -- variance_countAlsoViewed: float -- variance of countAlsoViewed
     | -- numNulls_countAlsoViewed: int -- count of null-value entries of countAlsoViewed
    ```





In [None]:
# %load -s task_3 assignment2.py
def task_3(data_io, product_data):
    # -----------------------------Column names--------------------------------
    # Inputs:
    asin_column = 'asin'
    price_column = 'price'
    attribute = 'also_viewed'
    related_column = 'related'
    # Outputs:
    meanPriceAlsoViewed_column = 'meanPriceAlsoViewed'
    countAlsoViewed_column = 'countAlsoViewed'
    # -------------------------------------------------------------------------

    # ---------------------- Your implementation begins------------------------





    # -------------------------------------------------------------------------

    # ---------------------- Put results in res dict --------------------------
    res = {
        'count_total': None,
        'mean_meanPriceAlsoViewed': None,
        'variance_meanPriceAlsoViewed': None,
        'numNulls_meanPriceAlsoViewed': None,
        'mean_countAlsoViewed': None,
        'variance_countAlsoViewed': None,
        'numNulls_countAlsoViewed': None
    }
    # Modify res:




    # -------------------------------------------------------------------------

    # ----------------------------- Do not change -----------------------------
    data_io.save(res, 'task_3')
    return res
    # -------------------------------------------------------------------------


In [None]:
res = task_3(data_io, data_dict['product'][['asin', 'related', 'price']])

# Task 4: data imputation
You may have noticed that there are lots of ```null``` values in the table. Now your task is to impute them with other values that can be used.

Since we have already flattened the schema, we only have two types of values in our table: numerical (including integer and floating numbers) and string. Now you need to impute a numerical column ```price```, as well as a string column ```title```.

1. Please implement a function ```task_4()```. For numerical column ```price```, first cast it to ```FloatType```. Then you want to impute the ```null``` values in the column with the mean value of all the not ```null``` values. Store the outputs in a new column ```meanImputedPrice```.
1. Same as, but this time impute ```null``` values with the **median** value of all the not ```null``` values in column ```price```. Store the outputs in a new column ```medianImputedPrice```.
1. As for the ```StringType``` columns, we want to simply impute with a special string ```'unknown'```. Please also impute empty strings ```''```. Store the outputs in a new column ```unknownImputedTitle```.

1. You need to conduct the above operations, then extract some statistics out of the generated columns. You need to put the statistics in a python dictionary named ```res```. The description and schema are as follows:
    ```
    res
     | -- count_total: int -- count of total rows of the entire table after above operations
     | -- mean_meanImputedPrice: float or None -- mean
     | -- variance_meanImputedPrice: float -- variance
     | -- numNulls_meanImputedPrice: int -- count of null-value entries
     | -- mean_medianImputedPrice: float or None -- mean
     | -- variance_medianImputedPrice: float -- variance
     | -- numNulls_medianImputedPrice: int -- count of null-value entries
     | -- numUnknowns_unknownImputedTitle: float -- count of 'unknown' value entries
    ```


In [None]:
# %load -s task_4 assignment2.py
def task_4(data_io, product_data):
    # -----------------------------Column names--------------------------------
    # Inputs:
    price_column = 'price'
    title_column = 'title'
    # Outputs:
    meanImputedPrice_column = 'meanImputedPrice'
    medianImputedPrice_column = 'medianImputedPrice'
    unknownImputedTitle_column = 'unknownImputedTitle'
    # -------------------------------------------------------------------------

    # ---------------------- Your implementation begins------------------------





    # -------------------------------------------------------------------------

    # ---------------------- Put results in res dict --------------------------
    res = {
        'count_total': None,
        'mean_meanImputedPrice': None,
        'variance_meanImputedPrice': None,
        'numNulls_meanImputedPrice': None,
        'mean_medianImputedPrice': None,
        'variance_medianImputedPrice': None,
        'numNulls_medianImputedPrice': None,
        'numUnknowns_unknownImputedTitle': None
    }
    # Modify res:




    # -------------------------------------------------------------------------

    # ----------------------------- Do not change -----------------------------
    data_io.save(res, 'task_4')
    return res
    # -------------------------------------------------------------------------


In [None]:
res = task_4(data_io, data_dict['product'][['price', 'title']])

# Task 5: embedding ```title``` with ```word2vec```
*This task assumes the ```title``` column is already imputed with ```unknown```. We will provide the imputed data table ```product_processed_data```.*

In this task we want to transform ```title``` into a fixed-length vector by training and then applying word2vec. 

1. You need to implement function ```task_5()```. For each row, 
    1. convert all characters in ```title``` to lowercase, 
    1. then split ```title``` by whitespace (```' '```) to an array of strings, store it in a new column ```titleArray```

1. Train a word2vec model out of this column ```titleArray```, then for each row, transform ```titleArray``` into vectors. First transform every word in the array to vector, then simply averaging the vectors to obtain the vector for the title. Put the title vector in a new column named ```titleVector```. Do not try to implement word2vec yourself, instead, use ```M.feature.Word2Vec``` and it has built-it method to do the transformation. See instructions below.

1. Use your trained word2vec model to get the 10 closest synonyms along with the similarity score (based on cosine similarity of word vectors, descending) each for three words inputed as ```<word_0>```, ```<word_1>```, and ```<word_2>```. ```M.feature.Word2Vec``` also has built-in method for this task.

1. You need to conduct the above operations, then extract some statistics out of the generated columns. You need to put the statistics in a python dictionary named ```res```. The description and schema is as follows:
    ```
    res
     | -- count_total: int -- count of total rows of the entire transformed table
     | -- size_vocabulary: int -- the size of the vocabulary of your word2vec model
     | -- word_0_synonyms: list -- synonyms tuples of word_0
     |    | -- element: tuple -- tuple of format (synonym, score)
     |    |    | -- element: string -- synonym
     |    |    | -- element: float -- score
     | -- word_1_synonyms: list 
     |    | -- element: tuple 
     |    |    | -- element: string 
     |    |    | -- element: float
     | -- word_2_synonyms: list 
     |    | -- element: tuple 
     |    |    | -- element: string 
     |    |    | -- element: float
    ```


**word2vec instructions**:

1. Set ```minCount```, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary to be ```100```.

1. Set the dimension of output word embedding to ```16```.

1. You need to set the random seed as ```SEED```, this is a global variable defined to be 102.

1. Set ```numPartitions``` to be ```4```.

1. You should keep all other settings as default.

1. ```M.feature.Word2Vec``` is not fully reproducible (although we have set the seed here). We are aware of the issue and your score will not be affected by its internal randomness.

https://spark.apache.org/docs/latest/ml-features.html#word2vec

In [None]:
# %load -s task_5 assignment2.py
def task_5(data_io, product_processed_data, word_0, word_1, word_2):
    # -----------------------------Column names--------------------------------
    # Inputs:
    title_column = 'title'
    # Outputs:
    titleArray_column = 'titleArray'
    titleVector_column = 'titleVector'
    # -------------------------------------------------------------------------

    # ---------------------- Your implementation begins------------------------





    # -------------------------------------------------------------------------

    # ---------------------- Put results in res dict --------------------------
    res = {
        'count_total': None,
        'size_vocabulary': None,
        'word_0_synonyms': [(None, None), ],
        'word_1_synonyms': [(None, None), ],
        'word_2_synonyms': [(None, None), ]
    }
    # Modify res:
    res['count_total'] = product_processed_data_output.count()
    res['size_vocabulary'] = model.getVectors().count()
    for name, word in zip(
        ['word_0_synonyms', 'word_1_synonyms', 'word_2_synonyms'],
        [word_0, word_1, word_2]
    ):
        res[name] = model.findSynonymsArray(word, 10)
    # -------------------------------------------------------------------------

    # ----------------------------- Do not change -----------------------------
    data_io.save(res, 'task_5')
    return res
    # -------------------------------------------------------------------------


In [None]:
res = task_5(data_io, data_dict['product_processed'], 'piano', 'rice', 'laptop')

# Task 6: one-hot encoding ```category``` and PCA
*Assume the schema of ```categories``` is already flattened and ```unknown``` imputed for the input data. We will provide you with the preprocessed table*

Now you need to one-hot encode the categorical features. Also, these categories may be correlated and as a practice, we want to run PCA on these categories. 
    
1. Implement function ```task_6()```. First one-hot encode ```category``` and put the output vectors in a new column ```categoryOneHot```. Note you do need to ensure the dimension of generated encoding vector equals to the size of domain. For example, if we have three categories in total: ```V = {'Electronics', 'Books', 'Appliances'}```. Then the encoding of 'Electronics' can be ```[1, 0, 0] or [0, 1, 0] or [0, 0, 1]``` but the dimension of this vector must be 3. Hint: before one-hot encoding a StringType column, you may need to first convert that column of strings to a column of numerical indices with ```M.feature.StringIndexer```. Then use ```M.feature.OneHotEncoderEstimator``` to do the encoding.

1. Second, use ```M.feature.PCA``` on the transformed column. Reduce the dimension of each one-hot vector to ```15```, put the transformed vectors in a new column ```categoryPCA```. You can use ```M.feature.PCA``` for the task.

1. Column  ```categoryOneHot``` and ```categoryPCA``` will be of VectorType. You do not need to worry about if the vectors are sparsely or densely represented.

1. You need to conduct the above operations, then extract some statistics out of the generated columns. You need to put the statistics in a python dictionary named ```res```. The description and schema is as follows:
    ```
    res
     | -- count_total: int -- count of total rows of the entire transformed table
     | -- meanVector_categoryOneHot: list -- the mean vector of all transformed one-hot encoding vectors
     |    | -- element: float -- each element of the mean vector, from first to last dimension
     | -- meanVector_categoryPCA: list -- mean vector of PCA-transformed vectors
     |    | -- element: float
    ```



In [None]:
# %load -s task_6 assignment2.py
def task_6(data_io, product_processed_data):
    # -----------------------------Column names--------------------------------
    # Inputs:
    category_column = 'category'
    # Outputs:
    categoryIndex_column = 'categoryIndex'
    categoryOneHot_column = 'categoryOneHot'
    categoryPCA_column = 'categoryPCA'
    # -------------------------------------------------------------------------    

    # ---------------------- Your implementation begins------------------------





    # -------------------------------------------------------------------------

    # ---------------------- Put results in res dict --------------------------
    res = {
        'count_total': None,
        'meanVector_categoryOneHot': [None, ],
        'meanVector_categoryPCA': [None, ]
    }
    # Modify res:




    # -------------------------------------------------------------------------

    # ----------------------------- Do not change -----------------------------
    data_io.save(res, 'task_6')
    return res
    # -------------------------------------------------------------------------


In [None]:
res = task_6(data_io, data_dict['product_processed'])

In [None]:
print ("End to end time: {}".format(time.time()-begin))