# Category to feature example
---

Applying a method of category to feature conversion, where new features are created based on the categories of one categorical column and the values of another column. Working fine on Pandas, failing to use with multiple categories on Dask.

## Importing the necessary packages

In [None]:
import dask.dataframe as dd                # Dask to handle big data in dataframes
import pandas as pd                        # Pandas to load the data initially
from dask.distributed import Client        # Dask scheduler
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
from tqdm import tqdm_notebook             # tqdm allows to track code execution progress
from IPython.display import display        # Display multiple outputs on the same cell
import data_utils as du                    # Data science and machine learning relevant methods

In [None]:
# Debugging packages
import pixiedust                           # Debugging in Jupyter Notebook cells

In [None]:
# Set up local cluster
client = Client()
client

In [None]:
client.run(os.getcwd)

## Creating data

Encoded dataframes:

In [None]:
data_df = pd.DataFrame([[103, 0, 'cat_a', 'val_a1'], 
                        [103, 1, 'cat_a', 'val_a2'],
                        [103, 2, 'cat_b', 'val_b1'],
                        [104, 0, 'cat_c', 'val_c1'],
                        [105, 0, 'cat_a', 'val_a3'],
                        [106, 0, 'cat_c', 'val_c2'],
                        [107, 0, 'cat_b', 'val_b1'],
                        [108, 0, 'cat_b', 'val_b2'],
                        [108, 1, 'cat_d', 'val_d1'],
                        [108, 2, 'cat_a', 'val_a1'],
                        [108, 3, 'cat_a', 'val_a3'],], columns=['id', 'ts', 'categories', 'values'])
data_df

In [None]:
data_df.to_csv('category_to_feature_test_data_df.csv')

## Applying the method on Pandas

Remember that we want each category (from `categories`) to turn into a feature, with values extracted from the column `values`.

In [None]:
converted_df = du.data_processing.category_to_feature(data_df, categories_feature='categories', values_feature='values')
converted_df

In [None]:
converted_df.to_csv('category_to_feature_test_converted_df.csv')

All is good, it worked as intended. Now let's try it on Dask.

## Applying the method on Dask

Remember that we want each category (from `categories`) to turn into a feature, with values extracted from the column `values`.

In [None]:
data_ddf = dd.from_pandas(data_df, npartitions=1)
data_ddf.compute()

In [None]:
du.data_processing.category_to_feature(data_ddf, categories_feature='categories', values_feature='values').compute()

It failed! Notice how it just put all the new columns with the same values as the last added column: `cat_d`. We can confirm this if we print the dataframe step by step:

In [None]:
# Copy the dataframe to avoid potentially unwanted inplace changes
copied_df = data_ddf.copy()
copied_df.compute()

In [None]:
# Find the unique categories
categories = copied_df['categories'].unique()
if 'dask' in str(type(copied_df)):
    categories = categories.compute()
categories

In [None]:
# Create a feature for each category
for category in categories:
    # Convert category to feature
    copied_df[category] = copied_df.apply(lambda x: x['values'] if x['categories'] == category
                                                    else np.nan, axis=1)
    print(f'Dataframe after adding feature {category}:')
    display(copied_df.compute())