## Receipt.ID 
### Item Category Hierachy Classification
Taxonomic classification, categorize items according to a pre-defined taxonomy. The goal is to assign one or more categories in the taxonomy to an item. It is a multi-class **and** multi-label classification problem with hierarchical relationships between each node in the tree.

#### Items
- Items come from a wide range for categories like Produce, Meat, Beverage, Supplies. 
- Example item to category mapping:


|item|mapping|
|---|---|
|Kale  | "Food/Produce/Kale"  |
|Vinegar white wine 50 grain  | "Food/Dry-Grocery/Vinegars/White Wine Vinegar"  |
|Imported nat flank steak  | "Food/Meats/Beef/Flank Steak"  |


To solve this problem, I will undertake the following course of action:
1. Explore the dataset
    - Explore the dataset to ensure its integrity and understand the context. 
2. Identify features that may be used. 
    - If possible, engineer features that might provide greater discrimination.
3. Build k independent *text-based* classifiers for the text-based features and feed the output from these classifiers into the next layer classifier which takes in the other features. Explore a couple of classifiers that might be well suited for the problem at hand.
    - Decision Trees
    - SVM
    - AdaBoost
4.  Select appropriate classifier based on evaluation metric and tune it for optimality.

In this notebook I do data pre-processing.



## Preliminaries

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
# Import libraries
from __future__ import absolute_import, division, print_function

import sys
sys.path.append('tools/')

import numpy as np
import pandas as pd
import pickle
from pivottablejs import pivot_ui
   
# Graphing Libraries
import qgrid
import matplotlib.pyplot as pyplt
import seaborn as sns
sns.set();
sns.set_style("white")
qgrid.nbinstall(overwrite=True)

## Transform Data from TSV

In [None]:
! python tools/data_conversion.py 'data/item-categorization-training2.tsv' 'data/data_training.dat'
! python tools/data_conversion.py 'data/item-categorization-validation.tsv' 'data/data_validation.dat'
! python tools/data_conversion.py 'data/item-categorization-test.tsv' 'data/data_test.dat'

In [None]:
df_ = pd.read_pickle('data/data_training.dat')

## Data Preprocessing

- Break out category into individual categories
    - level_id_1
    - .
    - level_id_7
- Break out mapped category into individual categories
    - mapped_level_1
    - .
    - mapped_level_7

In [None]:
! python tools/make_categories.py 'data/data_training.dat' 'data/data_training_expanded.dat'
! python tools/make_categories.py 'data/data_validation.dat' 'data/data_validation_expanded.dat'
! python tools/make_categories.py 'data/data_test.dat' 'data/data_test_expanded.dat'

## Load Data

In [3]:
df_1 = pd.read_pickle('data/data_training_expanded.dat')
df_2 = pd.read_pickle('data/data_validation_expanded.dat')
df_3 = pd.read_pickle('data/data_test_expanded.dat')

### Append the dataframes along rows

In [4]:
data = df_1
data = data.append(df_2, ignore_index=True)
data = data.append(df_3, ignore_index=True)
    
print("{0} records appended together".format(len(data)))    

158956 records appended together


### How many categories in the dataset?

In [5]:
def find_category(list_of_records):
    lst = []
    for r in list_of_records:
        lst.append(r)
    return lst

In [6]:
flatten = lambda l: [item for sublist in l for item in sublist]

In [7]:
lst = []
for category in data.category:
    lst.append(find_category(category))
    
len(set(flatten(lst)))

2607

### Engineered Features: *`branch_lenght`*
- The length of each branch of the tree

In [8]:
y = data.category.map(lambda x: len(x)) 
data['branch_lenght'] = y

In [9]:
col_name = [
          u'item_id', u'item_name', u'vendor_id', 'branch_lenght',
       u'mapped_level_0', u'mapped_level_1', u'mapped_level_2',
       u'mapped_level_3', u'mapped_level_4', u'mapped_level_5', u'mapped_level_6']

### Engineered Features: *`Item_name_match`*
- Do any word correspond to a thing in the list
    - dried sundried cranberry -> [Food, Produce, Fruits, Cranberries]
    - mushroom portabella -> 	[Food, Produce, Mushrooms]
    
Words in `item_name` appears in the tree branch where it belongs.
- Weight occurence by the frequency in the branch

In [10]:
df_ = data[['item_name', u'mapped_level_0',
       u'mapped_level_1', u'mapped_level_2', u'mapped_level_3',
       u'mapped_level_4', u'mapped_level_5']]

In [11]:
df_.to_pickle('data/df_data.dat')

### Run the data transformation 

In [12]:
! python tools/data_item_name_match.py 'data/df_data.dat' 'data/df_data_expanded.dat'

Reading in file: data/df_data.dat

Read in 158956 records
Generate item name match score
Write file out to disk


In [13]:
df_ = pd.read_pickle('data/df_data_expanded.dat')

In [14]:
## update the data table

data['item_name_match'] = df_['item_name_match']

### Write data out

In [15]:
data.to_pickle('data/data_expanded.dat')

## Process item labels

In [8]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

def preprocess_label_names(text_df):
    """
    Simple preprocessing pipeline which uses RegExp, sets basic token requirements, and removes stop words.
    """

    # extend stop words to capture modifiers like gluten
    stopwords_ = set(stopwords.words('english'))
    stopwords_.add('organic')
    stopwords_.add('gluten')
    stopwords_.add('free')
    stopwords_.add('pesticide')
    
    # tokenizer, stops, and stemmer
    tokenizer = RegexpTokenizer(r'\w+')
    stop_words = stopwords_
    stemmer = SnowballStemmer('english')

    # process articles
    label_list = []
    for article in text_df:
        cleaned_tokens = []
        tokens = tokenizer.tokenize(article.lower())
        for token in tokens:
            if token not in stop_words:
                if len(token) > 0 and len(token) < 20: # removes non words
                    if not token[0].isdigit() and not token[-1].isdigit(): # removes numbers
                        stemmed_tokens = stemmer.stem(token)
                        cleaned_tokens.append(stemmed_tokens)
        # add process article
        label_list.append(cleaned_tokens)

    # echo results and return
    print ('preprocessed content for %d records' % len(label_list))
    return label_list


In [None]:
## Only use item labels as input 

df = pd.read_pickle('data/data_expanded.dat')
text_features = df[[ u'item_name']]
text_features = text_features.fillna('')
text_features_data = text_features.apply(lambda x: ' '.join(x), axis=1)

# process articles
print ("Tokenizing, stemming and removing stop words ...")

processed_features_list = preprocess_label_names(text_features_data)
df['item_labels'] = processed_features_list
df.to_pickle('data/data_expanded.dat')

In [17]:
## update the data table

df['item_name_match'] = df_['item_name_match']

#### Write data out

In [19]:
df.to_pickle('data/data_test_expanded.dat')