# <h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import Requirements" data-toc-modified-id="Import-Requirements-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Requirements</a></span></li><li><span><a href="#Prepare Training Data" data-toc-modified-id="Prepare-Training-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Prepare Training Data</a></span><ul class="toc-item"></ul></li><li><span><a href="#Model Training" data-toc-modified-id="Model Training-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model Training</a></span></li><li><span><a href="#Model Saving" data-toc-modified-id="Model Saving-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model Saving</a></span><ul class="toc-item"></ul></li><li><span><a href="#Validation and Results" data-toc-modified-id="Validation and Results-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Validation and Results</a></span><ul class="toc-item"></ul></div>

<a id='Import Requirements'></a>

# Import Requirements

In [2]:
import pandas as pd
import numpy as np
import json
import os
import re
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, confusion_matrix
from sklearn.pipeline import Pipeline
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
import string

<a id='Prepare Training Data'></a>

# Prepare Training Data

Input data for training consists of both historical data and CICD data( Production run data for which manual agent validation has been done for the ML prediction)

In [3]:
def preprocess_text(message):

    #stopwords
    stpwrd = nltk.corpus.stopwords.words('english')
    #stpwrd.extend(new_stopwords)
    # 1. Init Lemmatizer
    lemmatizer = WordNetLemmatizer()
    #lowering and removing punctuation
    message = re.sub(r'[^\w\s]','', message.lower())
    #removing the numerical values and working only with text values
    message = re.sub('[^a-zA-Z]', " ", message )
    #removing the stopwords
    message = ' '.join([word for word in message.split() if word not in stpwrd and len(word)>1])
    #lemmatizing the text
    message =  " ".join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(message) if w not in string.punctuation])
    #print("message is : ",message)
    
    return message

In [53]:
#read input from cicd data into dataframe
data_cicd=pd.read_csv('../data/TaxML-CICD - Prod_Data.csv', usecols = ['Item','Description','establishment_type','Agent Corrected CAT Name', 'Agent Corrected Integer'])

#remove duplicate rows
data_cicd.drop_duplicates(inplace=True)

#remove empty rows from dataframe
data_cicd.dropna(how='all',inplace=True)

#remove rows having empty 'Agent Corrected CAT Name', ''Agent Corrected Integer'
data_cicd.dropna(subset=['Agent Corrected CAT Name', 'Agent Corrected Integer'],inplace=True)


In [54]:
data_cicd.head()

Unnamed: 0,Item,Description,establishment_type,Agent Corrected CAT Name,Agent Corrected Integer
0,Philly Cheese Steak,,GROCERY,"CAT_PREPARED_FOOD,TEMP_HEATED",1011
1,Mango Smoothie,,GROCERY,"CAT_PREPARED_DRINK,TEMP_HEATED",1141
2,Banana Strawberry Smoothie,,GROCERY,"CAT_PREPARED_DRINK,TEMP_HEATED",1141
3,"Fruits Salad Mango, Banana, Strwberry, and Ora...",,GROCERY,"CAT_PREPARED_DRINK,TEMP_HEATED",1141
4,Tea,"Chamomile green tea, lipton, raspberry, or gin...",GROCERY,"CAT_TEA,CONTAINER_BOTTLED,TEMP_HEATED",115111


In [55]:
# combine the columns Item, Description and establishment_type into one column 'combined_text'
data_cicd['combined_text'] = data_cicd[['Item','Description','establishment_type']].apply(lambda x: ' '.join(x[x.notnull()]), axis = 1)

# apply data preprocessing steps on the prepared column
data_cicd['processed_text']= data_cicd['combined_text'].map(lambda s:preprocess_text(s)) 

data_cicd = data_cicd.reset_index(drop=True)
# prepare the target column by combining 'Agent Corrected CAT Name' and 'Agent Corrected Integer'
data_cicd['target']= data_cicd['Agent Corrected CAT Name'] + ":" + data_cicd['Agent Corrected Integer']

#remove rows having empty target column
data_cicd.dropna(subset=['target'],inplace=True)

data_cicd = data_cicd[data_cicd['target']!= '#REF!:#REF!']

X_cicd= data_cicd[['Item','Description','establishment_type','processed_text']]
y_cicd= data_cicd['target']

# split the cicd data into train and test 
X_train_cicd, X_test_cicd, y_train_cicd, y_test_cicd = train_test_split(X_cicd, y_cicd, test_size = .10, random_state = 42)



In [60]:
#read input from historical data into dataframe
data_df = pd.read_csv('../data/historical_data_14_01_22.csv', encoding='utf8',engine='python',usecols=['Item','Description','establishment_type','target'])
#choose sample data from entire data
data_df = data_df.sample(frac=1, random_state=42)

#fill blanks with ''
data_df = data_df.fillna('')
# combine the columns Item, Description and establishment_type into one column 'combined_text'
data_df['combined_text'] = data_df[['Item','Description','establishment_type']].apply(lambda x: ' '.join(x[x.notnull()]), axis = 1)
# apply data preprocessing steps on the prepared column
data_df['processed_text'] = data_df['combined_text'].map(lambda s:preprocess_text(s)) 
data_df = data_df.reset_index(drop=True)

X = data_df[['Item','Description','establishment_type','processed_text']]
y = data_df['target']

# split the cicd data into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20, random_state = 42)


In [67]:
# y_train_final.unique().tolist()

We will append the CICD data to the historical data to create the final train and test data.
Train set has 80% of all historical data and 90% of all cicd data.
Test set consists of 20% of historic data and 10% of all cicd data.

In [68]:
X_train_final = X_train.append(X_train_cicd)
X_test_final = X_test.append(X_test_cicd)
y_train_final = y_train.append(y_train_cicd)
y_test_final = y_test.append(y_test_cicd)

In [69]:
X_train_final.head()

Unnamed: 0,Item,Description,establishment_type,processed_text
91785,Beef Bologna (1 lb),,SPECIALITY_STORE,beef bologna lb speciality store
7351,"Molly's Irish Cream, 1.0L liqueur (40.0% ABV)",,GROCERY,molly irish cream liqueur abv grocery
99725,Reuben Sandwich,"Corned beef, melted swiss, sauerkraut, and Rus...",GROCERY,reuben sandwich corned beef melted swiss sauer...
31499,Coors Light | 12 Pack 12 oz Cans,\N,GROCERY,coors light pack oz can grocery
153803,Golden Road Point The Way,Its light malt body is the perfect canvas for ...,GROCERY,golden road point way light malt body perfect ...


<a id='Model Training'></a>

In [70]:
print('Training data size: {}'.format(len(X_train_final)))
print('Test data size: {}'.format(len(X_test_final)))

Training data size: 167201
Test data size: 36471


In [71]:
print('Number of unique labels : {}'.format(len(y_train.unique().tolist())))

Number of unique labels : 294


# Model Training

The Model Pipeline consists of 1. CountVectorizer, 2. Tfidf-Transformer 3. RandomForestClassifier 

In [72]:
# create a result dataframe to store final results
result=X_test_final

#create the model pipeline
rf = Pipeline([('vect', CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', stop_words='english', max_df=0.85)),
       ('tfidf', TfidfTransformer()),
       ('clf', RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42))])

# perform model training
rf.fit(X_train_final['processed_text'].values, y_train_final.values)

# model prediction
y_pred = rf.predict(X_test_final['processed_text'].values)

result['original_cat']= y_test_final
result['predicted_cat'] = y_pred

result['prediction_cat_confscore'] = rf.predict_proba(X_test_final['processed_text'].values).max()

#
output = {'accuracy': accuracy_score(y_pred,y_test_final),'precision_score':precision_score(y_pred,y_test_final,average='macro'),'recall_score':recall_score(y_pred,y_test_final,average='macro')
,'f1_score':f1_score(y_pred,y_test_final,average='macro')}

result['confusion_matrix'] = str(output)


  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


<a id='Model Saving'></a>

# Model Saving

In [73]:
import pickle
import datetime
# save the model to disk
filename_primary= 'finalized_model.sav'
pickle.dump(rf, open(filename_primary, 'wb'))

<a id='Validation and Results'></a>

# Validation and Results

In [74]:
#accuracy score of the model
accuracy = rf.score(X_test_final['processed_text'].values, y_test_final)
print("Accuracy = {}".format(accuracy))

Accuracy = 0.7852540374544158


In [75]:
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt

In [76]:
#classification report 
classification_report = metrics.classification_report(y_test_final, y_pred, output_dict=True)

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [78]:
display(pd.DataFrame(classification_report).transpose())

Unnamed: 0,precision,recall,f1-score,support
"CAT_ALCOHOL,TEMP_COLD:109,1",0.592593,0.543689,0.567089,206.000000
"CAT_ALCOHOL,TEMP_HEATED:109,1",0.664890,0.619149,0.641205,1410.000000
"CAT_ALCOHOL,TEMP_UNHEATED:109,1",0.214286,0.082569,0.119205,218.000000
CAT_ALCOHOL:109,0.000000,0.000000,0.000000,0.000000
CAT_ANTI_FREEZE:774,0.000000,0.000000,0.000000,2.000000
...,...,...,...,...
CAT_WINE:534,0.889805,0.927585,0.908303,3839.000000
"TEMP_HEATED,CAT_PREPARED_FOOD:1,101",0.956522,0.956522,0.956522,23.000000
accuracy,0.785227,0.785227,0.785227,0.785227
macro avg,0.516489,0.423734,0.448933,36471.000000


In [79]:
# check the misclassifications
misclassifications= result.loc[result['original_cat']!=result['predicted_cat']]

In [80]:
misclassifications

Unnamed: 0,Item,Description,establishment_type,processed_text,original_cat,predicted_cat,prediction_cat_confscore,confusion_matrix
82041,Sprite (2 lt),,CONVENIENCE,sprite lt convenience,"CAT_SOFT_DRINK,CONTAINER_BOTTLED,TEMP_COLD:112...","CAT_SOFT_DRINK,CONTAINER_BOTTLED,TEMP_HEATED:1...",1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
130470,Kettle Chips (2 oz),,GROCERY,kettle chip oz grocery,"CAT_PREPACKAGED_FOOD,CAT_SNACK,TEMP_UNHEATED:1...",CAT_PREPACKAGED_FOOD_SNACK_CHIPS:746,1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
88982,Bud Light 24pk 12oz Btl 4.2% ABV,\N,GROCERY,bud light pk oz btl abv grocery,"CAT_BEER,TEMP_COLD:533,1",CAT_BEER:533,1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
138088,Jack Links 10000008418 Origin Beef Jerky,1.25 Oz,GROCERY,jack link origin beef jerky oz grocery,"CAT_PREPARED_FOOD,TEMP_HEATED:101,1","CAT_PREPACKAGED_FOOD,CAT_SNACK,TEMP_HEATED:106...",1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
19877,MXD Cocktail Co. Long Island Ice Tea 16oz Can,,GROCERY,mxd cocktail co long island ice tea oz grocery,"CAT_ALCOHOL,TEMP_HEATED:109,1","CAT_TEA,CONTAINER_BOTTLED,TEMP_HEATED:115,11,1",1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
...,...,...,...,...,...,...,...,...
17124,Finesse Gel Spray (Extra Control Mousse) (7 oz),"Avalon, Aussie, desert essence, dove, finesse,...",PHARMACY,finesse gel spray extra control mousse oz aval...,CAT_TPP_SKIN_CARE_PRODUCTS:818,CAT_TPP:531,1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
17166,Yardley (Oatmeal & Almond) (2 pk) (8.5 oz),"A la Maison, camay, dead sea, dermis, dial, do...",PHARMACY,yardley oatmeal almond pk oz la maison camay d...,CAT_MEDICATED_ITEMS:525,CAT_TPP:531,1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
13416,Talenti Salted Caramel Truffle Layers,Our Salted Caramel Truffle is an ode to our be...,SPECIALITY_STORE,talenti salted caramel truffle layer salted ca...,"CAT_PREPACKAGED_FOOD,CAT_ICECREAM,TEMP_COLD:10...","CAT_PREPACKAGED_FOOD,CAT_ICECREAM,TEMP_HEATED:...",1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
9527,La Costena Green Pickled Whole Jalapeno Pepper...,\N,LIQUOR,la costena green pickled whole jalapeno pepper...,"CAT_PREPACKAGED_FOOD,TEMP_COLD:106,1",CAT_LIQUOR:535,1.0,"{'accuracy': 0.785226618409147, 'precision_sco..."
