## Description:

Our product does not cover routine, wellness or preventive care.  We believe that costs that pet owners can expect periodically and budget for should be separate from an insurance policy meant to cover accidents and illnesses.

Use the data contained in p2_data.csv to build a binary classifier to predict the “PreventiveFlag” label using the text features provided.  This model can be used to automate the detection of ineligible line items.  The expected output are prediction probabilities for rows 10001 through 11000, where the labels are currently null.

## Data Exploration

### Original Data

To get an idea what data is being dealt with, loading the data and use data visualization tools to draw some insights from it. Below is the Power BI visualization of the original data

![title](Images/PetPic01.jpg)

#### Insights:
1. There're total of <font color='red'>10,000</font> records
2. Only <font color='red'>6.78%</font> of the records are categorized as "Preventive"
3. From the left-bottom word cloud we could see the most frequent diagnosis
4. It seems ItemDescription has the format <font color='red'>'Name: Description'</font>
5. Diagnosis doesn't say much about the category.

#### Comments:
* It would be really interesting to see what are the most common words in Diagnosis/ItemDescription for preventive diagnosis

### Count word frequencies for [PreventiveFlag = 1] records

Now load the original data into pandas dataframe, and filter by PreventiveFlag

In [1]:
import pandas as pd

ntest = 10000
df_p2 = pd.read_csv('./Data/p2_data.csv', encoding='latin1', nrows=ntest)
df_p2_pflag = df_p2.loc[df_p2['PreventiveFlag'] == 1]

pd.set_option('display.max_rows', 7)
df_p2_pflag

Unnamed: 0,id,ItemDescription,Diagnosis,PreventiveFlag
13,14,Shayna: Leptospirosis Vaccine,Transitional Cell Carcinoma Work up for transi...,1
47,48,RAPHAEL: Canine Rabies Booster 3 year,Lyme Postive test,1
52,53,Reese: Bordetella Oral - Annual,Exam & Anal Sac Expression,1
...,...,...,...,...
9938,9939,"Tiya: Bordetella, Booster (Injectable)","Deciduous Teeth Retained, Pre-Ops for Extraction",1
9939,9940,Tiya: PREPAID Fecal via Centrifugation,"Deciduous Teeth Retained, Pre-Ops for Extraction",1
9941,9942,Tiya: Parvovirus VACCINE Level (DA2PPV),"Deciduous Teeth Retained, Pre-Ops for Extraction",1


In [2]:
df_p2_npflag = df_p2.loc[df_p2['PreventiveFlag'] == 0]
df_p2_npflag

Unnamed: 0,id,ItemDescription,Diagnosis,PreventiveFlag
0,1,Six: Urgent Care Exam - Daytime (8am-6pm),colitis,0
1,2,Jafar: Office Visit/Physical Exam,Stomach Issues,0
2,3,Jafar: Fecal Smears,Stomach Issues,0
...,...,...,...,...
9997,9998,Stella: Doxycycline 100mg/ml,Heart Disease,0
9998,9999,MILLIE: Adequan Injection per cc,Complications of TPLO (Millie only),0
9999,10000,MILLIE: Biohazard Waste Disposal Fee,Complications of TPLO (Millie only),0


Count word frequencies for column ItemDescription, we analyse this column because the other column, 'Diagnosis', doesn't seem to have high relation to the category, but we'll see in later sections when we train our classification model

In [3]:
import string
from collections import defaultdict

def GetItemDescW2Freq(df):
    dict_w2c = defaultdict(lambda:0)
    ct = 0
    for index, row in df['ItemDescription'].iteritems():
        words = []
        try:
            index_desc = row.index(':') + 1
            words = [word.strip(string.punctuation) for word in row[index_desc:].split()]
        except:
            words = [word.strip(string.punctuation) for word in row.split()]
        
        for word in words:
            dict_w2c[word] += 1.0
            ct += 1
    for word in dict_w2c:
        dict_w2c[word] *= 100
        dict_w2c[word] /= ct
    sf = pd.Series(dict_w2c)
    dict_w2c = pd.DataFrame({'word':sf.index, 'count':sf.values})
    return dict_w2c

# Get word -> count dict from [PreventiveFlag == 1] records
df_w2c_id_p = GetItemDescW2Freq(df_p2_pflag)
df_w2c_id_p['is_preventive'] = True

# Get word -> count dict from [PreventiveFlag == 0] records
df_w2c_id_np = GetItemDescW2Freq(df_p2_npflag)
df_w2c_id_np['is_preventive'] = False

In [4]:
df_w2c_id_merged = pd.concat([df_w2c_id_p, df_w2c_id_np])
df_w2c_id_merged

Unnamed: 0,word,count,is_preventive
0,Leptospirosis,0.690369,True
1,Vaccine,3.175699,True
2,Canine,2.312737,True
...,...,...,...
5777,illness,0.002829,False
5778,case,0.002829,False
5779,pk,0.002829,False


In [5]:
df_w2c_id_merged.to_csv('./Data/WordFreqItemDesc.csv')

In [6]:
df_w2c_id_merged_col = df_w2c_id_p.merge(df_w2c_id_np, how='outer', left_on='word', right_on='word')
df_w2c_id_merged_col = df_w2c_id_merged_col[['word', 'count_x', 'count_y']]
df_w2c_id_merged_col['count_x'].fillna(0, inplace=True)
df_w2c_id_merged_col['count_y'].fillna(0, inplace=True)
df_w2c_id_merged_col.columns = ['word', 'count_p', 'count_np']
df_w2c_id_merged_col['diff_p_np'] = df_w2c_id_merged_col['count_p'] - df_w2c_id_merged_col['count_np']
df_w2c_id_merged_col

Unnamed: 0,word,count_p,count_np,diff_p_np
0,Leptospirosis,0.690369,0.002829,0.687540
1,Vaccine,3.175699,0.048096,3.127603
2,Canine,2.312737,0.987382,1.325355
...,...,...,...,...
6173,Fi,0.000000,0.002829,-0.002829
6174,illness,0.000000,0.002829,-0.002829
6175,case,0.000000,0.002829,-0.002829


In [7]:
df_w2c_id_merged_col.to_csv('./Data/WordFreqItemDescDiff.csv', index=False)

Now we're able to see what are the most freqently occurred words for both preventive and non-preventive scenarios

![Word Frequency ItemDescription](Images/PetPic02.jpg)

The figure above shows the frequency of words in ItemDescription for both preventive and non-preventive case, it can be seen some of the words are strong indicators for certain category.

The numbers in both left and right images are percentage, you might see the numbers are very small, but consider the amount of words, even two words with difference of 0.5 would mean a big difference in occurence.

#### Insights:

* Words like Vaccine, Bordetella, Rabies, Canine, Fecal, Annual, Heartgard, Interceptor have much higher occurence frequency in prentive catogory
* Words like Exam, mg, Consultation, Examination, Recheck, etc. have much higher occurence frequency in non-prentive category
* For reasons unclear, some words, like 3, 1, 6 has unusual bound to a certain category.

#### Conclusion:

Through the data exploration process, now we have a better understanding of our data, the next step is to figure out how to generate the model for our classification task, we could try traditional machine learning feature engineering way, then the exploration results can be used as a reference for that process, or we could also use deep learning techniques.

## Data Preparation

Next a decision has to be made in which way to go, I'm lean more towards to the deep learning way, for text classification task, TextCNN seems a good way to go. Then the real next step is to prepare our data, splitting our current dataset into three parts: train set, validation set and test set.

In [1]:
import torch
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 7)

ntest = 10000
df_p2_all = pd.read_csv('./Data/p2_data.csv', encoding='latin1')
df_p2_all_p = df_p2_all[df_p2_all['PreventiveFlag'] == 1]
df_p2_all_np = df_p2_all[df_p2_all['PreventiveFlag'] == 0]

df_p2_val_p = df_p2_all_p.sample(frac=0.05)
df_p2_trn_p = df_p2_all_p.append(df_p2_val_p).drop_duplicates(keep=False)
print(len(df_p2_val_p), len(df_p2_trn_p))

df_p2_val_np = df_p2_all_np.sample(frac=0.05)
df_p2_trn_np = df_p2_all_np.append(df_p2_val_np).drop_duplicates(keep=False)
print(len(df_p2_val_np), len(df_p2_trn_np))

df_p2_all_tst = df_p2_all[np.isnan(df_p2_all['PreventiveFlag'])]

def ConstructDatasetCore(df):
    df['CombinedText'] = df['ItemDescription'] + ' | ' + df['Diagnosis']
    del df['ItemDescription']
    del df['Diagnosis']
    try:
        df['PreventiveFlag'] = df['PreventiveFlag'].astype(int)
        del df['id']
    except:
        pass
    return df

def ConstructTrnValDataset(df_p, df_np, path):
    df = pd.concat([df_p, df_np])
    df = ConstructDatasetCore(df)
    df.to_csv(path, index=False)
    return df

def ConstructTstDataset(df_p, path):
    df = df_p.copy()
    df = ConstructDatasetCore(df)
    df.to_csv(path, index=False)
    return df

ConstructTstDataset(df_p2_all_tst, './Data/p2_tst.csv')
ConstructTrnValDataset(df_p2_val_p, df_p2_val_np, './Data/p2_val.csv')
ConstructTrnValDataset(df_p2_trn_p, df_p2_trn_np, './Data/p2_trn.csv')

34 644
466 8856


Unnamed: 0,PreventiveFlag,CombinedText
13,1,Shayna: Leptospirosis Vaccine | Transitional ...
47,1,RAPHAEL: Canine Rabies Booster 3 year | Lyme ...
52,1,Reese: Bordetella Oral - Annual | Exam & Anal...
...,...,...
9997,0,Stella: Doxycycline 100mg/ml | Heart Disease
9998,0,MILLIE: Adequan Injection per cc | Complicati...
9999,0,MILLIE: Biohazard Waste Disposal Fee | Compli...


In [2]:
pd.read_csv('./Data/p2_val.csv')

Unnamed: 0,PreventiveFlag,CombinedText
0,1,Cocoa: Sentinel Yellow (26 - 50 lbs) | Thyro-...
1,1,Coco Chanel: Heartgard Plus K9 S 1-25lb/1-11k...
2,1,"Teddy: Sentinel Yellow 11.5/230 mg, 12-22kg |..."
...,...,...
497,0,Seco: Cortisol (2) ACTH Stimulation T440 ACTH...
498,0,Linus: Urine Test Strip | Diabetes
499,0,Angie: Consultation per 10 min | Recheck of CHF


In [3]:
pd.read_csv('./Data/p2_tst.csv')

Unnamed: 0,id,PreventiveFlag,CombinedText
0,10001,,MILLIE: Polygylcan SA IM Arthritis Inj per ML...
1,10002,,Ebbet: Zignature Kangaroo Formula 13oz. | vom...
2,10003,,Ginger: Office Visit | ear infection
...,...,...,...
997,10998,,Theo: Thyroid Free T4 (ED) Add on - ADD50 | e...
998,10999,,MYSTIQUE: Exam/Medical Progress FollowUp | De...
999,11000,,Cricket: RAD FollowUp Radiograph | Vomiting


## Training

### Prepare train/validation/test dataset

Load dataset

In [4]:
import torch
from torchtext.data import Field
from torchtext.data import TabularDataset
import spacy
spacy_en = spacy.load('en')

# Use spacy as the tokenizer
tokenize = lambda x: [tok.text for tok in spacy_en.tokenizer(x)]

# Convert to lowercase & Tokenize
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)

# Labels are already processed
LABEL = Field(sequential=False, use_vocab=False)

tv_datafields = [("PreventiveFlag", LABEL), ("CombinedText", TEXT)]

trn, vld = TabularDataset.splits(
        path="./Data",
        train='p2_trn.csv', validation="p2_val.csv",
        format='csv',
        skip_header=True,
        fields=tv_datafields)

tst_datafields = [("PreventiveFlag", None), ("CombinedText", TEXT)]

tst = TabularDataset(
        path="Data/p2_tst.csv",
        format='csv',
        skip_header=True,
        fields=tst_datafields)

In [5]:
print(trn[15].CombinedText)
print(len(trn))

['moss', ':', ' ', 'interceptor', 'plus', '11.4', '-', '22.7', 'kg', '(', 'yellow', ')', '|', 'diarrhea']
9500


Load in pretrained word embeddings

In [6]:
TEXT.build_vocab(trn, vectors="glove.6B.100d")

In [7]:
TEXT.vocab.freqs.most_common(10)

[(' ', 11488),
 (':', 9779),
 ('|', 9507),
 ('-', 2614),
 ('/', 2599),
 (',', 1771),
 (')', 1660),
 ('(', 1655),
 ('mg', 1424),
 ('exam', 939)]

In [8]:
print(TEXT.vocab.stoi["<unk>"])

0


### Construct Iterator

In [9]:
from torchtext.data import Iterator, BucketIterator

trn_iter, vld_iter = BucketIterator.splits(
    (trn, vld), # we pass in the datasets we want the iterator to draw data from
    batch_sizes=(64, 64),
    device=torch.device('cuda:0'), # Use GPU
    sort_key=lambda x: len(x.CombinedText),
    sort_within_batch=False,
    repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

tst_iter = Iterator(
    tst, 
    batch_size=64, 
    device=torch.device('cuda:0'), 
    sort=False, 
    sort_within_batch=False, 
    repeat=False)

In [10]:
b = next(iter(trn_iter))
print(b)


[torchtext.data.batch.Batch of size 64]
	[.PreventiveFlag]:[torch.cuda.LongTensor of size 64 (GPU 0)]
	[.CombinedText]:[torch.cuda.LongTensor of size 34x64 (GPU 0)]


In [11]:
'''
b.CombinedText[10]
'''

'\nb.CombinedText[10]\n'

In [12]:
'''
b.CombinedText[0].size()
'''

'\nb.CombinedText[0].size()\n'

Wrapping the iterator

In [13]:
class BatchWrapper:
    def __init__(self, dl, xvar, yvar):
        # we pass in the list of attributes for x and y
        self.dl, self.xvar, self.yvar = dl, xvar, yvar
    
    def __iter__(self):
        for batch in self.dl:
            # we assume only one input in this wrapper
            x = getattr(batch, self.xvar)
            
            if self.yvar is not None:
                y = torch.cat([getattr(batch, self.yvar).unsqueeze(1)], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)
    
    def __len__(self):
        return len(self.dl)
    

trn_dl = BatchWrapper(trn_iter, "CombinedText", "PreventiveFlag")
vld_dl = BatchWrapper(vld_iter, "CombinedText", "PreventiveFlag")
tst_dl = BatchWrapper(tst_iter, "CombinedText", None)

In [14]:
'''
next(trn_dl.__iter__())
'''

'\nnext(trn_dl.__iter__())\n'

### Prepare the Text CNN model

In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

class CNN_Text(nn.Module):
    
    def __init__(self, args):
        super(CNN_Text, self).__init__()
        self.args = args
        
        V = args.embed_num
        D = args.embed_dim
        C = args.class_num
        Ci = 1
        Co = args.kernel_num
        Ks = args.kernel_sizes

        self.embed = nn.Embedding(V, D)
        self.convs1 = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K in Ks])
        self.dropout = nn.Dropout(args.dropout)
        self.fc1 = nn.Linear(len(Ks)*Co, C)

    def forward(self, x):
        x = self.embed(x)  # (N, W, D)
        # print(x.size())
        x = x.unsqueeze(1)  # (N, Ci, W, D)
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]  # [(N, Co, W), ...]*len(Ks)
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  # [(N, Co), ...]*len(Ks)
        x = torch.cat(x, 1)
        x = self.dropout(x)  # (N, len(Ks)*Co)
        logit = self.fc1(x)  # (N, C)
        return logit

Initialize the model

In [16]:
import os
from datetime import datetime

class Object(object):
    pass

args = Object()
args.embed_num = len(TEXT.vocab)
args.embed_dim = 128
args.dropout = 0.5
args.class_num = 2
args.kernel_num = 100
args.kernel_sizes = [3, 4, 5]
args.save_dir = os.path.join('Chkpt', datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))

print(args.embed_num)

7071


In [17]:
cnn = CNN_Text(args)

### Now we train the model

In [18]:
import tqdm

args.lr = 1e-5
args.epochs = 1000

def save(model, save_dir, save_prefix, steps):
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    save_prefix = os.path.join(save_dir, save_prefix)
    save_path = '{}_steps_{}.pt'.format(save_prefix, steps)
    torch.save(model.state_dict(), save_path)

def train(trn_iter, vld_iter, model, args):
    model.cuda()
    optim = torch.optim.Adam(model.parameters(), lr=args.lr)
    best_acc = 0
    last_step = 0
    running_loss = 0.0
    model.train()
    step = 0
    for epoch in range(1, args.epochs + 1):
        step += 1
        crct_trn, acur_trn = 0, 0
        crct_vld, acur_vld = 0, 0
        
        for batch in trn_iter:
            feature, target = batch.CombinedText, batch.PreventiveFlag
            optim.zero_grad()
            feature.transpose_(0, 1)
            logit = model(feature)
            loss = F.cross_entropy(logit, target)
            loss.backward()
            optim.step()
            running_loss += loss.data.item() * feature.size(0)
            crct_trn += (torch.max(logit, 1)[1].view(target.size()).data == target.data).sum()
        epoch_loss = running_loss / len(trn)
        acur_trn = crct_trn.data.item() * 1.0 / len(trn)
        
        val_loss = 0.0
        model.eval() # turn on evaluation mode
        for batch in vld_iter:
            feature, target = batch.CombinedText, batch.PreventiveFlag
            feature.transpose_(0, 1)
            logit = model(feature)
            loss = F.cross_entropy(logit, target)
            val_loss += loss.data.item() * feature.size(0)
            crct_vld += (torch.max(logit, 1)[1].view(target.size()).data == target.data).sum()
        acur_vld = crct_vld.data.item() / len(vld)

        val_loss /= len(vld)
        if step % 15 == 1:
            print('Epoch: {}, Trn Loss: {:.4f}, Val Loss: {:.4f}, acc_trn: {:.4f}({}/{}), acc_vld: {:.4f}({}/{})'
                  .format(epoch, epoch_loss, val_loss, acur_trn, crct_trn.data.item(), len(trn), acur_vld, crct_vld.data.item(), len(vld)))
            save(model, args.save_dir, 'snapshot', step)

In [19]:
train(trn_iter, vld_iter, cnn, args)

Epoch: 1, Trn Loss: 0.3024, Val Loss: 0.2625, acc_trn: 0.9228(8767/9500), acc_vld: 0.9320(466/500)
Epoch: 16, Trn Loss: 3.4010, Val Loss: 0.1970, acc_trn: 0.9343(8876/9500), acc_vld: 0.9340(467/500)
Epoch: 31, Trn Loss: 5.4319, Val Loss: 0.1478, acc_trn: 0.9588(9109/9500), acc_vld: 0.9520(476/500)
Epoch: 46, Trn Loss: 6.7918, Val Loss: 0.1217, acc_trn: 0.9738(9251/9500), acc_vld: 0.9580(479/500)
Epoch: 61, Trn Loss: 7.7428, Val Loss: 0.1065, acc_trn: 0.9840(9348/9500), acc_vld: 0.9660(483/500)
Epoch: 76, Trn Loss: 8.4092, Val Loss: 0.0970, acc_trn: 0.9914(9418/9500), acc_vld: 0.9700(485/500)
Epoch: 91, Trn Loss: 8.8684, Val Loss: 0.0909, acc_trn: 0.9957(9459/9500), acc_vld: 0.9680(484/500)
Epoch: 106, Trn Loss: 9.1779, Val Loss: 0.0880, acc_trn: 0.9988(9489/9500), acc_vld: 0.9680(484/500)
Epoch: 121, Trn Loss: 9.3828, Val Loss: 0.0867, acc_trn: 0.9996(9496/9500), acc_vld: 0.9680(484/500)
Epoch: 136, Trn Loss: 9.5172, Val Loss: 0.0875, acc_trn: 0.9998(9498/9500), acc_vld: 0.9680(484/500

### Evaluate the model

It seems epoch 121 achieves the best validation accuracy, we use checkpoint from this epoch to evaluate

In [65]:
cnn_eval = CNN_Text(args)
cnn_eval.load_state_dict(torch.load('./Chkpt/2019-02-25_00-00-37/snapshot_steps_121.pt'))
cnn_eval = cnn_eval.cuda()
cnn_eval.eval()

CNN_Text(
  (embed): Embedding(7071, 128)
  (convs1): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(3, 128), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(4, 128), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(5, 128), stride=(1, 1))
  )
  (dropout): Dropout(p=0.5)
  (fc1): Linear(in_features=300, out_features=2, bias=True)
)

In [72]:
df_vld = pd.read_csv('./Data/p2_val.csv')

for i in range(len(df_vld)):
    PreventiveFlag = df_vld.iloc[i]['PreventiveFlag']
    CombinedText = df_vld.iloc[i]['CombinedText']
    text = TEXT.preprocess(CombinedText)
    text = [[TEXT.vocab.stoi[x] for x in text]]
    x = torch.tensor(text)
    x = x.cuda()
    try:
        output = cnn_eval(x)
        _, predicted = torch.max(output, 1)
        predicted = predicted.data.item()
    
        if predicted != PreventiveFlag:
            print(i, predicted, PreventiveFlag)
    except:
        pass
        print(CombinedText)

0 0 1
2 0 1
3 0 1
8 0 1
11 0 1
12 0 1
14 0 1
19 0 1
26 0 1
32 0 1
33 0 1
Sales Tax | Dermatitis 
137 1 0
218 1 0
238 1 0
298 1 0
373 1 0
Sales Tax | Wellness 


In [90]:
df_tst = pd.read_csv('./Data/p2_tst.csv')

for i in range(len(df_tst)):
    PreventiveFlag = df_tst.iloc[i]['PreventiveFlag']
    CombinedText = df_tst.iloc[i]['CombinedText']
    text = TEXT.preprocess(CombinedText)
    text = [[TEXT.vocab.stoi[x] for x in text]]
    x = torch.tensor(text)
    x = x.cuda()
    try:
        output = cnn_eval(x)
        _, predicted = torch.max(output, 1)
        predicted = predicted.data.item()
        df_tst.iloc[i, df_tst.columns.get_loc('PreventiveFlag')] = int(predicted)
    except:
        pass
        print(CombinedText)
        df_tst.iloc[i, df_tst.columns.get_loc('PreventiveFlag')] = 0

Sales Tax | Vomiting 


In [91]:
df_tst['PreventiveFlag'] = df_tst['PreventiveFlag'].astype(int)

In [92]:
df_tst

Unnamed: 0,id,PreventiveFlag,CombinedText
0,10001,0,MILLIE: Polygylcan SA IM Arthritis Inj per ML...
1,10002,0,Ebbet: Zignature Kangaroo Formula 13oz. | vom...
2,10003,0,Ginger: Office Visit | ear infection
...,...,...,...
997,10998,0,Theo: Thyroid Free T4 (ED) Add on - ADD50 | e...
998,10999,0,MYSTIQUE: Exam/Medical Progress FollowUp | De...
999,11000,0,Cricket: RAD FollowUp Radiograph | Vomiting


In [93]:
df_tst.to_csv('./Data/p2_solution.csv')