The purpose of this document is to prepare all the work I've done to be put in a script. I can think of the following steps:
- create dummy data (5 observations from original dataset in csv)
- function to clean data
- ensure all [pickled](https://docs.python.org/3/library/pickle.html#examples) models perform properly
- function to vectorize data
- function to get predictions from model
- make it possible to call script from command line with [argparse](https://docs.python.org/3/library/argparse.html#example)

In [38]:
import argparse
import json
import pandas as pd
import pickle
import regex as re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

The first step is to create dummy data. Let's get on that.

In [3]:
df = pd.read_csv('alldata_1_for_kaggle.csv', encoding='MacRoman', index_col=0)
df.head()

Unnamed: 0,0,a
0,Thyroid_Cancer,Thyroid surgery in children in a single insti...
1,Thyroid_Cancer,""" The adopted strategy was the same as that us..."
2,Thyroid_Cancer,coronary arterybypass grafting thrombosis Ô¨Åb...
3,Thyroid_Cancer,Solitary plasmacytoma SP of the skull is an u...
4,Thyroid_Cancer,This study aimed to investigate serum matrix ...


I'll take the first five instances and save them to a csv without the label information attached. For the record, all model predictions should hopefully be thyroid cancer.

In [6]:
dummy_data = df.a[:5]
dummy_data

0    Thyroid surgery in  children in a single insti...
1    " The adopted strategy was the same as that us...
2    coronary arterybypass grafting thrombosis Ô¨Åb...
3     Solitary plasmacytoma SP of the skull is an u...
4     This study aimed to investigate serum matrix ...
Name: a, dtype: object

In [7]:
dummy_data.to_csv('dummy_data.csv')

In [13]:
dd = pd.read_csv('dummy_data.csv', encoding='MacRoman', index_col=0)
dd

Unnamed: 0,a
0,Thyroid surgery in children in a single insti...
1,""" The adopted strategy was the same as that us..."
2,coronary arterybypass grafting thrombosis √î¬®...
3,Solitary plasmacytoma SP of the skull is an u...
4,This study aimed to investigate serum matrix ...


Success. Now I need to incorporate work done in previous notebooks to prepare the text data.

In [31]:
def clean_text(df):
    col = df.columns[0]
    regexp = r"[^a-zA-Z\s']"
    df = df[col].str.replace(regexp, "", regex=True)
    df = df.str.lower()

    return df

In [34]:
dd = clean_text(dd)
dd

0    thyroid surgery in  children in a single insti...
1     the adopted strategy was the same as that use...
2    coronary arterybypass grafting thrombosis brin...
3     solitary plasmacytoma sp of the skull is an u...
4     this study aimed to investigate serum matrix ...
Name: a, dtype: object

Special characters are removed and all text is lowercase. Let's remove stopwords and stem the words using the function written in cleanup.ipynb. It may look slightly different because we are working with a series instead of a dataframe.

In [35]:
unpacked_stopwords = stopwords.words('english')
ps = PorterStemmer()

def remove_stopwords(article, stopword_list=unpacked_stopwords, stemmer=ps):

    tok_article = word_tokenize(article)
    approved_words = []

    for word in tok_article:
        if word in stopword_list:
            continue
        else:
            stem = stemmer.stem(word)
            approved_words.append(stem)

    return " ".join(approved_words)

In [37]:
dd = dd.apply(remove_stopwords)
dd

0    thyroid surgeri children singl institut osama ...
1    adopt strategi use prior year base four exclus...
2    coronari arterybypass graft thrombosi brin bri...
3    solitari plasmacytoma sp skull uncommon clinic...
4    studi aim investig serum matrix metalloprotein...
Name: a, dtype: object

Great. Let's transform these features using the vectorizer fit to our data in the preproc notebook.

Note: at this point, I realize I need to get the selector.support_ list of columns from the notebook. I'll find a way to get this list saved as a json object and read into this notebook.

In [39]:
with open('feature_list.json', 'r') as f:
    col_bools = json.load(f)    

In [41]:
with open('vec.pickle', 'rb') as f:
    vectorizer = pickle.load(f)

In [42]:
def create_features(df, vec=vectorizer, cols=col_bools):
    
    df_vec = vec.transform(df)
    df_vec = pd.DataFrame(df_vec.toarray(), columns=vec.vocabulary_.keys())
    
    return df_vec.loc[:, cols]

In [44]:
dd_vecs = create_features(dd)
dd_vecs

Unnamed: 0,lung,howev,signific,despit,higher,tabl,high,grade,associ,lower,...,accuraci,led,efficaci,period,stabl,coloni,sarscov,nk,ibd,exercis
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125068,0.0,...,0.0,0.0,0.0,0.00923,0.0,0.016814,0.717622,0.0,0.0,0.0
1,0.0,0.013307,0.0,0.007407,0.02151,0.0,0.011682,0.0,0.039784,0.047076,...,0.0,0.0,0.0,0.0,0.0,0.048138,0.026682,0.0,0.015507,0.0
2,0.017623,0.0,0.0,0.033805,0.0,0.0,0.0,0.0,0.166444,0.0,...,0.019381,0.0,0.021267,0.040202,0.0,0.0,0.020296,0.160208,0.011796,0.0
3,0.02077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0535,0.0,...,0.0,0.0,0.0,0.094762,0.0,0.0,0.02392,0.0,0.041708,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.006535,0.0,0.050074,0.0,...,0.007126,0.0,0.0,0.022173,0.0,0.0,0.291049,0.0,0.004337,0.0


Great success. Now I need to get predictions from the model. I will need to retrieve two bits of pickled information: the xgboost model and the label encoder. I'm expecting to get numeric labels back from my model (0, 1, or 2) and I want to transform those into the string names. Remember, we are expecting to see thyroid cancer predicted for every observation in this mock dataset.

In [45]:
with open('xgb.pickle', 'rb') as f:
    xgb = pickle.load(f)

with open('enc.pickle', 'rb') as f:
    encoder = pickle.load(f)    

In [48]:
def make_preds(df, model=xgb, enc=encoder):

    preds = model.predict(df)
    return enc.inverse_transform(preds)

In [49]:
make_preds(dd_vecs)

array(['Thyroid_Cancer', 'Thyroid_Cancer', 'Thyroid_Cancer',
       'Thyroid_Cancer', 'Thyroid_Cancer'], dtype=object)

YUPPPPPPPPP

I think I'm just going to figure out the final bullet point (making this callable from the command line) as I develop the script. There's no point scripting in a notebook.