# NLP Vectorizer and Model

This model and vectorizer are built to interact with the textual data in master dataset.  This notebook contains functions to pre-process data, train the search algorithm, and export pipelines as .pkl or similar files.

**Note** Deployment of large spacy models to heroku is not advised.  See the following libraries for implementation in web application

>See: nlp_model.py for function imports to integrate with web application

## Data PreProcessing

In [3]:
# Load Data
import pandas as pd
import os

data_path = '../master-2019-10-24 01:09.csv'
df = pd.read_csv(data_path, )
df.head()

Unnamed: 0.1,Unnamed: 0,Strain,Type,Percent Indica,Percent Sativa,Generated Description,Flavor,Effects,Rating,Carene,...,Cymene,CBN,CBC,Description,Total_THC,Total_CBD,Total_CBG,Nerolidol,Caryophyllene,Terpinene
0,0,sugar-cane,hybrid,0.4,0.6,Sugar Cane is a rare slightly sativa dominant ...,Earthy Sweet Candy Grape Spicy Fruity Herbal P...,Body High Cerebral Creative Energizing Relaxin...,,,...,,,,Sugar Cane is a rare slightly sativa dominant ...,20.0,,,,,
1,1,mac1,hybrid,0.5,0.5,"MAC 1, also known as “Miracle Alien Cookies X1...",Sweet Diesel Sour Spicy Herbal Pungent,Creative Euphoria Happy Motivation Relaxing Up...,,,...,,,,"MAC 1, also known as “Miracle Alien Cookies X1...",21.5,,,,,
2,2,chemdawg,hybrid,0.55,0.45,With a near-even balance between sativa and in...,Earthy Pungent Chemical Diesel Pine Diesel Ear...,Cerebral Creative Euphoria Happy Relaxing Cere...,4.3,,...,,0.07,0.069,With a near-even balance between sativa and in...,18.160529,0.239965,0.876875,,0.812522,
3,3,jack-herer,sativa,,,Jack Herer is easily one of the best-known str...,Earthy Sweet Spicy Herbal Lemon Pine Woody Ear...,Body High Cerebral Creative Euphoria Happy Bod...,4.4,0.023,...,0.01,0.082,0.046667,Jack Herer is easily one of the best-known str...,15.461573,0.206155,0.962574,,0.656769,0.034
4,4,nerds,hybrid,0.5,0.5,"Nerds, also known as “Nerdz,” is an evenly bal...",Earthy Sweet Grape Spicy Herbal Fruity Berry W...,Cerebral Creative Euphoria Focus Relaxing Cere...,,,...,,,,"Nerds, also known as “Nerdz,” is an evenly bal...",15.5,,,,,


### Create Mass Text Field

Combine all text fields into big field for vectorization

In [3]:
# Create new series with information
df['mass_text'] = df.Strain + df.Effects + df.Flavor + df.Description
df.mass_text[0]

'100-ogcreative,energetic,tingly,euphoric,relaxedearthy,sweet,citrus$100 og is a 50/50 hybrid strain that packs a strong punch. the name supposedly refers to both its strength and high price when it first started showing up in hollywood. as a plant, $100 og tends to produce large dark green buds with few stems. users report a strong body effect of an indica for pain relief with the more alert, cerebral feeling thanks to its sativa side.'

### Tokenize, Clean, Vectorize Text

Use Spacy's lightweight english model to tokenize mass_text, remove stop words, and then convert into vector.

In [1]:
# Load Spacy Model
import spacy
from spacy.tokens import Doc

nlp = spacy.load("en_core_web_md")

# Wrap filter/tokenizer
def filter_data(func):
    def wrapper(text):
        return filter_doc(func(text))
    return wrapper

# Filter on stop_words
def filter_doc(doc):
    filtered_sentence = []
    for word in doc:
        lexeme = doc.vocab[word.text]
        if lexeme.is_stop == False:
            if word.is_punct == False:
                filtered_sentence.append(word.text)
#     return filtered_sentence  #  Use to return a list of strings
#     return ' '.join(filtered_sentence)  # Use to return a single string with stop words, punctuation removed
    return Doc(nlp.vocab, filtered_sentence,[True]*len(filtered_sentence))  # Use to return a spacy.tokens.Doc


# Helper functions

# upgraded versions (TODO errors with finding spacy model in parallel process IPython)
@filter_data
def tokenize_text(text):
    return nlp(text)

In [4]:
# Example tokenizer use:

sample_string = 'A sample of text is the greatness of all'

example = tokenize_text(sample_string)

display(example, type(example))

sample text greatness 

spacy.tokens.doc.Doc

In [5]:
# Apply tokenizer to mass_text
df.mass_text  = df.mass_text.apply(tokenize_text)
df.mass_text[0:2]

0    (100-ogcreative, energetic, tingly, euphoric, ...
1    (98-white, widowrelaxed, aroused, creative, ha...
Name: mass_text, dtype: object

In [114]:
# Extract vectors from each Doc (mass_text description)

def get_vector_from_doc(x):
    return x.vector

df['mass_vector'] = df.mass_text.apply(get_vector_from_doc)

In [118]:
import numpy as np

vectors = df.mass_vector.apply(pd.Series)

vectors.shape

(2273, 300)

## Building Model

Implementation of KDTree to create search rankings

In [123]:
# Create Tree
from sklearn.neighbors import KDTree

kdtree = KDTree(vectors, leaf_size=2)
kdtree

<sklearn.neighbors.kd_tree.KDTree at 0x55c94a47c0d8>

In [153]:
# Test Tree Search

dist, ind = kdtree.query(vectors[:1], k=3)
vectors[:1].shape

(1, 300)

In [141]:
display(ind, dist)

array([[   0,  374, 1996]])

array([[0.        , 0.88292231, 0.88879175]])

#### Testing Outputs and index matchup

In [146]:
# Create false query from first 

test_string = """100-ogcreative energetic tingly euphoric relaxedearthy sweet citrus$100 og 
50/50 hybrid strain packs strong punch supposedly refers strength high price started showing 
hollywood plant $ 100 og tends produce large dark green buds stems users report strong body effect 
ndica pain relief alert cerebral feeling thanks sativa""" 
display(test_string)

'100-ogcreative energetic tingly euphoric relaxedearthy sweet citrus$100 og \n50/50 hybrid strain packs strong punch supposedly refers strength high price started showing \nhollywood plant $ 100 og tends produce large dark green buds stems users report strong body effect \nndica pain relief alert cerebral feeling thanks sativa'

In [160]:
# Example vectorization pipeline

input_vector = get_vector_from_doc(
    tokenize_text(test_string)
)

# input_vector = pd.Series(input_vector).to_numpy().reshape(1,-1)
input_vector = input_vector.reshape(1,-1)
input_vector.shape

(1, 300)

In [161]:
# Search Tree for x number of nearest matches

num_matches = 5

dist, ind = kdtree.query(input_vector, k=num_matches)

display(ind, dist)

array([[   0,  374, 1996, 1449,  390]])

array([[0.27035835, 0.85993068, 0.89083888, 0.90266355, 0.90718331]])

In [169]:
# Convert to train information

response = df[['StrainID', 'Effects', 'Flavor', 'Description']].iloc[ind[0]]
response

Unnamed: 0,StrainID,Effects,Flavor,Description
0,0,"creative,energetic,tingly,euphoric,relaxed","earthy,sweet,citrus",$100 og is a 50/50 hybrid strain that packs a ...
374,374,"happy,euphoric,relaxed,uplifted,creative","earthy,diesel,pungent",bruce banner might be best known as the alter-...
1996,1996,"happy,relaxed,sleepy,euphoric,giggly","earthy,woody,spicy/herbal",nirvana seeds created supergirl by backcrossin...
1449,1449,"energetic,happy,uplifted,euphoric,talkative","citrus,flowery,tea",oca’s cloud 9 is a phenotype of the mysterious...
390,390,"happy,focused,aroused,talkative,uplifted","pepper,mint,blueberry",bubblegun is a hybrid strain whose name plays ...


In [171]:
# Example conversion to JSON

response.to_json(orient='records')

'[{"StrainID":0,"Effects":"creative,energetic,tingly,euphoric,relaxed","Flavor":"earthy,sweet,citrus","Description":"$100 og is a 50\\/50 hybrid strain that packs a strong punch. the name supposedly refers to both its strength and high price when it first started showing up in hollywood. as a plant, $100 og tends to produce large dark green buds with few stems. users report a strong body effect of an indica for pain relief with the more alert, cerebral feeling thanks to its sativa side."},{"StrainID":374,"Effects":"happy,euphoric,relaxed,uplifted,creative","Flavor":"earthy,diesel,pungent","Description":"bruce banner might be best known as the alter-ego of comic book hero the incredible hulk, but maybe he wouldn\\u2019t be such a stressed out ball of anger if he just had some of his namesake strain. this green monster also has hidden strength and features dense nugs that pack the power of very high thc content. it\\u2019s a powerful strain whose effects come on quickly and strong and th

## Export Model

**Vectorizer and input transformation NOT INCLUDED** in this export.  Re-implementation of Spacy tokenizer and pandas/numpy transforms can be found in nlp_model.py

In [172]:
import pickle

with open('kdtree_model.pkl', 'wb') as f:
    pickle.dump(kdtree, f)