# MediZen Strain Recommendation API Model

## Version 1.1 - 2019-11-19

---

## Imports and Config

In [1]:
# General imports
import pandas as pd
import janitor
import os

In [2]:
# NLP Imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

In [3]:
# Configure pandas to display entire text of column
pd.set_option('max_colwidth', 200)
pd.set_option('max_columns', 200)  # Display up to 200 columns

---

## Data Loading and First Looks

In [4]:
# Load the data into pd.DataFrame
filepath = "/Users/Tobias/workshop/buildbox/medizen_ds_api/data/"
data_filename = "cannabis.csv"
data_filepath = os.path.join(filepath, data_filename)

df1 = pd.read_csv(data_filepath)

In [5]:
df1.head()

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus","$100 OG is a 50/50 hybrid strain that packs a strong punch. The name supposedly refers to both its strength and high price when it first started showing up in Hollywood. As a plant, $100 OG tends ..."
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially potent cut of White Widow that has grown in renown alongside Hawaiian legends like Maui Wowie and Kona Gold. This White Widow phenotype reeks of diesel a...
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy/Herbal,Sage,Woody","1024 is a sativa-dominant hybrid bred in Spain by Medical Seeds Co. The breeders claim to guard the secret genetics due to security reasons, but regardless of its genetic heritage, 1024 is a THC p..."
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genetics bred by Canadian LP Delta 9 BioTech. The two potent strains mix to create a balance between indica and sativa effects. 13 Dawgs has a sweet earthy...
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60% indica-dominant hybrid that combines the legendary LA strain Kosher Kush with champion sativa Tangie to create something quite unique. Growing tall i..."


---

## Data Wrangling and Feature Engineering

The end result that should be passed into the model is a single long string.  
Therefore, the new feature will simply be a concatenation of all three current features:

- `type`
- `effects`
- `flavor`

The cell below uses [pyjanitor](pyjanitor.readthedocs.io/) method-chaining to:

1. Clean up the feature names, which in this case only makes them lowercase
2. Concatenate the three features into one one, comma-separated feature
3. Remove all of the features except the new one

In [6]:
# User pyjanitor to wrangle the data

df2 = (df1
        .clean_names()  # In this case, fixes Title Case
        .concatenate_columns(
            # Explanation above - create one feature for NLP analysis
            column_names=["type", "effects", "flavor"],
            new_column_name="type_effects_flavor",
            sep=",",  # Staying consistent with comma-separation
        )
        .remove_columns(column_names=[
            "rating",
            "description",
            "type",
            "effects",
            "flavor",
        ]))

In [7]:
# Look at the resulting dataframe
print(df2.shape)
df2.head()

(2351, 2)


Unnamed: 0,strain,type_effects_flavor
0,100-Og,"hybrid,Creative,Energetic,Tingly,Euphoric,Relaxed,Earthy,Sweet,Citrus"
1,98-White-Widow,"hybrid,Relaxed,Aroused,Creative,Happy,Energetic,Flowery,Violet,Diesel"
2,1024,"sativa,Uplifted,Happy,Relaxed,Energetic,Creative,Spicy/Herbal,Sage,Woody"
3,13-Dawgs,"hybrid,Tingly,Creative,Hungry,Relaxed,Uplifted,Apricot,Citrus,Grapefruit"
4,24K-Gold,"hybrid,Happy,Relaxed,Euphoric,Uplifted,Talkative,Citrus,Earthy,Orange"


In [8]:
# Look at null values - hint: there shouldn't be any
# because they all have values in at least one of the three columns
df2.isnull().sum()

strain                 0
type_effects_flavor    0
dtype: int64

---

## TF-IDF

TF-IDF is a method of finding unique aspects of documents (strings).  
The more common a word is across the documents the lower the score.  
The result is the unique topics rising to the top. (They are called _top_-ics, after all)...

In [10]:
# Instantiate the vectorizer object
tfidf = TfidfVectorizer(stop_words="english")

# Create a vocabulary from the new feature
dtm = tfidf.fit_transform(df2["type_effects_flavor"])

# See the resulting feature matrix as a dataframe
docs = pd.DataFrame(dtm.todense(), columns = tfidf.get_feature_names())
docs.head()

Unnamed: 0,ammonia,apple,apricot,aroused,berry,blue,blueberry,butter,cheese,chemical,chestnut,citrus,coffee,creative,diesel,dry,earthy,energetic,euphoric,flowery,focused,fruit,giggly,grape,grapefruit,happy,herbal,honey,hungry,hybrid,indica,lavender,lemon,lime,mango,menthol,mint,minty,mouth,nan,nutty,orange,peach,pear,pepper,pine,pineapple,plum,pungent,relaxed,rose,sage,sativa,skunk,sleepy,spicy,strawberry,sweet,talkative,tar,tea,tingly,tobacco,tree,tropical,uplifted,vanilla,violet,woody
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.408775,0.0,0.351685,0.0,0.0,0.28758,0.37546,0.223409,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.272443,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.214536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.295473,0.0,0.0,0.0,0.477579,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.35866,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22211,0.339356,0.0,0.0,0.237126,0.0,0.328363,0.0,0.0,0.0,0.0,0.0,0.127147,0.0,0.0,0.0,0.172065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.691871,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23885,0.0,0.0,0.0,0.254998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.13673,0.371594,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.145705,0.0,0.564854,0.297667,0.0,0.0,0.371594,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1608,0.0,0.0,0.358211
3,0.0,0.0,0.645008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.248993,0.0,0.214218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.509129,0.0,0.0,0.0,0.258509,0.165951,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130678,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.290903,0.0,0.0,0.0,0.144217,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.364152,0.0,0.0,0.0,0.0,0.256186,0.0,0.199021,0.0,0.0,0.0,0.0,0.0,0.0,0.179345,0.0,0.0,0.0,0.242702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.64339,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.191117,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.419669,0.0,0.0,0.0,0.0,0.0,0.0,0.210917,0.0,0.0,0.0


## Get Similarities with Nearest Neighbor (K-NN)

In order to get a list of strains that are similar to a given input string, 
the K-Nearest Neighbor model will be used. This model uses a tree-based approach 
to calculate the distances between points and recursively clusters them until 
to find the desired number (k) of neighboring data points.

In [11]:
# Instantiate the knn model
nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')

# Fit (train) the model on the TF-IDF vectors created above
nn.fit(dtm.todense())

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=1.0)

### Running the KNN Model

In [12]:
# Example string to demonstrate the model
# This is a realistic example of what will be passed 
# into the model once it is integrated into the API
ex_1_str = "indica,happy,relaxed,hungry,talkative,citrus,tangy,flowery"

In [14]:
# Create the input vector
ex_1_vec = tfidf.transform([ex_1_str])

In [15]:
# Pass that vector into the trained knn model, specifying the number of neighbors to return
# This returns a list of two arrays: one is a measure of each neighbors 'near-ness'
# the other (the one we want) contains the indexes for the neighbors 
rec_array = nn.kneighbors(ex_1_vec.todense(), n_neighbors=10)
rec_array

(array([[0.67093641, 0.76730563, 0.77827539, 0.81093074, 0.81093074,
         0.81189686, 0.81484009, 0.84178747, 0.84178747, 0.84178747]]),
 array([[ 983,  731, 1948, 2081,  241,  397, 1450, 2220,  971,  956]]))

In [16]:
# Extract the second array - the list of strain ids (indexes) that are 'closest' to input
rec_id_list = rec_array[1][0]
rec_id_list

array([ 983,  731, 1948, 2081,  241,  397, 1450, 2220,  971,  956])

In [17]:
# Although the API will return only this list of indexes,
# for the purposes of this demo I'll hydrate that list with the rest 
# of the strain data from the original (pre-wrangled) dataframe
recommendations = df1.iloc[rec_id_list]
recommendations

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
983,Haoma,indica,4.8,"Relaxed,Happy,Sleepy,Hungry,Uplifted","Citrus,Flowery,Earthy","A cross between The Purps and Afghani, Haoma is a 70% indica strain with calming, stress-relieving effects. Haoma’s dense, compact buds have a fruity, floral aroma and melt away anxiety, pain, inf..."
731,Early-Girl,indica,3.8,"Relaxed,Happy,Hungry,Uplifted,Talkative","Citrus,Earthy,Pine","Early Girl is the wallflower of cannabis strains since its introduction in the 1980s. Lovingly preserved by the breeders at Sensi Seeds, this strain is lazy and relaxed, nothing over the top. A 75..."
1948,Sour-Patch-Kiss,hybrid,4.7,"Uplifted,Happy,Relaxed,Creative,Talkative","Citrus,Earthy,Flowery","Sour Patch Kiss by Elev8 Seeds was designed as a heavy-yielding trichome producer. This was achieved by crossing Kimbo Kush’s sweet, doughy aroma with Sour Kush’s pungent odor and generous product..."
2081,Sweet-Kush,hybrid,4.4,"Hungry,Happy,Relaxed,Euphoric,Sleepy","Citrus,Sweet,Flowery","Sweet Kush is the potent daughter of Sweet Tooth and OG Kush. Citrusy and sweet, Sweet Kush tastes just like a lemon drop candy. Combining the best of both cannabis types, this hybrid provides bot..."
241,Black-Velvet,hybrid,3.9,"Sleepy,Happy,Hungry,Relaxed,Euphoric","Sweet,Flowery,Citrus",This 50/50 hybrid strain is a cross of The Black and Burmese Kush that yields a potent flower with both cerebral and physical effects. The flower gets its density and purple-black hue from its Bla...
397,Bubblegum-Kush,indica,4.3,"Relaxed,Happy,Sleepy,Uplifted,Talkative","Sweet,Earthy,Flowery","An 80% indica strain from Bulldog Seeds in the Netherlands, Bubblegum Kush is a cross between Bubble Gum and an undisclosed Kush. An easy-to-grow plant that produces huge yields of frosty, resinou..."
1450,Negra-44,indica,5.0,"Creative,Energetic,Relaxed,Hungry,Talkative","Sweet,Earthy,Citrus","Negra 44 is an indica-dominant strain bred by R-Kiem Seeds in Spain. This award-winning variety crosses a Top 44 indica with native Ghana landrace strains, and inherits an earthy, fruity aroma. In..."
2220,Vader-Og,indica,4.5,"Sleepy,Relaxed,Happy,Euphoric,Hungry","Earthy,Flowery,Sweet","Vader OG by Ocean Grown Seeds is the namesake strain of one of OGS’s master growers, Vader. This cross began in 2006 with the combination of SFV OG and Larry OG, and evolved over a laborious proce..."
971,Gumbo,indica,4.6,"Relaxed,Euphoric,Happy,Sleepy,Hungry","Earthy,Sweet,Flowery","Getting its name from the classic bubble gum flavor, Gumbo is a perfect medicine for the evenings and has a smooth taste and finish. Gumbo is great for treatment of muscle spasms, sleeplessness, h..."
956,Green-Poison,indica,4.2,"Relaxed,Happy,Euphoric,Sleepy,Hungry","Sweet,Flowery,Earthy","Green Poison is a dangerously flavorful indica cross championed by Sweet Seeds. It pulls you in with a fruity and floral aroma, then delivers a potent dose of euphoria and body-numbing relaxation...."


---

## The FuncZone

In order to more easily integrate this recommendation process into the Flask API,
the steps can be grouped into a function that will take in a request and return the recommendations.

In [22]:
def recommend(req, n=10):
    """Function to recommend top n strains given a request."""
    # Create vector from request
    req_vec = tfidf.transform([req])

    # Access the top n indexes
    top_id = nn.kneighbors(req_vec.todense(), n_neighbors=n)[1][0]

    # Index-locate the neighbors in original dataframe
    top_df = df1.iloc[top_id]

    return top_df

In [23]:
# Another example request to test out the above function
ex_2_str = "hybrid,euphoric,energetic,creative,woody,earthy"

# Run the function, this time asking for the top 5 recommendations
ex_2_recs = recommend(ex_2_str, 5)
ex_2_recs

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
736,Earthquake,hybrid,4.5,"Creative,Euphoric,Uplifted,Happy,Energetic","Woody,Pine,Earthy",
1585,Pine-Cone,hybrid,4.5,"Creative,Uplifted,Energetic,Euphoric,Focused","Pine,Earthy,Woody",Pine Cone by Glen’s Plant Farm is the hybrid cross of Blue Tahoe and Cinex. This combination develops a strong forest aroma and tight resinous nuggets that explode with earthy flavor upon vaporiza...
873,Gods-Green-Crack,hybrid,4.5,"Relaxed,Happy,Euphoric,Uplifted,Energetic","Earthy,Woody,Sweet","God’s Green Crack is a balanced hybrid strain bred by Jordan of the Islands, who wanted to lighten up the heavy effects of God Bud with a high-flying Green Crack sativa. The indica and sativa pare..."
2310,White-Widow,hybrid,4.3,"Happy,Relaxed,Euphoric,Uplifted,Energetic","Earthy,Woody,Pungent","Among the most famous strains worldwide is White Widow, a balanced hybrid first bred in the Netherlands by Green House Seeds. A cross between a Brazilian sativa landrace and a resin-heavy South In..."
23,Ak-47,hybrid,4.2,"Happy,Relaxed,Uplifted,Euphoric,Energetic","Earthy,Pungent,Woody",Don't let its intense name fool you: AK-47 will leave you relaxed and mellow. This sativa-dominant hybrid delivers a steady and long-lasting cerebral buzz that keeps you mentally alert and engaged...


In [24]:
# The API should return a JSON object with only the ids
# Here's a slightly modified version to accomplish that
def recommend_json(req, n=10):
    """Function to recommend top n strains given a request."""
    # Create vector from request
    req_vec = tfidf.transform([req])

    # Access the top n indexes
    rec_id = nn.kneighbors(req_vec.todense(), n_neighbors=n)[1][0]

    # Convert np.ndarray to pd.Series then to JSON
    rec_json = pd.Series(rec_id).to_json(orient="records")

    return rec_json

## Pickling

In order to use the model in the Flask app, it can be pickled. 
The pickle module, and the pickle file format, allows Python objects
to be serialized and de-serialized. In this case, the trained vectorizer
and model can be made into pickle files, which are then loaded into the
Flask app for use in the recommendation API.

In [25]:
# Create pickle func to make pickling (a little) easier
def picklizer(to_pickle, filename, path):
    """
    Creates a pickle file.
    
    Parameters
    ----------
    to_pickle : Python object
        The trained / fitted instance of the 
        transformer or model to be pickled.
    filename : string
        The desired name of the output file,
        not including the '.pkl' extension.
    path : string or path-like object
        The path to the desired output directory.
    """
    import os
    import pickle

    # Create the path to save location
    picklepath = os.path.join(path, filename)

    # Use context manager to open file
    with open(picklepath, "wb") as p:
        pickle.dump(to_pickle, p)

In [26]:
# Picklize!

# Export vectorizer as pickle
picklizer(dtm, "vect_02.pkl", filepath)

# Export knn model as pickle
picklizer(nn, "knn_02.pkl", filepath)