# Toy cannabis recommender for med_cabinet_4
## Author: JAE Finger
### Updated: 06/18/2020

**Goals:** Step 1: Clean and examine dataset

Step 2: Develop baseline model to recommend a strain

Step 3: Develop robust model to recommend a strain

Step 4: Deploy app to cloud

Step 5: Connect cloud app to website

## Import packages and necessaries

In [None]:
# Import the sys package and install pandasprofiling for notebooks.
import sys
!{sys.executable} -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

Requirement already up-to-date: pandas-profiling[notebook] in /usr/local/lib/python3.6/dist-packages (2.8.0)
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [None]:
#Import Packages
# Data analysis
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import requests

# Data cleaning
import re

# Tokenizing words
import spacy
from spacy.tokenizer import Tokenizer
from collections import Counter

# TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

  import pandas.util.testing as tm


## Import data and examine raw values

In [None]:
# Import csv obtained from Kaggle
kaggle_1 = pd.read_csv("https://raw.githubusercontent.com/jae-finger/med_cabinet_4/master/cannabis.csv")

In [None]:
# Drop nulls
print(f"The shape before removing nulls is {kaggle_1.shape}")
df = kaggle_1.copy()
df = df.dropna()
print(f"The shape after removing nulls is {df.shape}")
df.head()

The shape before removing nulls is (2351, 6)
The shape after removing nulls is (2277, 6)


Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus",$100 OG is a 50/50 hybrid strain that packs a ...
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially pot...
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy/Herbal,Sage,Woody",1024 is a sativa-dominant hybrid bred in Spain...
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genet...
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60%..."


In [None]:
# Create a test strain manually
STRAIN_NAME = "User_Strain"
STRAIN_TYPE = "Sativa"
STRAIN_RATING = 5
STRAIN_EFFECTS = "Uplifed, Happy, Relaxed, Energetic, Creative"
STRAIN_FLAVORS = "Spicy, Herbal, Sage, Woody"
STRAIN_DESCRIPTION = "a sativadominant hybrid bred in spain by medical seeds co the breeders claim to guard the secret genetics due to security reasons but regardless of its genetic heritage it is a thc powerhouse with a sweet and spicy bouquet subtle fruit flavors mix with an herbal musk to produce uplifting sativa effects one specific phenotype is noted for having a pungent odor that fills a room similar to burning incense"
TEST_STRAIN = {'Strain': [STRAIN_NAME], 'Type': [STRAIN_TYPE], 'Rating': [STRAIN_RATING], 'Effects': [STRAIN_EFFECTS], 'Flavor' : [STRAIN_FLAVORS], 'Description' : [STRAIN_DESCRIPTION]}
test_strain_input = pd.DataFrame(TEST_STRAIN)
test_strain_input.head()

# Create a test strain through JSON


Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,User_Strain,Sativa,5,"Uplifed, Happy, Relaxed, Energetic, Creative","Spicy, Herbal, Sage, Woody",a sativadominant hybrid bred in spain by medic...


In [None]:
# Add to list of strains
df = df.append(test_strain_input)
df.reset_index(drop=True)
# Verify it was added
leng = len(df)-1
df.iloc[leng]

Strain                                               User_Strain
Type                                                      Sativa
Rating                                                         5
Effects             Uplifed, Happy, Relaxed, Energetic, Creative
Flavor                                Spicy, Herbal, Sage, Woody
Description    a sativadominant hybrid bred in spain by medic...
Name: 0, dtype: object

In [None]:
# lowercase text
df['Strain'] = df['Strain'].apply(lambda x: x.lower())
df['Description'] = df['Description'].apply(lambda x: x.lower())
df['Effects'] = df['Effects'].apply(lambda x: x.lower())
df['Flavor'] = df['Flavor'].apply(lambda x: x.lower())
df['Type'] = df['Type'].apply(lambda x: x.lower())

# remove symbols
df['Strain'] = df['Strain'].apply(lambda x: re.sub('[^a-zA-Z 0-9]', ' ', x))
df['Description'] = df['Description'].apply(lambda x: re.sub('[^a-zA-Z 0-9]', '', x))
df['Effects'] = df['Effects'].apply(lambda x: re.sub('[^a-zA-Z 0-9]', ' ', x))
df['Flavor'] = df['Flavor'].apply(lambda x: re.sub('[^a-zA-Z 0-9]', ' ', x))
df['Type'] = df['Type'].apply(lambda x: re.sub('[^a-zA-Z 0-9]', ' ', x))
df.head()

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,100 og,hybrid,4.0,creative energetic tingly euphoric relaxed,earthy sweet citrus,100 og is a 5050 hybrid strain that packs a st...
1,98 white widow,hybrid,4.7,relaxed aroused creative happy energetic,flowery violet diesel,the 98 aloha white widow is an especially pote...
2,1024,sativa,4.4,uplifted happy relaxed energetic creative,spicy herbal sage woody,1024 is a sativadominant hybrid bred in spain ...
3,13 dawgs,hybrid,4.2,tingly creative hungry relaxed uplifted,apricot citrus grapefruit,13 dawgs is a hybrid of g13 and chemdawg genet...
4,24k gold,hybrid,4.6,happy relaxed euphoric uplifted talkative,citrus earthy orange,also known as kosher tangie 24k gold is a 60 i...


In [None]:
# Create a corpus
print("Combining text features")
df['combined_text'] =df['Type'] + ' ' + df['Effects'] + ' ' + df['Flavor'] + df['Description'] + ' '
df.head()

Combining text features


Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description,combined_text
0,100 og,hybrid,4.0,creative energetic tingly euphoric relaxed,earthy sweet citrus,100 og is a 5050 hybrid strain that packs a st...,hybrid creative energetic tingly euphoric rela...
1,98 white widow,hybrid,4.7,relaxed aroused creative happy energetic,flowery violet diesel,the 98 aloha white widow is an especially pote...,hybrid relaxed aroused creative happy energeti...
2,1024,sativa,4.4,uplifted happy relaxed energetic creative,spicy herbal sage woody,1024 is a sativadominant hybrid bred in spain ...,sativa uplifted happy relaxed energetic creati...
3,13 dawgs,hybrid,4.2,tingly creative hungry relaxed uplifted,apricot citrus grapefruit,13 dawgs is a hybrid of g13 and chemdawg genet...,hybrid tingly creative hungry relaxed uplifted...
4,24k gold,hybrid,4.6,happy relaxed euphoric uplifted talkative,citrus earthy orange,also known as kosher tangie 24k gold is a 60 i...,hybrid happy relaxed euphoric uplifted talkati...


## Spacy model

In [None]:
# Inialize spaCy model and tokenizer
nlp = spacy.load("en_core_web_sm")
tokenizer = Tokenizer(nlp.vocab)

In [None]:
# Add a couple custom stop words
STOP_WORDS = nlp.Defaults.stop_words.union(['weed', 'strain'])

In [None]:
# Setting up the word counter. 
word_counts = Counter()

def count(docs):
    '''This function takes a list of tokenized documents as input and returns
    a dataframe with 
    
    # Arguments
        docs: list, tokenized list of documents
        
    # Returns
        wc: dataframe, 
    '''
    
    word_counts = Counter()
    appears_in = Counter()

    total_docs = len(docs)

    for doc in docs:
        word_counts.update(doc)
        appears_in.update(set(doc))

    temp = zip(word_counts.keys(), word_counts.values())

    wc = pd.DataFrame(temp, columns = ['word', 'count'])

    wc['rank'] = wc['count'].rank(method='first', ascending=False)
    total = wc['count'].sum()

    wc['pct_total'] = wc['count'].apply(lambda x: x / total)

    wc = wc.sort_values(by='rank')
    wc['cul_pct_total'] = wc['pct_total'].cumsum()

    t2 = zip(appears_in.keys(), appears_in.values())
    ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
    wc = ac.merge(wc, on='word')

    wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)

    return wc.sort_values(by='rank')

In [None]:
# Tokenize the combined text
tokens = []
for doc in tokenizer.pipe(df['combined_text'], batch_size=250):
    
    doc_tokens = []
    for token in doc: 
        if token.text.lower() not in STOP_WORDS:
            doc_tokens.append(token.text.lower())
   
    tokens.append(doc_tokens)
    
df['spaCy_tokens'] = tokens

In [None]:
# Print spacy token count
wc = count(df['spaCy_tokens'])
print(wc.shape)
wc.head()

(12306, 7)


Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
42,hybrid,1359,2357,1.0,0.019871,0.019871,0.596576
97,happy,1846,1947,2.0,0.016414,0.036285,0.81036
3,relaxed,1705,1773,3.0,0.014947,0.051233,0.748464
15,euphoric,1634,1772,4.0,0.014939,0.066172,0.717296
8,sweet,1172,1532,5.0,0.012916,0.079087,0.514486


In [None]:
# Get token lemmas
def get_lemmas(text):

    doc = nlp(text)
    
    lemmas = []
    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False)) and (token.pos_ != 'PRON'):
            lemmas.append(token.lemma_)
    
    return lemmas

In [None]:
# Get and check out lemmas
df['lemmas'] = df['combined_text'].apply(get_lemmas)
df['lemmas'].head()

0    [hybrid, creative, energetic, tingly, euphoric...
1    [hybrid, relaxed, arouse, creative, happy, ene...
2    [sativa, uplift, happy, relaxed, energetic, cr...
3    [hybrid, tingly, creative, hungry, relaxed, up...
4    [hybrid, happy, relaxed, euphoric, uplifted, t...
Name: lemmas, dtype: object

In [None]:
# View lemma word count
wc = count(df['lemmas'])
print(wc.shape)
wc.head(50)

(10780, 7)


Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
32,strain,1768,2873,1.0,0.023747,0.023747,0.776119
43,hybrid,1369,2397,2.0,0.019812,0.043559,0.600966
98,happy,1846,1950,3.0,0.016118,0.059677,0.81036
16,euphoric,1634,1772,4.0,0.014646,0.074323,0.717296
41,effect,1273,1552,5.0,0.012828,0.087151,0.558824
9,sweet,1177,1539,6.0,0.012721,0.099872,0.516681
36,indica,908,1458,7.0,0.012051,0.111923,0.398595
241,relax,1191,1376,8.0,0.011373,0.123296,0.522827
10,earthy,1010,1194,9.0,0.009869,0.133165,0.443371
224,cross,1001,1069,10.0,0.008836,0.142001,0.439421


## TFIDF


In [None]:
# Set up TFIDF
# Instantiate vectorizer object

def tokenize(document):
    
    doc = nlp(document)
    
    return [token.lemma_.strip() for token in doc if (token.is_stop != True) and (token.is_punct != True)]

tfidf = TfidfVectorizer(
    stop_words = 'english',
    # tokenizer = tokenize,
    ngram_range = (1,2),
    min_df = 1, 
    max_df = 0.9,
    max_features = 5000)

In [None]:
# Create a vocabulary and tf-idf score per document
text = df['combined_text']
dtm = tfidf.fit_transform(text)

In [None]:
# Get feature names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

# View Feature Matrix as DataFrame
print(dtm.shape)
dtm.head()

(2278, 5000)


Unnamed: 0,10,10 week,10 weeks,100,11,11 cbdthc,11 ratio,11 weeks,12,13,14,14 weeks,15,1520,16,18,19,1960s,1970s,1980s,1990s,1996,1st,1st hawaiian,1st place,1st prize,20,20 cbd,20 thc,2002,2003,2004,2005,2006,2007,2009,2010,2011,2012,2013,...,wonderful,wonders,wont,wood,woody,woody aroma,woody earthy,woody pine,woody spicy,woodythe,word,work,works,world,world regions,worlds,worldwide,worries,worth,worthwhile,worthy,wrapped,wreck,xiii,xxx,years,yellow,yield,yield attentive,yield potency,yielding,yields,yields following,youll,youre,youre looking,zest,zesty,zesty lemon,zombie
0,0.0,0.0,0.0,0.194436,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.088039,0.112682,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.289975,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.126616,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Similarity (Recommender)

In [None]:
# Calculate Distance of TF-IDF Vectors
dist_matrix  = cosine_similarity(dtm)

In [None]:
# Turn it into a DataFrame
cosine_df = pd.DataFrame(dist_matrix)
print(cosine_df.shape)
cosine_df.head()

(2278, 2278)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,2238,2239,2240,2241,2242,2243,2244,2245,2246,2247,2248,2249,2250,2251,2252,2253,2254,2255,2256,2257,2258,2259,2260,2261,2262,2263,2264,2265,2266,2267,2268,2269,2270,2271,2272,2273,2274,2275,2276,2277
0,1.0,0.02548,0.03577,0.035198,0.049681,0.074152,0.075141,0.029039,0.049631,0.03783,0.025712,0.079832,0.064655,0.10728,0.040829,0.068684,0.035231,0.061321,0.028653,0.024437,0.036713,0.109,0.022661,0.049608,0.069387,0.008573,0.067368,0.062918,0.054581,0.040287,0.063045,0.049566,0.026303,0.062014,0.052476,0.041175,0.161492,0.027713,0.049748,0.015091,...,0.049015,0.092725,0.06928,0.040558,0.050808,0.033563,0.023365,0.069447,0.012308,0.096068,0.033295,0.036704,0.057213,0.010384,0.018179,0.072484,0.070909,0.068461,0.025219,0.05279,0.024739,0.04613,0.054326,0.063011,0.074792,0.059831,0.016752,0.035526,0.017604,0.053937,0.020421,0.036504,0.047146,0.027658,0.072872,0.055305,0.021397,0.057112,0.013269,0.036188
1,0.02548,1.0,0.02851,0.020209,0.027724,0.013317,0.01739,0.013324,0.009935,0.008277,0.011554,0.012583,0.019752,0.013683,0.013598,0.024841,0.033124,0.055438,0.017873,0.018458,0.018618,0.015518,0.016893,0.013125,0.033061,0.014819,0.020947,0.044574,0.038943,0.02711,0.009663,0.039053,0.028817,0.024249,0.006895,0.009654,0.049469,0.070988,0.023365,0.061197,...,0.034196,0.037642,0.053125,0.049091,0.008888,0.001181,0.017438,0.01279,0.066503,0.010395,0.020104,0.006876,0.053559,0.079132,0.011363,0.031771,0.064963,0.071633,0.02404,0.011932,0.001669,0.039932,0.040645,0.024622,0.026904,0.022268,0.028869,0.047304,0.007072,0.047294,0.010401,0.003497,0.023152,0.014218,0.012524,0.014101,0.034709,0.009745,0.032202,0.028844
2,0.03577,0.02851,1.0,0.092315,0.023537,0.013123,0.071111,0.017081,0.03944,0.019486,0.007935,0.01808,0.026642,0.018915,0.026595,0.002299,0.006045,0.014945,0.100252,0.013156,0.008632,0.018308,0.027482,0.046201,0.063091,0.032094,0.020871,0.017135,0.038478,0.015687,0.031787,0.02189,0.03256,0.065506,0.02804,0.093644,0.090432,0.052708,0.030697,0.012645,...,0.043247,0.018108,0.015425,0.039001,0.06899,0.039681,0.034772,0.029261,0.019053,0.043725,0.041909,0.020134,0.046823,0.022781,0.079889,0.01479,0.051529,0.04061,0.030059,0.026619,0.008805,0.005439,0.090888,0.044534,0.018479,0.014067,0.032091,0.070138,0.097078,0.022596,0.034533,0.080743,0.056053,0.148902,0.045022,0.077644,0.071267,0.006633,0.019528,0.988449
3,0.035198,0.020209,0.092315,1.0,0.032869,0.021993,0.027604,0.051286,0.056858,0.024332,0.014725,0.03092,0.052392,0.024597,0.032091,0.017883,0.007023,0.030521,0.035785,0.03089,0.045677,0.043094,0.021036,0.065376,0.056737,0.086963,0.024401,0.020778,0.021784,0.024552,0.021491,0.022174,0.030822,0.120373,0.013679,0.033489,0.041837,0.117219,0.046102,0.016482,...,0.044671,0.042345,0.011953,0.026588,0.035565,0.026784,0.041921,0.051659,0.060394,0.035797,0.05678,0.02259,0.025662,0.045763,0.021783,0.057617,0.037631,0.034948,0.027806,0.062915,0.028444,0.021575,0.055617,0.046706,0.016664,0.043789,0.015795,0.080343,0.026825,0.065465,0.039388,0.049913,0.05269,0.06187,0.02729,0.0733,0.020279,0.03591,0.051053,0.092002
4,0.049681,0.027724,0.023537,0.032869,1.0,0.026728,0.026568,0.031561,0.032546,0.039795,0.008858,0.037676,0.060373,0.105601,0.070832,0.034922,0.022005,0.059575,0.017237,0.01736,0.033228,0.022107,0.010881,0.023706,0.025142,0.079702,0.01559,0.069835,0.155883,0.065278,0.036677,0.01292,0.013122,0.045463,0.011343,0.029917,0.036007,0.032623,0.028539,0.031517,...,0.029221,0.040908,0.036021,0.027583,0.037539,0.021449,0.042127,0.026889,0.030233,0.034634,0.011715,0.045492,0.042902,0.053982,0.022873,0.198882,0.063541,0.031695,0.043809,0.076588,0.020619,0.011179,0.028281,0.04596,0.065071,0.047619,0.011702,0.015015,0.010254,0.029148,0.037778,0.029583,0.036904,0.046783,0.034129,0.016228,0.038845,0.020305,0.02152,0.022609


In [None]:
# # Grab the top 5 most similar strains to the custom strain at the start.
last_cosine = len(cosine_df)-1
cosine_results = cosine_df[cosine_df[0] < 1][last_cosine].sort_values(ascending=False)[1:6]
cosine_results =  pd.DataFrame(cosine_results)
cosine_results = cosine_results.reset_index()
cos_results = cosine_results['index'].values.tolist()
cos_results

[2, 1918, 1667, 1040, 261]

In [None]:
# Check results
print('----------------------------')
print('----------------------------')
print(f"Seed strain:") 
print(f"{df.iloc[leng]}")
print('----------------------------')
print('----------------------------')
print('Similar strains:')
print('----------------------------')
for each in cos_results:
  print(df.iloc[each])

----------------------------
----------------------------
Seed strain:
Strain                                                 user strain
Type                                                        sativa
Rating                                                           5
Effects               uplifed  happy  relaxed  energetic  creative
Flavor                                  spicy  herbal  sage  woody
Description      a sativadominant hybrid bred in spain by medic...
combined_text    sativa uplifed  happy  relaxed  energetic  cre...
spaCy_tokens     [sativa, uplifed,  , happy,  , relaxed,  , ene...
lemmas           [sativa, uplifed,  , happy,  , relaxed,  , ene...
Name: 0, dtype: object
----------------------------
----------------------------
Similar strains:
----------------------------
Strain                                                        1024
Type                                                        sativa
Rating                                                         4.