# MediZen Data Science

## (with notes for presentation)

The presentation notebook: [MediZen Presentation Notebook](05-medizen-ds-present.ipynb)

---

## The Data Team

### DS

- Tobias Reaper
- Vera Mendes
- Alex Gerwer

### ML

- Maxime Vacher-Materno

---

---

## The Problem

- Although recreational cannabis has been legalized in many states, it is important to remember that for many, 
  - cannabis is a medication they use to treat a specific condition or set of conditions
- As with any medication, there are many factors that contribute to its effectiveness
  - Not only dosage and intake method, but the patient's weight, age, fitness, etc.
  - No data, but I bet there are many people out there who could benefit greatly from it
    - But don't want to because of a bad experience in the past
- Take Kate, for example
  - "I don't want to feel like I'm high; I want to feel like I'm just having a really great day."
  - I'd bet many people who think the same don't realize that this experience is very possible
  - But it takes some effort to get there
- The MediZen app aims to help with this

![Kate Russel](../images/kate.png)

---

## MediZen

### Notes

- Unfortunately our front end folks were not able to get everything working in time
- I'll be using screenshots to demonstrate the app's UI

### Functionality

> The goal with the app is to help users find their best high.

- Finding the right strain
- Determine the right dosage
- Create a treatment plan

### MVP

- User creation and login
- Strain recommendations
- Save recommendations

### Stretch

- Dosages
- Intake methods
- Intake schedule

![Desktop Landing](../screenshots/Desktop-1-Landing.png)

---

## The DS Problem: Strain Recommendations 

To start thinking about this problem, let's take a brief look at the data

- Aggregated from user reviews on Leafly, posted to Kaggle (by kingburrito666)
- Over 2300 strains, each one with characteristics, rating, and description
  - The characteristics are the important bit here
    - type, effects, flavor
  - Because they were the basis for the app's recommendation system

In [2]:
# Load and look at the dataset
import pandas as pd
import janitor

datapath = "../../data/cannabis.csv"

df1 = pd.read_csv(datapath)

df1.head()

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus",$100 OG is a 50/50 hybrid strain that packs a ...
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially pot...
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy/Herbal,Sage,Woody",1024 is a sativa-dominant hybrid bred in Spain...
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genet...
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60%..."


![Desktop Filters](../screenshots/Desktop-4-Filters.png)

#### The DS Problem, Cont'd: Recs or Filters

> ...to Filter or to Smart-Filter?

The input we would be receiving is a list of the user's desired characteristics.

We had two primary options for how to tackle the problem. We could...

---

### Use Naive Filtering

- Break up `type`, `effects`, and `flavor` into separate columns for each individual element
  - Filter based on booleans for each column
- Use Pandas `str.contains()` or SQL `WHERE` clauses to filter
- These solutions could be implemented relatively easily...
  - They don't necessarily require ML modeling
- These filtering solutions are _naive_ - they strictly enforce the preferences sent to us by the user
    - If the user chose to filter by `sativa`, they would _only see sativa strains_
    - What if the best strain for their particular case, given the other characteristics, is a `hybrid`?
    - Sure, they would maybe be able to still find a good match
    - However, the recommendations would fail to accurately gauge their need
      - And, therefore miss out on providing the best recommendations

---

### Build a Recommendation Engine

- A good recommendation engine will be a little more flexible
    - The recommendations will be a result of the entire filterset as a whole
      - Not based on individual elements
    - To continue this example, the user thinks that the strain for them is `sativa`
      - Sativas are typically associated with effects like `energetic`, `creative`, `focused`
      - They check that on the filters list
    - For the effects, however, they check off: `relaxed`, `giggly`, `sleepy`, `happy`, `hungry`
      - These effects are more likely to be associated with `hybrid` or `indica`
      - A more robust recommendations engine can recommend those, even if the user indicated `sativa`

> How to build that type of engine?

---

## Recommended Model Recommendations

- NLP models are typically very good at breaking up and analyzing long strings (documents)
  - ...such as...a list of characteristics
  - Therefore, we thought it would simplify things to concatenate the three characteristics columns
  - The result is a single feature containing a single long string of each strain's characteristics
  - The input coming from the app can easily be concatenated and formatted to match our new feature
- Now we can think about methods, or models, that compare the input string to that of the dataset
- In order to use such data in an ML model, the words must be vectorized
  - or converted from words into numbers
  - The method we used is called TF-IDF

### Feature Engineering

In [4]:
# User pyjanitor to wrangle the data and engineer that single feature
df2 = (df1
        .clean_names()  # In this case, fixes Title Case
        .concatenate_columns(
            # Create a single feature for NLP analysis
            column_names=["type", "effects", "flavor"],
            new_column_name="type_effects_flavor",
            sep=",",
        )
        .remove_columns(column_names=[
            "rating",
            "description",
            "type",
            "effects",
            "flavor",
        ]))

What we're left with is a single feature containing the `type`, `effects`, and `flavor` of each strain.

In [6]:
# Configure pandas to display entire text of column
pd.set_option('max_colwidth', 200)
pd.set_option('max_columns', 200)  # Display up to 200 columns

df2.head()

Unnamed: 0,strain,type_effects_flavor
0,100-Og,"hybrid,Creative,Energetic,Tingly,Euphoric,Relaxed,Earthy,Sweet,Citrus"
1,98-White-Widow,"hybrid,Relaxed,Aroused,Creative,Happy,Energetic,Flowery,Violet,Diesel"
2,1024,"sativa,Uplifted,Happy,Relaxed,Energetic,Creative,Spicy/Herbal,Sage,Woody"
3,13-Dawgs,"hybrid,Tingly,Creative,Hungry,Relaxed,Uplifted,Apricot,Citrus,Grapefruit"
4,24K-Gold,"hybrid,Happy,Relaxed,Euphoric,Uplifted,Talkative,Citrus,Earthy,Orange"


---

### Vectorizing with TF-IDF

- TF-IDF is a method of finding unique aspects of documents (strings)
  - The more common a word is across the documents the lower the score
  - The result is the unique topics rising to the top
- This way, we can compare the unique aspects of an input string.
  - In our case, we want to find the most similar, or least-unique
  - In this TF-IDF Matrix, 0 means completely similar

In [9]:
# Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the vectorizer object
tfidf = TfidfVectorizer(stop_words="english")

# Create a vocabulary from the new feature
dtm = tfidf.fit(df2["type_effects_flavor"])

# This trained vocabulary is what we want to pickle and use in the app
import pickle
with open("vector_vocab.pkl", "wb") as p:
    pickle.dump(dtm, p)

# Create vectorized version of the concatenated feature
sparse = tfidf.transform(df2["type_effects_flavor"])

# The result is a sparse matrix, which can be converted back to a dataframe
vdtm = pd.DataFrame(sparse.todense(), columns=tfidf.get_feature_names())

vdtm

---

#### Model spec recap

- The most important thing that our model must be able to do is...
  - Find strains in our dataset with similar characteristics to the input
  - Ideally, rank these similar rows from most to least similar

#### Finding the Nearest Neighbors

- Nearest Neighbor is a great method of calculating a list of similar "neighbors" to a given input
  - Ball trees partition data in a series of nesting hyper-spheres
  - A ball tree recursively divides the data into nodes defined by a centroid and radius
    - Such that each point in the node lies within the hyper-sphere defined by the centroid and radius
- Unsupervised model (validation)
  - Our model is not "predicting" as much as breaking down and analyzing the input
  - Using that analysis to find out where that input would fit into the dataset
  - Then finding the k (as in K-Nearest Neighbor) number of nearest rows

In [8]:
# Recommendation Model
from sklearn.neighbors import NearestNeighbors

In [16]:
# Instantiate the knn model
nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')

# Fit (train) the model on the TF-IDF vector dataframe created above
nn.fit(vdtm)

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=1.0)

In [15]:
# This trained vocabulary is what we want to pickle and use in the app
import pickle
with open("nn_rec_model.pkl", "wb") as p:
    pickle.dump(nn, p)

In [17]:
input1 = "sativa,happy,energetic,focused,euphoric,earthy,woody,flowery"
num_recs = 10

# Create vector using the vocab that was fit above
input_vector = tfidf.transform([input1])

# Use NN model to calculate the top n similar strains
top_id = nn.kneighbors(input_vector.todense(), n_neighbors=num_recs)[1][0]

#### Returns

- Pass the vectorized input into the trained knn model, specifying the number of neighbors to return
- This returns a list of two arrays: one is a measure of each neighbors 'near-ness'
- the other (the one we want) contains the indexes for the neighbors
  - API returns a list of only indexes
  - For the purposes of this demo I'll hydrate that list with the rest data
  - from the original (pre-wrangled) dataframe

In [18]:
pd.set_option('max_colwidth', 60)

# Index-locate the neighbors in original dataframe
top_df = df1.iloc[top_id]

top_df

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
2335,Y-Griega,sativa,4.8,"Happy,Energetic,Uplifted,Focused,Euphoric","Earthy,Woody,Flowery","Also known as simply “Y,” the 80% sativa Y Griega is an ..."
2129,Thai-Tanic,sativa,4.0,"Energetic,Uplifted,Happy,Focused,Euphoric","Sweet,Earthy,Woody",Thai-Tanic is a very compact sativa variety with that cl...
8,3D-Cbd,sativa,4.6,"Uplifted,Focused,Happy,Talkative,Relaxed","Earthy,Woody,Flowery",3D CBD from Snoop Dogg’s branded line of cannabis strain...
987,Harlequin,sativa,4.3,"Relaxed,Focused,Happy,Uplifted,Energetic","Earthy,Sweet,Woody",Harlequin is a 75/25 sativa-dominant strain renowned for...
2125,Thai,sativa,4.2,"Happy,Relaxed,Focused,Uplifted,Energetic","Earthy,Flowery,Sweet",Thai refers to a cannabis variety that grows natively in...
475,Charlottes-Web,sativa,4.5,"Relaxed,Uplifted,Focused,Happy,Energetic","Earthy,Flowery,Sweet",Charlotte’s Web is a cultivar with less than 0.3% THC th...
1100,Jack-Herer,sativa,4.4,"Happy,Uplifted,Energetic,Focused,Euphoric","Earthy,Pine,Woody",Jack Herer is a sativa-dominant cannabis strain that has...
948,Green-Haze,sativa,3.8,"Happy,Talkative,Creative,Focused,Hungry","Woody,Flowery,Earthy",Green Haze by A.C.E. Seeds is another version of their s...
1164,Kali-Mist,sativa,4.1,"Energetic,Focused,Uplifted,Euphoric,Creative","Woody,Earthy,Citrus","Kali Mist is known to deliver clear-headed, energetic ef..."
2047,Super-Green-Crack,sativa,4.5,"Happy,Giggly,Energetic,Focused,Euphoric","Earthy,Flowery,Pungent",Super Green Crack is a true sativa. Like a cup of strong...


![Mobile Filters](../screenshots/Desktop-3-Discover.png)

---

## The FuncZone

- In order to more easily integrate this recommendation process into the Flask API,
  - we grouped into a function that will take in a request and return the recommendations
- To use the pre-trained vocab (vectorizer) and NN model in the app...pickles!
  - The pickle module, and the pickle file format, allows Python objects to be serialized and de-serialized

In [16]:
def recommend(req, n=10):
    """Function to recommend top n strains given a request."""
    # Create vector from request
    req_vec = tfidf.transform([req])

    # Access the top n indexes
    top_id = nn.kneighbors(req_vec.todense(), n_neighbors=n)[1][0]

    # Index-locate the neighbors in original dataframe
    top_df = df1.iloc[top_id]

    return top_df

### JSON Version

In [18]:
# The API should return a JSON object with only the ids
# Here's a slightly modified version to accomplish that
def recommend_json(req, n=10):
    """Function to recommend top n strains given a request."""
    # Create vector from request
    req_vec = tfidf.transform([req])

    # Access the top n indexes
    rec_id = nn.kneighbors(req_vec.todense(), n_neighbors=n)[1][0]

    # Convert np.ndarray to pd.Series then to JSON
    rec_json = pd.Series(rec_id).to_json(orient="records")

    return rec_json