## Recommender System

The goal of this document is to provide a way to recommend treatments to users.  For each condition, we can see what treatments have worked for other patients.  We can also go one step further and say, if Treatment/Tag A has worked for you, then other people who have had success with Treatment/Tag A have also had success with Treatment/Tag B.

The same will also be possible in reverse.  Some Treatments/Tags may cause Conditions/Symptoms to worsen, and we may be able to recommend against those Treatments/Tags.

In order to say that a treatment is working, we need a measure of that.  There are a few strategies for doing that, so it's handled in a separate notebook "treatment_effectiveness".

### Filter Type

We will use a collaborative filter to make our recommendations, but there are two different types that we need to consider.  Item based filtering will form groups of associated items(in our case, an item is a treatment/tag), and recommend people who have good results with items in that set to other items in that set.  User based filtering will try to form groups of users that have success with similar items, and make recommendations based on what items work well for that group.  We will of course figure out which one is best for our situation by trying both.


In [2]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("effectiveness.csv")
print df.head()

   user_id         condition     treatment  before_value  after_value  \
0       20           Fatigue      Provigil      1.750000     1.444444   
1       20  major somnolence      Provigil      2.000000     1.333333   
2       20        sleepiness      Provigil      1.750000     1.777778   
3       52           Allergy  Escitalopram      1.083333     0.500000   
4       52           Allergy     Magnesium      0.800000     0.333333   

   effectiveness  
0       0.305556  
1       0.666667  
2      -0.027778  
3       0.583333  
4       0.466667  


### Item Based Collaborative Filtering

We will start by trying to predict a single treatment for a single condition, and see how that goes.  Since Depression is common, we will start with that.

In [32]:
print df[df['condition'] == "Depression"]['treatment'].value_counts().head(15)
#print df[(df['treatment'] == "good sleep") & (df['condition'] == "Depression")]

stressed         44
tired            41
ate breakfast    35
good sleep       30
period           21
had sex          20
happy            20
alcohol          18
walked           16
exercise         14
Anxious          14
Ibuprofen        13
Superlong nap    13
travel           12
poor sleep       12
Name: treatment, dtype: int64


OK, so we are hurting for samples of specific treatments.  Let's go with "good sleep" as it has the most samples out of the tags which sound like they might help with depression.
We will hold a few users back who have reported good sleep while suffering from depression.  Our first goal will be to use the rest of the users to create a model which can accurately predict the effectiveness for the test users.

First up we will make a recommendation on "good sleep" by finding which other treatment/tag is most correlated to it.  Correlation makes a great distance measure because it gives a p-value which can be used to assess how significant the distance measures are. Pearson correlation is a measure of how much a variable changes relative to another variable, divided by how much they change independently.  This will help us accomodate the fact that not all users will rate their symptoms the same.  Presumably, some users will consistently rate their symptoms as being worse than others.

In [16]:
from scipy.stats.stats import pearsonr

good_sleep_users = list(set(df[(df['treatment'] == "good sleep") & (df['condition'] == "Depression")]['user_id']))
test_users = good_sleep_users[:5]
train_users = good_sleep_users[5:]
train_rows = df[df['user_id'].isin(train_users)]
test_rows = df[df['user_id'].isin(test_users)]

#just going to abstract this now in case we need it later
#finds the pearson correlation between the specified treatment, and all of the treatments used to in conjunction with the specified condition
def correlate_treatments(train_df, treatment, condition):
    affected_rows = train_df[train_df['condition'] == condition]
    other_treatments = list(set(affected_rows[affected_rows['treatment'] != treatment]['treatment']))
    treatment_correlations = {}
    for treatment2 in other_treatments:
        users_with_treatment = list(set(affected_rows[affected_rows['treatment'] == treatment2]['user_id']))
        treatment1_values = affected_rows[(affected_rows['user_id'].isin(users_with_treatment)) & (affected_rows['treatment'] == treatment)]['effectiveness']
        treatment2_values = affected_rows[(affected_rows['user_id'].isin(users_with_treatment)) & (affected_rows['treatment'] == treatment2)]['effectiveness']
        if len(treatment1_values) > 1 :
            correlation = pearsonr(treatment1_values, treatment2_values)[0]
            treatment_correlations[treatment2] = correlation
    return treatment_correlations
        
treatment_correlations = correlate_treatments(train_rows, 'good sleep', 'Depression')
print treatment_correlations

{'humid': 0.93698444082858501, 'neck ache': 1.0, 'Stayed at home': 1.0, 'Upset stomach': 0.54524358134172757, 'ate breakfast': 0.80653698125459583, 'middleschmertz': -0.99999999999999978, 'busy': 0.71530534640575005, 'family': 1.0, 'ovarian cramps': -1.0, "can't sleep": 1.0, 'paranoia': 1.0, 'ovulating': -1.0, 'dairy': -0.97493712242917863, 'shoulder pain': 1.0, 'crying': 1.0, ' toothache': 0.99999999999999989, 'nausea': 1.0, 'good day': 1.0, 'right knee weakness': 1.0, 'Adam over': -1.0, 'chest pain': 1.0, 'sore legs': 1.0, 'distraction': 1.0, 'Benadryl': 0.99999999999999989, 'congested': -1.0, 'bad sleep': 0.6645785337312623, 'Day off': -1.0, 'Marijuana': -1.0, 'anxious': 1.0, 'doctor appointment': 0.99999999999999989, 'had therapy': 1.0, 'overslept': 1.0, 'Went to work': 0.46368590165453483, 'neck pain': 0.99999999999999989, 'unproductive': 1.0, 'tired': 0.57515640453310901, 'household chores': 1.0, 'Period': 1.0, 'fast food': 1.0, 'napped': 0.65546509150066878, 'worried': 0.5999300

We can see that the number of correlations we've learned between "good sleep" and other treatments for depression is very low.  
The reason for this is that if we look at users that have tried "good sleep" and any other tag, they are almost always the only user that has tried that combination.  Which leaves us comparing single values, which Pearson can't help us with.  We can also see a lot of correlations that are basically 1 or -1, which are usually occurring when we have just two users with the same tags.

This may still be the best way to build the recommender system, but the volume of data would need to increase, probably by orders of magnitude.

We can still try a different distance measure, so let's try the cosine similarity.  This way we can still get the distance between two points.  Whether those distances will be useful remains to be seen.

In [15]:
from scipy.spatial import distance

def cosine_distances(train_df, treatment, condition):
    affected_rows = train_df[train_df['condition'] == condition]
    other_treatments = list(set(affected_rows[affected_rows['treatment'] != treatment]['treatment']))
    treatment_correlations = {}
    for treatment2 in other_treatments:
        users_with_treatment = list(set(affected_rows[affected_rows['treatment'] == treatment2]['user_id']))
        treatment1_values = affected_rows[(affected_rows['user_id'].isin(users_with_treatment)) & (affected_rows['treatment'] == treatment)]['effectiveness']
        treatment2_values = affected_rows[(affected_rows['user_id'].isin(users_with_treatment)) & (affected_rows['treatment'] == treatment2)]['effectiveness']
        if not np.isnan(treatment2_values).any():
            cos_distance = distance.cosine(treatment1_values, treatment2_values)
            treatment_correlations[treatment2] = cos_distance
    return treatment_correlations

treatment_correlations = cosine_distances(train_rows, 'good sleep', 'Depression')

This one is a bit long to print out.  There are a lot of tags, but a lot of them have a distance of 0.   Still, if one of the 0 length tags is found in our test set these entries might be useful.  It's the equivalent of saying "this treatment worked this well for this one other person".  So maybe better than nothing.

I will now perform a test and validate.

In [50]:
from sklearn.metrics import r2_score

def predict_effectiveness(x):
    highestCorrelationValue = 0
    highestCorrelationKey = ""
    highestCorrelationEffectiveness = 0
    for value in x:
        if value in treatment_correlations.keys():
            if treatment_correlations[value] > highestCorrelationValue:
                highestCorrelationValue = treatment_correlations[value]
                highestCorrelationKey = value
                highestCorrelationEffectiveness = treatment_correlations
    print "value most correlated to good sleep was " + highestCorrelationKey + " with value " + str(highestCorrelationValue)
    return str(highestCorrelationValue) + "," + highestCorrelationKey

test_depression_rows = test_rows[test_rows['condition'] == 'Depression']
test_depression_rows['closest_correlation'] = test_depression_rows.groupby('user_id')['treatment'].transform(predict_effectiveness)
real_values = []
predicted_values = []
for user in test_users:
    corr_value,corr_key = test_depression_rows[(test_depression_rows['user_id'] == user) & (test_depression_rows['treatment'] == 'good sleep')]['closest_correlation'].values[0].split(',')
    predicted_value = test_depression_rows[(test_depression_rows['user_id'] == user) & (test_depression_rows['treatment'] == corr_key)]['effectiveness'].values[0]
    real_value = test_depression_rows[(test_depression_rows['user_id'] == user)]['effectiveness'].values[0]
    print "predicted value was is " + str(predicted_value) + " real value is " + str(real_value)
    real_values.append(real_value)
    predicted_values.append(predicted_value)
print "r2 accuracy score " + str(r2_score(real_values, predicted_values))


value most correlated to good sleep was hot and humid with value 1.0
value most correlated to good sleep was feet tingly with value 1.0
value most correlated to good sleep was family with value 1.0
value most correlated to good sleep was cleaning with value 1.0
value most correlated to good sleep was ate breakfast with value 0.806536981255
predicted value was is 0.3 real value is 0.222222222222
predicted value was is 0.0 real value is -1.0
predicted value was is 0.166666666667 real value is 0.2
predicted value was is -0.8 real value is -0.8
predicted value was is -1.66666666667 real value is -1.25
r2 accuracy score 0.382066572767


### User Based Collaborative Filtering

Initial results on the item based filter are not exactly ready from production, but also not entirely discouraging.  It's at least showing some predictive power, and on very little data.

Lets try the other approach.