## Recommender System

The goal of this document is to provide a way to recommend treatments to users.  For each condition, we can see what treatments have worked for other patients.  We can also go one step further and say, if Treatment/Tag A has worked for you, then other people who have had success with Treatment/Tag A have also had success with Treatment/Tag B.

The same will also be possible in reverse.  Some Treatments/Tags may cause Conditions/Symptoms to worsen, and we may be able to recommend against those Treatments/Tags.

In order to say that a treatment is working, we need a measure of that.  There are a few strategies for doing that, so it's handled in a separate notebook "treatment_effectiveness".

### Filter Type

We will use a collaborative filter to make our recommendations, but there are two different types that we need to consider.  Item based filtering will form groups of associated items(in our case, an item is a treatment/tag), and recommend people who have good results with items in that set to other items in that set.  User based filtering will try to form groups of users that have success with similar items, and make recommendations based on what items work well for that group.  We will of course figure out which one is best for our situation by trying both.

VERSION INFO : The user-based recommender below uses profile information which was added in the 083016 version of the datafile, use that one or later.

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("effectiveness_083016.csv")
print df.head()

   user_id  age     sex country         condition     treatment  before_value  \
0       20   49    male      US           Fatigue      Provigil      1.750000   
1       20   49    male      US  major somnolence      Provigil      2.000000   
2       20   49    male      US        sleepiness      Provigil      1.750000   
3       52   44  female      US           Allergy  Escitalopram      1.083333   
4       52   44  female      US           Allergy     Magnesium      0.800000   

   after_value  effectiveness  
0     1.444444       0.305556  
1     1.333333       0.666667  
2     1.777778      -0.027778  
3     0.500000       0.583333  
4     0.333333       0.466667  


### Item Based Collaborative Filtering

We will start by trying to predict a single treatment for a single condition, and see how that goes.  Since Depression is common, we will start with that.

In [2]:
print df[df['condition'] == "Depression"]['treatment'].value_counts().head(15)
#print df[(df['treatment'] == "good sleep") & (df['condition'] == "Depression")]

stressed         45
tired            42
ate breakfast    35
good sleep       31
period           22
happy            20
had sex          19
alcohol          18
walked           17
Anxious          14
exercise         14
travel           13
Headache         13
Superlong nap    13
bad sleep        13
Name: treatment, dtype: int64


OK, so we are hurting for samples of specific treatments.  Let's go with "good sleep" as it has the most samples out of the tags which sound like they might help with depression.
We will hold a few users back who have reported good sleep while suffering from depression.  Our first goal will be to use the rest of the users to create a model which can accurately predict the effectiveness for the test users.

First up we will make a recommendation on "good sleep" by finding which other treatment/tag is most correlated to it.  Correlation makes a great distance measure because it gives a p-value which can be used to assess how significant the distance measures are. Pearson correlation is a measure of how much a variable changes relative to another variable, divided by how much they change independently.  This will help us accomodate the fact that not all users will rate their symptoms the same.  Presumably, some users will consistently rate their symptoms as being worse than others.

In [3]:
from scipy.stats.stats import pearsonr

#set gives us a list of all distinct users, as well as a shuffle
good_sleep_users = list(set(df[(df['treatment'] == "good sleep") & (df['condition'] == "Depression")]['user_id']))
test_users = good_sleep_users[:6]  #grab %20 of users for testing
train_users = good_sleep_users[6:]
train_rows = df[df['user_id'].isin(train_users)]
test_rows = df[df['user_id'].isin(test_users)]

#just going to abstract this now in case we need it later
#finds the pearson correlation between the specified treatment, and all of the treatments used to in conjunction with the specified condition
def correlate_treatments(train_df, treatment, condition):
    affected_rows = train_df[train_df['condition'] == condition]
    other_treatments = list(set(affected_rows[affected_rows['treatment'] != treatment]['treatment']))
    treatment_correlations = {}
    for treatment2 in other_treatments:
        users_with_treatment = list(set(affected_rows[affected_rows['treatment'] == treatment2]['user_id']))
        treatment1_values = affected_rows[(affected_rows['user_id'].isin(users_with_treatment)) & (affected_rows['treatment'] == treatment)]['effectiveness']
        treatment2_values = affected_rows[(affected_rows['user_id'].isin(users_with_treatment)) & (affected_rows['treatment'] == treatment2)]['effectiveness']
        if len(treatment1_values) > 1 :
            correlation = pearsonr(treatment1_values, treatment2_values)[0]
            treatment_correlations[treatment2] = correlation
    return treatment_correlations
        
treatment_correlations = correlate_treatments(train_rows, 'good sleep', 'Depression')
print treatment_correlations

{'humid': 0.93698444082858501, 'neck ache': 1.0, 'Stayed at home': 1.0, 'Upset stomach': 0.54524358134172757, 'cold': -1.0, 'ate breakfast': 0.4853091842706253, 'middleschmertz': -0.99999999999999978, 'busy': 0.72247406517454238, 'family': 1.0, 'ovarian cramps': -1.0, "can't sleep": 1.0, 'paranoia': 1.0, 'sugar': 1.0, 'ovulating': -1.0, 'dairy': 0.12883874324150038, 'shoulder pain': 1.0, ' toothache': 0.99999999999999989, 'distraction': 1.0, 'good day': -1.0, 'right knee weakness': 1.0, 'Adam over': -1.0, 'chest pain': 1.0, 'sore legs': 1.0, 'nausea': 1.0, 'Benadryl': 0.99999999999999989, 'congested': -1.0, 'bad sleep': 0.67050082708071557, 'Day off': -0.99999999999999978, 'Marijuana': -1.0, 'anxious': 0.99999999999999989, 'doctor appointment': 1.0, 'had therapy': 1.0, 'overslept': 1.0, 'Went to work': 0.4767596381184977, 'neck pain': 0.99999999999999989, 'unproductive': 1.0, 'tired': 0.52507391318031171, 'household chores': 1.0, 'Period': 1.0, 'fast food': 1.0, 'napped': 0.66666669027

We can see that the number of correlations we've learned between "good sleep" and other treatments for depression is very low.  
The reason for this is that if we look at users that have tried "good sleep" and any other tag, they are almost always the only user that has tried that combination.  Which leaves us comparing single values, which Pearson can't help us with.  We can also see a lot of correlations that are basically 1 or -1, which are usually occurring when we have just two users with the same tags.

This may still be the best way to build the recommender system, but the volume of data would need to increase, probably by orders of magnitude.

We can still try a different distance measure, so let's try the cosine similarity.  This way we can still get the distance between two points.  Whether those distances will be useful remains to be seen.

In [4]:
from scipy.spatial import distance

def cosine_distances(train_df, treatment, condition):
    affected_rows = train_df[train_df['condition'] == condition]
    other_treatments = list(set(affected_rows[affected_rows['treatment'] != treatment]['treatment']))
    treatment_correlations = {}
    for treatment2 in other_treatments:
        users_with_treatment = list(set(affected_rows[affected_rows['treatment'] == treatment2]['user_id']))
        treatment1_values = affected_rows[(affected_rows['user_id'].isin(users_with_treatment)) & (affected_rows['treatment'] == treatment)]['effectiveness']
        treatment2_values = affected_rows[(affected_rows['user_id'].isin(users_with_treatment)) & (affected_rows['treatment'] == treatment2)]['effectiveness']
        if not np.isnan(treatment2_values).any():
            cos_distance = distance.cosine(treatment1_values, treatment2_values)
            treatment_correlations[treatment2] = cos_distance
    return treatment_correlations

treatment_correlations = cosine_distances(train_rows, 'good sleep', 'Depression')

This one is a bit long to print out.  There are a lot of tags, but a lot of them have a distance of 0.   Still, if one of the 0 length tags is found in our test set these entries might be useful.  It's the equivalent of saying "this treatment worked this well for this one other person".  So maybe better than nothing.

I will now perform a test and validate.

In [5]:
from sklearn.metrics import r2_score

#takes all of the treatments that a user has tried, and finds which one is most correlated to "good sleep"
def predict_effectiveness(x):
    highestCorrelationValue = 0
    highestCorrelationKey = ""
    highestCorrelationEffectiveness = 0
    for value in x:
        if value in treatment_correlations.keys():
            if treatment_correlations[value] > highestCorrelationValue:
                highestCorrelationValue = treatment_correlations[value]
                highestCorrelationKey = value
                highestCorrelationEffectiveness = treatment_correlations
    print "value most correlated to good sleep was " + highestCorrelationKey + " with value " + str(highestCorrelationValue)
    return str(highestCorrelationValue) + "," + highestCorrelationKey

test_depression_rows = test_rows[test_rows['condition'] == 'Depression']
test_depression_rows['closest_correlation'] = test_depression_rows.groupby('user_id')['treatment'].transform(predict_effectiveness)
real_values = []
predicted_values = []
for user in test_users:
    corr_value,corr_key = test_depression_rows[(test_depression_rows['user_id'] == user) & (test_depression_rows['treatment'] == 'good sleep')]['closest_correlation'].values[0].split(',')
    predicted_value = test_depression_rows[(test_depression_rows['user_id'] == user) & (test_depression_rows['treatment'] == corr_key)]['effectiveness'].values[0]
    real_value = test_depression_rows[(test_depression_rows['user_id'] == user)]['effectiveness'].values[0]
    print "predicted value is " + str(predicted_value) + " real value is " + str(real_value)
    real_values.append(real_value)
    predicted_values.append(predicted_value)
print "r2 accuracy score " + str(r2_score(real_values, predicted_values))


value most correlated to good sleep was happy with value 0.563930772324
value most correlated to good sleep was tired with value 0.625685887011
value most correlated to good sleep was couch potato with value 2.0
value most correlated to good sleep was happy with value 0.563930772324
value most correlated to good sleep was cleaning with value 0.366664212496
value most correlated to good sleep was walked with value 0.385388498727
predicted value is 0.75 real value is 0.222222222222
predicted value is 0.0 real value is -1.0
predicted value is -0.0454545454545 real value is 0.2
predicted value is -0.8 real value is -0.8
predicted value is -0.666666666667 real value is -1.41666666667
predicted value is 1.16666666667 real value is 1.16666666667
r2 accuracy score 0.591630696131


### User Based Collaborative Filtering with KMeans

Initial results on the item based filter are not exactly ready from production, but still encouraging.  It's showing some predictive power, and on very little data.  Lets not get to excited though, due to the very small test set size.

Lets try the other approach.  This way of building a collaborative filter uses that strategy of finding a user's neighbours.  This can be done either with K-Means or with K-Nearest Neighbours.  Our clustering analysis shows that either of these options will make good candidates, so we are going to have to try both.

In [59]:
def pivot_to_per_user(df):
    user_df = df.pivot_table(index='user_id',columns='treatment',values='effectiveness').reset_index()
    return user_df.fillna(0)

all_good_sleep = pivot_to_per_user(df[df['user_id'].isin(good_sleep_users)])

#now add in one-hotted profile info
#since we are trying to find users most similar to each other age, sex, and location might prove helpful
profile = df[['user_id','sex','age','country']].drop_duplicates()
all_good_sleep = pd.merge(all_good_sleep, profile, on='user_id', how='left')
sex = pd.get_dummies(all_good_sleep['sex_y'])
country = pd.get_dummies(all_good_sleep['country'])
all_good_sleep = pd.concat([all_good_sleep, sex], axis=1)
all_good_sleep = pd.concat([all_good_sleep, country], axis=1)
all_good_sleep = all_good_sleep.drop('user_id', axis=1)
all_good_sleep = all_good_sleep.drop('sex_y', axis=1)
all_good_sleep = all_good_sleep.drop('country', axis=1)
all_good_sleep['age'] = all_good_sleep['age'].fillna(np.mean(all_good_sleep['age']))

In [71]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import r2_score

#PCA, then K-means, then find the mean effectiveness for the target treatment for each cluster
#data is the df containing both your train and test sets
#split is how many users you want split into the test set
#pca_components and num_clusters are hyperparameters
#treatment is the target treatment
def find_cluster_effectiveness(data, split, pca_components, num_clusters, treatment):
    #applying PCA because the number of features in the pivoted dataset is very high
    pca = PCA(n_components=pca_components)
    pca.fit(data)
    fit_data = pca.transform(all_good_sleep)
    train = fit_data[split:]
    test = fit_data[:split]
    test_df = all_good_sleep.iloc[:split,:] #splitting on same lines so I can look these users up at test time

    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(train)
    pred_train = kmeans.predict(train)
    pred_test = kmeans.predict(test)

    #now look up which cluster each test sample falls into, and take the mean "good_sleep" effectiveness for that cluster
    cluster_effectivness = []
    for i in range(num_clusters):
        rows_in_cluster = all_good_sleep.iloc[np.where(pred_train == i)[0],:][treatment]
        cluster_effectivness.append(np.mean(rows_in_cluster))

    predicted = []
    actual = []
    for i in range(len(pred_test)):
        predicted.append(cluster_effectivness[pred_test[i]])
        actual.append(test_df.iloc[i,:][treatment])
        #print "predicted value is " + str(cluster_effectivness[pred_test[i]]) + " actual value is " + str(test_df.iloc[i,:][treatment])
    r2 = r2_score(actual, predicted)
    return r2

#search to find best PCA components and num clusters
num_clusters_list = [3, 5, 10, 15,20,25]
pca_components_list = [50, 100, 200]
for num_clusters in num_clusters_list:
    for pca_components in pca_components_list:
        r2 = find_cluster_effectiveness(all_good_sleep, 6, pca_components,num_clusters,"good sleep")
        print "r2 accuracy score with num_clusters=" + str(num_clusters) + " pca_components=" +str(pca_components) + " : "+ str(r2)


r2 accuracy score with num_clusters=3 pca_components=50 : -0.619685461877
r2 accuracy score with num_clusters=3 pca_components=100 : -0.619685461877
r2 accuracy score with num_clusters=3 pca_components=200 : -0.619685461877
r2 accuracy score with num_clusters=5 pca_components=50 : -0.595061946057
r2 accuracy score with num_clusters=5 pca_components=100 : -0.677492200554
r2 accuracy score with num_clusters=5 pca_components=200 : -0.622317158881
r2 accuracy score with num_clusters=10 pca_components=50 : -0.724690080404
r2 accuracy score with num_clusters=10 pca_components=100 : -0.811948469328
r2 accuracy score with num_clusters=10 pca_components=200 : -0.829812694566
r2 accuracy score with num_clusters=15 pca_components=50 : -0.545811060786
r2 accuracy score with num_clusters=15 pca_components=100 : -0.54021837861
r2 accuracy score with num_clusters=15 pca_components=200 : -0.54021837861
r2 accuracy score with num_clusters=20 pca_components=50 : -0.249063845758
r2 accuracy score with nu

Negative scores, this model is performing worse than just guessing the mean of good sleep effectiveness.  Brutal.


### User Based Collaborative Filtering with KMeans

This is finding the K users that are most like the test user, and taking the mean of their target values.  Fortunately for me, this is exactly what scikit-learn's Nearest Neighbour Regressor does.  And this time I'll be able to use exhaustive grid search to find my parameters.

In [68]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn import grid_search
from sklearn.cross_validation import train_test_split

#using same data structure as kmeans solution
X = all_good_sleep.drop('good sleep', axis=1)
y = all_good_sleep['good sleep']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

parameters = {'weights':('uniform', 'distance'), 'n_neighbors':[1,2,3,4,5,6,7,8]}
nn = KNeighborsRegressor(1)
clf = grid_search.GridSearchCV(nn, parameters)  #scorer defaults to r2
clf.fit(X_train, y_train)
print clf.best_params_
pred = clf.predict(X_test)
print pred
print y_test
print r2_score(y_test, pred)

{'n_neighbors': 5, 'weights': 'uniform'}
[-0.01175926 -0.12170198  0.04530992  0.23189434 -0.35776607 -0.06608979
 -0.02884755]
2     0.070833
29   -0.380952
13    0.125000
10    1.250000
27    0.711111
25    0.297259
22   -0.191139
Name: good sleep, dtype: float64
-0.304938246319


Ouch.  We might be able to improve that with PCA.  But given the extremely low score, it doesn't seem like a valuable use of time to continue this line of thinking.


I'm uncertain, but I believe the reason for the user based filtering not working well may have to do with having so many features that are unrelated to the predicted value.  It is even likely that almost all of the features that I'm using to describe a user are totally irrelevant to the value that I'm trying to predict.  The item-based method does not suffers from because it finds the most relevant feature (by cosine distance) and discards the rest.  So some further investigation may involve trying to solve this problem.


### Wrap-up

The item based recommender is greatly outperforming both types of user based recommenders.  For the time being I'll be proceeding with building that out.  It should be noted though, that as this dataset grows in both breadth and depth, we should revisit this notebook.  Just drop in a new CSV and run each section and see how the r2 scores change. 
