In [13]:
import pandas as pd
import numpy as np

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Recommendation Engines

Week 11 | Lesson 3.1

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Explain what a recommendation engines is
- Explain the math behind recommendation engines
- Explain the types of recommendation engines and their pros and cons

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 5 min  | [Opening](#opening)  |  Opening |
| 5 min | [Introduction](#introduction) | Introduction |
| 30 min | [Demo/Guided Practice](#collaborative) | Collaborative Filtering |
| 10 min | [Demo/Guided Practice](#content) | Content-based Filtering |
| 25 min | [Independent Practice](#independent) | Independent Practice |
| 10 min | [Conclusion](#conclusion) | Conclusion |

![](http://res.cloudinary.com/goodsearch/image/upload/v1410895418/hi_resolution_merchant_logos/target_coupons.gif)

![](https://cdn1.vox-cdn.com/thumbor/lazP2aCcxVUI5RnbcxWpmjr7MU0=/cdn0.vox-cdn.com/uploads/chorus_asset/file/4109214/Discover_Weekly_Snapshot.0.png)

![](https://pmcvariety.files.wordpress.com/2015/09/pandora-logo.jpg?w=670&h=377&crop=1)

![](http://techlogitic.com/wp-content/uploads/2015/11/rs_560x415-140917143530-1024.Tinder-Logo.ms_.091714_copy.jpg)

![](https://pbs.twimg.com/profile_images/744949842720391168/wuzyVTTX.jpg)

### So how might we go about recommending things to people that they have never seen or tried before? How can we know what they'll like before they do?

### We have essentially two options:
- Based upon similar people
- Based upon similar characteristics of the item

- The first is called **Collaborative Filtering**
- The second is called **Content-based Filtering**

## Collaborative Filtering

We'll first look at user-to-user filtering. The idea behind this method is finding your taste doppelgänger. This is the person who is most similar to you based upon the ratings both of you have given to a mix of products.

Let's take a look at how this works.

We begin with what's called a utility matrix.
![](./assets/images/utility.png)

Now if we want to find the users most similar to user A, we can use something called cosine similarity. Cosine similarity uses the cosine between two vectors to compute a scalar value that represents how closely related these vectors are. If the vectors have an angle of 0 (they are pointing in exactly the same direction), then the cosine of 0 is 1 and they are perfectly similar. If they point in completely different directions (the angle is 90 degrees), then the cosine similarity is 0 and they are unrelated. 

With that, let's calculate the cosine similarity of A against all other users. We'll start with B. We have a sparse matrix so let's just fill in 0 for the missing values.

```python
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1),\
np.array([0,4,0,4,0,5,0]).reshape(1,-1))
```
 This give us cosine similarity of .1835

This is a low rating and makes sense since they have no ratings in common.

Let's run it for user A and C now.

```python
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1),\
np.array([2,0,2,0,1,0,0]).reshape(1,-1))
```

This gives us a cosine simularity of .8852. This indicates these users are very similar. But are they?

## We can't use zeros!

By inputing 0 to fill the missing values, we have indicated strong negative sentiment for the missing ratings and thus agreement where there is none. We should instead represent that with a neutral value. We can do this by mean centering the values at zero. Let's see how that works.

We add up all the ratings for user A and then divide by the total. In this case that is 17/4 or 4.25. We then subtract 4.25 from every individual rating. We then do the same for all other users. That gives us the following table:

![](./assets/images/centered.png)

```python
cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0])\
.reshape(1,-1),\
np.array([0,-.33,0,-.33,0,.66,0])\
.reshape(1,-1))
```

This new figure for this is:  .3077

```python
cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0])\
.reshape(1,-1),\
np.array([.33,0,.33,0,-.66,0,0])\
.reshape(1,-1))
```
The new figure for this is: -0.246

So the A and B got more similar and A and C got further apart which is what we'd hope to see. This centering process also has another benefit in that easy and hard raters are put on the same basis.

#### Exercise: Find the similarity between X and Y and X and Z for the following.

|User |Snarky's Potato Chips	| SoSo Smoth Lotion	|Duffly Beer	|BetterTap Water	|XXLargeLivin' Football Jersey	|Snowy Cotton Ballas	|Disposos Diapers|
|:-:|---|---|---|---|---|---|---|---|
| X| |4| | 3| | 4|? |
| Y| |3.5| | 2.5| | 4| 4|
| Z| | 4| | 3.5| | 4.5| 4.5|

In [5]:
from sklearn.metrics.pairwise import cosine_similarity

array([[ 0.18353259]])

In [27]:
Y = [0, 3.5, 0, 2.5, 0, 4, 4]
X = [0, 4, 0, 3, 0, 4, 0]
Z = [0, 4, 0, 3.5, 0, 4.5, 4.5]

non_zeroes = []
new_mat = []
for i in X:
    if i > 0:
        non_zeroes.append(i)
meanie = np.mean(non_zeroes)
stdev = np.std(non_zeroes)
for i in non_zeroes:
    new = (i - meanie)/stdev
    new_mat.append(new)
        

print new_mat
print X
print Y
print Z

[0.70710678118654779, -1.4142135623730947, 0.70710678118654779]
[0, 4, 0, 3, 0, 4, 0]
[0, 3.5, 0, 2.5, 0, 4, 4]
[0, 4, 0, 3.5, 0, 4.5, 4.5]


In [17]:
# between X and Y
cosine_similarity(np.array(X).reshape(1,-1),\
np.array(Y).reshape(1,-1))

array([[ 0.67101992]])

In [18]:
# between X and Z
cosine_similarity(np.array(X).reshape(1,-1),\
np.array(Z).reshape(1,-1))

array([[ 0.69793691]])

In [19]:
# between Y and Z
cosine_similarity(np.array(Y).reshape(1,-1),\
np.array(Z).reshape(1,-1))

array([[ 0.99348623]])

## But how do we get the rating for an item?

Next we'll find the expected rating for User X for Disposo's Diapers using the weighted results of the two closest users (we only have two here, but normally k would be selected) Y and Z.

We do this by weighing each user's similarity to X and multiplying by their rating. We then divide by the sum of their similarities to arrive at our rating.

(.42447212 * (4) + .46571861 * (4.5)) / (.42447212 + .46571861) = 4.26

#### Check: What might be some problems with user-to-user filtering?

In practice, there is a type of collaborative filtering that performs much better than user-to-user filtering: item-to-item filtering.

## Item-to-item filtering

Let's take a look at an example ratings table.

![](./assets/images/songs.png)

Just as in user-to-user filtering, we need to center our values by row.

#### Exercise: Center the values by row and find the cosine similarity for each row vs. row 5 (S5).

In [35]:
S1 = [[2, 0, 4, 0, 5], np.mean([2, 4, 5]), np.std([2, 4, 5])]
S2 = [[0, 3, 0, 3, 0], np.mean([3, 3]), np.std([3, 3])]
S3 = [[1, 0, 5, 0, 4], np.mean([1, 5, 4]), np.std([1, 5, 4])]
S4 = [[0, 4, 4, 4, 0], np.mean([4, 4, 4]), np.std([4, 4, 4])]
S5 = [[3, 0, 0, 0, 5], np.mean([3, 5]), np.std([3, 5])]

mat_list = [S1, S2, S3, S4, S5]

new_mat = []
for i in mat_list[0]:
    if i == 0:
        new_mat.append(i)
    else:
        new_i = (i - S1mean)/S1std
        new_mat.append(new_i)

print S1mean
print S1std
        
new_s1

3.66666666667
1.24721912892


[-1.6666666666666665, 0, 0.33333333333333348, 0, 1.3333333333333335]

The nearest songs should have been S1 and S3. To calculate the rating for our target song, S5, for U3, using a k of 2, we have the following equation:

(.98 * (4) + .72 * (5)) / (.98 + .72) = 4.42

This is the similarity of our closest song S1 times User 3's rating + the similarity of song S3 times User 3's rating of it. This is then divided by the total similarity.

Therefore, based on this item-to-item collaborative filtering, we can see U3 is likely to rate S5 very highly at 4.42 from our calculations.

## Content-based Filtering

Finally, there is another method called content-based filtering. In content-based filtering, the items are broken down into "feature baskets". These are the characteristics that represent the item. The idea is that if you like the features of song X, then finding a song that has similar characteristics will tell us that you're likely to like it as well.


The quintessential example of this is Pandora with it's musical genome. Each song is rated on ~450 characteristics by a trained musicologist.

## Independent Exercise:

Write a function that takes in a utility matrix with users along the index and songs along the columns as seen above in the item-to-item filtering example. The function should accept a target user and song that it will return a rating for. 

Use the following as your utility matrix;

In [36]:
df = pd.DataFrame({'U1':[2 , None, 1, None, 3], 'U2': [None, 3, None, 4,
None],'U3': [4, None, 5, 4, None], 'U4': [None, 3, None, 4, None], 'U5': [5, None, 4, None, 5]})
df.index = ['S1', 'S2', 'S3', 'S4', 'S5']

In [37]:
df

Unnamed: 0,U1,U2,U3,U4,U5
S1,2.0,,4.0,,5.0
S2,,3.0,,3.0,
S3,1.0,,5.0,,4.0
S4,,4.0,4.0,4.0,
S5,3.0,,,,5.0


In [44]:
def The_Recommender(df):
    df = df.fillna(0)
    new_row = []
    non_zeroes = []
    for i in range(len(df)):
        for column in df.columns:
            if df.ix[i, column] != 0:
                non_zeroes.append(df.ix[i, column])
            else:
                continue
        meanie = np.mean(non_zeroes)
        stdev = np.std(non_zeroes)
        for column in df.columns:
            if df.ix[i, column] == 0:
                new_row.append(0)
            else:
                new_obs = (df.ix[i, column] - meanie)/stdev
                new_row.append(new_obs)
        print df.ix[i, :]
        print new_row
        df.ix[i,:] = new_row
    return df

In [45]:
The_Recommender(df)

U1    2.0
U2    0.0
U3    4.0
U4    0.0
U5    5.0
Name: S1, dtype: float64
[-1.3363062095621219, 0, 0.26726124191242451, 0, 1.0690449676496978]
U1    0.0
U2    3.0
U3    0.0
U4    3.0
U5    0.0
Name: S2, dtype: float64
[-1.3363062095621219, 0, 0.26726124191242451, 0, 1.0690449676496978, 0, -0.39223227027636798, 0, -0.39223227027636798, 0]


ValueError: cannot copy sequence with size 10 to array axis with dimension 5

## Conclusion

We have looked at the major types of recommender systems in this lesson. Let's quickly wrap up by looking at the pros and cons of each.

Collaborative Filtering:
Pros:
- No need to hand craft features
Cons:
- Needs a large existing set of ratings (cold-start problem)
- Sparsity occurs when the number of items far exceeds what a person could purchase

Content-based Filtering:
Pros:
- No need for a large number of users
Cons:
- Lacks serendipity
- May be difficult to generate the right features

In fact, the best solution -- and the one most likely in use in any large-scale, production system is a combination of both of these. This is known as a **hybrid system**. By combining the two systems, you can get the best of both worlds.