<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Similar Users and Recommender Systems 

_Authors: Dave Yerrington (SF)_

---


## Preface: working with sets

In mathematics, a set is a collection of distinct objects.  In Python, "Sets" are lists with no duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.

> _Fun fact for your next party:  Techincally, Python sets are implemented using dictionaries (under the hood)._

**Here are two sets of colors:**

In [1]:
a = set(["Red", "Green", "Blue"])
b = set(["Black", "White", "Green"])

To find out which items are in both sets (**both sets only**), use the "intersection" method:

In [2]:
a.intersection(b)

{'Green'}

To find the items in a, but not b.

In [3]:
a.difference(b)

{'Blue', 'Red'}

To find the items in b, but not a.

In [4]:
b.difference(a)

{'Black', 'White'}

To find a list of all unique sets (aka: union):

In [5]:
set(list(a) + list(b))

{'Black', 'Blue', 'Green', 'Red', 'White'}

In [6]:
a.union(b)

{'Black', 'Blue', 'Green', 'Red', 'White'}

How many are different?

In [22]:
print("Number of different items in b:  %d" % len(b.difference(a)))

Number of different items in b:  2


## From sets to lists
---

Now that we're experts on Python sets, let's get savvy working with lists and unstructured data.

Using the `split()` method on a string, we can "split" it by a delimiter, to be used as a list.  By default, the `.split()` method can be applied to any string object, and will automatically split on spaces.  

> *Note: You can pass a parameter to split to change which character it will split on, such as ",", if you're trying to turn a comma seprated list of items into a list.*

The following will turn a space delimited *string* into a **list**.

In [8]:
"my name is dave my name is dave my name is dave".split()

['my',
 'name',
 'is',
 'dave',
 'my',
 'name',
 'is',
 'dave',
 'my',
 'name',
 'is',
 'dave']

If we had many values, it would be hard to know which of them are unique.  That's when we use sets.

In [9]:
set("my name is dave my name is dave my name is dave".split())

{'dave', 'is', 'my', 'name'}

## Who has similar tastes in music?
---

We will attempt to build a small process that takes feedback from a survey and maps a distance function to find similar users based on [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index).

**Along the way we will be:**
* Working with requests
* Using Python sets and lists
* Cleaning up bad data
* Implementing the Jaccard distance function
* Finding similar users

First, we will be taking a survey!  This survey will be growing each time someone does this lab so you will be able to compare to past cohorts.

> [Take the DSI music survey](https://docs.google.com/forms/d/1sSUwdx6hj-K5GjVV00W_3we7r6QeCZvgfjYSL7VrAOE/edit)

### Loading the data

First we will load our results via HTTP. Then we will load them into Pandas via StringIO, which allows us to interoperate on strings as if they were file resources. Finally we will load them as a Dataframe.  

This is setup for us below.

In [23]:
import pandas as pd
import requests

from io import StringIO, BytesIO

%matplotlib inline

# if you can't run a survey and load from google spreadseets, 
# you can use the local csv.
# local_csv = './datasets/favorite_music_responses.csv'
# df = pd.read_csv(local_csv, index_col=0)
# df.dropna(inplace=True)

spreadsheet = "https://docs.google.com/spreadsheets/d/1cpUb7XbN-qOq4xbGdYfhY9FtrMqRd0izz4PmTPMejt0/export?format=csv&id=1cpUb7XbN-qOq4xbGdYfhY9FtrMqRd0izz4PmTPMejt0&gid=216538035"
http = requests.get(spreadsheet)

csv_data = BytesIO(http.content)
df = pd.read_csv(csv_data, index_col=0)

In [24]:
df.head(50)

Unnamed: 0_level_0,Name,Favorite Genres / Genres you like,What time of day do you like to listen to music?
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2/1/2017 9:14:18,Kat,"Metal, Ultra Speed Metal","Afternoon, Night"
2/1/2017 9:14:26,Austin Whaley,"Electronic Music, Hip Hop / Rap, Pop, R&B / Soul",Night
2/1/2017 9:14:45,William Tell,"Blues, European Music (Folk / Pop), Latin Musi...","Morning, Afternoon, Special occasions"
2/1/2017 9:15:13,Laura,"Alternative Music, Blues, Country, Hip Hop / R...","Morning, Afternoon, Night, Special occasions"
2/1/2017 9:15:17,Vinnie,"Easy Listening, Hip Hop / Rap, R&B / Soul","Morning, Afternoon, Night, Special occasions, ..."
2/1/2017 9:15:22,Jim,"Alternative Music, Country, Indie Pop, Asian P...",Night
2/1/2017 9:15:23,Rashim Khadka,"Blues, Country, Dance, Hip Hop / Rap, Rock","Morning, Afternoon, Night"
2/1/2017 9:16:00,Kara,"Blues, Country, Latin Music, Rock",depends on my mood
2/1/2017 9:16:02,Luke Armbruster,"Electronic Music, Hip Hop / Rap, R&B / Soul, R...","Morning, Noon, Afternoon, Night, 24/7"
2/1/2017 9:16:06,medhi mugnier,"Hip Hop / Rap, Reggae",24/7


### 1. Rename the genre feature

For ease of reference rename the feature **"Favorite Genres / Genres you like"** to **"genres"**.


In [None]:
# Renaming the time of day feature for later as well

### 2. Select only your response from the new "genre" feature

Try printing out only the first value, where `df["Name"] == "[Your name]"`.

### 3. Take your survey response for "genre" and split it into a list equal to the number of responses you chose

For example if you chose "Blues, Reggae, Electronic Music", convert it to a list that looks like ["Blues", "Raggae", "Electronic Music"].

In [None]:
# You can use .values or .iloc

### 4. Create a function that takes 2 lists and calculates the Jaccard distance

You can do this! Double check the lecture slides and refer to the set operations for how to calculate this.  

Jaccard distance or similarity is defined as such:

# $
Jaccard = \frac{A\cap B}{A\cup B} = \frac{\text{Items in common (intersecting)}}{\text{Unique items in space A and B}}
$

In [None]:
# Update the jaccard function
def jaccard(list1, list2):
    pass


list1 = ['blue', 'green', 'yellow']
list2 = ['black', 'orange', 'yellow', 'green']

jaccard(list1, list2)

### 5.  Now for our final trick: calculate the distance between your genre preferences vs everyone elses

Loop through everyone in the dataframe and create a list out of their "genre" string, print out their name and the distance between you and their genre preferences.

### 6. Try calculating the distance on the time of day feature

Make a new dataframe, for just you vs everyone, using jaccard and time of day. Are there any interesting patterns you see?

### 7. What can you say about the selection of options for genre or time and what they mean?

One thing that is pretty obvious is that there are fewer options for times of day.  Times of day is much more broad and may not be a great predictor of personalizable characteristics within the dataset.

Also, options that broadly generalize preferences that already exist in the set that you're collecting is diminishing the preference value.  For instance options such as "24/7", "all", "everything", could describe other options in the same set and don't point to a preference to anything specific.  If you're going to ask explicitly for feedback, then these items will certainly not be very useful.

## 8. Bonus:  Try Jaccard out on the LastFM dataset and compare it to Pearson and Cosine.

In [27]:
from sklearn.preprocessing import StandardScaler
import pandas as pd, numpy as np
import sqlite3
conn = sqlite3.connect("./db.sqlite3")
conn.text_factory = lambda x: str(x, 'latin1')

sql = """
SELECT r.userID, r.artistID, r.tagID, 
a.name AS artist,
t.tagValue as genre
FROM rec_user_artist_tags r
LEFT JOIN rec_artists a on r.artistID = a.id
LEFT JOIN rec_tags t on r.tagID = t.tagID
WHERE a.name NOT NULL
LIMIT 15000
"""

artists = pd.read_sql(sql, con=conn)
artist_genre = artists.groupby(["artist", "genre"]).size().sort_values(ascending=False).unstack().fillna(0)
artist_genre

genre,00,00s,10s,1970,1970s,1973,1978,1979,1979 songs,1980,...,wooooooaaaaahh,world,world music,worst lyrics ever,x factor,xtina,xtina love,yildirim turker,you,zadrotstvo
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
!!!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#####,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
(hed) Planet Earth,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
*NSYNC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
+44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12 Stones,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1200 Micrograms,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12012,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13th Floor Elevators,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2 Unlimited,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
## Calculate similarity on a smaller subset at first
## Reference artists "2Pac" and "Nickelback", our favorite band, in the similarity matrix (and sort the scores)