## Load Data and Library

1. What are the top 3 and the bottom 3 states in terms of number of users?
2. What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically define user engagement. What the CEO cares about here is in which states users are using the product a lot/very little.
3. The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users?
4. Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to "Eight Days A Week", which song has the highest probability of being played right after it by the same user? This is going to be v1 of a song recommendation model.

5. How would you set up a test to check whether your model works well and is improving engagement?

In [79]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
%matplotlib inline

In [4]:
data = pd.read_json('song.json')
data['time_played'] = pd.to_datetime(data['time_played'])
data['user_sign_up_date'] = pd.to_datetime(data['user_sign_up_date'])
data.head()

Unnamed: 0,id,user_id,user_state,user_sign_up_date,song_played,time_played
0,GOQMMKSQQH,122,Louisiana,2015-05-16,Hey Jude,2015-06-11 21:51:35
1,HWKKBQKNWI,3,Ohio,2015-05-01,We Can Work It Out,2015-06-06 16:49:19
2,DKQSXVNJDH,35,New Jersey,2015-05-04,Back In the U.S.S.R.,2015-06-14 02:11:29
3,HLHRIDQTUW,126,Illinois,2015-05-16,P.s. I Love You,2015-06-08 12:26:10
4,SUKJCSBCYW,6,New Jersey,2015-05-01,Sgt. Pepper's Lonely Hearts Club Band,2015-06-28 14:57:00


## Data Exploratory Analysis 

In [6]:
data.isnull().sum()

id                   0
user_id              0
user_state           0
user_sign_up_date    0
song_played          0
time_played          0
dtype: int64

In [7]:
#check unique values
data.nunique() #we have users repeat, we have 200 songs in our data

id                   4000
user_id               196
user_state             41
user_sign_up_date      20
song_played           100
time_played          3997
dtype: int64

In [32]:
def unique_count(x):
    return len(np.unique(x))

In [33]:
#What are the top 3 and the bottom 3 states in terms of number of users?
state = pd.DataFrame(data.groupby('user_state')['user_id'].apply(unique_count))
state = state.sort_values(by='user_id',ascending = True)

In [34]:
state.head(3)

Unnamed: 0_level_0,user_id
user_state,Unnamed: 1_level_1
Arizona,1
New Mexico,1
Connecticut,1


In [35]:
state.tail(3)

Unnamed: 0_level_0,user_id
user_state,Unnamed: 1_level_1
Texas,15
California,21
New York,23


What are the top 3 and the bottom 3 states in terms of user engagement?

In [36]:
#define user engagement
#as the average play per user in each state
songs_played = pd.DataFrame(data.groupby('user_state')['id'].count())
songs_played.head()

Unnamed: 0_level_0,id
user_state,Unnamed: 1_level_1
Alabama,104
Alaska,58
Arizona,22
Arkansas,34
California,425


In [37]:
state = state.sort_values(by='user_state',ascending = True)

In [38]:
state

Unnamed: 0_level_0,user_id
user_state,Unnamed: 1_level_1
Alabama,4
Alaska,2
Arizona,1
Arkansas,2
California,21
Colorado,3
Connecticut,1
Florida,7
Georgia,6
Idaho,1


In [51]:
engagement = pd.DataFrame(songs_played['id']/state['user_id']).reset_index()
data = data.merge(engagement, on = 'user_state')


In [59]:
data = data.rename(columns={'0_x':'engagement'})
data.head()

Unnamed: 0,id,user_id,user_state,user_sign_up_date,song_played,time_played,engagement
0,GOQMMKSQQH,122,Louisiana,2015-05-16,Hey Jude,2015-06-11 21:51:35,21.0
1,YLAXXRCAOR,193,Louisiana,2015-05-20,Reprise / Day in the Life,2015-06-24 18:22:24,21.0
2,EBICGVXPFG,122,Louisiana,2015-05-16,Birthday,2015-06-25 17:43:51,21.0
3,LTLJOPWZWS,193,Louisiana,2015-05-20,Revolution,2015-06-05 22:30:48,21.0
4,VFGACYYOVG,193,Louisiana,2015-05-20,Yesterday,2015-06-28 14:24:51,21.0


The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users?

In [65]:
early_signup = pd.DataFrame(data[['user_state','user_id','user_sign_up_date']].groupby('user_state')['user_id','user_sign_up_date'].min().reset_index())

In [67]:
early_signup.sort_values(by='user_sign_up_date')

Unnamed: 0,user_state,user_id,user_sign_up_date
0,Alabama,5,2015-05-01
35,Texas,7,2015-05-01
30,Oregon,1,2015-05-01
28,Ohio,3,2015-05-01
26,North Carolina,2,2015-05-01
24,New Mexico,4,2015-05-01
23,New Jersey,6,2015-05-01
31,Pennsylvania,11,2015-05-02
25,New York,10,2015-05-02
19,Minnesota,8,2015-05-02


## Modeling

Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to "Eight Days A Week", which song has the highest probability of being played right after it by the same user? This is going to be v1 of a song recommendation model.

In [71]:
#build a user_song matrix
song_user = data.groupby(['song_played','user_id'])['id'].count().unstack()
song_user = (song_user>0).astype(int)

In [72]:
song_user.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,191,192,193,194,195,196,197,198,199,200
song_played,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A Day In The Life,0,0,1,1,0,1,0,0,0,0,...,0,0,1,1,0,1,0,0,1,0
A Hard Day's Night,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
A Saturday Club Xmas/Crimble Medley,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ANYTIME AT ALL,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Across The Universe,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [76]:
#calculate the song-song similarity matrix
song_user.norm = normalize(song_user, axis = 1)
similarity = np.dot(song_user.norm, song_user.norm.T)
song_similarity = pd.DataFrame(similarity,index = song_user.index, columns = song_user.index)

In [77]:
song_similarity

song_played,A Day In The Life,A Hard Day's Night,A Saturday Club Xmas/Crimble Medley,ANYTIME AT ALL,Across The Universe,All My Loving,All You Need Is Love,And Your Bird Can Sing,BAD BOY,BALLAD OF JOHN AND YOKO,...,We Can Work It Out,When I'm 64,While My Guitar Gently Weeps,Wild Honey Pie,With a Little Help From My Friends,YOUR MOTHER SHOULD KNOW,Yellow Submarine,Yesterday,You Never Give Me Your Money,You're Going To Lose That Girl
song_played,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A Day In The Life,1.000000,0.264392,0.139347,0.148968,0.132196,0.301023,0.295599,0.098533,0.197066,0.201129,...,0.516528,0.056888,0.578459,0.279852,0.399723,0.088131,0.330489,0.365433,0.164222,0.0
A Hard Day's Night,0.264392,1.000000,0.000000,0.000000,0.100000,0.146385,0.111803,0.000000,0.000000,0.091287,...,0.305788,0.129099,0.266996,0.000000,0.000000,0.000000,0.050000,0.215003,0.074536,0.0
A Saturday Club Xmas/Crimble Medley,0.139347,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.161165,0.000000,0.000000,0.182574,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
ANYTIME AT ALL,0.148968,0.000000,0.000000,1.000000,0.000000,0.164957,0.094491,0.125988,0.000000,0.000000,...,0.172292,0.000000,0.188044,0.097590,0.191663,0.000000,0.000000,0.103835,0.000000,0.0
Across The Universe,0.132196,0.100000,0.000000,0.000000,1.000000,0.097590,0.000000,0.000000,0.000000,0.000000,...,0.101929,0.000000,0.133498,0.000000,0.000000,0.000000,0.000000,0.061430,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YOUR MOTHER SHOULD KNOW,0.088131,0.000000,0.000000,0.000000,0.000000,0.195180,0.111803,0.149071,0.000000,0.182574,...,0.152894,0.258199,0.133498,0.000000,0.151186,1.000000,0.100000,0.122859,0.000000,0.0
Yellow Submarine,0.330489,0.050000,0.000000,0.000000,0.000000,0.243975,0.111803,0.223607,0.000000,0.000000,...,0.254824,0.000000,0.289246,0.173205,0.188982,0.100000,1.000000,0.215003,0.074536,0.0
Yesterday,0.365433,0.215003,0.000000,0.103835,0.061430,0.209822,0.274721,0.137361,0.091574,0.112154,...,0.422650,0.237915,0.464708,0.283731,0.325054,0.122859,0.215003,1.000000,0.228934,0.0
You Never Give Me Your Money,0.164222,0.074536,0.000000,0.000000,0.000000,0.072739,0.166667,0.000000,0.111111,0.000000,...,0.189934,0.000000,0.232175,0.086066,0.112687,0.000000,0.074536,0.228934,1.000000,0.0


In [88]:
def top_k(song, similarity, k=1):
    df = similarity.loc[song].sort_values(ascending = False)[:k+1].reset_index()
    df = df.rename(columns={'song_played':'song','song':'similarity'})
    return df

In [91]:
df = top_k(song = 'A Day In The Life',similarity = song_similarity,k = 10)

In [92]:
df

Unnamed: 0,song,A Day In The Life
0,A Day In The Life,1.0
1,Revolution,0.705327
2,Come Together,0.691885
3,Get Back,0.671014
4,Hello Goodbye,0.610658
5,Back In the U.S.S.R.,0.607872
6,Let It Be,0.594578
7,Hey Jude,0.591295
8,Lucy In The Sky With Diamonds,0.580249
9,While My Guitar Gently Weeps,0.578459
