<a href="https://colab.research.google.com/github/Bryan-Az/Neurobytes/blob/notebooks/mlops/notebooks/users_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis of the Synthetic User Preferences

This synthetic dataset was generated with the Kaggle 'Spotify Million Song' dataset.

The original data contains the features:

- 'artist': The name of the artist
- 'song': The name of the song
- 'text': The song lyrics
- 'link': The link to the song via spotify api

A sample of 10,000 rows was selected from this dataset to calculate synthetic user preferences, as the original dataset is very large. The document-term matrix was calculated, where each song is a document and the words in the lyrics are the terms.

The cosin similarity was calculated, representing the similarity between from songs-to-songs by their lyrics. This data was then used to calculate synthetic user data by sampling a 'starter' song for each user, and then using the top 3 similar songs to the original song as their sample user preferences. Then, the following two columns were added to a new user preferences dataset:

- 'songID': a many-to-one foreign key to the original song dataset.
- 'userID': a many-to-one userID index.




In [21]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import pandas as pd

In [22]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [23]:
users_glink = 'https://drive.google.com/file/d/1fw3kszKGlaz4x6oMgCMNmY_s3ejwdGxT/view?usp=drive_link'
filename = 'user_preferences.csv'

In [24]:
def get_link_content(link, name):
  file_id = link.split('/')[-2]
  downloaded = drive.CreateFile({'id': file_id})
  downloaded.GetContentFile(name)

In [25]:
get_link_content(users_glink, filename)

In [26]:
user_preferences = pd.read_csv(filename, index_col=0).sort_values(by='userID').loc[:, ['userID','songID', 'artist', 'song', 'text', 'link']]

In [27]:
user_preferences.sample(10)

Unnamed: 0,userID,songID,artist,song,text,link
56910,318,8120,Ice Cube,Rollin' Wit The Lench Mob,You can't fuck with the criminal rapping over ...,/i/ice+cube/rollin+wit+the+lench+mob_20066630....
131631,738,32258,Fabolous,Funkmaster Flex Freestyle,Feel like I'm Big Meech \r\nWhen I be cashing...,/f/fabolous/funkmaster+flex+freestyle_20887392...
79562,445,16402,Prince,Breakdown,Listen to me closely as the story unfolds \r\...,/p/prince/breakdown_21087577.html
137563,771,36431,Ice Cube,Tales From The Darkside,Verse one: ice cube \r\n \r\nPeace. Haha don...,/i/ice+cube/tales+from+the+darkside_20066631.html
162745,912,43869,Michael W. Smith,Cover Me,In the darkness and in the flood \r\nYou're t...,/m/michael+w+smith/cover+me_20486294.html
42363,236,43657,Michael Buble,Frosty The Snowman,"Frosty the snowman was a jolly happy soul, \r...",/m/michael+buble/frosty+the+snowman_20988329.html
18305,103,49685,R. Kelly,A Woman's Threat,"My time, my patience, my love \r\nMy blood, m...",/r/r+kelly/a+womans+threat_20112981.html
35575,199,43127,Maroon 5,Keep On Rockin' In The Free World,[Neil Young cover] \r\n \r\nColors on the st...,/m/maroon+5/keep+on+rockin+in+the+free+world_2...
146840,823,45850,Nina Simone,Feeling Good,Birds flying high you know how I feel \r\nSun...,/n/nina+simone/feeling+good_20100629.html
177783,996,51128,Roy Orbison,Never,A new sweet song of love has got you off my mi...,/r/roy+orbison/never_20119005.html


In [28]:
user_preferences.groupby('userID').count().mean()

songID    178.34
artist    178.34
song      178.34
text      178.34
link      178.34
dtype: float64

In [29]:
user_preferences.shape

(178340, 6)