<a href="https://colab.research.google.com/github/Bryan-Az/Neurobytes/blob/notebooks/mlops/notebooks/users_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis of the Synthetic User Preferences

This synthetic dataset was generated with the Kaggle 'Spotify Million Song' dataset.

The original data contains the features:

- 'artist': The name of the artist
- 'song': The name of the song
- 'text': The song lyrics
- 'link': The link to the song via spotify api

A sample of 10,000 rows was selected from this dataset to calculate synthetic user preferences, as the original dataset is very large. The document-term matrix was calculated, where each song is a document and the words in the lyrics are the terms.

The cosin similarity was calculated, representing the similarity between from songs-to-songs by their lyrics. This data was then used to calculate synthetic user data by sampling a 'starter' song for each user, and then using the top 3 similar songs to the original song as their sample user preferences. Then, the following two columns were added to a new user preferences dataset:

- 'songID': a many-to-one foreign key to the original song dataset.
- 'userID': a many-to-one userID index.




In [1]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import pandas as pd



In [2]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
users_glink = 'https://drive.google.com/file/d/1fw3kszKGlaz4x6oMgCMNmY_s3ejwdGxT/view?usp=drive_link'
filename = 'user_preferences.csv'

In [4]:
def get_link_content(link, name):
  file_id = link.split('/')[-2]
  downloaded = drive.CreateFile({'id': file_id})
  downloaded.GetContentFile(name)

In [5]:
get_link_content(users_glink, filename)

In [15]:
user_preferences = pd.read_csv(filename, index_col=0).sort_values(by='userID').loc[:, ['userID','songID', 'artist', 'song', 'text', 'link']]

In [19]:
user_preferences.sample(10)

Unnamed: 0,userID,songID,artist,song,text,link
3466,866,329,Air Supply,Speaking Of Love,"I've been in love, I've walked alone \r\nI tu...",/a/air+supply/speaking+of+love_20004974.html
212,53,1019,Barbie,Two Voices One Song,It's so rare to find a friend like you \r\nSo...,/b/barbie/two+voices+one+song_20907675.html
3842,960,71,ABBA,People Need Love,"People need hope, people need lovin' \r\nPeop...",/a/abba/people+need+love_20002847.html
1042,260,1099,Barbra Streisand,I'd Rather Be Blue Over You (Than Happy With S...,"I'd rather be blue, thinking of you \r\nI'd r...",/b/barbra+streisand/id+rather+be+blue+over+you...
1873,468,430,Alabama,Pass It On Down,We live in the land of plenty \r\nBut many th...,/a/alabama/pass+it+on+down_20005129.html
2522,630,312,Air Supply,My Hearts With You,I know you're leaving \r\nI won't ask you to ...,/a/air+supply/my+hearts+with+you_20004877.html
174,43,44,ABBA,I've Been Waiting For You,"I, I've been in love before \r\nI thought I w...",/a/abba/ive+been+waiting+for+you_20002832.html
453,113,1310,Bette Midler,Have Yourself A Merry Little Christmas,"Have yourself a merry little Christmas, \r\nL...",/b/bette+midler/have+yourself+a+merry+little+c...
742,185,1077,Barbra Streisand,Have Yourself A Merry Little Christmas,Christmas' future is far away \r\nChristmas' ...,/b/barbra+streisand/have+yourself+a+merry+litt...
2039,509,1274,Bee Gees,Alive,"Maybe you talk too high, man. \r\nMaybe I tal...",/b/bee+gees/alive_20015780.html


In [18]:
user_preferences.groupby('userID').count().mean()

songID    4.0
artist    4.0
song      4.0
text      4.0
link      4.0
dtype: float64