# Sparkify Sample dataset notebook
This notebook contains steps of exploration, processing and modeling with a tiny subset (128MB) of the full dataset available (12GB). Full dataset is treated separately in the notebook on AWS platform.

In [6]:
# import libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType, DateType
from pyspark.sql.window import Window

from pyspark.ml.feature import CountVectorizer, IDF, CountVectorizerModel
from pyspark.ml.feature import OneHotEncoder, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.pipeline import PipelineModel
from datetime import datetime

import pandas as pd
import numpy as np
from itertools import chain
from typing import Dict
import sweetviz as sv

In [7]:
# create a Spark session
spark = SparkSession \
    .builder \
    .appName("Sparkify") \
    .getOrCreate()

In [8]:
# rimestamp coefficient
TS_COEF = 1000*60*60*24

# today date
TODAY = str(datetime.today().date())

# Load and Clean Dataset
In this notebook we load the mini-dataset from locally stored file `mini_sparkify_event_data.json`

In [9]:
# Read in full sparkify dataset
event_data = "mini_sparkify_event_data.json"
df = spark.read.json(event_data)
df.head()

Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30')

# Exploratory Data Analysis
Since we are looking at a small subset, it's quite convenient to perform EDA using pandas.
Our analysis consists of 3 steps:
* Explore Data
* Define Churn
* Explore churned vs stayed users

#### Explore Data
I used [sweetviz](https://pypi.org/project/sweetviz/) package to visualize data and make first observations. At this stage we identify the structure of each column, check the nulls and ranges/lists of column values.

#### Define Churn
I create a column `churn` to use as the label for your model. I used the `Cancellation Confirmation` events to define the churn, which happen for both paid and free users.

#### Explore churned vs stayed users
Once we've defined churn, we run exploratory data analysis by comparing users who stayed vs users who churned. This is important for the next stage of feature engineering. Looking at major differences, we define the logic for user-level features.

### EDA observations
I convert Spark dataframe into pandas dataframe to run EDA with more flexibility. Using `sweetviz` I look at the major properties of each column. 

Here are the **first observations**:
1. There are 225 registered users in the dataset and 2354 sessions during 63 days. 97% of records cover the events for these users and only 3% include the data about the guests.
2. For guest users (`auth='Guest'`) we don't have neiver songs data or user demographics data, nor *userId*. Their page visits are limited to: Home, Help, Register, About, Submit Registration, Error. We exclude guest users from model dataset.
3. There are 3% of records with `auth='Logged Out'`, which include Home, Login, About, Help and Error events. There is no *userId* data for these events, so we exclude them from modelling dataset.
3. 80% of records describe NextSong event and include artist and song data. 20% of events cover all over possible actions.
4. We have 52 cancellation events, which are described by `auth='Cancelled'` and `page='Cancellation Confirmation'`. There are 52 unique userId, who cancelled subscription. So this event is unique per user.]

We **define Churn** as a fact of cancellation of subscription from existing user. The fact of cancellation is translated through 2 columns: `auth='Cancelled'` and `page='Cancellation Confirmation'`, which are uniquely defined, so we can use any of 2 to define the target. Let's use `page='Cancellation Confirmation'` as our target. 

Before moving to the Feature engineering step, let's **compare** behaviour of **churned users VS stayed users**. Here are some observations:
1. Among those who churn there are more males (57% M / 43% F), and vice versa, there are more women among those users, who stay subscrbed (42% M / 58% F).
2. Churned users usually have smaller number of items in session (churn median = 66 VS other median = 71).
3. Among churned users there are more free users (28% in churn VS 19% in other)
4. From page events statistics we see that among churned there are less *Thumbs Up*, more *Thumbs Down*, almost two times higher frequency of *Roll Advert*.
5. Churned users have smaller lifetime.

In [6]:
pandas_data = df.toPandas()
pandas_data[:2]

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,Martha Tilston,Logged In,Colin,M,50,Freeman,277.89016,paid,"Bakersfield, CA",PUT,NextSong,1538173000000.0,29,Rockpools,200,1538352117000,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,30
1,Five Iron Frenzy,Logged In,Micah,M,79,Long,236.09424,free,"Boston-Cambridge-Newton, MA-NH",PUT,NextSong,1538332000000.0,8,Canada,200,1538352180000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",9


In [7]:
# check number of records and columns in the subset
pandas_data.shape

(286500, 18)

In [8]:
# explore time range of data provided
print('Earliest date is', pd.to_datetime(pandas_data['ts'], unit='ms').dt.date.min())
print('Last date is', pd.to_datetime(pandas_data['ts'], unit='ms').dt.date.max())
print('Total number of days:', (pandas_data['ts'].max() - pandas_data['ts'].min())//(TS_COEF))

Earliest date is 2018-10-01
Last date is 2018-12-03
Total number of days: 63


In [10]:
# generate sweetviz report
analysis = sv.analyze([pandas_data, 'sample_data'])
analysis.show_html('./EDA_reports/sample_data_overview.html')

Report ./EDA_reports/sample_data_overview.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [9]:
# explore data structure for guest visitors
pandas_data[pandas_data['auth']=='Guest'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 97633 to 199445
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   artist         0 non-null      object 
 1   auth           97 non-null     object 
 2   firstName      0 non-null      object 
 3   gender         0 non-null      object 
 4   itemInSession  97 non-null     int64  
 5   lastName       0 non-null      object 
 6   length         0 non-null      float64
 7   level          97 non-null     object 
 8   location       0 non-null      object 
 9   method         97 non-null     object 
 10  page           97 non-null     object 
 11  registration   0 non-null      float64
 12  sessionId      97 non-null     int64  
 13  song           0 non-null      object 
 14  status         97 non-null     int64  
 15  ts             97 non-null     int64  
 16  userAgent      0 non-null      object 
 17  userId         97 non-null     object 
dtypes: f

In [10]:
# check what actions are available for guest users
pandas_data[pandas_data['auth']=='Guest']['page'].value_counts()

Home                   36
Help                   23
Register               18
About                  14
Submit Registration     5
Error                   1
Name: page, dtype: int64

In [11]:
# check the events associated with empty song data
pandas_data[pandas_data['song'].isnull()]['page'].value_counts()

Home                         14457
Thumbs Up                    12551
Add to Playlist               6526
Add Friend                    4277
Roll Advert                   3933
Login                         3241
Logout                        3226
Thumbs Down                   2546
Downgrade                     2055
Help                          1726
Settings                      1514
About                          924
Upgrade                        499
Save Settings                  310
Error                          258
Submit Upgrade                 159
Submit Downgrade                63
Cancellation Confirmation       52
Cancel                          52
Register                        18
Submit Registration              5
Name: page, dtype: int64

In [12]:
# check the events associated with full song data
pandas_data[~pandas_data['song'].isnull()]['page'].value_counts()

NextSong    228108
Name: page, dtype: int64

In [13]:
# check page events associated with Cancelled status
pandas_data[pandas_data['auth']=='Cancelled']['page'].value_counts()

Cancellation Confirmation    52
Name: page, dtype: int64

In [14]:
# check unique user and total count per different values of auth status
pandas_data.groupby('auth').agg({'userId': pd.Series.nunique,
                                 'ts': 'count'})

Unnamed: 0_level_0,userId,ts
auth,Unnamed: 1_level_1,Unnamed: 2_level_1
Cancelled,52,52
Guest,1,97
Logged In,225,278102
Logged Out,1,8249


In [17]:
# compare 2 subsets: churned users VS stayed users
known_users_df = pandas_data[pandas_data['userId']!=''].copy()

# add lifetime column
known_users_df['max_ts'] = known_users_df.groupby('userId')['ts'].transform('max')
known_users_df['lifetime'] = (known_users_df['max_ts']-known_users_df['registration'])/TS_COEF

# list of churned users
churned_uid_list = pandas_data[pandas_data['page']=='Cancellation Confirmation']['userId'].to_list()

report = sv.compare_intra(known_users_df, known_users_df['userId'].isin(churned_uid_list), ["Churn", "Stayed"])
report.show_html('./EDA_reports/Churn vs stayed.html')

Report ./EDA_reports/Churn vs stayed.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [18]:
# explore time range of cancellations
cancel_data = pandas_data[pandas_data['page']=='Cancellation Confirmation']

print('Earliest date of cancellation is', pd.to_datetime(cancel_data['ts'], unit='ms').dt.date.min())
print('Last date of cancellation is', pd.to_datetime(cancel_data['ts'], unit='ms').dt.date.max())
print('Total number of days between first and last:', (cancel_data['ts'].max() - cancel_data['ts'].min())//(TS_COEF))

Earliest date of cancellation is 2018-10-01
Last date of cancellation is 2018-11-29
Total number of days between first and last: 58


# Feature Engineering
Once you've familiarized yourself with the data, build out the features you find promising to train your model on. To work with the full dataset, you can follow the following steps.
- Write a script to extract the necessary features from the smaller subset of data
- Ensure that your script is scalable, using the best practices discussed in Lesson 3
- Try your script on the full data set, debugging your script if necessary

If you are working in the classroom workspace, you can just extract features based on the small subset of data contained here. Be sure to transfer over this work to the larger dataset when you work on your Spark cluster.

### Compile the modelling dataset
1. Exclude records with empty *userId*.
2. Add label: 1 = Churn, 0 = Not churn. Condition: `page='Cancellation Confirmation'`
3. Remove records of `page='Cancellation Confirmation'`.
4. Sort dataframe by `userId` and `ts`
5. Aggregate features at user level:
    * create list of songs
    * create list of artists
    * list of page events (Cancellation Confirmation and Cancel events preliminary filtered out to remove the leak)
    * session frequency
    * average number of songs per session
    * binary feature: Male gender = 1/0
    * binary feature: paid acoount = 1/0
    * lifetime (days): time difference between last activity and registration date

**Step 1**: Aggregate user-level properties

In [10]:
def preprocess_data(df: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:
    """
    Aggregate App data at user level and collects
    dataframe for further feature engineering steps
    =================
    Args:
        df (pyspark Dataframe) : data extraction from Sparkify
        
    Return:
        preprocessed pyspark dataframe
    """
    w = Window.partitionBy(df.userId).orderBy(df.ts)
    w_uid = Window.partitionBy(df.userId)

    preprocessed_df = (df
                       .filter(F.col('userId')!='') #filter out guests
                       .withColumn('cancelled', (F.col('page')=='Cancellation Confirmation').cast(IntegerType())) 
                       .withColumn('churn', F.max('cancelled').over(w_uid)) # define churn label
                       .withColumn('current_level', F.last('level').over(w)) # sort levels of subscription by date
                       .withColumn('last_userAgent', F.last('userAgent').over(w)) # sort agents by date
                       .filter(~F.col('page').isin(['Cancellation Confirmation',
                                                    'Cancel'])) #remove cancellation page events from dataset
                       .groupby('userId') # aggregate features at user level
                       .agg(F.collect_list('artist').alias('artist_list'), # combine into list all artist
                            F.collect_list('song').alias('song_list'), # combine into list all songs
                            F.collect_list('page').alias('page_list'), # combine into list all page events
                            F.countDistinct('sessionId').alias('session_count'), # calculate total number of sessions
                            F.count('song').alias('song_count'), # calculate total number of songs
                            F.first('gender').alias('gender'), # gender data
                            F.last('current_level').alias('current_level'), # take last level value
                            F.max('churn').alias('churn'), 
                            F.min('ts').alias('min_ts'), # start timestamp 
                            F.max('ts').alias('max_ts'), # end timestamp
                            F.last('last_userAgent').alias('last_userAgent'), # recent agent
                            F.min('registration').alias('registration') # registration date
                           )
                       # frequency of sessions
                       .withColumn('session_freq', F.col('session_count')/((F.col('max_ts')-F.col('min_ts'))/TS_COEF))
                       # avg number of songs per session
                       .withColumn('song_per_session', F.col('song_count')/F.col('session_count'))
                       # binary feature: Male = 1/0
                       .withColumn('gender_Male', (F.col('gender')=='M').cast(IntegerType()))
                       # binary feature: paid = 1/0
                       .withColumn('is_paid', (F.col('current_level')=='paid').cast(IntegerType()))
                       # lifetime
                       .withColumn('lifetime', (F.col('max_ts')-F.col('registration'))/TS_COEF)
                       # extract device/OS pointers from agent
                       .withColumn('agent_Windows', F.col('last_userAgent').contains('Windows').cast(IntegerType()))
                       .withColumn('agent_Mac', F.col('last_userAgent').contains('Mac').cast(IntegerType()))
                       .withColumn('agent_iPhone', F.col('last_userAgent').contains('iPhone').cast(IntegerType()))
                       .withColumn('agent_iPad', F.col('last_userAgent').contains('iPad').cast(IntegerType()))
                       .withColumn('agent_Linux', F.col('last_userAgent').contains('Linix').cast(IntegerType()))
                      ).cache()
    
    return preprocessed_df

In [11]:
preprocessed_df = preprocess_data(df)
preprocessed_df.count()

225

In [12]:
preprocessed_df.groupby('churn').count().toPandas()
# churn <--> 23%

Unnamed: 0,churn,count
0,1,52
1,0,173


**Step 2**: Prepare transformers to collect feature vector

Used features:
* Apply TF-IDF to artist list, song list and page list. We limit vocabSize to 100 elements
* Beside TF-IDF generated features keep session frequency, avg number of songs per session, lifetime, gender, paid, agent based features

In [13]:
def tf_idf_transformer(list_name: str,
                       vocabSize: int=100):
    """
    Combines TF and IDF pyspark transformers
    ------------
    
    Args:
        list_name (string) : prefix of the feature with work list in the format
            prefix_list
        vocabSize (int)    : number of top-output words to keep
    
    Returns:
        tf transformer, idf transformer
    """
    tf = CountVectorizer(inputCol=f"{list_name}_list", outputCol=f"TF_{list_name}", vocabSize=vocabSize)
    tf_idf = IDF(inputCol=f"TF_{list_name}", outputCol=f"TFIDF_{list_name}")
    return tf, tf_idf


artist_tf, artist_tf_idf = tf_idf_transformer('artist')
song_tf, song_tf_idf = tf_idf_transformer('song')
page_tf, page_tf_idf = tf_idf_transformer('page')

assembler = VectorAssembler(inputCols=["TFIDF_artist", "TFIDF_song", "TFIDF_page",
                                       "session_freq", "song_per_session", 
                                       "lifetime", "gender_Male", 
                                       "is_paid", "agent_Windows",
                                       "agent_Mac", "agent_iPhone", "agent_iPad", 
                                       "agent_Linux"], 
                            outputCol="features", 
                            handleInvalid="skip")


feature_pipeline = Pipeline(stages=[artist_tf, artist_tf_idf, 
                                   song_tf, song_tf_idf,
                                   page_tf, page_tf_idf,
                                   assembler
                                   ])

In [15]:
test = feature_pipeline.fit(preprocessed_df)
test_df = test.transform(preprocessed_df)
test_df.count()

225

**Step 3**: Extract feature names and vocabularies from feature transformers to use it later for feature importance analyzis 

In [16]:
# extract vocabularies for future explanations
stages = test.stages
vectorizers = [s for s in stages if isinstance(s, CountVectorizerModel)]
vocab_dict = {}
vocab_dict['artist'] = vectorizers[0].vocabulary
vocab_dict['song'] = vectorizers[1].vocabulary
vocab_dict['page'] = vectorizers[2].vocabulary


# extract feature names after vector assembler
attrs = sorted(
    (attr["idx"], attr["name"]) for attr in (chain(*test_df
        .schema["features"]
        .metadata["ml_attr"]["attrs"].values())))

# Modeling
We split the full dataset into train (70%) and test (30%). During cross-validation process train data is additionally split into train and validation subsets. Test data is used only to check the model (nexer seen during training).

We try 2 models:
* Random Forest Classifier
* Gradient Boosted Tree Classifier

Note: since we use tree-based models, we don't don't need to scale numerical features.
Our problem is imbalanced: 23% of positive cases (churn) and 67% of negative (stayed). Thus, we use F1-score to tune hyperparameters and check final quality of the model.

In [17]:
(train_data, test_data) = preprocessed_df.randomSplit([0.7, 0.3], seed=10)

# cache dataframes
train_data = train_data.cache()
test_data = test_data.cache()

In [18]:
def score_the_model(test_data: pyspark.sql.DataFrame, 
                    model: Pipeline, 
                    metric_name: str='accuracy'):
    """
    Calculate model score by metric given in metric_name
    --------------------
    Args:
        test_data (pyspark DataFrame): dataframe with test samples
        model (Pipeline)             : pretrained model
        metric_name (string)         : one of the MulticlassClassificationEvaluator metrics
    Return:
        None
    """
    # Make predictions
    predictions = model.transform(test_data)

    # Set up evaluator and compute score
    evaluator = MulticlassClassificationEvaluator(
        labelCol="churn", 
        predictionCol="prediction", 
        metricName=metric_name)
    score = evaluator.evaluate(predictions)
    print("Score = ", score)

In [19]:
def rename(x: str,
           vocab_dict: Dict) -> str:
    """
    Rename raw attribute names according to vocabularies
    ----------------
    Args:
        x (string) : original name
        vocab_dict (Dict) : dictionary containing all vocabularies
    Return:
        string : new name
    """
    if 'TFIDF' in x:
        components = x.split('_')
        new_components = components[:-1]
        new_components.append(vocab_dict[components[1]][int(components[-1])])
        new_x = '_'.join(new_components)
    else:
        new_x = x
    return new_x

### Random Forest Classifier

In [71]:
%%time
# Tune model
rf = RandomForestClassifier(labelCol="churn", featuresCol="features", 
                            seed = 10)
rf_pipeline = Pipeline(stages=[feature_pipeline, rf])

# set parameters grid
paramGrid = (ParamGridBuilder()
            .addGrid(rf.maxDepth, [5, 7])
            .addGrid(rf.numTrees, [20, 30])
            .build()
            )

# choose evaluater
evaluator = MulticlassClassificationEvaluator(labelCol="churn", 
                                               predictionCol="prediction", 
                                               metricName="f1")

# define cross-validator
crossval = CrossValidator(estimator=rf_pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3,
                          seed=10)

# run cross-validation
cvModel = crossval.fit(train_data)

CPU times: user 3.33 s, sys: 738 ms, total: 4.07 s
Wall time: 13min 20s


In [72]:
# check best combination of parameters
cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]

{Param(parent='RandomForestClassifier_6a30cb8053ca', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5,
 Param(parent='RandomForestClassifier_6a30cb8053ca', name='numTrees', doc='Number of trees to train (>= 1).'): 20}

In [73]:
# let's test it
score_the_model(test_data, cvModel, metric_name='f1')

Score =  0.7254676842034173


In [74]:
# save model
cvModel.bestModel.write().overwrite().save("./saved_models/rf_model")

In [75]:
# load model
persistedModel = PipelineModel.load("./saved_models/rf_model")

In [86]:
feature_importance_tab = pd.DataFrame([(name, persistedModel.stages[-1].featureImportances[idx])
                                       for idx, name in attrs
                                       if persistedModel.stages[-1].featureImportances[idx]],
                                      columns=['feature_name_raw', 'importance'])
feature_importance_tab['feature_name'] = feature_importance_tab['feature_name_raw'].apply(lambda x: rename(x, vocab_dict))
feature_importance_tab.sort_values(by='importance', ascending=False)[:20]

Unnamed: 0,feature_name_raw,importance,feature_name
126,session_freq,0.123965,session_freq
128,lifetime,0.060926,lifetime
117,TFIDF_page_6,0.031007,TFIDF_page_Logout
40,TFIDF_artist_56,0.022657,TFIDF_artist_Shakira
124,TFIDF_page_14,0.018132,TFIDF_page_Error
71,TFIDF_song_10,0.01808,TFIDF_song_Ain't Misbehavin
14,TFIDF_artist_18,0.018072,TFIDF_artist_Harmonia
12,TFIDF_artist_16,0.017009,TFIDF_artist_Linkin Park
9,TFIDF_artist_13,0.016618,TFIDF_artist_Taylor Swift
69,TFIDF_song_7,0.015998,TFIDF_song_Use Somebody


### Gradient Boosted Tree classifier

In [15]:
%%time
# Tune model
gbt = GBTClassifier(labelCol="churn", featuresCol="features")
gbt_pipeline = Pipeline(stages=[feature_pipeline, gbt])

# set parameters grid
paramGrid = (ParamGridBuilder()
            .addGrid(gbt.maxDepth, [3, 5])
            .addGrid(gbt.maxIter, [5, 10])
            .build()
            )

# choose evaluater
evaluator = MulticlassClassificationEvaluator(labelCol="churn", 
                                               predictionCol="prediction", 
                                               metricName="f1")

# define cross-validator
crossval = CrossValidator(estimator=gbt_pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3,
                          seed=10)

# run cross-validation
cvModel = crossval.fit(train_data)

CPU times: user 4.82 s, sys: 939 ms, total: 5.76 s
Wall time: 36min 4s


In [16]:
# check best combination of parameters
cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]

{Param(parent='GBTClassifier_e80ef7dbf8d5', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 3,
 Param(parent='GBTClassifier_e80ef7dbf8d5', name='maxIter', doc='max number of iterations (>= 0).'): 5}

In [17]:
# let's test it
score_the_model(test_data, cvModel, metric_name='f1')

Score =  0.7313432835820896


In [18]:
# save the model
cvModel.bestModel.write().overwrite().save("./saved_models/gbt_model")
print('GBT model saved to ./saved_models/gbt_model')

GBT model saved to ./saved_models/gbt_model


In [22]:
# load model
persistedModel = PipelineModel.load("./saved_models/gbt_model")

In [23]:
feature_importance_tab = pd.DataFrame([(name, persistedModel.stages[-1].featureImportances[idx])
                                       for idx, name in attrs
                                       if persistedModel.stages[-1].featureImportances[idx]],
                                      columns=['feature_name_raw', 'importance'])
feature_importance_tab['feature_name'] = feature_importance_tab['feature_name_raw'].apply(lambda x: rename(x, vocab_dict))
feature_importance_tab.sort_values(by='importance', ascending=False)[:20]

Unnamed: 0,feature_name_raw,importance,feature_name
6,TFIDF_song_5,0.163089,TFIDF_song_Dog Days Are Over (Radio Edit)
19,lifetime,0.156076,lifetime
18,session_freq,0.112537,session_freq
2,TFIDF_artist_4,0.09969,TFIDF_artist_BjÃÂ¶rk
11,TFIDF_song_63,0.094081,TFIDF_song_Tighten Up
10,TFIDF_song_59,0.05725,TFIDF_song_I Gotta Feeling
3,TFIDF_artist_56,0.054867,TFIDF_artist_Shakira
9,TFIDF_song_58,0.042169,TFIDF_song_Master Of Puppets
5,TFIDF_artist_88,0.035837,TFIDF_artist_The Notorious B.I.G.
0,TFIDF_artist_0,0.032679,TFIDF_artist_Kings Of Leon
