# Making The Recommendation System

Made by: Alexander Beaucage

Date: June 23 2023

Contact Info: Beaucagealex202@gmail.com

The goal of this notebook is to get a recommender running. To do this I will be using a association rules table, this will get songs that co-occur frequently.

In [1]:
# Import the librarys I need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
# Get the directory for the data
datadir = r"C:\Users\Alexander\Documents\Data Science Boot Camp\Capstone\Copys\Clean data\cleandata.csv"
# Load in the data
data = pd.read_csv(datadir, index_col = 0)# Use index column to use the index on the data

In [3]:
# Take a look at the data
data.head()

Unnamed: 0,user_id,artistname,trackname,playlistname
0,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,(The Angels Wanna Wear My) Red Shoes,HARD ROCK 2010
1,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello & The Attractions,"(What's So Funny 'Bout) Peace, Love And Unders...",HARD ROCK 2010
2,9cc0cfd4d7d7885102480dd99e7a90d6,Tiffany Page,7 Years Too Late,HARD ROCK 2010
3,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello & The Attractions,Accidents Will Happen,HARD ROCK 2010
4,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,Alison,HARD ROCK 2010


I'm going to convert the 4 columns down to 2. I'll do this by combining `user_id`, `playlistname`, then `artistname`, and `trackname`. This is to get the unique playlists and songs, because there can be many songs and playlists with the exact same name.

In [4]:
# Making a new column for the song artist combination
new_col = []
# Loop through each item in data frame only getting the trackname and artistname
for item in data[["trackname","artistname"]].values:
    #print(item[0], item[1])
    new_col.append(str(item[0]) + " by " + str(item[1]))

In [5]:
# Adding the new column to the data set
data["song_artist"] = new_col

In [6]:
# Drop the old columns
data.drop(columns = ["artistname","trackname"], inplace = True)

In [7]:
# See if it looks right
data.head()

Unnamed: 0,user_id,playlistname,song_artist
0,9cc0cfd4d7d7885102480dd99e7a90d6,HARD ROCK 2010,(The Angels Wanna Wear My) Red Shoes by Elvis ...
1,9cc0cfd4d7d7885102480dd99e7a90d6,HARD ROCK 2010,"(What's So Funny 'Bout) Peace, Love And Unders..."
2,9cc0cfd4d7d7885102480dd99e7a90d6,HARD ROCK 2010,7 Years Too Late by Tiffany Page
3,9cc0cfd4d7d7885102480dd99e7a90d6,HARD ROCK 2010,Accidents Will Happen by Elvis Costello & The ...
4,9cc0cfd4d7d7885102480dd99e7a90d6,HARD ROCK 2010,Alison by Elvis Costello


In [8]:
# Making a new column for the playlist user combinations
new_col = []
# Loop through each item in the data frame
for item in data[["playlistname","user_id"]].values:
    #print(item[0], item[1])
    new_col.append(str(item[0]) + " by " + str(item[1]))

In [9]:
# Add the new column to the data frame
data["playlist_user"] = new_col

In [10]:
# Drop the old columns
data.drop(columns = ["playlistname","user_id"], inplace = True)

In [11]:
# Take a look at the new data frame
data.head()

Unnamed: 0,song_artist,playlist_user
0,(The Angels Wanna Wear My) Red Shoes by Elvis ...,HARD ROCK 2010 by 9cc0cfd4d7d7885102480dd99e7a...
1,"(What's So Funny 'Bout) Peace, Love And Unders...",HARD ROCK 2010 by 9cc0cfd4d7d7885102480dd99e7a...
2,7 Years Too Late by Tiffany Page,HARD ROCK 2010 by 9cc0cfd4d7d7885102480dd99e7a...
3,Accidents Will Happen by Elvis Costello & The ...,HARD ROCK 2010 by 9cc0cfd4d7d7885102480dd99e7a...
4,Alison by Elvis Costello,HARD ROCK 2010 by 9cc0cfd4d7d7885102480dd99e7a...


Now that I've got the columns sorted out I will be selecting songs that appear 10 or more times in the data.

In [12]:
# This making a list of songs that appear more then 10 times
songs10 = list(data["song_artist"].value_counts()[data["song_artist"].value_counts().values >= 10].index)

In [13]:
# Creating a selector where songs appear more than 10 times
selector = []
# Looping through each item in the song_artist column
for item in data["song_artist"].values:
    # If the item appeared more then 10 times
    if item in songs10:
        selector.append(True)
    # Else if the item appeared less than 10 times
    else:
        selector.append(False)

In [14]:
# Does the selector look right?
selector

[False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,


In [15]:
# Filter down to the songs that appear 10 or more times
data = data[selector]

In [16]:
# How many unique playlists?
data["playlist_user"].unique().shape

(13695,)

In [17]:
# How may unique songs?
data["song_artist"].unique().shape

(13054,)

After reducing the size of the data by only using songs that appear 10 or more times it's time to make a association rules table.

In [18]:
# Get the playlist_users and get a list of the songs in the playlist
plgrouped = data.groupby("playlist_user")["song_artist"].apply(list)

In [19]:
# Take a look at the data to see if it makes sense
plgrouped.head()

playlist_user
 - Starred -  by 1d55a12fc82e5fc88f8dad24ab86c8b7                      [(I Can't Get No) Satisfaction - (Original Sin...
 Amys 80s by b358a18274eb1e8ecc8e731ff348c268                          [Always Something There To Remind Me by Naked ...
 Bad Company and Lynyrd Skynyrd by 1db5901cdada0674460ec5cb2b72d72e    [Bad Company (Remastered Album Version) by Bad...
 Champs-Élysées  by 6e4855ad98cfcfdbff0806912125f76d                   [Je t'aime moi non plus by Serge Gainsbourg, J...
 Fαvorιтer / becki by 4c18bed0f15d0b1be954b165b45399d5                 [Already Gone by Kelly Clarkson, Best Thing I ...
Name: song_artist, dtype: object

In [20]:
# Instatiate the transaction encooder
te = TransactionEncoder()

# Fit the encoder to the grouped data
onehot = te.fit_transform(plgrouped)

# Make the fitted encoder into a data frame
onehot_df = pd.DataFrame(onehot, columns=te.columns_)

# Take a look at the encoded data
onehot_df.head()

Unnamed: 0,#1 Crush by Garbage,#41 by Dave Matthews Band,#9 Dream by José González,#Beautiful by Mariah Carey,#GETITRIGHT by Miley Cyrus,#NAME? by Aurora Beltrán,#SELFIE by The Chainsmokers,#thatPOWER by will.i.am,& It Was U by How To Dress Well,'Cause I'm A Man by Tame Impala,...,¡Corre! by Jesse & Joy,¿Aha Han Vuelto? by Lori Meyers,¿No Podiamos Ser Agua? by Maldita Nerea,À tout à l'heure by Bibio,Águas De Março by Antônio Carlos Jobim,Ára bátur by Sigur Rós,Ég anda by Sigur Rós,Éxtasis by Pablo Alboran,Ísjaki by Sigur Rós,ÜBerlin by R.E.M.
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [21]:
x = apriori(onehot_df, 
            min_support=0.0001, 
            use_colnames=True, 
            max_len = 2, 
            low_memory = True)

# Create the associaton rules table
assorules = association_rules(x, metric="lift", min_threshold=1.0)

In [22]:
# Take a look at the association rules table I just made
assorules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(#1 Crush by Garbage),((Antichrist Television Blues) by Arcade Fire),0.001022,0.001387,0.000146,0.142857,102.969925,0.000145,1.165048,0.991302
1,((Antichrist Television Blues) by Arcade Fire),(#1 Crush by Garbage),0.001387,0.001022,0.000146,0.105263,102.969925,0.000145,1.116505,0.991664
2,(1901 by Phoenix),(#1 Crush by Garbage),0.007594,0.001022,0.000146,0.019231,18.811813,0.000138,1.018566,0.954087
3,(#1 Crush by Garbage),(1901 by Phoenix),0.001022,0.007594,0.000146,0.142857,18.811813,0.000138,1.157807,0.947811
4,(1979 by The Smashing Pumpkins),(#1 Crush by Garbage),0.006864,0.001022,0.000292,0.042553,41.62614,0.000285,1.043377,0.982722


In [23]:
# How many rows of data?
assorules.shape

(9403132, 10)

In [24]:
# Create a string of the alphabet with uppercase and numbers
alpha = "abcdefghijklmnopqrstuvwxyz"

# Adding the alphabet in uppercase, a space, and numbers
alpha += alpha.upper() + " " + "1234567890"

# Seeing if the output looks right
alpha

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 1234567890'

In [25]:
# This function takes in a frozen list and turns it into a string of alphanumeric characters
def remove_punctuation(string):
    # This is getting the string out of the frozen list by splitting on '
    string = str(string).split(r"'")[1]
    
    # Creating an empty list to return
    new = ""
    
    # Loop through each character in string
    for item in string:
        
        # If the item is a alphanumeric character
        if item in alpha:
            
            # Append it to the string 
            new += item
    
    # Return the string with only the alphanumeric characters of string
    return new

In [26]:
# Make the antecedents row strings insted of frozen lists
assorules["antecedents"] = assorules["antecedents"].apply(remove_punctuation)

In [86]:
# Use regular expression to get the song for a recommendation 
#selector = assorules["antecedents"].str.match("^One More Time.*$")

In [109]:
# Create a selector  for this song (Get recommendations for this song)
selector = (assorules["antecedents"] == "Magic by Coldplay")

In [110]:
# See what the top 10 recommendations are
assorules[selector].sort_values(by = "support", ascending = False)[0:10]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
330836,Magic by Coldplay,(A Sky Full of Stars by Coldplay),0.009785,0.009858,0.004162,0.425373,43.151741,0.004066,1.723105,0.986478
7226029,Magic by Coldplay,(Rather Be (feat. Jess Glynne) by Clean Bandit),0.009785,0.012413,0.002629,0.268657,21.642669,0.002507,1.350374,0.96322
2985952,Magic by Coldplay,(Counting Stars by OneRepublic),0.009785,0.012486,0.002264,0.231343,18.527756,0.002141,1.284727,0.955375
7225772,Magic by Coldplay,(Paradise by Coldplay),0.009785,0.009566,0.002264,0.231343,24.185086,0.00217,1.288526,0.968125
7225912,Magic by Coldplay,(Pompeii by Bastille),0.009785,0.015407,0.002045,0.208955,13.562283,0.001894,1.244674,0.935419
7227236,Magic by Coldplay,(The Scientist by Coldplay),0.009785,0.010442,0.001972,0.201493,19.296785,0.001869,1.23926,0.957547
4488296,Magic by Coldplay,(Fix You by Coldplay),0.009785,0.010004,0.001972,0.201493,20.1419,0.001874,1.239809,0.959743
6809856,Magic by Coldplay,(Let Her Go by Passenger),0.009785,0.012924,0.001825,0.186567,14.435239,0.001699,1.213469,0.939922
5963548,Magic by Coldplay,(I See Fire by Ed Sheeran),0.009785,0.00774,0.001825,0.186567,24.104126,0.00175,1.219843,0.967985
7227738,Magic by Coldplay,(Viva La Vida by Coldplay),0.009785,0.010734,0.001752,0.179104,16.685958,0.001647,1.205106,0.949358


In [111]:
# Make a recommendation
randnum = np.random.randint(0,10)
assorules[selector].sort_values(by = "support", ascending = False)["consequents"].values[randnum]

frozenset({'The Scientist by Coldplay'})