In [None]:
from plotly.offline import init_notebook_mode, iplot_mpl, download_plotlyjs, plot, iplot
import plotly_express as px
import plotly.figure_factory as ff
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
init_notebook_mode(connected=True)
import pandas_profiling
import statsmodels.formula.api as sm
import missingno as msno
from sklearn.preprocessing import LabelEncoder
from statsmodels.compat import lzip
import statsmodels.api as sm
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler,MinMaxScaler
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
try:
    import apyori
except:
    !pip install apyori

from apyori import apriori

******
## Association rules mining - Apriori Algorithm using Python
- author:: Xavier Martinez Bartra"
- date: "November 2020"
******

In this project we follow the steps of a data mining project for the case of an association rule generation algorithm. The DataSet contains a set of records with the history of the songs that a user (user) has listened to in a music web portal. "artist" is the name of the group that has listened, sex and country correspond to variables that describe the user.

## 1. DataSet Loading

In [None]:
#we load the data
data=pd.read_csv("../input/lastfm/lastfm.csv")

In [None]:
data

In [None]:
df=data.copy()

In [None]:
df.info()

We note that the DataSet has neither missing nor null values. All variables are Strings, except the user variable which is an integer.

In [None]:
pandas_profiling.ProfileReport(df)

We have 2 duplicate rows. We assume it is an error in the DataSet.

In [None]:
df[df.duplicated(keep=False)]

We remove duplicate values.

In [None]:
df.drop_duplicates(inplace=True)

## 2. DataSet Exploration

In [None]:
print("User unique: ",len(df.user.unique()))
print("Artist unique: ",len(df.artist.unique()))
print("Country unique: ",len(df.country.unique()))

In [None]:
len(df.artist.unique())

The DataSet contains 15,000 unique users from 159 countries and 1,004 artists.

In [None]:
fig = px.treemap(df, path=['sex','country'], title='DataSet Treemap by Sex & Country',
                 color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1,
          marker_line_color="black")).update_layout(paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')

fig.show() 

In [None]:
df.sex.value_counts(normalize=True)

73% of the observations are men.

The most played artists (with the most support):

In [None]:
df.artist.value_counts(normalize=True)

In [None]:
fig = px.treemap(df, path=['artist'], title='DataSet Treemap by Artist',
                 color_discrete_sequence=px.colors.qualitative.Pastel).update_traces(dict(marker_line_width=1,
          marker_line_color="black")).update_layout(paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)')

fig.show() 

## 3. Transformation of the data set

In this section we will transform the data set to be able to execute the a priori algorithm properly.

In [None]:
#select the relevant columns for the algorithm (user and artist)
df = df[['user','artist']]

In [None]:
transactions = []
for i in df['user'].unique():
    transactions.append(list(df[df['user'] == i]['artist'].values))

In [None]:
len(transactions)

We have transformed the data set into a list of all transactions (artists) that correspond to each of the users of the DataSet. (All artists played by each of the 15,000 users.)

The objective of the algorithm will be to generate knowledge by identifying similarities between users.

In [None]:
# We check the values of the first 5 users in the list
transactions[0:5]

 ## 4. Apriori algorithm

Explanation of the algorithm:

    The Apriori algorithm is a machine learning algorithm that is used to obtain information about the structured relationships between the different elements involved. It is a data mining technique used to extract frequent item sets and relevant association rules.

Things to know before implementation:

**Association rule:**

Identification of frequent patterns and associations (relationships) between a set of elements.

             
**Support.**

It simply measures how popular an item is as measured by the proportion of transactions in which it appears. It is a metric derived from the frequency of a certain set within the data set. By selecting this parameter we select the frequency of transactions necessary for the rule to take effect.


**Trust.**

This indicates the probability that element will be executed and, when element X has been processed, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.

One drawback of the confidence measure is that it could misrepresent the importance of a partnership. This is because it only explains how popular item X is, but not Y. If item y is also very popular in general, there will be a higher chance that a transaction containing X will also contain it, thus inflating the measure of confidence.

**Lift.**

This indicates the probability that item Y will be purchased when item X is purchased, while monitoring the popularity of item Y.

#### 4.1 Parameterization with lift> 2, confidence> 0.4 and support> 0.03

We are going to run the ariori algorithm with min_support (0.03), min_confidence = 0.4, min_lift = 2 to see the associations that the algorithm finds with a minimum of robustness.

In [None]:
association_rules = apriori(transactions, min_support=0.03, min_confidence=0.4, min_lift=2)
association_results = list(association_rules)

In [None]:
#Función para generar el DataSet
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))


Apriori = pd.DataFrame(inspect(association_results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

Apriori['Transaction']= Apriori['Left Hand Side']+"--"+Apriori['Right Hand Side']
Apriori.sort_values(by='Lift',inplace=True,ascending=False)

In [None]:
Apriori

With the selected parameters, the a priori algorithm has only found 7 rules ordered by their Lift score. These rules can be interpreted as "if the author is heard" Left Hand Side ", then" Right Hand Side "is predicted with the corresponding level of support, confidence and lift.

Here we see the association rules created by our a priori algorithm. For example, let's take a look at rule number 2.

   - The second value, rhs, corresponds to the Beatles, the artist predicted by a priori who is heard in association with Bob Dylan.
   
   
   - The third value, support, is the number of transactions, which include these two elements, divided by the total number of transactions. (As described above when we chose the parameters for Apriori). Thus, of the total transactions, 3.4% corresponds to Bob Dylan with The Beatles.
   
      
   - The fourth value, confidence, is the percentage probability that the rule will be maintained. It tells us, of all the transactions that contain bob dylan, how often they also incorporate The Beatles. In this case it is 49.71%.
   
     
   - The fifth value, lift, gives us the independence / dependency of a rule. It takes into account the trust value and its relationship to the entire data set. The lift is the increase in the probability of listening to The Beatles with the knowledge that Bob Dylan has been heard on the probability of hearing The Beatles without any knowledge of the presence of Bob Dylan. A lift value greater than 1 guarantees a high association between {Y} and {X}. The higher the value of the lift, the greater the chances of hearing an artist Y if the user has already listened to an artist X. In this case, we have that with a Lift of 2.79 The Beatles are 2.79 times more likely to be heard if the user has also listened to Bob Dylan (without any prior knowledge about the presence of Bob Dylan).

#### 4.2 Parameterization with lift> 2, confidence> 0.4 and support> 0.02

We will lower the minimum support and lift levels for you to investigate further associations with lower support levels.

In [None]:
association_rules = apriori(transactions, min_support=0.02, min_confidence=0.25, min_lift=2)
association_results = list(association_rules)

Apriori = pd.DataFrame(inspect(association_results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

Apriori['Transaction']= Apriori['Left Hand Side']+"--"+Apriori['Right Hand Side']
Apriori.sort_values(by='Lift',inplace=True,ascending=False)

In [None]:
Apriori

In this case, the algorithm has found 70 rules. For example, we observed that 49.99% of users who listened to iron maiden also listened to metallica. Having a very strong lift between iron maiden - metallica. Thus, the 4.85 Lift tells us that Metalica is 4.85 times more likely to be heard if the user has also listened to Iron Maiden (controlling for Metalica's popularity).

#### 4.2 Parameterization with lift> 1.5, confidence> 0.2 and support> 0.01

We lowered the margin levels of all parameters even further.

In [None]:
association_rules = apriori(transactions, min_support=0.01, min_confidence=0.2, min_lift=1.5)
association_results = list(association_rules)

Apriori = pd.DataFrame(inspect(association_results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

Apriori['Transaction']= Apriori['Left Hand Side']+"--"+Apriori['Right Hand Side']
Apriori.sort_values(by='Lift',inplace=True,ascending=False)

In [None]:
Apriori

With this parameterization, the algorithm has found 762 rules. The more we lower the margins of the parameters, the more rules the algorithm will generate, although the margin of confidence in them will be weaker.

## 5. Conclusions
  We have executed the Apriori algorithm with different levels of confidence and support and interpreted some of the obtained rules. Depending on the purpose of the data, we will choose different levels of confidence that are more or less rigorous. The knowledge generated with the algorithm is useful for useful to know the preferences of its users on the different musical artists. (with the respective confidence levels), as well as being able to make smart data-drive decisions.