## Transformation of transactional data into incidence matrix format

Dataset : lastfm.csv   

The datset contains 4 columns: "user, artist, sex, country".

Transform the data into an incidence matrix where each listener represents a row, with 0 and 1s across the
columns indicating whether or not he or she has played a certain artist.

Thus the transactional data to be converted into the matrix form containing the 'user' & 'artist' information like 
'artist' on columns and 'user' as row index.

In [1]:
#importing libraries
import pandas as pd
import numpy as np

#reading the data file.
radio=pd.read_csv('lastfm.csv') 
print('\n Sample view of dataset lastfm radio:\n\n',radio.head(5))

#shape of the dataset 
print('\n\nDataset shape:',radio.shape)

# Describe the data
print('\n\nDescribe data:\n', radio.user.describe())

# counts of each user 
tab= pd.crosstab(radio.user,columns='counts')
print('\n\nCounts for each user:\n',tab[0:5])

# print the unique artists list
artists = radio.artist.unique()
print('\n\nNo. of unique artists:',artists.shape[0])
print('\nUnique artists list sample:\n',artists[0:5])

# print the unique users list
users = radio.user.unique()
print('\n\nNo. of unique users:',users.shape[0])
print('\nUnique users list sample:\n',users[0:5])

# remove duplicate rows here
radio.drop_duplicates(inplace = True)
print('\n\nSize of data after duplicate records removal:',radio.shape)

# creating a dictionary to hold the 'user-id' as 'key' and the 'list of artists' as its 'value'. 
d = {}

# for loop to traverse the entire length of dataset.
for i in range(0,len(radio)): 
    # when a user id does not exist initially in the dictionary, an empty list is created for that key value
    # and, the list is appended with the artist name aginst the key ie userid.
    if radio.user.iloc[i] not in d:
        d[radio.user.iloc[i]]= []
    d[radio.user.iloc[i]].append(radio.artist.iloc[i])
    
#empty dataframe is created with 'columns' as 'artists' & the 'user-ids' as 'indexes(rows)'.
#This dataframe is used to represent the 'artists liked' by a 'user-id' in the binary form. 
df = pd.DataFrame(columns=artists, index=users)
    
#Now,creating the binary representation of user-id & artist-name using two for loops.
#first for loop to get the user-id one by one. 
#second for loop to get the artist-name one by one to fill the artist-name liked by each user-id.
#whenever the artist-name exists in the 'list of artists' against the 'user-id', 
#1 is assigned to the row,column position.
#else, 0 is assigned to the row,column position.

for user_id in users:
    for artist_name in artists:
    
        if artist_name in d[user_id]:
            df.loc[user_id,artist_name] = 1
        else:
            df.loc[user_id,artist_name] = 0
        
print('\n\nTransfomed data:\n',df.head(5))



 Sample view of dataset lastfm radio:

    user                   artist sex  country
0     1    red hot chili peppers   f  Germany
1     1  the black dahlia murder   f  Germany
2     1                goldfrapp   f  Germany
3     1         dropkick murphys   f  Germany
4     1                 le tigre   f  Germany


Dataset shape: (289955, 4)


Describe data:
 count    289955.000000
mean       9852.460447
std        5692.355041
min           1.000000
25%        4935.000000
50%        9838.000000
75%       14769.000000
max       19718.000000
Name: user, dtype: float64


Counts for each user:
 col_0  counts
user         
1          16
3          29
4          27
5          11
6          23


No. of unique artists: 1004

Unique artists list sample:
 ['red hot chili peppers' 'the black dahlia murder' 'goldfrapp'
 'dropkick murphys' 'le tigre']


No. of unique users: 15000

Unique users list sample:
 [1 3 4 5 6]


Size of data after duplicate records removal: (289953, 4)


Transfomed data: