# Sampling user groups from the LFM1b dataset

**BEFORE YOU RUN THE NOTEBOOK**:

Make sure you place the unzipped [LFM1b dataset](https://drive.jku.at/ssf/s/readFile/share/1056/266403063659030189/publicLink/LFM-1b.zip) (published in [this paper](http://www.cp.jku.at/people/schedl/Research/Publications/pdf/schedl_ism_mam_2017.pdf)) in the `./data` folder

This notebook does:
- replicate the sampling performed in [this paper about unfairness in music recommendations](https://link.springer.com/chapter/10.1007%2F978-3-030-45442-5_5) --> "extreme" strategy
- replicate the sample strategy used in [this paper about unfairness in music recommendations](https://arxiv.org/abs/1907.13286) --> "percentile" strategy
- compare these two ways of creating user samples

In [2]:
import pandas as pd
import numpy as np
import os

In [3]:
out_folder = "./data/user_groups"

if not os.path.exists(out_folder):
    os.mkdir(out_folder)

In [4]:
extreme_relevant_user_ids = set()
percentile_relevant_user_ids = set()

## Load LFM1b data

In [5]:
users_additional = pd.read_csv("data/LFM-1b/LFM-1b_users_additional.txt", sep="\t")

In [6]:
users_additional
sorted_user_add = users_additional.sort_values(by="mainstreaminess_global")

## Sample according to "extreme" strategy

In [7]:
sorted_user_add["mainstreaminess_global"] == sorted_user_add["mainstreaminess_global"].median()

69165     False
25230     False
105083    False
116054    False
110784    False
          ...  
18190     False
35278     False
13605     False
32889     False
31726     False
Name: mainstreaminess_global, Length: 120322, dtype: bool

In [8]:
low_main_users =  sorted_user_add.iloc[0:1000,]

n = sorted_user_add.shape[0]
lower = int((n/2) - 500)
higher = int((n/2) + 500)

medium_main_users = sorted_user_add.iloc[lower:higher,]
high_main_users =  sorted_user_add.iloc[-1000:,]


In [9]:
low_main_users[["user-id", "mainstreaminess_global"]].to_csv(f"{out_folder}/extreme_low_main_users.csv")
medium_main_users[["user-id", "mainstreaminess_global"]].to_csv(f"{out_folder}/extreme_medium_main_users.csv")
high_main_users[["user-id", "mainstreaminess_global"]].to_csv(f"{out_folder}/extreme_high_main_users.csv")

extreme_relevant_user_ids = extreme_relevant_user_ids.union(low_main_users["user-id"])
extreme_relevant_user_ids = extreme_relevant_user_ids.union(medium_main_users["user-id"])
extreme_relevant_user_ids = extreme_relevant_user_ids.union(high_main_users["user-id"])

## Sample according to "percentile" strategy

In [10]:
twenty_percentile = users_additional["mainstreaminess_global"].quantile(0.2)
eighty_percentile = users_additional["mainstreaminess_global"].quantile(0.8)

low_percentile_main_users = users_additional[users_additional["mainstreaminess_global"] <= twenty_percentile].sample(1000)
medium_percentile_main_users = users_additional[np.logical_and(users_additional["mainstreaminess_global"] > twenty_percentile, users_additional["mainstreaminess_global"] < eighty_percentile)].sample(1000)
high_percentile_main_users = users_additional[users_additional["mainstreaminess_global"] >= eighty_percentile].sample(1000)

In [11]:
low_percentile_main_users[["user-id", "mainstreaminess_global"]].to_csv(f"{out_folder}/percentile_low_main_users.csv")
medium_percentile_main_users[["user-id", "mainstreaminess_global"]].to_csv(f"{out_folder}/percentile_medium_main_users.csv")
high_percentile_main_users[["user-id", "mainstreaminess_global"]].to_csv(f"{out_folder}/percentile_high_main_users.csv")

percentile_relevant_user_ids = percentile_relevant_user_ids.union(low_percentile_main_users["user-id"])
percentile_relevant_user_ids = percentile_relevant_user_ids.union(medium_percentile_main_users["user-id"])
percentile_relevant_user_ids = percentile_relevant_user_ids.union(high_percentile_main_users["user-id"])

## Create user events files

In [44]:
cols = ['user', 'artist', 'album', 'track', 'timestamp']

In [46]:
iter_csv = pd.read_csv('data/LFM-1b/LFM-1b_LEs.mat', names = ['user', 'artist', 'album', 'track', 'timestamp'], encoding="utf", iterator=True, chunksize=1000)
percentile_user_le = pd.concat([chunk[chunk['user'].isin(percentile_relevant_user_ids)] for chunk in iter_csv])

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 1: invalid start byte

In [30]:
percentile_user_le

Unnamed: 0,user,artist,album,track,timestamp


In [32]:
users_additional["user-id"].isin(percentile_relevant_user_ids)

0         False
1         False
2         False
3         False
4         False
          ...  
120317    False
120318    False
120319    False
120320    False
120321    False
Name: user-id, Length: 120322, dtype: bool

## Recreate analysis