# Notes for Revisiting Popularity and Demographic Biases in Recommender Evaluation and Effectivness


First we start by loading in the data and creating a plan 

### Plan for MLM1:

Prepare dataset:
- Age binning
- Gender binning
- Usage binning
- Pop-Index binning 
- Divide data into train & validate/test
    - For ML1M:
        - partition whole set of 6, 040 users into five splits containing 1, 208 users, for each iteration of cross-validation. 
        - hold out 20% of the items each user has interacted with to use as test data.
        - All other users and the rest of the test users’ items are used for model training in each iteration. 
        - The ML1M dataset only includes users who have rated over 20 or more movies, so none are removed.


### Loading in the Movie Lens Data

First off we load the packages we need
Notes regareding the data used below
I replaced the :: seperator with |, then I copied the data into an excel file in order to circumvent problems loading the data with different number of columns as well as naming the variables

In [7]:
import numpy as np
import pandas as pd
import implicit
import scipy

## USERS FILE DESCRIPTION


User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy.  Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

In [3]:
user_data = pd.read_excel(r'C:\mahmoud uni\TU\WS2022_2023\Experiment Design\EX2\users.xlsx')
print(user_data)

      UserID Gender  Age  Occupation Zip-Code
0          1      F    1          10    48067
1          2      M   56          16    70072
2          3      M   25          15    55117
3          4      M   45           7     2460
4          5      M   25          20    55455
...      ...    ...  ...         ...      ...
6035    6036      F   25          15    32603
6036    6037      F   45           1    76006
6037    6038      F   56           1    14706
6038    6039      F   45           0     1060
6039    6040      M   25           6    11106

[6040 rows x 5 columns]


In [16]:
# We use the dat file because it loads faster
rating_raw = np.genfromtxt('ratings.dat',
                     names=True,
                     dtype=None,
                     encoding=None,
                     delimiter='::')

rating_data = pd.DataFrame(rating_raw, columns=["UserID", "MovieID", "Rating", "Timestamp"])
print(rating_data)

         UserID  MovieID  Rating  Timestamp
0             1     1193       5  978300760
1             1      661       3  978302109
2             1      914       3  978301968
3             1     3408       4  978300275
4             1     2355       5  978824291
...         ...      ...     ...        ...
1000204    6040     1091       1  956716541
1000205    6040     1094       5  956704887
1000206    6040      562       5  956704746
1000207    6040     1096       4  956715648
1000208    6040     1097       4  956715569

[1000209 rows x 4 columns]


In [17]:
movies_data = pd.read_excel(r'C:\mahmoud uni\TU\WS2022_2023\Experiment Design\EX2\movies.xlsx')
print(movies_data)

      MovieID                               Title    Genre 1     Genre 2  \
0           1                    Toy Story (1995)  Animation  Children's   
1           2                      Jumanji (1995)  Adventure  Children's   
2           3             Grumpier Old Men (1995)     Comedy     Romance   
3           4            Waiting to Exhale (1995)     Comedy       Drama   
4           5  Father of the Bride Part II (1995)     Comedy         NaN   
...       ...                                 ...        ...         ...   
3878     3948             Meet the Parents (2000)     Comedy         NaN   
3879     3949          Requiem for a Dream (2000)      Drama         NaN   
3880     3950                    Tigerland (2000)      Drama         NaN   
3881     3951             Two Family House (2000)      Drama         NaN   
3882     3952               Contender, The (2000)      Drama    Thriller   

      Genre 3 Genre 4 Genre 5 Genre 6  
0      Comedy     NaN     NaN     NaN  
1     F

In [18]:
data = pd.merge(movies_data, rating_data, how='inner', on = "MovieID")
data = pd.merge(data, user_data, how = "inner", on = "UserID")

## Alternative Approach using Implicit Dataset

In [9]:
import numpy as np
import pandas as pd
import implicit
import scipy

# Als
from implicit.datasets.movielens import get_movielens
from implicit.als import AlternatingLeastSquares

# NDCG
from sklearn.metrics import ndcg_score

#### Alternative Least Squares

Possible fixes/notes:
there is a movielens.py file in the examples folder (https://github.com/benfred/implicit)

implicit package documentation (https://benfred.github.io/implicit/)




In [10]:
# Alternative approach
movies, ratings = get_movielens(variant="1m")

# get the transpose since the most of the functions in implicit expect (user, item) 
# sparse matrices instead of (item, user)
moviesT = ratings.T.tocsr()

In [16]:
ratings.head

AttributeError: 'numpy.ndarray' object has no attribute 'head'

Implicit provides implementations of several different algorithms for implicit feedback recommender systems. In the paper they use the AlternatingLeastSquares model. 

 This model aims to learn a binary target of whether each user has interacted with each item - but weights each binary interaction by a confidence value of how confident we are in this user/item interaction. The implementation in implicit uses the values of a sparse matrix to represent the confidences, with the non zero entries representing whether or not the user has interacted with the item.

In [11]:
# AlternativeLeastSquares
model = AlternatingLeastSquares(factors=50, regularization=0.01)
temp = model.fit(moviesT)

  0%|          | 0/15 [00:00<?, ?it/s]

The .recommend call will compute the N best recommendations for each user in the input, and return the itemids in the ids array as well as the computed scores in the scores array. We can see what the musicians are recommended for each user by looking up the ids in the artists array:

#### Roadmap
- Make 1000 recommendations for every user
- measure results (scores?) using NDCG
- MRR
- RBP
- Perform Kruskal-Wallis significance on mean NDCG values between demographic groups

#### ALS

In [12]:
# TODO make recommendations for all the users
# min 1
# max 6040

userid = 1
ids, scores = model.recommend(userid, moviesT[userid], N = 1000, filter_already_liked_items=False)
pd.DataFrame({"movies": movies[ids], "score": scores, "already_liked": np.in1d(ids, moviesT[userid].indices)})


Unnamed: 0,movies,score,already_liked
0,Beauty and the Beast (1991),1.045676,True
1,Toy Story (1995),1.023153,True
2,"Lion King, The (1994)",0.973881,False
3,Toy Story 2 (1999),0.971791,True
4,"Wizard of Oz, The (1939)",0.933014,True
...,...,...,...
995,Bandits (1997),0.037338,False
996,Klute (1971),0.037207,False
997,"Thing From Another World, The (1951)",0.037140,False
998,Footloose (1984),0.037104,False


In [13]:
# Calculating 1k Scores for all users
df_dict = {}

for userid in range(1, 6040):
    ids, scores = model.recommend(userid, moviesT[userid], N = 1000, filter_already_liked_items=False)
    df_dict["userScore{0}".format(userid)] = pd.DataFrame({"movies": movies[ids], "score": scores, "already_liked": np.in1d(ids, moviesT[userid].indices)})


In [14]:
# Alternative way of calculacting batch recommendations
userids = np.arange(6041)
bIds, bScores = model.recommend(userids, moviesT[userids], N = 1000)

#### NDCG

In [17]:
df_dict["userScore1"]

Unnamed: 0,movies,score,already_liked
0,Toy Story (1995),1.006420,True
1,Beauty and the Beast (1991),1.005634,True
2,Toy Story 2 (1999),1.005403,True
3,Aladdin (1992),0.938332,True
4,"Lion King, The (1994)",0.930784,False
...,...,...,...
995,Dick (1999),0.035864,False
996,When the Cats Away (Chacun cherche son chat) (...,0.035856,False
997,"Age of Innocence, The (1993)",0.035845,False
998,Madame Sousatzka (1988),0.035817,False


In [16]:
# TODO: Problem: true relevance scores for all metrics -> vielleicht Bewertungen normieren 
# und als true relevance nutzen? 
y_score = df_dict["userScore1"]
ndcg_score(y_true, y_score["score"])

TypeError: ndcg_score() missing 1 required positional argument: 'y_score'