### Foma Mironenko, <br>SPbU, Faculty of Mathematics and Mechanics,<br>431

# Clustreization methods, *Part I*

### The purpose of this work is to create a movie recomendation system. In the code below we apply different clusterization techniques to structure data from @MovieLens dataset. We give a try to three well-known algorithms: `K-means`, `DBSCAN` and `OPTICS`. As a result we obtain a set of users subdivided into groups with similar genre preferences. This allows us to build a film recommendation system.

# Part I

### In this notebook all necessary data is pre-processed and exported into a `.csv` file for further use. In *Part II* this data is imported and clusterization is applied.

# Dataset structure

### As a source data for clusterization we use a pre-processed MovieLens dataset. Each film in this dataset is marked with a subset of a set of 18 genres. There is also a list of user rates. Each user has rated multiple films and a rate is a number between 0 and 5.

In [16]:
#----- data handling -----#
import pandas as pd
import numpy as np
from tqdm import tqdm

# Constants definition

In [17]:
GENRES = {
    'action':      0,  'adventure': 1,  'animation': 2, 
    'children':    3,  'comedy':    4,  'crime':     5,
    'documentary': 6,  'drama':     7,  'fantasy':   8,
    'film-noir':   9,  'horror':    10, 'musical':   11, 
    'mystery':     12, 'romance':   13, 'sci-fi':    14,
    'thriller':    15, 'war':       16, 'western':   17
};

N_GEN = len(GENRES);

TAGS = 'tags';
MOVS = 'movies';
RATS = 'ratings';
GEN_SCRS = 'genome-scores';
GEN_TAGS = 'genome-tags';

# Functions definition

##### Get first `Nrows` rows from file:  `"{name}.csv"` or all file if `Nrows`  was not passed

In [18]:
def get_dataset(name: str, Nrows: int = None):
    if Nrows is None:
        return pd.read_csv(f"./ml-25m/{name}.csv", encoding="utf-8");
    return pd.read_csv(f"./ml-25m/{name}.csv", nrows=Nrows, encoding="utf-8");

##### Return a vector of length: `len(GENRES)`

In [26]:
def parse_genre(genre_str: str):
    gens = set( genre_str.lower().split('|') );
    gens &= GENRES.keys();
    inds = [GENRES[genre] for genre in gens];
    res  = np.zeros(N_GEN, dtype=int);
    if len(inds) != 0:
        res[np.array(inds)] = 1;
    return res;

# Prepare data

In [42]:
# MOVIES
movs_df = get_dataset(MOVS);
movs_df.sort_values(axis=0, by='movieId', ascending=True, inplace=True);
movs_df.set_index('movieId', inplace=True);
# RATINGS
rats_df = get_dataset(RATS, 1000000);

In [43]:
movs_df.head(1)

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [44]:
rats_df.head(1)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044


## User rates data

##### For each user we compute an average rate set for each of the genres. We choose all films with a certain genre from a set of films rated by the user and compute their average rate.
##### ``n_genres`` -- amount of films of each genre rated by the user <br>``mean_rates`` -- average rate for each genre

In [45]:
user_marks = pd.DataFrame(columns=['uId', 'n_genres', 'mean_rates']);
user_marks.set_index('uId', inplace=True);

for i in tqdm(range(len(rats_df))):
    rate = rats_df.iloc[i];
    mId = rate['movieId'];
    if mId not in movs_df.index.values:
        continue
    movie = movs_df.loc[mId];
    genre_vec = parse_genre(movie['genres']);
    weights_vec = genre_vec * rate['rating'];
    if rate.userId in user_marks.index.values:
        user_marks.loc[rate.userId, 'mean_rates'] += weights_vec;
        user_marks.loc[rate.userId, 'n_genres'] += genre_vec;
    else:
        user_marks.loc[rate.userId] = [genre_vec, weights_vec];

100%|██████████| 1000000/1000000 [10:55<00:00, 1524.87it/s]


In [46]:
user_marks['mean_rates'] = user_marks['mean_rates'] / (1 + user_marks['n_genres']);
user_marks.head(5)

Unnamed: 0_level_0,n_genres,mean_rates
uId,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,"[4, 11, 2, 3, 23, 8, 1, 53, 5, 1, 1, 5, 4, 18,...","[3.3, 3.4166666666666665, 2.6666666666666665, ..."
2.0,"[66, 75, 17, 25, 63, 18, 0, 91, 29, 0, 3, 11, ...","[3.6417910447761193, 3.8552631578947367, 3.416..."
3.0,"[334, 198, 50, 48, 176, 132, 3, 232, 78, 5, 45...","[3.629850746268657, 3.670854271356784, 3.90196..."
4.0,"[145, 114, 31, 28, 81, 37, 5, 49, 39, 0, 10, 7...","[3.164383561643836, 3.0478260869565217, 3.3593..."
5.0,"[18, 21, 4, 9, 49, 14, 0, 45, 8, 0, 3, 7, 7, 2...","[3.526315789473684, 3.6818181818181817, 3.0, 3..."


##### Store pre-processing result into a csv file for further use

In [47]:
user_marks.to_csv('rates_data.csv');

## Film data

##### For each film we compute a genre vector of type ``bool`` and length 18. An element of the vector is set to 1 if and only if the genre is present in film description.

In [50]:
movie_genres = pd.DataFrame(columns=['mId', 'name', 'genre_vec']);
movie_genres.set_index('mId', inplace=True);

for mId in tqdm(range(len(movs_df))):
    movie = movs_df.iloc[mId];
    genre_vec = parse_genre(movie['genres']);
    movie_genres.loc[mId] = [movie['title'], genre_vec];

100%|██████████| 62423/62423 [03:06<00:00, 335.24it/s]


In [51]:
movie_genres.head(5)

Unnamed: 0_level_0,name,genre_vec
mId,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Toy Story (1995),"[0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
1,Jumanji (1995),"[0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
2,Grumpier Old Men (1995),"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ..."
3,Waiting to Exhale (1995),"[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, ..."
4,Father of the Bride Part II (1995),"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


##### Store pre-processing result into a csv file for further use

In [52]:
movie_genres.to_csv('genre_data.csv');