### Foma Mironenko, <br>SPbU, Faculty of Mathematics and Mechanics,<br>431

# Clustreization methods, *Part II*

### In this paper we import the pre-processed data from *Part I* from a `.csv` and try different clusterization methods on it.

In [1]:
#----- data handling -----#
import pandas as pd
import numpy as np
from tqdm import tqdm

In [2]:
#----- clustering -----#
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import OPTICS
from sklearn.cluster import cluster_optics_dbscan

# Cluster user data

##### ``np array`` is stored in csv cell as a string. So, before usage it must be decoded back into an array of numbers.

In [3]:
user_marks = pd.read_csv('rates_data.csv');

user_marks['n_genres'] = user_marks['n_genres'].apply(
    lambda arr: np.fromstring(arr[1:-1],  sep=' ', dtype=int));
user_marks['mean_rates'] = user_marks['mean_rates'].apply(
    lambda arr: np.fromstring(arr[1:-1],  sep=' ', dtype=float));

# user_marks['n_genres'] = user_marks['n_genres'] / (1 + user_marks['n_genres'].apply(lambda arr: np.sum(arr)));
# user_marks['n_genres'] = 10 * user_marks['n_genres'];
user_marks.head(5)

Unnamed: 0,uId,n_genres,mean_rates
0,1.0,"[4, 11, 2, 3, 23, 8, 1, 53, 5, 1, 1, 5, 4, 18,...","[3.3, 3.41666667, 2.66666667, 2.875, 3.7083333..."
1,2.0,"[66, 75, 17, 25, 63, 18, 0, 91, 29, 0, 3, 11, ...","[3.64179104, 3.85526316, 3.41666667, 3.5192307..."
2,3.0,"[334, 198, 50, 48, 176, 132, 3, 232, 78, 5, 45...","[3.62985075, 3.67085427, 3.90196078, 3.6326530..."
3,4.0,"[145, 114, 31, 28, 81, 37, 5, 49, 39, 0, 10, 7...","[3.16438356, 3.04782609, 3.359375, 3.10344828,..."
4,5.0,"[18, 21, 4, 9, 49, 14, 0, 45, 8, 0, 3, 7, 7, 2...","[3.52631579, 3.68181818, 3.0, 3.0, 3.5, 3.8666..."


##### ``genre_batch`` ---  amount of films of each genre, rated by a user<br> ``rates_batch`` --- an average rate set by the user to each genre

In [4]:
genre_batch = user_marks['n_genres'].to_numpy();
genre_batch = np.stack(genre_batch);
rates_batch = user_marks['mean_rates'].to_numpy();
rates_batch = np.stack(rates_batch);
# batch1 = np.concatenate((genre_batch, rates_batch), axis=1);
batch1 = genre_batch;

### **KMeans**

In [11]:
processor1 = KMeans(n_clusters=18, init='k-means++');
processor1.fit(batch1);

In [12]:
labels_kmeans = processor1.labels_;
cluster_volumes = [sum(labels_kmeans == i) for i in range(max(labels_kmeans) + 1)]
print(cluster_volumes)

[144, 553, 2, 40, 264, 9, 3364, 94, 27, 15, 7, 113, 152, 30, 24, 411, 1, 1497]


### **DBSCAN**

In [13]:
processor2 = DBSCAN(eps=100, min_samples=2, metric='manhattan');
processor2.fit(batch1);

In [14]:
labels_dbscan = processor2.labels_;
cluster_volumes = [sum(labels_dbscan == i) for i in range(max(labels_dbscan) + 1)]
print(cluster_volumes)

[5523, 4, 2, 2, 4, 3, 2, 2, 2, 2, 2, 2, 3, 2, 4, 4, 2, 2, 2, 3, 2, 3, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2]


### **OPTICS**

In [15]:
processor3 = OPTICS(min_samples=3, max_eps=200, metric='manhattan');
processor3.fit(batch1);

In [16]:
labels_optics = cluster_optics_dbscan(
            reachability   = processor3.reachability_,
            core_distances = processor3.core_distances_,
            ordering       = processor3.ordering_,
            eps            = 100
);

In [17]:
cluster_volumes = [sum(labels_optics == i) for i in range(max(labels_optics) + 1)]
print(cluster_volumes)

[5523, 2, 3, 3, 3, 2, 3, 3, 3, 1, 4, 3, 2]


##### We see that no algorithm has outputed a satisfactory result. ``OPTICS`` and ``DBSCAN`` have recognized a single huge sluster and a few side clusters with size ``n <= 4``. ``K-means``, however, has recognized two huge clusters  ``(n = 3364, n = 1497)``, some big clusters ``(100 <= n <= 500)`` and a little of small ones ``(n < 100)``.

# Cluster films data

##### Parse strings from csv

In [33]:
movie_genres = pd.read_csv('genre_data.csv');
movie_genres['genre_vec'] = movie_genres['genre_vec'].apply(
    lambda arr: np.fromstring(arr[1:-1],  sep=' ', dtype=int));
movie_genres.head(5)

Unnamed: 0,mId,name,genre_vec
0,0,Toy Story (1995),"[0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
1,1,Jumanji (1995),"[0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
2,2,Grumpier Old Men (1995),"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ..."
3,3,Waiting to Exhale (1995),"[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, ..."
4,4,Father of the Bride Part II (1995),"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [34]:
batch2 = movie_genres['genre_vec'].to_numpy();
batch2 = np.stack(np.vstack(batch2), axis=0);

### **DBSCAN**

In [35]:
processor2 = DBSCAN(eps=3, min_samples=100, metric='manhattan');
processor2.fit(batch2);
labels_dbscan = processor2.labels_;

In [37]:
cluster_volumes = [sum(labels_dbscan == i) for i in range(max(labels_dbscan) + 1)]
cluster_volumes

[62422]