# DBSCAN
This notebook is split into three sections with one for every algorithm.
Essentially, the idea is to cluster movies and users so that we can
recommend the entire cluster they belong to, for both users and movies.

Compared to the other algorithms, this means that we will not be able
to give exactly how much they like a given movie, but just give them
the cluster they reside in.

## Initialise PySpark and data

In [1]:
from pyspark.ml.linalg import DenseVector
from pyspark.mllib.random import RandomRDDs
import pyspark.sql
import pyspark
from pyspark import SparkContext, SparkConf, SQLContext
from sklearn.cluster import DBSCAN
import os
import numpy as np

from density.slides_dbscan import my_DBSCAN

if os.path.basename(os.getcwd()) == 'density':
    print("Current dir is", os.getcwd())
    print("Changing dir to be in root")
    os.chdir('..')
    print('now in', os.getcwd())

SPARK_CONF = SparkConf()
SPARK_CONF.set("spark.driver.memory", "10g")
SPARK_CONF.set("spark.cores.max", "4")
SPARK_CONF.set("spark.executor.heartbeatInterval", "3600")
SPARK_CONF.setAppName("word2vec")

SPARK_CONTEXT = SparkContext.getOrCreate(SPARK_CONF)
SPARK = SQLContext(SPARK_CONTEXT)

Current dir is E:\Users\nicol\Documents\GitHub\3804ICT-Data-mining-try-4\density
Changing dir to be in root
now in E:\Users\nicol\Documents\GitHub\3804ICT-Data-mining-try-4


In [2]:
# Load data: TODO
# Assume that I can do this
MOVIES: pyspark.sql.DataFrame = 1  # data frame
USERS: pyspark.sql.DataFrame = 1  # data frame

# Just so that we have something lets just go ahead and do this
size = 1000
np.random.seed(42)
MOVIES_SIMILARITY_MATRIX = np.random.rand(size, size)
USERS_SIMILARITY_MATRIX = np.random.rand(size, size)

# To keep things fair, initialise the necessary parameters right at the start
MOVIES_RADIUS = 0.001
MOVIES_MINIMUM_POINTS = 4

USERS_RADIUS = 0.001
USERS_MINIMUM_POINTS = 4

In [3]:
print(MOVIES_SIMILARITY_MATRIX)

[[0.37454012 0.95071431 0.73199394 ... 0.13681863 0.95023735 0.44600577]
 [0.18513293 0.54190095 0.87294584 ... 0.06895802 0.05705472 0.28218707]
 [0.26170568 0.2469788  0.90625458 ... 0.30978786 0.29004553 0.87141403]
 ...
 [0.08526213 0.09203751 0.69438213 ... 0.93032574 0.91330033 0.17538921]
 [0.76021961 0.96382646 0.00477478 ... 0.33470457 0.40011529 0.95614839]
 [0.11429115 0.15980354 0.82572709 ... 0.41807198 0.42867126 0.92944855]]


## Scikit DBSCAN
[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN)
### Implementation

In [4]:
scikit_movies_clustering = DBSCAN(
    eps=MOVIES_RADIUS, min_samples=MOVIES_MINIMUM_POINTS, metric='precomputed', n_jobs=-1
).fit(MOVIES_SIMILARITY_MATRIX)

scikit_users_clustering = DBSCAN(
    eps=USERS_RADIUS, min_samples=USERS_MINIMUM_POINTS, metric='precomputed', n_jobs=-1
).fit(MOVIES_SIMILARITY_MATRIX).fit(USERS_SIMILARITY_MATRIX)

In [5]:
print(set(scikit_movies_clustering.labels_))
print(set(scikit_users_clustering.labels_))

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, -1}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, -1}


## PyCaret DBSCAN

## Slides
### Implementation
Since we made a class that just inherits the scikit dbscan and replaces the fit function, we should
just be able to do the same process here as the scikit section

In [24]:
my_dbscan_movies_clustering = my_DBSCAN(
    eps=MOVIES_RADIUS, min_samples=MOVIES_MINIMUM_POINTS, metric='precomputed', n_jobs=-1
).fit(MOVIES_SIMILARITY_MATRIX)

my_dbscan_users_clustering = my_DBSCAN(
    eps=USERS_RADIUS, min_samples=USERS_MINIMUM_POINTS, metric='precomputed', n_jobs=-1
).fit(USERS_SIMILARITY_MATRIX)

In [25]:
print(set(my_dbscan_movies_clustering.labels_))
print(set(my_dbscan_users_clustering.labels_))

{0.0, 1.0, 2.0, 3.0, -1.0}
{0.0, 1.0, 2.0, -1.0}
