# Galaxy clusters identification

So, we have already studied several unsupervised machine learning algorithms for solving the clustering problem. In this notebook, we will select the clustering algorithm and tune the model hyperparameters to get the best result.

In this task, we will be dealing with real data from the *Sloan Digital Sky Survey*. 

In [None]:
# Let`s import all libraries and tools we`ll need

import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import adjusted_rand_score

### Sloan Digital Sky Survey (SDSS)

SDSS is a large-scale study of multispectral images and redshift spectra of stars and galaxies using a 2.5-meter wide-angle telescope at Apache Point Observatory in New Mexico. The project is named after the Alfred Sloan Foundation.

Research began in 2000, during the work of the project, more than 35% of the celestial sphere was mapped with photometric observations of about 500 million objects and obtaining spectra for more than 3 million objects. The average redshift from galaxy images was 0.1; for bright red galaxies up to *z = 0.4*, for quasars up to *z = 5*. 



### Dataset

The dataset consists of 4717 observations (galaxies) and for each galaxy we know three features:

- **ra** - the angle of right ascention that may has values from 0 to 360 degrees;
- **dec** - the angle of declination that may has values from -90 to 90 degrees;
- **z** - the redshift of the galaxy that is positive;
- **iGrID** - a ground truth cluster label of a galaxy that we want find.

Angles help us to detect the position of a galaxy on a celestial sphere and the redshift we will consider as an analogy of the distance from the observer to a galaxy.

In [None]:
url = 'https://raw.githubusercontent.com/HSE-LAMBDA/ML-IDS-private/main/1_course/7_US/galaxies.csv?token=ANW5UZ2ZZHEYPTEWLIU7BATBRKO3G'
data = pd.read_csv(url)

In [None]:
data.head()

In [None]:
X = data[['ra','dec','z']]  # the data to detect galaxy clusters 
y = data.iGrID              # ground truth labels to compare results of clustering

In [None]:
# Let`s visualize the position of galaxies on a plane (ra, dec)

plt.figure(figsize=(14,7))

plt.scatter(X.ra, X.dec, s=1)
plt.xlim((0,360))
plt.ylim((-90,90))
plt.xlabel('Right ascension')
plt.ylabel('Declination')
plt.title('The celestial sphere on plane \n');

### Baseline

**K-means**

In [None]:
algorithm = KMeans(n_clusters=500, random_state=42)
pred = algorithm.fit_predict(X)

print('ARI:', round(adjusted_rand_score(y,pred),2))

**Agglomerative Clustering**

In [None]:
algorithm = AgglomerativeClustering(n_clusters=1000, affinity='euclidean', linkage = 'single')
pred = algorithm.fit_predict(X)

print('ARI:', round(adjusted_rand_score(y,pred),2))

**DBSCAN**

In [None]:
algorithm = DBSCAN(eps=1.45, min_samples=3, metric='euclidean')

pred = algorithm.fit_predict(X)

print('ARI:', round(adjusted_rand_score(y,pred),2))

The best score we got on DBSCAN: **0.82**.

## Task: 
Tune hyperparameters of algorithms to improve the result of clustering.