## Notebook 2: Clustering

Welcome! This notebook aims to apply clustering to discern between AGNs and SFGs, uncovering inherent groupings and validating them against known truths.


---

## Introduction 

 ### Brief overview of AGN and SFGs

AGN, or Active Galactic Nuclei, are galaxies that emit a significant fraction of their luminosity from a central supermassive black hole, often overshadowing the regular emissions from the galaxy itself. These intense radiations arise from the gravitational energy released as matter accretes onto the black hole. On the other hand, Star-Forming Galaxies (SFGs) are dominated by star formation processes. Their radio emissions are primarily due to synchrotron radiation produced by relativistic electrons and free-free emissions from HII regions. Differentiating between these two is crucial for understanding galaxy formation and evolution.

### What is Clustering

Clustering is a method of unsupervised machine learning where data points are grouped into subsets or "clusters" based on their similarities, without having prior labels for the groups. The goal is to ensure that data points in the same cluster are more similar to each other than to those in other clusters. Common applications include customer segmentation, image segmentation, and anomaly detection. Clustering techniques, such as K-Means, Hierarchical, and DBSCAN, are often employed depending on the nature of the data and the specific requirements of a task.

### Objective

Objective: Our primary aim in this notebook is to employ unsupervised learning techniques to effectively cluster and differentiate between AGN and SFGs. By analyzing their inherent features and patterns, we hope to unveil the distinct characteristics that set these two entities apart.



---

### Loading the data

In [None]:
import pickle

# Loading data from pickle file
with open("mightee3Feat.pkl", "rb") as file:
    mightee_data = pickle.load(file)

print(mightee_data)  

---

### Clustering using KMeans

#### Initializing KMeans with Two Clusters:
We're setting up the KMeans algorithm to identify two clusters, representing the Active Galactic Nuclei AGN and Star-Forming Galaxies (SFGs). This configuration aims to segment the data into these two distinct categories based on inherent features.



In [None]:
k_optimal = 2

#### Training Kmeans

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(n_clusters=k_optimal, random_state=1912)
# clusters = kmeans.fit_predict(mightee_data[['Mstar', 'qir']])
clusters = kmeans.fit_predict(mightee_data[['Mstar', 'qir']])

Now, let us count the number of samples between the two clusters

In [None]:
import numpy as np

In [None]:
print( len(np.where(clusters==1)[0]) )
print( len(np.where(clusters==0)[0]) )

### Visualisation

In [None]:
import matplotlib.pyplot as plt 


In [None]:
mightee_data['Cluster'] = clusters

# mightee_array = mightee_data[['Mstar', 'qir']].values
mightee_array = mightee_data.values


plt.scatter(mightee_array[clusters == 0, 0], mightee_array[clusters == 0, 1], label='Cluster 0')
plt.scatter(mightee_array[clusters == 1, 0], mightee_array[clusters == 1, 1], label='Cluster 1')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', label='Centroids')
plt.legend()
plt.show()

We advise removing the outliers for a better visualization.
Good luck!

---

### Testing

In [None]:
import pandas as pd

val = pd.read_pickle("data/val.pkl")

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [None]:
clusters_val = kmeans.predict(val[['Mstar', 'qir']]) # Prediction on the validation set from the trained KNN model.

acc = accuracy_score(val.label, clusters_val)

print(f"Accuracy: {acc}") # if the accuracy is less than 50% then simply do 1-acc

---

### Hackathon Task

As a team, come up with the best unsupervised model that can separate AGN from SFGs.