# Notebook 1: Data Exploration

Welcome!, The main goal for this notebook is to provide a deep understanding of the dataset, its characteristics, and any hidden patterns.

---

## Introduction

#### Data Description

This project analyzes radio continuum data from the MeerKAT International GHz Tiered Extragalactic Exploration ([MIGHTEE](https://idia.ac.za/mightee/)) survey, focusing on extragalactic radio emissions in the COSMOS field. MIGHTEE offers insights into galaxy evolution, given its depth and multiwavelength coverage. The COSMOS field, devoid of bright radio, UV, and X-ray sources, is ideal for deep astronomical surveys.

The MIGHTEE data includes deep fields like COSMOS, XMM-LSS, E-CDFS and ELAIS-S1, covering a 20 deg^2 total area at a central frequency of 1284 MHz. These fields were observed using MeerKAT's L-band receivers for about 1000 hours. They were chosen due to their extensive multiwavelength data from prior and potential future surveys.

The **MIGHTEE-COSMOS multiwavelength catalogue** we will use here combines radio data with optical, near-to-far infrared, and X-ray measurements from various sources in the COSMOS field. The main radio catalogue was cross-matched with optical and near-infrared data from other studies, resulting in a matched set of 5,224 radio sources. Additionally, 572 of these sources were detected in X-ray observations. This dataset also incorporated Mid-Infrared (MIR) data, referencing specific IRAC flux densities. The catalogue further integrated far-infrared data from the Herschel Extra-galactic Legacy Project, identifying thousands of radio sources across different instruments.

#### Why explore the data?

Data exploration serves as the foundational step in data analysis, aimed at understanding inherent structures, patterns, and potential anomalies within a dataset. It helps ascertain the quality of data, understand features' distributions, formulate hypotheses, guide preprocessing decisions, and select appropriate analytical tools and models. Efficient data exploration also aids in communicating initial insights to stakeholders. In this notebook, we will perform some of these critical tasks.

<div align="center">
  <img src="images/cweb.jpg" width="400" height="400">
    <p><center>Cosmic Web</center></p>
</div>

---

### First we import some libraries:

In [None]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
from astropy.table import Table 
import astropy.io.fits as fits 


---

### Reading data

In [None]:
import pandas as pd

mightee_data = pd.read_pickle("data/mightee.pkl")

In [None]:
mightee_data.shape

From what we can observe, we have 144 features (columns) and 2824 rows. The following are the feature names:

In [None]:
list(mightee_data.columns[:10]) # printing the first 10 columns
# For feature description please check the following file @@

In [None]:
print(mightee_data.isna().any().any()) # checking if the dataframe has any missing values

The dataframe seems to be clean, because someone did, You guys are welcome :D

---

### Descriptive Statistics:

**Descriptive statistics** provide a concise overview of the central tendencies, spread, and shape of the data distribution. By observing the mean, median, mode, and standard deviation, we can quickly grasp the characteristics of our dataset and identify potential anomalies or patterns.

In [None]:
desc_stats = mightee_data.describe()

# Display the statistics
desc_stats

### Data Normalization

From our descriptive statistics and visual data exploration, we notice discrepancies in the scale of our features. Different scales can skew our data visualization and mask meaningful patterns. By normalizing, we rescale features to the range `[0,1]`, ensuring uniformity for better data visualization and insights.



In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
normalized_mightee = scaler.fit_transform(mightee_data)
normalized_mightee = pd.DataFrame(normalized_mightee, columns=mightee_data.columns)

In [None]:
normalized_mightee.describe()

---

### Freature selection

Having 147 features can be overwhelming for visualization and clustering. To simplify our dataset and remove redundant or highly correlated features, we can utilize correlation analysis. By examining how features correlate with one another, we can identify and remove those that offer similar information, thus streamlining our dataset for more effective analysis.

In [None]:
correlation_matrix = mightee_data.corr()
upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.80)]

mightee_data.drop(columns=to_drop, inplace=True)

In [None]:
mightee_data.shape

---

### Data Visualisation

Well, I will let you explore the all 46 features, but I am going to explore only 3:

- qir
- Mstar
- COS_best_z_v5

In [None]:
import seaborn as sns

# Settings for better visualization
sns.set(font_scale=1)
sns.set_style('ticks')

# Using pairplot with hue as 'class_star'
sns.pairplot(mightee_data[['Mstar', 'qir', 'COS_best_z_v5']], hue='COS_best_z_v5')
plt.show()

----

### Saving the data

Now, we will be using the above three features to perform some clustering in the next notebook

In [None]:
import pickle

with open("mightee3Feat.pkl", "wb") as file:
    pickle.dump(mightee_data[['Mstar', 'qir', 'COS_best_z_v5']], file)