# **Hunting for exoplanets using Machine Learning**
*  **[Use machine learning to hunt for exoplanets](https://www.youtube.com/watch?v=y1k2jc3YTeg&list=PL7HQvd_RTCc3Vope7dkx4pggrH5f-uvZe)**.

* Method used for hunting exoplanets: **[Transit Photometry](https://www.planetary.org/articles/down-in-front-the-transit-photometry-method)**

* Dataset used: **[Kepler Space Telescope Dataset](https://www.kaggle.com/datasets/keplersmachines/kepler-labelled-time-series-data)**

In [None]:
# Import required libraries

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')

## Analysing the dataset ##

In [None]:
# Read training dataset (CSV file) into a dataframe

train_df = pd.read_csv("/kaggle/input/kepler-labelled-time-series-data/exoTrain.csv")
train_df.head()

In [None]:
# Shape of the training dataframe

train_df.shape

## Check for missing values ##

In [None]:
# Display the rows with null values in the dataframe

train_df[train_df.isnull().any(axis = 1)]

In [None]:
# We can visualize null values using a heatmap as well
# Display null values in training dataframe
# In this case, it'll return a blank plot as there are no missing values

sns.heatmap(train_df.isnull())

## Decoding the labels in the dataset ##

In [None]:
# Check the number of labels in the train dataframe

train_df['LABEL'].unique()

In [None]:
# Extract indexes of stars with exoplanets

train_df[train_df['LABEL'] == 2].index

* Rows 0 to 36 only are have label 2. This indicates only 37 stars have exoplanets in the train data.
* The distribution of labels can also be visualized using countplot.

In [None]:
# Visualize distribution of both labels using countplot

plt.figure(figsize = (3, 5))
ax = sns.countplot(x = 'LABEL', data = train_df)
ax.bar_label(ax.containers[0])

* The train data is highly imbalanced as of now. We will work on both balanced and imbalanced data and compare the results.

## Replacing the labels ##
Replace the labels for ease of working:

Stars with exoplanets: 2 -> 1
Stars without exoplanets: 1 -> 0

In [None]:
#Replacing the labels 2 and 1 with 1 and 0 respectively

train_df = train_df.replace({"LABEL" : {2 : 1, 1: 0}})
train_df.LABEL.unique()

## Visualizing the light curves in the data ##

When an exoplanet passes between the telescope and the star, the flux value of the star decreases, which causes a dip in the light curve of the star. In other words, when we plot the graph of the flux values of a particular star, and if the light curve follows a particular pattern where the flux initially decreases, remains constant and then increases over time, this can hint at the star being a candidate with an exoplanet. 

In [None]:
# Drop the label column as we do not want to plot it in the curve

plot_df = train_df.drop(['LABEL'], axis = 1)
plot_df

In [None]:
# Plot the light curve for a random star - here we plot for the 3rd star from the plot dataframe

time = range(1, 3198) # X - axis will hold the time periods staring from 1 to 3197
flux_values = plot_df.iloc[3, :].values # Y - axis will hold the range of flux or brightness values for the star
plt.figure(figsize = (15, 5))
plt.plot(time, flux_values, linewidth = 1)

We can try to plot for multiple stars. We can observe that if a plot has multiple dips, then this could possibly be a multiplanetary system where the star is being orbited by more than one exoplanet. If the plot has no dips and almost follows a straight line, this could mean the star has no exoplanet(s) orbiting it.

Here, for few stars (like star 2998), we can observe that some flux values are extremely high that lie out of range. These high flux values act as extreme outliers that can be problematic for the machine learning model we use further to classify the stars.

## Handling outliers ##
We first visualize the outliers using boxplot

In [None]:
plt.figure(figsize = (20, 9))
for i in range(1, 4):
    plt.subplot(1, 4, i)
    sns.boxplot(data = train_df, x = 'LABEL', y = 'FLUX.' + str(i))

By observing the dataset, we can infer that any flux values above 0.25 x 10⁶ are extreme outliers. We just drop these outliers.

In [None]:
# Dropping outliers

train_df.drop(train_df[train_df['FLUX.2'] > 0.25e6].index, axis = 0, inplace = True)
sns.boxplot(data = train_df, x = 'LABEL', y = 'FLUX.' + str(np.random.randint(1000)))


## The K - Nearest Neighbours Algorithm ##
* Here, we use the KNN algorithm for classifying the data.
* Although, KNN is sensitive to outliers and imbalanced data, its performance is shown to be better than other classification algorithms for this dataset.
* Below are the steps to classify a new data point for a pre - determined value of K
  1. Determine the value of K.
  2. Use Euclidean distance to compute the distance between the new data point and all the other existing data points.
  3. Choose the K data points that are nearest to the new data point.
  4. Among these K points, pick the class that the majority of points are classified into.
  5. Assign this class to the new data point.

## Implementing KNN ##
We first implement KNN on imbalanced data.

In [None]:
# We first extract the independent (x) and dependent (y) features from train dataframe
# Here, our independent features are the flux values and the dependent feature is the label

x = train_df.drop(['LABEL'], axis = 1)
y = train_df.LABEL
x, y

In [None]:
# Split dependent and independent features into train and test sets

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

## Feature Scaling ##
The flux values do not lie in a particular range in this dataset. They are varying between different values for each star. So we use feature scaling to scale the flux values

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train_sc = sc.fit_transform(x_train)
x_test_sc = sc.transform(x_test)

## Data Modelling ##

In [None]:
# Fit the KNN classifier model on the scaled train data
from sklearn.neighbors import KNeighborsClassifier as KNC

# Choosing k = 5
knn_classifier = KNC(n_neighbors = 1, metric = 'minkowski', p = 2)

# Fitting the model
knn_classifier.fit(x_train_sc, y_train)

# Predict the labels for the scaled test set
y_pred = knn_classifier.predict(x_test_sc)

# Results
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, auc

print("Validation accuracy of KNN: ", accuracy_score(y_test, y_pred))
print()
print("Classification report: \n", classification_report(y_test, y_pred))

# Confusion matrix
plt.figure(figsize = (15,11))
plt.subplots_adjust(wspace = 0.3)
plt.suptitle("KNN Performance before handling the imbalance in data", color = 'b', weight = 'bold')
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test, y_pred), annot = True, cmap = "Set2", fmt = "d", linewidths = 3, cbar = False, xticklabels = ['Non - exoplanet', 'Exoplanet'], yticklabels = ['Non - exoplanet', 'Exoplanet'], square = True)
plt.xlabel("Actual Labels", fontsize = 15, weight = 'bold', color = 'tab:pink')
plt.ylabel("Predicted Labels", fontsize = 15, weight = 'bold', color = 'tab:pink')
plt.title("Confusion Matrix", fontsize = 20, color = 'm')

# ROC Curve and Area under the curve
predicting_probabilities = knn_classifier.predict_proba(x_test_sc)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, predicting_probabilities)
plt.subplot(222)
plt.plot(fpr, tpr, label = ("AUC: ", auc(fpr, tpr)), color = 'g')
plt.plot([1,0], [1,0], "k--")
plt.legend()
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC - Curve and Area under the curve", fontsize = 20, color = 'm')
plt.show()




The classification report shows that the metrics precision, recall and F1 - score are all 0 for label 1, i.e., for stars with exoplanets. This is due to the high imbalance in data, as stars without exoplanets are way higher than stars with exoplanets. This imbalance has caused the KNN model to bias and predict towards stars without exoplanets. To handle this, we first balance the data and fit the model over the balanced data.

## Handling the imbalance in the data ##
We use RandomOverSampler for handling the imbalance in data. RandomOverSampler over - samples by duplicating some of the original samples from the minority class.

In my case, there was incompatibility with sklearn and imblearn versions. Hence, in the next two cells, I'm clearing the cache and installing the compatible versions. You would need to restart the kernel after installing the new versions.

In [None]:
# Clear Python cache
import sys
if 'imblearn' in sys.modules:
    del sys.modules['imblearn']
if 'sklearn' in sys.modules:
    del sys.modules['sklearn']

# Force reimport
import importlib
importlib.invalidate_caches()


In [None]:
!pip install --no-deps scikit-learn==1.4.0 --force-reinstall --quiet
!pip install --no-deps imbalanced-learn==0.12.0 --force-reinstall --quiet

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
x_ros, y_ros = ros.fit_resample(x, y)

In [None]:
y_ros.value_counts().plot(kind = 'bar', title = 'After applying RandomOverSampler')


Let's compare and visualize imbalanced and balanced data

In [None]:
from collections import Counter
print(f"Before ROS: {Counter(y)}\nAfter ROS: {Counter(y_ros)}")

Initially, the count of class 0 was 5050. One entry is missing as it was an outlier and it was dropped.
After applying ROS, both classes 0 and 1 have 5049 entries. In order to balance the data, to the initial 37 class 1 entries, 5012 additional (duplicate) entries were added. This also increased the size of the dataset.

In [None]:
# Initial size of imbalanced dataset
print(len(y))

# Size of dataset after balancing
print(len(y_ros))

## Splitting balanced data into train and test set, scaling the data and data modelling ##

In [None]:
# Split dependent and independent features into train and test sets

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_ros, y_ros, test_size = 0.3, random_state = 0)


In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train_sc = sc.fit_transform(x_train)
x_test_sc = sc.transform(x_test)

## Choosing optimal K value ##

In [None]:
err_rate = []

for k in range(1,11):
    knn_clasfr = KNC(n_neighbors = k)
    knn_clasfr.fit(x_train_sc, y_train)
    pred = knn_clasfr.predict(x_test_sc)
    err_rate.append(np.mean(pred != y_test))

arg, val = err_rate.index(min(err_rate)), min(err_rate)

plt.figure(figsize = (5,5))
plt.plot(range(1, 11), err_rate, 'co--', markersize = 8)
plt.plot(arg+1, val, marker = 'o', markersize = 8, markerfacecolor = 'r', markeredgecolor = 'g')
plt.title("Error rate wrt K values with minimum K marked")
plt.ylabel("Error Rate")
plt.xlabel("K values")

In [None]:
# Fit the KNN classifier model on the scaled train data
from sklearn.neighbors import KNeighborsClassifier as KNC

# Choosing k = 5
knn_classifier = KNC(n_neighbors = 1, metric = 'minkowski', p = 2)

# Fitting the model
knn_classifier.fit(x_train_sc, y_train)

# Predict the labels for the scaled test set
y_pred = knn_classifier.predict(x_test_sc)

# Results
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, auc

print("Validation accuracy of KNN: ", accuracy_score(y_test, y_pred))
print()
print("Classification report: \n", classification_report(y_test, y_pred))

# Confusion matrix
plt.figure(figsize = (15,11))
plt.subplots_adjust(wspace = 0.3)
plt.suptitle("KNN Performance before handling the imbalance in data", color = 'b', weight = 'bold')
plt.subplot(221)
sns.heatmap(confusion_matrix(y_test, y_pred), annot = True, cmap = "Set2", fmt = "d", linewidths = 3, cbar = False, xticklabels = ['Non - exoplanet', 'Exoplanet'], yticklabels = ['Non - exoplanet', 'Exoplanet'], square = True)
plt.xlabel("Actual Labels", fontsize = 15, weight = 'bold', color = 'tab:pink')
plt.ylabel("Predicted Labels", fontsize = 15, weight = 'bold', color = 'tab:pink')
plt.title("Confusion Matrix", fontsize = 20, color = 'm')

# ROC Curve and Area under the curve
predicting_probabilities = knn_classifier.predict_proba(x_test_sc)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, predicting_probabilities)
plt.subplot(222)
plt.plot(fpr, tpr, label = ("AUC: ", auc(fpr, tpr)), color = 'g')
plt.plot([1,0], [1,0], "k--")
plt.legend()
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC - Curve and Area under the curve", fontsize = 20, color = 'm')
plt.show()

## Testing the model on the test set (unseen data) ##

In [None]:
# Read test dataset (CSV file) into a dataframe

test_df = pd.read_csv("/kaggle/input/kepler-labelled-time-series-data/exoTest.csv")
test_df.head()

In [None]:
test_df.shape

## Pre - processing the test dataset ##

In [None]:
# Check for missing values

test_df[test_df.isnull().any(axis = 1)]

In [None]:
# Visualize distribution of labels using countplot

plt.figure(figsize = (3, 5))
ax = sns.countplot(x = 'LABEL', data = test_df)
ax.bar_label(ax.containers[0])

In [None]:
# Replacing the labels 2 and 1 with 1 and 0 respectively
test_df = test_df.replace({"LABEL": {2 : 1, 1 : 0}})
test_df.LABEL.unique()

In [None]:
# Handle outliers

plt.figure(figsize = (20, 9))
for i in range(1, 4):
    plt.subplot(1, 4, i)
    sns.boxplot(data = test_df, x = 'LABEL', y = 'FLUX.' + str(i))

In [None]:
test_df.drop(test_df[test_df['FLUX.2'] > 0.25e6].index, axis = 0, inplace = True)
sns.boxplot(data = test_df, x = 'LABEL', y = 'FLUX.' + str(np.random.randint(1000)))

In [None]:
# Extract features and labels

x_unseen = test_df.drop(['LABEL'], axis = 1)
y_unseen = test_df.LABEL
x_unseen, y_unseen

In [None]:
# Feature scaling

x_unseen_sc = sc.fit_transform(x_unseen)

In [None]:
# Predict the labels for the scaled test set
y_pred = knn_classifier.predict(x_unseen_sc)

# Results
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, auc

print("Validation accuracy of KNN: ", accuracy_score(y_unseen, y_pred))
print()
print("Classification report: \n", classification_report(y_unseen, y_pred))

# Confusion matrix
plt.figure(figsize = (15,11))
plt.subplots_adjust(wspace = 0.3)
plt.suptitle("KNN Performance before handling the imbalance in data", color = 'b', weight = 'bold')
plt.subplot(221)
sns.heatmap(confusion_matrix(y_unseen, y_pred), annot = True, cmap = "Set2", fmt = "d", linewidths = 3, cbar = False, xticklabels = ['Non - exoplanet', 'Exoplanet'], yticklabels = ['Non - exoplanet', 'Exoplanet'], square = True)
plt.xlabel("Actual Labels", fontsize = 15, weight = 'bold', color = 'tab:pink')
plt.ylabel("Predicted Labels", fontsize = 15, weight = 'bold', color = 'tab:pink')
plt.title("Confusion Matrix", fontsize = 20, color = 'm')

# ROC Curve and Area under the curve
predicting_probabilities = knn_classifier.predict_proba(x_unseen_sc)[:,1]
fpr, tpr, thresholds = roc_curve(y_unseen, predicting_probabilities)
plt.subplot(222)
plt.plot(fpr, tpr, label = ("AUC: ", auc(fpr, tpr)), color = 'g')
plt.plot([1,0], [1,0], "k--")
plt.legend()
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC - Curve and Area under the curve", fontsize = 20, color = 'm')
plt.show()

As we can observe, due to the test dataset being higly imbalanced as well, the model did not make any predictions for exoplanet, i.e., the true positives are zero.