# Exoplanet Habitability Predictor

## 1. Data Collection and Cleaning

**Objective**: Gather data on exoplanets and prepare it for analysis.

- **Source Data**: [NASA Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu/)
- **Tasks**:
    - Gather data on exoplanets.
    - Clean: Remove missing values, outliers, irrelevant columns.
    - Normalize/standardize features.
    - Split data: training, validation, test sets.

---

## 2. Exploratory Data Analysis (EDA)

**Objective**: Understand the distribution of the data and identify potential features.

- **Tasks**:
    - Visualize distributions of features.
    - Correlations between features & habitability.
    - Identify potential features.

---

## 3. Model Training & Evaluation

**Objective**: Develop a model that predicts habitability scores based on exoplanet features.

- **Tasks**:
    - Choose a machine learning model (e.g., Regression).
    - Train the model on the training set.
    - Validate on the validation set.
    - Metrics: MAE, RMSE.
    - Hyperparameter optimization: grid search, random search.
    - Test on the test set.

---

## 4. Model Deployment with Flask

**Objective**: Integrate the trained model into a Flask application for real-time predictions.

- **Tasks**:
    - Create a Flask app.
    - Integrate model using `pickle`.
    - Design endpoints: receive exoplanet parameters, return scores.

---

## 5. Front-End Development

**Objective**: Create a user-friendly interface for the habitability predictor.

- **Tasks**:
    - Design web interface (HTML, CSS, JavaScript).
    - Input: exoplanet parameters.
    - Output: predicted habitability score.

---

## 6. Testing & Debugging

**Objective**: Ensure the complete system functions as expected.

- **Tasks**:
    - Test system end-to-end.
    - Check Flask app functionality.
    - Web interface & Flask app communication.
    - Error handling.

---

## 7. Documentation & User Guide

**Objective**: Document the process and provide a guide for users.

- **Tasks**:
    - Document: Data collection, model training, deployment.
    - User guide: Using the predictor.

---

## 8. Deployment & Monitoring

**Objective**: Make the predictor available to users and ensure it functions over time.

- **Tasks**:
    - Deploy Flask app (e.g., Heroku).
    - Monitor for issues/bugs.
    - Collect user feedback.



## 2. Exploratory Data Analysis (EDA)

### 2.1 Load Cleaned Data

In [1]:
import pandas as pd

# Load the cleaned dataset
data_path = "../data/cleaned_5250.csv"
data = pd.read_csv(data_path)

# Display the first few rows to understand its structure
data.head()

Unnamed: 0,name,distance,stellar_magnitude,planet_type,discovery_year,mass_multiplier,mass_wrt,radius_multiplier,radius_wrt,orbital_radius,orbital_period,eccentricity,detection_method
0,11 Comae Berenices b,304.0,4.72307,Gas Giant,2007,19.4,Jupiter,1.08,Jupiter,1.29,0.892539,0.23,Radial Velocity
1,11 Ursae Minoris b,409.0,5.013,Gas Giant,2009,14.74,Jupiter,1.09,Jupiter,1.53,1.4,0.08,Radial Velocity
2,14 Andromedae b,246.0,5.23133,Gas Giant,2008,4.8,Jupiter,1.15,Jupiter,0.83,0.508693,0.0,Radial Velocity
3,14 Herculis b,58.0,6.61935,Gas Giant,2002,8.13881,Jupiter,1.12,Jupiter,2.773069,4.8,0.37,Radial Velocity
4,16 Cygni B b,69.0,6.215,Gas Giant,1996,1.78,Jupiter,1.2,Jupiter,1.66,2.2,0.68,Radial Velocity


### 2.2 Data Overview

In [2]:
data.describe()

Unnamed: 0,distance,stellar_magnitude,discovery_year,mass_multiplier,radius_multiplier,orbital_radius,orbital_period,eccentricity
count,5233.0,5089.0,5250.0,5227.0,5233.0,4961.0,5250.0,5250.0
mean,2167.168737,12.683738,2015.73219,6.434812,1.015121,6.962942,479.1509,0.063568
std,3245.522087,3.107571,4.307336,12.972727,0.603479,138.6736,16804.45,0.141424
min,4.0,0.872,1992.0,0.02,0.2,0.0044,0.0002737851,-0.52
25%,389.0,10.939,2014.0,1.804,0.325,0.053,0.01259411,0.0
50%,1371.0,13.543,2016.0,4.17014,1.12,0.1028,0.03449692,0.0
75%,2779.0,15.021,2018.0,8.0,1.41,0.286,0.1442163,0.06
max,27727.0,44.61,2023.0,752.0,6.9,7506.0,1101370.0,0.95


### 2.3 Feature Distributions
Visualizing the distributions of individual features can be helpful.

In [9]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

features = data.columns()

for feature in features:
    plt.figure(figsize=(10, 5))
    sns.histplot(data[feature], bins=50, kde=True)
    plt.title(f'Distribution of {feature}')
    plt.show()