# Exploring NASA Exoplanet Data
In this project, we work with the NASA Exoplanet Dataset to analyze planetary and stellar properties using machine learning. The goal is to understand how exoplanets are discovered, classify them based on temperature, and explore natural groupings in the data. The project progresses from basic data exploration to supervised and unsupervised learning, with a focus on practical implementation and model comparison rather than theoretical explanations.

## Objectives

- Perform exploratory data analysis (EDA) on NASA exoplanet data

- Predict the discovery method of exoplanets using supervised classification

- Classify exoplanets into temperature-based categories

- Compare multiple machine learning models using evaluation metrics

- Apply unsupervised learning to identify clusters of similar exoplanets
## Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)

df = pd.read_csv(r"C:\Users\madha\Downloads\12310219-PA\nasa_exoplanets.csv", sep=',')
df.head()


## Basic Dataset Overview

In [None]:
df.shape

In [None]:
df.info()

## Key Columns and Missing Values

In [None]:
key_columns = [
    "discoverymethod",
    "pl_orbper",
    "pl_rade",
    "pl_bmasse",
    "pl_eqt",
    "st_teff",
    "st_mass"
]

df[key_columns].isnull().sum()

## Target Variable: Discovery Method

In [None]:
df["discoverymethod"].value_counts().head(10)

In [None]:
plt.figure(figsize=(10, 6))

sns.countplot(
    data=df,
    y="discoverymethod",
    order=df["discoverymethod"].value_counts().head(10).index,
    palette="viridis"
)

plt.title("Top 10 Exoplanet Discovery Methods")
plt.xlabel("Number of Exoplanets")
plt.ylabel("Discovery Method")

plt.tight_layout()
plt.show()

## Planetary and Stellar Feature Distributions

In [None]:
plt.figure(figsize=(12, 7))
sns.set_style("whitegrid")

# 1. Planet Radius
plt.subplot(2, 2, 1)
sns.histplot(df["pl_rade"], bins=30, kde=True)
plt.xlim(0, 20)
plt.title("Planet Radius Distribution")
plt.xlabel("Planet Radius (Earth radii)")

# 2. Planet Equilibrium Temperature
plt.subplot(2, 2, 2)
sns.histplot(df["pl_eqt"], bins=30, kde=True)
plt.title("Planet Equilibrium Temperature")
plt.xlabel("Temperature (K)")

# 3. Stellar Temperature
plt.subplot(2, 2, 3)
sns.histplot(df["st_teff"], bins=30, kde=True)
plt.xlim(0, 10000)
plt.title("Stellar Temperature")
plt.xlabel("Temperature (K)")

# 4. Orbital Period (log)
plt.subplot(2, 2, 4)
sns.histplot(np.log10(df["pl_orbper"] + 1), bins=30, kde=True)
plt.title("Orbital Period (Log Scale)")
plt.xlabel("log10(Orbital Period + 1)")

plt.tight_layout()
plt.show()

## Relationships Between Features

In [None]:
plt.figure(figsize=(8, 6))
sns.set_style("whitegrid")

sns.scatterplot(
    data=df,
    x="pl_orbper",
    y="pl_rade",
    alpha=0.5
)

plt.xscale("log")
plt.xlabel("Orbital Period (days)")
plt.ylabel("Planet Radius (Earth radii)")
plt.title("Orbital Period vs Planet Radius")

plt.tight_layout()
plt.show()

## Correlation Analysis (Selected Features)

In [None]:
features = [
    "pl_orbper",
    "pl_orbsmax",
    "pl_rade",
    "pl_bmasse",
    "pl_eqt",
    "st_mass",
    "st_rad",
    "st_teff"
]

plt.figure(figsize=(10, 6))
sns.heatmap(df[features].corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.tight_layout()
plt.show()


## Summary of Exploratory Analysis

- The dataset contains confirmed exoplanetary, stellar, and discovery-related parameters.
- Key features show wide variability and differing levels of missing data.
- Discovery methods are unevenly distributed.
- Planetary and stellar properties exhibit diverse distributions and relationships.

This exploratory analysis informs preprocessing and modeling decisions in subsequent modules.
