# Initial Exoplanet Data Exploration of TESS Candidates

## Purpose

The main goal of this notebook is to explore and understand the TESS confirmed exoplanet dataset and the set of false positive exoplanet examples.

Key features of this dataset will be analyzed to:

1. **Gain Familiarity:** Develop an understanding of the types of data avaulable and their distributions
2. **Identify Patterns:** Look for any distinguishing characteristics between confirmed exoplanets and their false positive counterparts
3. **Guide Feature Engineering:** Use the insights gained to pick features for our machine learning model, which will aim to classify exoplanet transit signals

## Dataset

We will be using the following data:
* **NASA_TESS_Project_Candidates_Labelled_CSV.csv:** Contains metadata for confirmed exoplanets discovered by TESS.
* **TESS Light Curves:** We will download light curve data from MAST for a subset of confirmed exoplanets and false positives, using the TIC IDs provided in the CSV file.

## Libraries / Prerequisites to run

* `pandas`: For data manipulation and analysis.
* `numpy`: For numerical operations.
* `matplotlib`: For creating visualizations.
* `lightkurve`: For downloading and manipulating TESS light curves.

# 1. Loading and Cleaning the data

In [1]:
try: 
    import lightkurve as lk
except ImportError:
    %pip install lightkurve
    import lightkurve as lk
    print("Lightkurve installed successfully")
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [49]:
# Laod the Labeled dataset skipping the header information in the file

labeled_planet_data = pd.read_csv('data/NASA_TESS_Project_Candidates_Labelled_CSV.csv', skiprows=69)


In [50]:
# Check the first 5 rows of the dataset to see if it was loaded correctly
print(labeled_planet_data.head().to_markdown(index=False, numalign="left", stralign="left"))

| toi     | tid       | tfopwg_disp   | rastr        | ra      | decstr        | dec      | st_pmra   | st_pmraerr1   | st_pmraerr2   | st_pmralim   | st_pmdec   | st_pmdecerr1   | st_pmdecerr2   | st_pmdeclim   | pl_tranmid   | pl_tranmiderr1   | pl_tranmiderr2   | pl_tranmidlim   | pl_orbper   | pl_orbpererr1   | pl_orbpererr2   | pl_orbperlim   | pl_trandurh   | pl_trandurherr1   | pl_trandurherr2   | pl_trandurhlim   | pl_trandep   | pl_trandeperr1   | pl_trandeperr2   | pl_trandeplim   | pl_rade   | pl_radeerr1   | pl_radeerr2   | pl_radelim   | pl_insol   | pl_insolerr1   | pl_insolerr2   | pl_insollim   | pl_eqt   | pl_eqterr1   | pl_eqterr2   | pl_eqtlim   | st_tmag   | st_tmagerr1   | st_tmagerr2   | st_tmaglim   | st_dist   | st_disterr1   | st_disterr2   | st_distlim   | st_teff   | st_tefferr1   | st_tefferr2   | st_tefflim   | st_logg   | st_loggerr1   | st_loggerr2   | st_logglim   | st_rad   | st_raderr1   | st_raderr2   | st_radlim   | toi_created         | rowupdate   

In [51]:
# Check the columns in the dataset
print(labeled_planet_data.columns)

Index(['toi', 'tid', 'tfopwg_disp', 'rastr', 'ra', 'decstr', 'dec', 'st_pmra',
       'st_pmraerr1', 'st_pmraerr2', 'st_pmralim', 'st_pmdec', 'st_pmdecerr1',
       'st_pmdecerr2', 'st_pmdeclim', 'pl_tranmid', 'pl_tranmiderr1',
       'pl_tranmiderr2', 'pl_tranmidlim', 'pl_orbper', 'pl_orbpererr1',
       'pl_orbpererr2', 'pl_orbperlim', 'pl_trandurh', 'pl_trandurherr1',
       'pl_trandurherr2', 'pl_trandurhlim', 'pl_trandep', 'pl_trandeperr1',
       'pl_trandeperr2', 'pl_trandeplim', 'pl_rade', 'pl_radeerr1',
       'pl_radeerr2', 'pl_radelim', 'pl_insol', 'pl_insolerr1', 'pl_insolerr2',
       'pl_insollim', 'pl_eqt', 'pl_eqterr1', 'pl_eqterr2', 'pl_eqtlim',
       'st_tmag', 'st_tmagerr1', 'st_tmagerr2', 'st_tmaglim', 'st_dist',
       'st_disterr1', 'st_disterr2', 'st_distlim', 'st_teff', 'st_tefferr1',
       'st_tefferr2', 'st_tefflim', 'st_logg', 'st_loggerr1', 'st_loggerr2',
       'st_logglim', 'st_rad', 'st_raderr1', 'st_raderr2', 'st_radlim',
       'toi_created', 'rowupda

In [55]:
# Since we just need the identifying information to download the light curve data, as well as its label, we will only keep toi, tid, tfopwg_disp columns
filtered_data = labeled_planet_data[['toi', 'tid', 'tfopwg_disp']].rename(columns={'toi': 'TOI', 'tid': 'TIC ID', 'tfopwg_disp': 'Disposition'})


# Check the first 5 rows of the filtered dataset
print(filtered_data.head())

       TOI     TIC ID Disposition
0  1000.01   50365310          FP
1  1001.01   88863718          PC
2  1002.01  124709665          FP
3  1003.01  106997505          FP
4  1004.01  238597883          FP


In [64]:
# Filter the confirmed planets and false positives
confirmed_planets = filtered_data[filtered_data['Disposition'] == 'PC'].dropna()
false_positives = filtered_data[filtered_data['Disposition'] == 'FP'].dropna()

# Check how many confirmed planets and false positives are there
print(f"Number of confirmed planets: {len(confirmed_planets)}")
print(f"Number of false positives: {len(false_positives)}")


Number of confirmed planets: 4653
Number of false positives: 1035
