# This notebook is in progress and not meant to be a showpiece
I will update this notebook as I go, and will remove this notice from the top when it is complete

# Identifying Exoplanets using the TESS Objects of Interest Data Set
***Matt Paterson***<br>
***Machine Learning Engineer***<br>
***Santa Cruz, California***<br>

**Note about the Data**<br>
This data set is downloaded from the Cal Tech NASA Exoplanet Archive found <a href='https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=TOI'>here</a> and was downloaded on 2021-03-06 for the purpose of creating this instructional notebook.

**Note about the Spacecraft**<br>
TESS was launched by NASA in 2018 aboard a SpaceX rocket and completed its initial mission in July of 2020, poitively identifying 66 planets and recording over 2100 objects of interest. You can learn more about TESS <a href='https://exoplanets.nasa.gov/tess/'>here</a>

## The Data Science Problem
Can we use data from over 2000 observations made by the Transiting Exoplanet Survey Satellite (TESS) to positively identify more exoplanets in the sky using Machine Learning models?

## Import the necessary python libraries
In this notebook we will use python exclusively and in order to build our models, we'll utilize scikit-learn's libraries. This particular notebook will focus on the Logistic Regression model only.

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import confusion_matrix
from sklearn.model_selection import train_test_split

## Import your data
The data that we'll be using is already downloaded from CalTech/NASA, cleaned up a little for use here, and saved in a folder called 'data'. That 'data' folder lives in the same directory as a folder called 'code' where this notebook lives.

In order to import the data set, you'll have to provide the pandas function with a full datapath to find your csv file. You can choose to do this using the absolute path, which is recommended for production code, but for ease in this situation, you can go ahead and use the relative path.

In [7]:
datapath = '../data/'
filename = 'tess_oi.csv'

When you import the data set into a pandas DataFrame, go ahead and set the index column to 'rowid' using an argument in the pandas function. Then display 5 rows of data in your DataFrame.

In [9]:
df = pd.read_csv(datapath + filename, index_col='rowid')
df.sample(5)

Unnamed: 0_level_0,toi,toipfx,tid,ctoi_alias,pl_pnum,tfopwg_disp,rastr,ra,raerr1,raerr2,...,st_loggerr2,st_logglim,st_loggsymerr,st_rad,st_raderr1,st_raderr2,st_radlim,st_radsymerr,toi_created,rowupdate
rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1519,2319.01,2319,349488688,349488688.0,1,PC,16h55m08.37s,253.784885,,,...,-0.092576,0,1,1.39619,0.06407,-0.06407,0,1,10/5/2020 21:01,11/3/2020 12:00
1536,2334.01,2334,200743403,200743403.0,1,PC,03h16m37.01s,49.154193,,,...,-0.08,0,1,1.7,0.1,-0.1,0,1,10/23/2020 19:58,2/1/2021 10:00
1520,232.01,232,402026209,402026209.0,1,KP,23h34m15.1s,353.562915,,,...,,0,1,0.894974,0.055553,-0.055553,0,1,11/16/2018 21:15,11/6/2020 16:00
1000,1861.01,1861,323295479,323295479.0,1,PC,08h43m12.03s,130.80013,,,...,-0.33,0,1,1.06,0.06,-0.06,0,1,5/18/2020 20:24,6/17/2020 10:00
1629,2419.01,2419,358248442,358248442.0,1,PC,03h33m04.89s,53.270375,,,...,-0.33,0,1,1.3542,0.062663,-0.062663,0,1,11/25/2020 22:47,1/5/2021 16:00


How many rows and columns of data do you have in this data set? How many null values do you have in each column? Are the data types in the table logical and ready for modeling? Why or why not? What are some different python or pandas or matplotlib functions that you can use to inspect this data?

In [4]:
df.shape

(2542, 86)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2542 entries, 1 to 2542
Data columns (total 86 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   toi                2542 non-null   float64
 1   toipfx             2542 non-null   int64  
 2   tid                2542 non-null   int64  
 3   ctoi_alias         2542 non-null   float64
 4   pl_pnum            2542 non-null   int64  
 5   tfopwg_disp        2535 non-null   object 
 6   rastr              2542 non-null   object 
 7   ra                 2542 non-null   float64
 8   raerr1             0 non-null      float64
 9   raerr2             0 non-null      float64
 10  decstr             2542 non-null   object 
 11  dec                2542 non-null   float64
 12  decerr1            0 non-null      float64
 13  decerr2            0 non-null      float64
 14  st_pmra            2526 non-null   float64
 15  st_pmraerr1        2526 non-null   float64
 16  st_pmraerr2        2526 

In [17]:
df.isna().sum()

toi               0
toipfx            0
tid               0
ctoi_alias        0
pl_pnum           0
               ... 
st_raderr2      403
st_radlim         0
st_radsymerr      0
toi_created       0
rowupdate         0
Length: 86, dtype: int64

## Quick and dirty data cleaning
With any dataset, you'll want to spend lots of time looking at each and every column of data and learning what each is and how it relates to the other columns. You'll employ correlation matrices, scatterplots, and bar charts to see patterns. You'll have to impute missing data and decide if some data can be imputed at all. You'll have to engineer data columns and aggregate data columns and convert columns to different units of measurement. 

This particular notebook is intended to help you learn how to use the Logistic Regression Classification Model, so rather than spend all of our time on the data (a typical Data Scientist spends most of their time acquiring, cleaning, and maintaining their data), we're going to make what I like to call the 'Quick and Dirty Model' by dropping all columns with null values.

## Drop all null values
Use a pandas method to drop all of the columns in your table that contain null values, and save this new dataframe to a variable called 'tess', and then print the number of rows and columns that the new 'tess' DataFrame contains.

In [18]:
tess = df.dropna(axis=1)
tess.shape

(2542, 37)

How many of your new columns are non-numeric?

In [19]:
tess.dtypes

toi                  float64
toipfx                 int64
tid                    int64
ctoi_alias           float64
pl_pnum                int64
rastr                 object
ra                   float64
decstr                object
dec                  float64
pl_tranmid           float64
pl_tranmidlim          int64
pl_tranmidsymerr       int64
pl_orbperlim           int64
pl_orbpersymerr        int64
pl_trandurh          float64
pl_trandurhlim         int64
pl_trandurhsymerr      int64
pl_trandep           float64
pl_trandeplim          int64
pl_trandepsymerr       int64
pl_radelim             int64
pl_radesymerr          int64
st_tmag              float64
st_tmagerr1          float64
st_tmagerr2          float64
st_tmaglim             int64
st_tmagsymerr          int64
st_distlim             int64
st_distsymerr          int64
st_tefflim             int64
st_teffsymerr          int64
st_logglim             int64
st_loggsymerr          int64
st_radlim              int64
st_radsymerr  