# 1. Data Exploration & Cleaning

## Import dependencies, set up environment

1. `python3 -m venv .venv`
2. utilize virtual environment
    - (LINUX/MAC) `source .venv/bin/activate`
    - (WINDOWS) `.venv\Scripts\Activate.    ps1`
3. `pip install -r requirements.txt`

In [1]:
# Jupyter magic
%run ../util/dependencies.py

### Importing Data

Upon reviewing the quality of data from the "Planetary Systems" (PS) database, it was deemed better to pivot towards the "Planetary Systems Composite Data" (PSCompPars) database. In brief:
1. PS details a record for each exoplanet and each one of its references (this helps us reach original literature analyses of these bodies). Missing data is prevelant.
2. PSCompPars curates a “best available” or “most complete” set of parameters for each planet, pulling from multiple references.

So far as our exploration of exoplanets and their stars (studying the whole population of exoplanets thus far), this seems outside the scope of our analysis, and creates a cumbersome process of exploring the data. This will aid in limiting time spent cleaning the dataset, and limit our analysis to 6065 from a daunting ~32,000 records.

For an explanation on how the composite dataset aggregates all available information on exoplanet figures, please see <https://exoplanetarchive.ipac.caltech.edu/docs/pscp_calc.html>.

In [5]:
# TAP base URL (Planetary Systems Composite Data)
url = "https://exoplanetarchive.ipac.caltech.edu/TAP/sync"
# The "Planetary Systems Composite Data" database (confirmed exoplanets) is encoded as "PSCompPars" within the Exoplanet Archive
ADQL_query =    "SELECT " \
                "pl_controv_flag, pl_name, hostname, pl_letter, sy_snum, discoverymethod, disc_year," \
                "pl_radj, pl_massj, st_spectype, st_rad, st_mass, st_met, st_lum, st_teff, st_rad " \
                "FROM PSCompPars"

# Request data as CSV
params = {
    "query": ADQL_query,
    "format": "csv"
}
response = requests.get(url, params=params)

# Load into "Planetary Systems" DataFrame
pscp = pd.read_csv(io.StringIO(response.text))
print("Data loaded successfully. Number of records:", len(pscp))

# Save to parquet for local use and following notebooks
# (parquets are smaller files than csv, so good as an intermediate file type for future processing)
file_name = 'pscp_01_raw.csv'

try:
    pscp.to_csv('../data/' + file_name, index=False)
    print('Data saved as \'' + file_name + '\'')
except Exception as e:
    print('Data failed to save as \'' + file_name + '\': ' + e)

Data loaded successfully. Number of records: 6065
Data saved as 'pscp_01_raw.csv'


Over 355 columns in this dataset!! All different features we can analyze in another related project pertaining to exoplanet exploration and methods for doing so . . .

After reviewing the column descriptions (as defined here: <https://exoplanetarchive.ipac.caltech.edu/docs/API_PS_columns.html>), the following features are relevant to our exploration:
1. pl_controv_flag (is the comfirmation of this planet questioned?)
2. pl_name
3. hostname (most common star name)
4. pl_letter
5. sy_snum
6. discoverymethod
7. disc_year
8. pl_radj
9. pl_massj
10. st_spectype
11. st_rad
12. st_mass
13. st_met
14. st_lum
15. st_rad

As we will see it later, it is important to get an overview of all of the stars we have seen in the observable universe thus far, to draw a picture of what exoplanet host (stellar) date is accessable to us, vs the stars that are in the observable universe. This is done as a means to detect bias from:
1. **Educated Assumptions** Stars we choose to observe,
2. **Technical Limitations** Stars are easier to observe, and
3. **Physical Stellar Characteristics** Stars that tend to have more planets

As such, the following dataset was appended to the 

In [6]:
# TAP base URL (Kepler Stellar Table)
url = "https://exoplanetarchive.ipac.caltech.edu/TAP/sync"
# The "Kepler Stellar Table" database (confirmed exoplanets) is encoded as "keplerstellar" within the Exoplanet Archive
ADQL_query =    "SELECT " \
                "kepid, tm_designation, teff, feh, radius, mass, dens, " \
                "nconfp, nkoi, ntce " \
                "FROM keplerstellar"

# Request data as CSV
params = {
    "query": ADQL_query,
    "format": "csv"
}
response = requests.get(url, params=params)

# Load into "Kepler Stellar" DataFrame
ks = pd.read_csv(io.StringIO(response.text))
print("Data loaded successfully. Number of records:", len(ks))

# Save raw data as csv
file_name = 'ks_01_raw.csv'

try:
    ks.to_csv('../data/' + file_name, index=False)
    print('Data saved as \'' + file_name + '\'')
except Exception as e:
    print('Data failed to save as \'' + file_name + '\': ' + e)

Data loaded successfully. Number of records: 990244
Data saved as 'ks_01_raw.csv'
