# Authors
\<Author Names\>

# Data Domain

The data we are using is on exoplanets from the NASA Exoplanet Archive. It includes various attributes of known exoplanets, such as their names, host stars, orbital characteristics, and physical parameters like mass and radius. The table is designed for researchers and enthusiasts to analyze and compare the properties of these distant worlds.

The first 88 rows of the csv file contain full descriptions of the columns shorthand names. So to make the file readable for panda, those rows have to be removed.

# Importing Libraries

The following cell imports all libraries that will be used for this notebook.

In [10]:
import numpy as np
import scipy as sp
import pandas as pd
# IMPORTANT: Allows this notebook to get data if on Colab.
# Comment out if running locally.
from google.colab import drive

# Data Preprocessing

# Gathering the Data

The following code cell will import the data from the CSV file and put it in a Pandas `DataFrame`.

Here, `pd.read_csv` function is used from the Pandos library which will read the CSV file the data is currently in. The parameter `comment` tells the function that there are comments in the CSV file and will handle them appropriately to prevent errors and garbage from being put into the `DataFrame` `df_data`.

In [12]:
# Path to the file. If on Colab, use the appropriate Colab package to import
#  the data.
path = "/PSCompPars_2024.09.17_17.11.11.csv"

# IMPORTANT: If running on Google Colab, uncomment this line and use instead.
#  You will need to enable permissions on your account to use.
# If you are doing this locally, please use the file path above instead.
drive.mount('/content/drive')
#path = "/content/drive/My Drive/School/SacStateClasses/CSC177/CSC177_GroupFolder/1_DataPreprocessing/PSCompPars_2024.09.17_17.11.11.csv"

df_data = pd.read_csv(path, comment='#')


print("Display information about the DataFrame:")
print(df_data.dtypes)
print("\nShow contents of the data:")
df_data

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Display information about the DataFrame:
pl_name             object
hostname            object
sy_snum              int64
sy_pnum              int64
discoverymethod     object
                    ...   
sy_kmagerr1        float64
sy_kmagerr2        float64
sy_gaiamag         float64
sy_gaiamagerr1     float64
sy_gaiamagerr2     float64
Length: 84, dtype: object

Show contents of the data:


Unnamed: 0,pl_name,hostname,sy_snum,sy_pnum,discoverymethod,disc_year,disc_facility,pl_controv_flag,pl_orbper,pl_orbpererr1,...,sy_disterr2,sy_vmag,sy_vmagerr1,sy_vmagerr2,sy_kmag,sy_kmagerr1,sy_kmagerr2,sy_gaiamag,sy_gaiamagerr1,sy_gaiamagerr2
0,11 Com b,11 Com,2,1,Radial Velocity,2007,Xinglong Station,0,323.210000,0.060000,...,-1.92380,4.72307,0.023,-0.023,2.282,0.346,-0.346,4.44038,0.003848,-0.003848
1,11 UMi b,11 UMi,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,0,516.219970,3.200000,...,-1.97650,5.01300,0.005,-0.005,1.939,0.270,-0.270,4.56216,0.003903,-0.003903
2,14 And b,14 And,1,1,Radial Velocity,2008,Okayama Astrophysical Observatory,0,186.760000,0.110000,...,-0.71400,5.23133,0.023,-0.023,2.331,0.240,-0.240,4.91781,0.002826,-0.002826
3,14 Her b,14 Her,1,2,Radial Velocity,2002,W. M. Keck Observatory,0,1765.038900,1.677090,...,-0.00730,6.61935,0.023,-0.023,4.714,0.016,-0.016,6.38300,0.000351,-0.000351
4,16 Cyg B b,16 Cyg B,3,1,Radial Velocity,1996,Multiple Observatories,0,798.500000,1.000000,...,-0.01110,6.21500,0.016,-0.016,4.651,0.016,-0.016,6.06428,0.000603,-0.000603
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5751,ups And b,ups And,2,3,Radial Velocity,1996,Lick Observatory,0,4.617033,0.000023,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5752,ups And c,ups And,2,3,Radial Velocity,1999,Multiple Observatories,0,241.258000,0.064000,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5753,ups And d,ups And,2,3,Radial Velocity,1999,Multiple Observatories,0,1276.460000,0.570000,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5754,ups Leo b,ups Leo,1,1,Radial Velocity,2021,Okayama Astrophysical Observatory,0,385.200000,2.800000,...,-0.89630,4.30490,0.023,-0.023,2.184,0.248,-0.248,4.03040,0.008513,-0.008513


# Preprocessing the Data

## Remove Duplicate Data

The first operation will be to remove any possible duplicate data from the `DataFrame`.

In [9]:
# Code for duplicate data.
dups = df_data.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
# Uncomment this line if the data has any duplicats
#df_data = df_data.drop_duplicates()

Number of duplicate rows = 0


## Fill In Missing Values

The next operation is to fill in any possible missing data values. Several columns where the uncertainty is null, we replace them with the mean uncertainty. There are a few steps that must be taken however:

1. Replace any missing data points with `np.nan`.
2. Replace `np.nan` with the *mean* of that attribute.

In [6]:
# This will go through the entire DataFrame and replace any empty attributes
#  with NaN using np.nan
df_data = df_data.replace('', np.nan)
df_data


Unnamed: 0,pl_name,hostname,sy_snum,sy_pnum,discoverymethod,disc_year,disc_facility,pl_controv_flag,pl_orbper,pl_orbpererr1,...,sy_disterr2,sy_vmag,sy_vmagerr1,sy_vmagerr2,sy_kmag,sy_kmagerr1,sy_kmagerr2,sy_gaiamag,sy_gaiamagerr1,sy_gaiamagerr2
0,11 Com b,11 Com,2,1,Radial Velocity,2007,Xinglong Station,0,323.210000,0.060000,...,-1.92380,4.72307,0.023,-0.023,2.282,0.346,-0.346,4.44038,0.003848,-0.003848
1,11 UMi b,11 UMi,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,0,516.219970,3.200000,...,-1.97650,5.01300,0.005,-0.005,1.939,0.270,-0.270,4.56216,0.003903,-0.003903
2,14 And b,14 And,1,1,Radial Velocity,2008,Okayama Astrophysical Observatory,0,186.760000,0.110000,...,-0.71400,5.23133,0.023,-0.023,2.331,0.240,-0.240,4.91781,0.002826,-0.002826
3,14 Her b,14 Her,1,2,Radial Velocity,2002,W. M. Keck Observatory,0,1765.038900,1.677090,...,-0.00730,6.61935,0.023,-0.023,4.714,0.016,-0.016,6.38300,0.000351,-0.000351
4,16 Cyg B b,16 Cyg B,3,1,Radial Velocity,1996,Multiple Observatories,0,798.500000,1.000000,...,-0.01110,6.21500,0.016,-0.016,4.651,0.016,-0.016,6.06428,0.000603,-0.000603
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5751,ups And b,ups And,2,3,Radial Velocity,1996,Lick Observatory,0,4.617033,0.000023,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5752,ups And c,ups And,2,3,Radial Velocity,1999,Multiple Observatories,0,241.258000,0.064000,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5753,ups And d,ups And,2,3,Radial Velocity,1999,Multiple Observatories,0,1276.460000,0.570000,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5754,ups Leo b,ups Leo,1,1,Radial Velocity,2021,Okayama Astrophysical Observatory,0,385.200000,2.800000,...,-0.89630,4.30490,0.023,-0.023,2.184,0.248,-0.248,4.03040,0.008513,-0.008513


After replacing missing attribute values with NaN, next is to replace those attribute values the mean for each feature.

In [7]:
sr_mean_vals = df_data.mean(numeric_only=True) # Series that contains the mean
df_data.fillna(sr_mean_vals, inplace=True)
print(f"df_data shape: {df_data.shape}")
df_data

df_data shape: (5756, 84)


Unnamed: 0,pl_name,hostname,sy_snum,sy_pnum,discoverymethod,disc_year,disc_facility,pl_controv_flag,pl_orbper,pl_orbpererr1,...,sy_disterr2,sy_vmag,sy_vmagerr1,sy_vmagerr2,sy_kmag,sy_kmagerr1,sy_kmagerr2,sy_gaiamag,sy_gaiamagerr1,sy_gaiamagerr2
0,11 Com b,11 Com,2,1,Radial Velocity,2007,Xinglong Station,0,323.210000,0.060000,...,-1.92380,4.72307,0.023,-0.023,2.282,0.346,-0.346,4.44038,0.003848,-0.003848
1,11 UMi b,11 UMi,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,0,516.219970,3.200000,...,-1.97650,5.01300,0.005,-0.005,1.939,0.270,-0.270,4.56216,0.003903,-0.003903
2,14 And b,14 And,1,1,Radial Velocity,2008,Okayama Astrophysical Observatory,0,186.760000,0.110000,...,-0.71400,5.23133,0.023,-0.023,2.331,0.240,-0.240,4.91781,0.002826,-0.002826
3,14 Her b,14 Her,1,2,Radial Velocity,2002,W. M. Keck Observatory,0,1765.038900,1.677090,...,-0.00730,6.61935,0.023,-0.023,4.714,0.016,-0.016,6.38300,0.000351,-0.000351
4,16 Cyg B b,16 Cyg B,3,1,Radial Velocity,1996,Multiple Observatories,0,798.500000,1.000000,...,-0.01110,6.21500,0.016,-0.016,4.651,0.016,-0.016,6.06428,0.000603,-0.000603
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5751,ups And b,ups And,2,3,Radial Velocity,1996,Lick Observatory,0,4.617033,0.000023,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5752,ups And c,ups And,2,3,Radial Velocity,1999,Multiple Observatories,0,241.258000,0.064000,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5753,ups And d,ups And,2,3,Radial Velocity,1999,Multiple Observatories,0,1276.460000,0.570000,...,-0.06290,4.09565,0.023,-0.023,2.859,0.274,-0.274,3.98687,0.008937,-0.008937
5754,ups Leo b,ups Leo,1,1,Radial Velocity,2021,Okayama Astrophysical Observatory,0,385.200000,2.800000,...,-0.89630,4.30490,0.023,-0.023,2.184,0.248,-0.248,4.03040,0.008513,-0.008513


In [None]:
import csv
import pandas as pd

localpath = r"./PSCompPars_2024.09.17_17.11.11.csv"

# Read the data
data = []
with open(localpath, "r") as file:
    reader = csv.reader(file)
    for row in reader:
        data.append(row)

# Remove the first 88 rows
data2 = data[88:]

# Convert the data to a DataFrame
df = pd.DataFrame(data2[1:], columns=data2[0])

# Get rows 4 through 88 of data and make them 1 row
header = [" ".join(row) for row in data[3:87]]

# Remove all text before ":"
header = [h.split(":", 1)[1].strip() if ":" in h else h for h in header]

# Replace the first row with the header
df.columns = header

# Print the first 5 rows of the updated DataFrame to verify
print(df.head())

FileNotFoundError: [Errno 2] No such file or directory: './PSCompPars_2024.09.17_17.11.11.csv'

Several columns where the uncertainty is null, we replace them with the mean uncertainty.

In [None]:
# Identify columns with "Unc" in their names
unc_columns = [col for col in df.columns if "Unc" in col]

# Calculate the means of those columns and replace empty cells with their means
for col in unc_columns:
    # Convert to numeric, forcing errors to NaN
    df[col] = pd.to_numeric(df[col], errors="coerce")

    # Calculate the mean
    mean_value = df[col].mean()
    #print(f"Mean of {col}: {mean_value}")

    # Check for NaN values before filling
    nan_count_before = df[col].isna().sum()
    #print(f"Number of NaN values in {col} before filling: {nan_count_before}")

    # Replace NaN values with the mean
    nan_indices = df[col].index[df[col].isna()].tolist()
    df[col] = df[col].fillna(mean_value)

    # Ensure no silent downcasting
    df[col] = df[col].infer_objects(copy=False)

    # Check for NaN values after filling
    nan_count_after = df[col].isna().sum()
    #print(f"Number of NaN values in {col} after filling: {nan_count_after}")

# Print the rows that match the nan_indices, just the first column and the "unc" columns
if nan_indices:
    print(df.loc[nan_indices, ['Planet Name'] + unc_columns])
else:
    print(f"No NaN values were found in {col} to fill.")

                        Planet Name  Orbital Period Upper Unc. [days]  \
12     2MASS J01033563-5515561 AB b                      96120.774305   
84         CFBDSIR J145829+101343 b                       2737.500000   
85                 CFHTWIR-Oph 98 b                     470000.000000   
135                      DMPP-3 A b                          0.001100   
146               EPIC 201757695.02                          0.000100   
...                             ...                               ...   
5686  WISEP J121756.91+162640.2 A b                      96120.774305   
5710                      alf Ari b                          0.300000   
5711                      alf Tau b                          0.900000   
5716                      bet UMi b                          2.700000   
5725                     gam1 Leo b                          1.250000   

      Orbital Period Lower Unc. [days]  Orbit Semi-Major Axis Upper Unc. [au]  \
12                       -21952.508592    