Data comes from [The Extrasolar Planet Encyclopedia](http://exoplanet.eu/). Thanks to Ilya Marchenko for sharing this dataset on [Kaggle](https://www.kaggle.com/ilyamarchenko/full-exoplanet-catalog?select=exoplanet_confirm_and_candidates.csv).

In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
exo_full_dataset = pd.read_csv('/content/drive/My Drive/exoplanets.csv')
exo_full_dataset.head()

Unnamed: 0,# name,planet_status,mass,mass_error_min,mass_error_max,mass_sini,mass_sini_error_min,mass_sini_error_max,radius,radius_error_min,...,star_sp_type,star_age,star_age_error_min,star_age_error_max,star_teff,star_teff_error_min,star_teff_error_max,star_detected_disc,star_magnetic_field,star_alternate_names
0,11 Com b,Confirmed,,,,19.4,1.5,1.5,,,...,G8 III,,,,4742.0,100.0,100.0,,,
1,11 Oph b,Confirmed,21.0,3.0,3.0,,,,,,...,M9,0.011,0.002,0.002,2375.0,175.0,175.0,,,"Oph 1622-2405, Oph 11A"
2,11 UMi b,Confirmed,,,,10.5,2.47,2.47,,,...,K4III,1.56,0.54,0.54,4340.0,70.0,70.0,,,
3,11 Uma b,Unconfirmed,3.72,0.82,0.82,,,,,,...,K5III,,,,4090.0,70.0,70.0,,,
4,14 And b,Confirmed,,,,5.33,0.57,0.57,,,...,K0III,,,,4813.0,20.0,20.0,,,


In [None]:
exo = exo_full_dataset.loc[:, ['radius', 'mass', 'planet_status', 'orbital_period', 'star_distance']] 

In [None]:
print("\nUnique values\n",exo.nunique())
print("\nNull values\n\n", exo.isna().sum())


Unique values
 radius            1572
mass              1057
planet_status        5
orbital_period    7050
star_distance     2532
dtype: int64

Null values

 radius            1504
mass              5815
planet_status        0
orbital_period     335
star_distance     2665
dtype: int64


## Create dummy example data

For all techniques we'll first demonstrate them on the simple DataFrame created below, then on the more realistic CSV file.

In [None]:
import numpy as np

alien_species = {"alien_height":[80, 63, 70, 93, np.nan], "alien_age":[12, np.nan, 87, 415, 892], "home_planet":["Mars", "Jupiter", "Europa", "Mars", "Europa"]}

alien_df = pd.DataFrame(alien_species)
alien_df.head()

Unnamed: 0,alien_height,alien_age,home_planet
0,80.0,12.0,Mars
1,63.0,,Jupiter
2,70.0,87.0,Europa
3,93.0,415.0,Mars
4,,892.0,Europa


## Imputation
First the simple DataFrame.

### `alien_df` Example

In [None]:
from sklearn.impute import SimpleImputer
features = alien_df.loc[:, ["alien_height", "alien_age"]]
print(features.head(), "\n")
imp = SimpleImputer()
imp.fit(features)
imputed = imp.transform(features)

# the rest of this code block reformats the data to print it in an educative way. don't sweat it!
# scikit learn often strips the column headers (it's due to converting arrays to numpy for math), so add them back like so:
imputed_alien_df = pd.DataFrame(imputed,columns=features.columns)
print(imputed_alien_df.head())
# adding back the categorical data
imputed_alien_df["home_planet"] = alien_df["home_planet"]

   alien_height  alien_age
0          80.0       12.0
1          63.0        NaN
2          70.0       87.0
3          93.0      415.0
4           NaN      892.0 

   alien_height  alien_age
0          80.0       12.0
1          63.0      351.5
2          70.0       87.0
3          93.0      415.0
4          76.5      892.0


Now let's perform imputation on the exoplanets dataset.

### Exoplanets Example

In [None]:
exo_numbers = exo.loc[:, ['radius', 'mass', 'orbital_period', 'star_distance']]
print(exo.head())
print("\nNull values\n\n", exo.isna().sum(), "\n")
imp = SimpleImputer()
imp.fit(exo_numbers)
imputed = imp.transform(exo_numbers)
imputed_exo_df = pd.DataFrame(imputed,columns=exo_numbers.columns)

# reformatting the imputed data below
imputed_exo_df = pd.DataFrame(imputed,columns=exo_numbers.columns)
# adding back in the categorical data
imputed_exo_df["planet_status"] = exo["planet_status"]
print(imputed_exo_df.head())
print("\nNull values\n\n", imputed_exo_df.isna().sum())

   radius   mass planet_status  orbital_period  star_distance
0     NaN    NaN     Confirmed          326.03          110.6
1     NaN  21.00     Confirmed       730000.00          145.0
2     NaN    NaN     Confirmed          516.22          119.5
3     NaN   3.72   Unconfirmed          651.90           31.6
4     NaN    NaN     Confirmed          185.84           76.4

Null values

 radius            1504
mass              5815
planet_status        0
orbital_period     335
star_distance     2665
dtype: int64 

     radius       mass  orbital_period  star_distance planet_status
0  5.946769   6.872148          326.03          110.6     Confirmed
1  5.946769  21.000000       730000.00          145.0     Confirmed
2  5.946769   6.872148          516.22          119.5     Confirmed
3  5.946769   3.720000          651.90           31.6   Unconfirmed
4  5.946769   6.872148          185.84           76.4     Confirmed

Null values

 radius            0
mass              0
orbital_period    0


## One-Hot Encoding
First the `alien_df` data.

### `alien_df` Example

In [None]:
enc_alien_df = pd.get_dummies(imputed_alien_df)

print(imputed_alien_df.head(), "\n")
print(enc_alien_df.head())

   alien_height  alien_age home_planet
0          80.0       12.0        Mars
1          63.0      351.5     Jupiter
2          70.0       87.0      Europa
3          93.0      415.0        Mars
4          76.5      892.0      Europa 

   alien_height  alien_age  home_planet_Europa  home_planet_Jupiter  \
0          80.0       12.0                   0                    0   
1          63.0      351.5                   0                    1   
2          70.0       87.0                   1                    0   
3          93.0      415.0                   0                    0   
4          76.5      892.0                   1                    0   

   home_planet_Mars  
0                 1  
1                 0  
2                 0  
3                 1  
4                 0  


Now the exoplanet dataset.

### Exoplanets Example

In [None]:
print(imputed_exo_df.loc[:, "planet_status"].unique())
imputed_exo_df.head()

In [None]:
enc_exo_df = pd.get_dummies(imputed_exo_df)
print(imputed_exo_df.head(), "\n")
print(enc_exo_df.head())

     radius       mass  orbital_period  star_distance planet_status
0  5.946769   6.872148          326.03          110.6     Confirmed
1  5.946769  21.000000       730000.00          145.0     Confirmed
2  5.946769   6.872148          516.22          119.5     Confirmed
3  5.946769   3.720000          651.90           31.6   Unconfirmed
4  5.946769   6.872148          185.84           76.4     Confirmed 

     radius       mass  orbital_period  star_distance  \
0  5.946769   6.872148          326.03          110.6   
1  5.946769  21.000000       730000.00          145.0   
2  5.946769   6.872148          516.22          119.5   
3  5.946769   3.720000          651.90           31.6   
4  5.946769   6.872148          185.84           76.4   

   planet_status_Candidate  planet_status_Confirmed  \
0                        0                        1   
1                        0                        1   
2                        0                        1   
3                        0 

## Further Practice
If you want to sharpen your skills, try feature scaling the exoplanet data below. Then you'd have a wholly preprocessed dataset!

(Since the one-hot encoded data occurs within the normal range of standard deviation, you don't need to worry about scaling it.)

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# remember how this goes?