# Exoplanet Ranking
Data discovering, preprocessing, computing simillarity and clustering.

**!! What is actually an exoplanet?**

## 1. Data Discovering
> To detect the important data it's necessary to go through the dataset and check what the columns stand for, which are useful for further processing, what the columns stand for. In this stage we will also fill information about the Earth (planet with rowid 0), since actual data is "random".

> Some data has been already removed in the very first processing (e.g. links, duplicated values, irrelevant data). Original dataset: `data/data_original.csv`

In [56]:
# imports
import pandas as pd

In [57]:
# loading data
df_origin = pd.read_csv("../data/data_preprocess.csv", index_col="rowid")
print("Number of:\n\trows = {}\n\tcolumns = {}".format(df_origin.shape[0], df_origin.shape[1]))
display(df_origin.head())
display(df_origin.info())

Number of:
	rows = 4056
	columns = 22


Unnamed: 0_level_0,fpl_orbper,fpl_smax,fpl_eccen,fpl_bmasse,fpl_rade,fpl_dens,fpl_tranflag,fpl_cbflag,fpl_snum,dec,...,fst_spt,fst_teff,fst_logg,fst_lum,fst_mass,fst_rad,fst_met,fst_metratio,fst_age,simil
rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1000.0,3.0,0.3,1000.0,12.4,10.0,0,0,1,30.0,...,K0 III,4742.0,3.0,1.0,1.5,7.0,-0.01,[Fe/H],0.4,84.0188
1,326.03,1.29,0.231,6165.6,12.1,19.1,0,0,1,17.792868,...,K0 III,4742.0,2.31,2.243,2.7,19.0,-0.35,[Fe/H],4.236718,39.4383
2,516.21997,1.53,0.08,4684.8142,12.3,13.8,0,0,1,71.823898,...,K4 III,4213.0,1.93,2.43,2.78,29.79,-0.02,[Fe/H],1.56,78.3099
3,185.84,0.83,0.0,1525.5,12.9,3.9,0,0,1,39.236198,...,G8 III,4813.0,2.63,1.763,2.2,11.0,-0.24,[Fe/H],4.5,79.844
4,1773.40002,2.93,0.37,1481.0878,12.9,3.79,0,0,1,43.817646,...,K0 V,5338.0,4.45,-0.151,0.9,0.93,0.41,[Fe/H],5.24,91.1647


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4056 entries, 0 to 4055
Data columns (total 22 columns):
fpl_orbper      4056 non-null float64
fpl_smax        4056 non-null float64
fpl_eccen       1553 non-null float64
fpl_bmasse      4056 non-null float64
fpl_rade        4056 non-null float64
fpl_dens        4056 non-null float64
fpl_tranflag    4056 non-null int64
fpl_cbflag      4056 non-null int64
fpl_snum        4056 non-null int64
dec             4056 non-null float64
fst_optmag      4056 non-null float64
fst_nirmag      4056 non-null float64
fst_spt         1366 non-null object
fst_teff        4056 non-null float64
fst_logg        4056 non-null float64
fst_lum         4056 non-null float64
fst_mass        4056 non-null float64
fst_rad         4056 non-null float64
fst_met         4056 non-null float64
fst_metratio    4056 non-null object
fst_age         4056 non-null float64
simil           4056 non-null float64
dtypes: float64(17), int64(3), object(2)
memory usage: 728.8+ KB


None

### 1.1 Missing values

In [58]:
# number of missing values in each column
display(df_origin.notnull().sum() / (df_origin.notnull().sum() + df_origin.isnull().sum()))

fpl_orbper      1.000000
fpl_smax        1.000000
fpl_eccen       0.382890
fpl_bmasse      1.000000
fpl_rade        1.000000
fpl_dens        1.000000
fpl_tranflag    1.000000
fpl_cbflag      1.000000
fpl_snum        1.000000
dec             1.000000
fst_optmag      1.000000
fst_nirmag      1.000000
fst_spt         0.336785
fst_teff        1.000000
fst_logg        1.000000
fst_lum         1.000000
fst_mass        1.000000
fst_rad         1.000000
fst_met         1.000000
fst_metratio    1.000000
fst_age         1.000000
simil           1.000000
dtype: float64

> The most of the values are fully filled, but there are two columns with missing values: column `fpl_eccen` and column `fst_spt`. `fpl_eccen` holds  the eccentricity of planet's orbit (Amount by which the orbit of the planet deviates from a perfect circle). `fst_spt` holds spectral type of the star that the planet orbits.
> Since there is only 33 % -- 38 %  valid values, it is better to remove these columns then trying to compute the value using for example k-nn.

In [59]:
df_modified = df_origin.drop(labels=["fpl_eccen", "fst_spt"], axis=1)
print(df_modified.columns)

Index(['fpl_orbper', 'fpl_smax', 'fpl_bmasse', 'fpl_rade', 'fpl_dens',
       'fpl_tranflag', 'fpl_cbflag', 'fpl_snum', 'dec', 'fst_optmag',
       'fst_nirmag', 'fst_teff', 'fst_logg', 'fst_lum', 'fst_mass', 'fst_rad',
       'fst_met', 'fst_metratio', 'fst_age', 'simil'],
      dtype='object')


### 1.2 Non-numeric values
> There is a column `fst_metratio` containing information about which element is in abundance on the planet. The column contains data type `object`.


In [60]:
df_modified["fst_metratio"].unique()

array(['[Fe/H]', '[M/H]', '[m/H]'], dtype=object)

> `Fe/H` denotes iron abundance and `M/H` or `m/H` denotes general metal content.

In [62]:
# replacing all occurancies of "m/H" with "M/H" and then replacing with 

df_modified.loc[df_modified["fst_metratio"] == "[m/H]", "fst_metratio"] = "[M/H]"
df_modified["fst_metratio"] = df_modified["fst_metratio"].astype("category")
cat_columns = df_modified.select_dtypes("category").columns
df_modified[cat_columns] = df_modified[cat_columns].apply(lambda x: x.cat.codes)

df_modified["fst_metratio"].unique()

array([0, 1])

## 2. Data Preprocessing
> Improving quality of results

## 3. Visualizing
> Visualizing what data looks like, checking correlations,...

## 4. Similarity Computation
> Computing similarity between earth and any other exoplanet

## 5. Clustering
> Are the planets divided into some groups based on their parameters?