# Exoplanet Ranking
Data discovering, preprocessing, computing simillarity and clustering.

**!! What is actually an exoplanet?**

## 1. Data Discovering
> To detect the important data it's necessary to go through the dataset and check what the columns stand for, which are useful for further processing, what the columns stand for. In this stage we will also fill information about the Earth (planet with rowid 0), since actual data is "random".

> Some data has been already removed in the very first processing (e.g. links, duplicated values, irrelevant data). Original dataset: `data/data_original.csv`

In [None]:
# imports
#%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

In [None]:
# loading data
df_origin = pd.read_csv("../data/data_preprocess.csv", index_col="rowid")
print("Number of:\n\trows = {}\n\tcolumns = {}".format(df_origin.shape[0], df_origin.shape[1]))
display(df_origin.head())
display(df_origin.info())

### 1.1 Missing values

In [None]:
# number of missing values in each column
display(df_origin.notnull().sum() / (df_origin.notnull().sum() + df_origin.isnull().sum()))

> The most of the values are fully filled, but there are two columns with missing values: column `fpl_eccen` and column `fst_spt`. `fpl_eccen` holds  the eccentricity of planet's orbit (Amount by which the orbit of the planet deviates from a perfect circle). `fst_spt` holds spectral type of the star that the planet orbits.
> Since there is only 33 % -- 38 %  valid values, it is better to remove these columns then trying to compute the value using for example k-nn.

In [None]:
df_modified = df_origin.drop(labels=["fpl_eccen", "fst_spt"], axis=1)
print(df_modified.columns)

### 1.2 Non-numeric values
> There is a column `fst_metratio` containing information about which element is in abundance on the planet. The column contains data type `object`.


In [None]:
df_modified["fst_metratio"].unique()

> `Fe/H` denotes iron abundance and `M/H` or `m/H` denotes general metal content.

In [None]:
# replacing all occurancies of "m/H" with "M/H" and then replacing with 

df_modified.loc[df_modified["fst_metratio"] == "[m/H]", "fst_metratio"] = "[M/H]"
df_modified["fst_metratio"] = df_modified["fst_metratio"].astype("category")
cat_columns = df_modified.select_dtypes("category").columns
cont_cols = list(df_modified.select_dtypes(exclude="category").columns) # columns with continuous values
dis_cols = list(cat_columns) # cols with discrete values
df_modified[cat_columns] = df_modified[cat_columns].apply(lambda x: x.cat.codes)

df_modified["fst_metratio"].unique()

## 1.3 Earth Filling-in

In [None]:
def dropColumn(elem, lists):
    for l in lists:
        if(elem in l):
            l.remove(elem)

In [None]:
display(df_modified.describe())

df_modified.loc[0, "fpl_orbper"] = 365.256363 # orbital period in days
df_modified.loc[0, "fpl_smax"] = 1.000001018 # the longest radius of an elliptic orbit
df_modified.loc[0, "fpl_bmasse"] = 1 # mass of the planet (earth unit)
df_modified.loc[0, "fpl_rade"] = 1 # radius (earth unit)
df_modified.loc[0, "fpl_dens"] = 5.51 # density of the planet (g/cm**3)
df_modified.loc[0, "fpl_tranflag"] = 1 # does planet transit the star
df_modified.loc[0, "fpl_cbflag"] = 0 # does planet orbit a binary solar system
df_modified.loc[0, "fpl_snum"] = 1 # number of stars in the solar system
df_modified.loc[0, "dec"] = 23.4 # declination of the planetary system
df_modified.drop(labels=["fst_optmag"], axis=1, inplace=True) # optical magnitude
dropColumn("fst_optmag", [cont_cols, dis_cols])
df_modified.drop(labels=["fst_nirmag"], axis=1, inplace=True) # near-IR magnitude
dropColumn("fst_nirmag", [cont_cols, dis_cols])
df_modified.loc[0, "fst_teff"] = 252 # effective temperature in Kelvins
df_modified.loc[0, "fst_logg"] = 5.437751 # gravity acceleration at the star surface log10(cm/s**2)
df_modified.loc[0, "fst_lum"] = 0 # star lumonisty log10(lumonisity)
df_modified.loc[0, "fst_mass"] = 1 # stellar mass (sun unit)
df_modified.loc[0, "fst_rad"] = 1 # stellar raidus (sun unit)
df_modified.loc[0, "fst_met"] = 0.012 # star metallicity
df_modified.loc[0, "fst_metratio"] = 0 # metal abundance (in comparison to sun)
df_modified.loc[0, "fst_age"] = 4.603 # stellar age (in billions)
df_modified.drop(labels=["simil"], axis=1, inplace=True) # random similarity (for visualizing purposes)
dropColumn("simil", [cont_cols, dis_cols])

display(df_modified.head())
print("cols with continuous values:", cont_cols)
print("cols with discrete values:", dis_cols)

## 2. General Data Analysis
> Analysing the data: 
* mean and deviation
* distribution
* joint distribution,
* ... 

In [None]:
df_vis = df_modified.copy()
# cont_cols -- continuous columns
# dis_cols -- discrete columns

### 2.1 Mean, Deviation and Ratio

In [None]:
# log-scaled mean of the each variable with its standard deviation (sqrt from variance)

mean_log = np.log10(df_vis[cont_cols].mean())
std_log = np.log10(df_vis[cont_cols].std())
y_vals = np.arange(len(mean_log))

fix, ax = plt.subplots(figsize=(15,5))
ax.errorbar(mean_log, y_vals, xerr=std_log, ls='none', fmt='o')
ax.set_xlabel("Value")
ax.set_ylabel("Variable")
ax.set_yticks(y_vals)
ax.set_yticklabels(cont_cols)
ax.set_title("Log-scaled means + deviations")
plt.show()

> As visible from the figure above, the stellar mass (fst_mass) is pretty constant across all the data. On the other hand orbital period varies from $ 10^{-2} $ to $ 10^8 $. All the values are log-scaled so it can be easily compared.

In [None]:
for col in dis_cols:
    val_count = df_vis[col].notnull().count()
    labels = []
    counts = []
    for val in df_vis[col].unique():
        labels.append(val)
        #counts.append(df_vis[df_vis[col] == val][col].count() / val_count)
        counts.append(df_vis[df_vis[col] == val][col].count())
    fix, ax = plt.subplots(figsize=(7,7))
    ax.pie(counts, labels=labels, autopct='%1.1f%%', pctdistance=0.75, radius=1)
    ax.set_title(col)
    
print("In fst_metration (metal abundance): \n\t0 = iron abundance, \n\t1 = general metal abundance")

> Perhaps it's not even necessary to emphasize that at the most of the exoplanets there is a iron abundance...

### 2.2 Density of Random Variable
> Since there is 16 random variables, only a few of those that are interesting haven been chosen.

In [None]:
def displayDensity(data, earth):
    fig, ax = plt.subplots(figsize=(15,5))
    ax.set_title("Density: {}".format(data.name))
    ax.hist(data, density=True, bins=50)
    ax.axvline(earth, color="red", label="Earth", linewidth=3)
    
int_dist = ["fpl_rade", "dec", "fst_logg", "fst_age"] # interesting columns
    
for col in int_dist:
    displayDensity(df_vis[col], df_vis.loc[0, col])

## 3. Similarity Computation
> Computing similarity between earth and any other exoplanet

## 4. Clustering
> Are the planets divided into some groups based on their parameters?