# Exoplanet Ranking
Data discovering, preprocessing, computing simillarity and clustering.

**!! What is actually an exoplanet?**

## 1. Data Discovering
> To detect the important data it's necessary to go through the dataset and check what the columns stand for, which are useful for further processing, what the columns stand for. In this stage we will also fill information about the Earth (planet with rowid 0), since actual data is "random".

> Some data has been already removed in the very first processing (e.g. links, duplicated values, irrelevant data). Original dataset: `data/data_original.csv`

In [None]:
# imports
#%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial import distance

In [None]:
# loading data
df_origin = pd.read_csv("../data/data_preprocess.csv", index_col="rowid")
print("Number of:\n\trows = {}\n\tcolumns = {}".format(df_origin.shape[0], df_origin.shape[1]))
display(df_origin.head())
display(df_origin.info())

### 1.1 Missing values

In [None]:
# number of missing values in each column
display(df_origin.notnull().sum() / (df_origin.notnull().sum() + df_origin.isnull().sum()))

> The most of the values are fully filled, but there are two columns with missing values: column `fpl_eccen` and column `fst_spt`. `fpl_eccen` holds  the eccentricity of planet's orbit (Amount by which the orbit of the planet deviates from a perfect circle). `fst_spt` holds spectral type of the star that the planet orbits.
> Since there is only 33 % -- 38 %  valid values, it is better to remove these columns then trying to compute the value using for example k-nn.

In [None]:
df_modified = df_origin.drop(labels=["fpl_eccen", "fst_spt"], axis=1)
print(df_modified.columns)

### 1.2 Non-numeric values
> There is a column `fst_metratio` containing information about which element is in abundance on the planet. The column contains data type `object`.


In [None]:
df_modified["fst_metratio"].unique()

> `Fe/H` denotes iron abundance and `M/H` or `m/H` denotes general metal content.

In [None]:
# replacing all occurancies of "m/H" with "M/H" and then replacing with 

df_modified.loc[df_modified["fst_metratio"] == "[m/H]", "fst_metratio"] = "[M/H]"
df_modified["fst_metratio"] = df_modified["fst_metratio"].astype("category")
cat_columns = df_modified.select_dtypes("category").columns
cont_cols = list(df_modified.select_dtypes(exclude="category").columns) # columns with continuous values
dis_cols = list(cat_columns) # cols with discrete values
df_modified[cat_columns] = df_modified[cat_columns].apply(lambda x: x.cat.codes)

df_modified["fst_metratio"].unique()

## 1.3 Earth Filling-in

In [None]:
def dropColumn(elem, lists):
    for l in lists:
        if(elem in l):
            l.remove(elem)

In [None]:
display(df_modified.describe())

df_modified.loc[0, "fpl_orbper"] = 365.256363 # orbital period in days
df_modified.loc[0, "fpl_smax"] = 1.000001018 # the longest radius of an elliptic orbit
df_modified.loc[0, "fpl_bmasse"] = 1 # mass of the planet (earth unit)
df_modified.loc[0, "fpl_rade"] = 1 # radius (earth unit)
df_modified.loc[0, "fpl_dens"] = 5.51 # density of the planet (g/cm**3)
df_modified.loc[0, "fpl_tranflag"] = 1 # does planet transit the star
df_modified.loc[0, "fpl_cbflag"] = 0 # does planet orbit a binary solar system
df_modified.loc[0, "fpl_snum"] = 1 # number of stars in the solar system
df_modified.loc[0, "dec"] = 23.4 # declination of the planetary system
df_modified.drop(labels=["fst_optmag"], axis=1, inplace=True) # optical magnitude
dropColumn("fst_optmag", [cont_cols, dis_cols])
df_modified.drop(labels=["fst_nirmag"], axis=1, inplace=True) # near-IR magnitude
dropColumn("fst_nirmag", [cont_cols, dis_cols])
df_modified.loc[0, "fst_teff"] = 252 # effective temperature in Kelvins
df_modified.loc[0, "fst_logg"] = 5.437751 # gravity acceleration at the star surface log10(cm/s**2)
df_modified.loc[0, "fst_lum"] = 0 # star lumonisty log10(lumonisity)
df_modified.loc[0, "fst_mass"] = 1 # stellar mass (sun unit)
df_modified.loc[0, "fst_rad"] = 1 # stellar raidus (sun unit)
df_modified.loc[0, "fst_met"] = 0.012 # star metallicity
df_modified.loc[0, "fst_metratio"] = 0 # metal abundance (in comparison to sun)
df_modified.loc[0, "fst_age"] = 4.603 # stellar age (in billions)
df_modified.drop(labels=["simil"], axis=1, inplace=True) # random similarity (for visualizing purposes)
dropColumn("simil", [cont_cols, dis_cols])

display(df_modified.head())
print("cols with continuous values:", cont_cols)
print("cols with discrete values:", dis_cols)

## 2. General Data Analysis
> Analysing the data: 
* mean and deviation
* distribution
* joint distribution,
* ... 

In [None]:
df_vis = df_modified.copy()
# cont_cols -- continuous columns
# dis_cols -- discrete columns

### 2.1 Mean, Deviation and Ratio

In [None]:
# log-scaled mean of the each variable with its standard deviation (sqrt from variance)
# there is added 0.1 to the means to avoid computing log of 0
mean_log = np.log10(df_vis[cont_cols].mean() + 0.1)
std_log = np.log10(df_vis[cont_cols].std() + 0.1)
y_vals = np.arange(len(mean_log))

fix, ax = plt.subplots(figsize=(15,10))
ax.errorbar(mean_log, y_vals, xerr=std_log, ls='none', fmt='o')
ax.set_xlabel("Value")
ax.set_ylabel("Variable")
ax.set_yticks(y_vals)
ax.set_yticklabels(cont_cols)
ax.set_title("Log-scaled means + deviations")
plt.show()

> As visible from the figure above, the stellar mass (fst_mass) is pretty constant across all the data. On the other hand orbital period varies from $ 10^{-2} $ to $ 10^8 $. All the values are log-scaled so it can be easily compared.

In [None]:
for col in dis_cols:
    val_count = df_vis[col].notnull().count()
    labels = []
    counts = []
    for val in df_vis[col].unique():
        labels.append(val)
        #counts.append(df_vis[df_vis[col] == val][col].count() / val_count)
        counts.append(df_vis[df_vis[col] == val][col].count())
    fix, ax = plt.subplots(figsize=(7,7))
    ax.pie(counts, labels=labels, autopct='%1.1f%%', pctdistance=0.75, radius=1, )
    ax.set_title(col)
    
print("In fst_metration (metal abundance): \n\t0 = iron abundance, \n\t1 = general metal abundance")

> Perhaps it's not even necessary to emphasize that at the most of the exoplanets there is a iron abundance...

### 2.2 Density of Random Variable
> Since there is 16 random variables (${16\choose2} = 120$), only a few of those that are interesting haven been chosen to display the graph.

In [None]:
def displayDensity(data, earth):
    fig, ax = plt.subplots(figsize=(15,5))
    ax.set_title("Density: {}".format(data.name))
    ax.hist(data, density=True, bins=50)
    ax.axvline(earth, color="red", label="Earth", linewidth=3)
    plt.legend()
    plt.show()
    
int_dist = ["fpl_rade", "dec", "fst_logg", "fst_age"] # interesting columns
    
for col in int_dist:
    displayDensity(df_vis[col], df_vis.loc[0, col])

> `fpl_rade` (radius of the exoplanet) points to that there are two types of the planets -- the ones with radius pretty simillar to the Earth and the ones with radius around 12 EU (Earth Units).
`dec` (declination of the planetary system) is pretty common around 40 degrees. This parameter is considered only for "fun" since it is not supposed to affect livability of the planet in any aspect.
In the chart `fst_logg` there is visible that the Sun has greater gravity acceleration at the surface than the most of the other stars. The last chart showing `fst_age` (stellar age) confirmes that the Sun is pretty in the mean of all the stars.

### 2.3 Correlation of Variables

In [None]:
cor_matrix = df_vis.corr()
plt.figure(figsize=(15,15))
sns.heatmap(cor_matrix, annot=True, cmap="YlGnBu", linewidths=.5)

> The most of the variables are not correlated at all, but there are few pretty correlated:
* `fst_logg` and `fst_lum` ($-0.72$) = with gravity acceleration luminosity decreases,
* `fst_logg` and `fst_rad` ($-0.75$) = with gravity acceleration stellar radius decreases,
* `fst_lum` and `fst_mass` ($0.52$) = with luminosity stellar mass increases.

### 2.4 Joint Distribution of Highly Correlated Variables

In [None]:
def displayJointDistribution(a, b):
    fig, ax = plt.subplots(figsize=(15,5))
    colors = ["blue"] * a.shape[0]
    colors[0] = "red"
    ax.axvline(a[0], color="red", label="Earth")
    ax.axhline(b[0], color="red")
    ax.scatter(a, b, c=colors)
    ax.set_title("{} vs. {}".format(a.name, b.name))
    ax.set_xlabel(a.name)
    ax.set_ylabel(b.name)
    plt.legend()
    plt.show()
    
print("Gravity acceleration (fst_logg) at the stellar surface and star luminosity (fst_lum).")
displayJointDistribution(df_vis["fst_logg"], df_vis["fst_lum"])
print("Gravity acceleration (fst_logg) at the stellar surface and the stellar radius (fst_rad).")
displayJointDistribution(df_vis["fst_logg"], df_vis["fst_rad"])
print("Stellar luminosity (fst_lum) and the stellar mass (fst_mass).")
displayJointDistribution(df_vis["fst_lum"], df_vis["fst_mass"])

> It's curious that stars with gravity acceleration at the surface $4.3^{10} \text{cm}/\text{s}^2$ have various luminosity  (from $-3.8^{10}\text{L}_\odot$ to $2.1^{10}\text{L}_\odot$)

## 3. Clustering
> Are the planets divided into some groups based on their parameters?

### 3.1 Normalization
> Some variables have greater deviation than others, therefore it is necessary to rescale all the values into range $\left[0,1\right]$

In [None]:
df_simil = df_vis.copy()
x = df_simil.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_simil = pd.DataFrame(x_scaled, columns=df_simil.columns)
df_simil.index.name = "rowid"

### 3.2 KMeans

In [None]:
## finding the best number of clusters

n = []
score = []

for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state = 1)
    kmeans.fit(df_simil)
    n.append(k)
    score.append(kmeans.inertia_)

fig, ax = plt.subplots(figsize=(15,5))
ax.plot(n, score)

> There is an signicant elbow when number of clusters is set to 2, which means that it is a optimum. Even though it was expected to have few dozens of clusters...

In [None]:
# kmeans with the optimal number of clusters
optimum = 2
kmeans = KMeans(n_clusters=optimum, random_state = 1)
kmeans.fit(df_simil)

# projecting into lower-dimensional space
svd = TruncatedSVD(n_components=2)
reduced = svd.fit_transform(df_simil)

# coloring clusters
colors = []
for point in kmeans.labels_:
    if(point == 0):
        colors.append("green")
    if(point == 1):
        colors.append("blue")
        
# plotting
fig, ax = plt.subplots(1, 2, figsize=(15,5))
ax[0].scatter(reduced[:,0], reduced[:,1], c=colors)
ax[0].axvline(reduced[0,0], c="red", label="Earth")
ax[0].axhline(reduced[0,1], c="red")
ax[0].set_title("Clustered exoplanets")

# adding cluster to dataset
df_simil["cluster"] = kmeans.labels_

# cluster ratio (pie-chart)
cluster_count = []
cluster_colors = []
cluster_count.append(df_simil.loc[df_simil["cluster"] == 0, "cluster"].count())
cluster_colors.append("green")
cluster_count.append(df_simil.loc[df_simil["cluster"] == 1, "cluster"].count())
cluster_colors.append("blue")

ax[1].pie(cluster_count, autopct='%1.1f%%', pctdistance=1.2, colors=cluster_colors)
ax[1].set_title("Cluster-ratio")
plt.show()

# means in each cluster
mean_a = df_simil[df_simil["cluster"] == 0].mean()
mean_b = df_simil[df_simil["cluster"] == 1].mean()
x = np.arange(mean_a.shape[0])
fig, ax = plt.subplots(figsize=(15,7))
ax.set_title("Normalized Means in Each Cluster")
ax.scatter(mean_a, x, color="green")
ax.scatter(mean_b, x, color="blue")
ax.set_yticks(x)
ax.set_yticklabels(df_simil.columns)
plt.show()

> Earth is situated in the green cluster, where is located 78% of all the exoplanets. After plotting the means of each cluster, it was found out that the clusters differ in:
* `fst_log` gravity acceleration at the surface of the star,
* `dec` declination of the solar system,
* `fpl_tranflag` planet transits the star and
* `fpl_rade`radius of the planet.

## 3. Similarity Computation
> Computing similarity between Earth and any other exoplanet. Similarity is computed using cosine similarity between planets' vectors.

### 3.1 Basic Cosine Similarity

In [None]:
# computing similarity
df_without_cluster = df_simil.drop(labels=["cluster"], axis=1)
cosine_simil = cosine_similarity(df_without_cluster)
df_cos = df_vis.copy()
df_cos["cos_simil"] = cosine_simil[:,0]

# selecting top-n exoplanets
top = df_cos.sort_values("cos_simil", ascending=False).head(10)
print("Top 10 exoplanets ranked by cosine similarity")
display(top)

### 3.2 Weighted Ranking
> Some variables are more important than the others. Therefore an vector w of wights has been created...

In [None]:
w = [5,  # fpl_orbper
     3,  # fpl_smax
     5,  # fpl_bmasse
     5,  # fpl_rade
     7,  # fpl_dens
     1,  # fpl_tranflag
     5,  # fpl_cbflag
     8,  # fpl_snum
     3,  # dec
     10, # fst_teff
     4,  # fst_logg
     4,  # fst_lum
     6,  # fst_mass
     6,  # fst_rad
     3,  # fst_met
     1,  # fst_metratio
     8   # fst_age
    ]

weighted_distances = []
earth = df_without_cluster.iloc[0,:]

for row in df_without_cluster.iterrows():
    weighted_distances.append(1 - distance.cosine(earth, row[1], np.true_divide(w,10)))

df_cos["weighted_simil"] = weighted_distances

### 3.3 Saving Results

In [None]:
# adding cluster to the result
df_cos["cluster"] = df_simil["cluster"]

# display result
display(df_cos.head())

# saving the result
df_cos.to_csv("output.csv")