### This is a simple notebook to build, visualize, and diagnose the performance of DT algorithms on the (larger) habitable planets data set.

It accompanies Chapter 3 of the book.

Data for this exercise come from [here](https://phl.upr.edu/ ).

Author: Viviana Acquaviva

In [None]:
import pandas as pd

import numpy as np

import sklearn.tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate
from sklearn.model_selection import KFold, StratifiedKFold

from scipy import stats

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

In [None]:
from io import StringIO  
from IPython.display import Image  
import pydotplus
from sklearn.tree import export_graphviz

In [None]:
import matplotlib
font = {'size'   : 20}

matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=20) 
matplotlib.rc('ytick', labelsize=20) 
matplotlib.rcParams['figure.dpi'] = 300

### Step 1: Preliminary data analysis/exploration.

Once we are working with research-level data sets, our first step should always be data exploration.

We can read the data in a data frame, as we did previously, and do some preliminary data analysis.

In [None]:
df = pd.read_csv('../data/phl_exoplanet_catalog.csv', sep = ',')

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.groupby('P_HABITABLE').count()

#### Start by lumping together Probably and Possibly Habitable planets.

In [None]:
# What are we doing here? Creating a new data frame called bindf and droppoing the old habitability tag
bindf = df.drop('P_HABITABLE', axis = 1) 

In [None]:
# how about here? creating my new habitability tag
bindf['P_HABITABLE'] = (np.logical_or((df.P_HABITABLE == 1) , (df.P_HABITABLE == 2))) 

# and here? Re-casting this column as integer
bindf['P_HABITABLE'] = bindf['P_HABITABLE'].astype(int) 

In [None]:
bindf.head()

### Let's select some columns.

S_MAG - star magnitude 

S_DISTANCE - star distance (parsecs)

S_METALLICITY - star metallicity (dex)

S_MASS - star mass (solar units)

S_RADIUS - star radius (solar units)

S_AGE - star age (Gy)

S_TEMPERATURE - star effective temperature (K)

S_LOG_G - star log(g)

P_DISTANCE - planet mean distance from the star (AU) 

P_FLUX - planet mean stellar flux (earth units)

P_PERIOD - planet period (days) 

### We can select the same features as we did in Chapter 2.

In [None]:
final_features = bindf[['S_MASS', 'P_PERIOD', 'P_DISTANCE']] 

In [None]:
targets = bindf.P_HABITABLE

In [None]:
final_features.head()

### There are some NaNs. We can see this by using the "describe" property, which only counts numerical values in each column.

In [None]:
final_features.shape

In [None]:
final_features.describe()

### We can count missing data by column...

In [None]:
for i in range(final_features.shape[1]):
    print(len(np.where(final_features.iloc[:,i].isna())[0]))

### ...and get rid of them (Note: there are much better imputing strategies!)

In [None]:
final_features = final_features.dropna(axis = 0) # gets rid of any instance with at least one NaN in any column
final_features.shape

### Learning Check-in

What is a 'NaN' and why do we need to remove them from this data?

<details><summary><b>Click here for the answer!</b></summary>
<p>

```
NaN, is not infact data about your grandmother, but rather stands for "Not a number" and is python's way of telling us that there is an unknown value where there should be one.

If we don't remove them from our data set, we will run into trouble if we try to run an calculations that fail when operating on NaN values.

Try it! Comment out the first line of the previous code block and re-run the following blocks. Does anything look different?
```

</p>
</details>

### Next step: search for outliers

Method 1 - plot!

In [None]:
plt.hist(final_features.iloc[:,0], bins = 100, alpha = 0.5)

There is a remarkable outlier; the same happens for other features. 

But we could have also known from the difference between mean and median (which, in fact, is even more pronounced for orbital distance and period).

In [None]:
final_features.describe()

In [None]:
final_features = final_features[(np.abs(stats.zscore(final_features)) < 5).all(axis=1)] 

# This eliminates > 5 sigma outliers; however it counts from the mean so it might not be ideal

In [None]:
targets = targets[final_features.index]

### Now reset index.

In [None]:
final_features = final_features.reset_index(drop=True)

In [None]:
final_features.head()

### And don't forget to do the same for the label vector.

In [None]:
targets = targets.reset_index(drop=True)

In [None]:
targets.head()

### Comparing the shapes, we can see that 9 outliers were eliminated.

In [None]:
targets.shape

### Check balance of data set

In [None]:
#Simple way: count 0/1s, get fraction of total

In [None]:
np.sum(targets)/len(targets)

In [None]:
np.bincount(targets) #this shows the distribution of the two classes

### This tells us that our data set is extremely imbalanced, and therefore, we need to be careful.

#### We can also just take a look at the first two features, using different symbols for the two classes.

In [None]:
plt.figure(figsize=(10,6))

cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#20B2AA','#FF00FF'])

a = plt.scatter(final_features['S_MASS'], final_features['P_PERIOD'], marker = 'o',\
            c = targets, s = 100, cmap=cmap, label = 'Test')

plt.legend();

a.set_facecolor('none')

plt.yscale('log')
plt.xlabel('Mass of Parent Star (Solar Mass Units)')
plt.ylabel('Period of Orbit (days)');

bluepatch = mpatches.Patch(color='#20B2AA', label='Not Habitable')
magentapatch = mpatches.Patch(color='#FF00FF', label='Habitable')

ax = plt.gca()
leg = ax.get_legend()

plt.legend(handles=[magentapatch, bluepatch],\
           loc = 'lower right', fontsize = 14);

### Learning Check-in

Based on this graph, would you expect DT or kNN to perform better? Why?

<details><summary><b>Click here for the answer!</b></summary>
<p>

```
Possibly kNN, because DT would only make splits along the features and cannot cut the data set diagonally.
```

</p>
</details>


What kind of performance can we expect (qualitatively, is the information sufficient?) Do you expect to have latent (hidden) variables that might affect the outcome beyond those that we have?

<details><summary><b>Click here for the answer!</b></summary>
<p>

```
There is a lot of overlap between the two classes, which suggests that we can't expect a great performance unless we collect more features.
```

</p>
</details>