BloomTech Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
%%capture
import sys

# For Google reproducibility
if 'google.colab' in sys.modules:
    DATA_PATH = "/content/drive/My Drive/Kaggle"
    !pip install category_encoders==2.*

    #Connect to remote data
    from google.colab import drive
    drive.mount(DATA_PATH, force_remount=True)

# Local data store on drive D:
else:
    DATA_PATH = "D:/Datafiles/"

In [2]:
# Import Block
import pandas as pd

#sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector

Part One - Dataset Basics

For this week's modules I am using the asteroid dataset found on Kaggle here: https://www.kaggle.com/datasets/sakhawat18/asteroid-dataset
(This is a cleaner version of the NASA/JPL datasets used in my portfolio projects, as such thoroughness is not needed for the daily projects)

The target for this exercise will be the Potentially Hazardous Asteroid classification feature. 
Unfortunately (or, fortunately for the Earth) this feature is heavily unbalanced in favor of 'no' so we will need to adjust our metrics.

This is an unbalanced binary classification problem, so we will evaluate balanced accuracy and F1 score, as well as ROC AUC.
We will also evaluate the accuracy of the models trained on synthetically balanced datasets using eg. SMOTE techniques.

The features in this dataset have a lot of redundancy due to having been merged from various sources; additionally there are
many label features which need to be culled for modeling purposes and to prevent cross-leaking. 

Also worthy of note is that the PHA designation is a direct fuction of two features, absolute magnitude and minimum orbit intercept distance,
so models will eventually be compared between the initial inference with the features included, and ones built without those features.

In [13]:
# Wrangling Functions

def wrangle(filepath):

    df = pd.read_csv(filepath, index_col=['pdes']) #pdes = primary designation number
    
    #drop extraneous label/index/constant columns
    labels = ['id', 'spkid', 'full_name', 'name', 'prefix', 'orbit_id', 'equinox', 'class']
    df.drop(columns=labels, inplace=True)
    #drop duplicate date/distance columns
    #NOTE: we want consistent Julian dates / AU distances
    labels= ['epoch_mjd', 'epoch_cal', 'tp_cal', 'per_y', 'moid_ld']
    df.drop(columns=labels, inplace=True)
    #drop leaky columns
    labels = ['neo']
    df.drop(columns=labels, inplace=True)

    #drop rows with no target value
    df.dropna(subset=['pha'], inplace=True)

    return df

In [15]:
df = wrangle(DATA_PATH + "/Asteroids/dataset.csv")

print(df.shape)
print("\n")
print(df.info())

  if (await self.run_code(code, result,  async_=asy)):


(938603, 30)


<class 'pandas.core.frame.DataFrame'>
Index: 938603 entries, 1 to 2678 T-3
Data columns (total 30 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   pha             938603 non-null  object 
 1   H               932341 non-null  float64
 2   diameter        136209 non-null  float64
 3   albedo          135103 non-null  float64
 4   diameter_sigma  136081 non-null  float64
 5   epoch           938603 non-null  float64
 6   e               938603 non-null  float64
 7   a               938603 non-null  float64
 8   q               938603 non-null  float64
 9   i               938603 non-null  float64
 10  om              938603 non-null  float64
 11  w               938603 non-null  float64
 12  ma              938602 non-null  float64
 13  ad              938599 non-null  float64
 14  n               938603 non-null  float64
 15  tp              938603 non-null  float64
 16  per             938599 non-null  float64
 17

In [17]:
#print(df.value_counts())

print(df.isna().sum()) #missing H may be a problem later


pha                    0
H                   6262
diameter          802394
albedo            803500
diameter_sigma    802522
epoch                  0
e                      0
a                      0
q                      0
i                      0
om                     0
w                      0
ma                     1
ad                     4
n                      0
tp                     0
per                    4
moid                   0
sigma_e                1
sigma_a                1
sigma_q                1
sigma_i                1
sigma_om               1
sigma_w                1
sigma_ma               1
sigma_ad               5
sigma_n                1
sigma_tp               1
sigma_per              5
rms                    1
dtype: int64
