In [1]:
import os

os.chdir("../")

# Data Cleaning - Handling Missing Values

The purpose of this notebook is to deal with missing values in the raw data collected from Jet Propulsion Laboratory.

## Loading Dataset

In this section, I load the raw dataset and look at it's shape and column data types.

In [2]:
import pandas as pd
import plotly.express as px

In [3]:
df = pd.read_csv("data/Asteroid_Data.csv", low_memory=False)
print(f"Number of (rows, columns) = {df.shape}")

Number of (rows, columns) = (1340607, 43)


In [4]:
df.sample(3)

Unnamed: 0,full_name,a,e,i,om,w,q,ad,per_y,data_arc,...,moid,moid_ld,sigma_e,sigma_a,sigma_q,sigma_i,sigma_per,class,first_obs,last_obs
892442,(2014 KW141),3.179,0.2239,16.58,236.92,112.95,2.467,3.89,5.67,6220.0,...,1.52,591.0,2.1e-07,9.3e-08,6.2e-07,3.5e-05,9.1e-05,MBA,2004-11-19,2021-11-30
688827,(2006 ST162),2.673,0.235,12.62,195.22,236.32,2.045,3.3,4.37,4813.0,...,1.09,423.0,5.6e-07,5.8e-08,1.5e-06,1.4e-05,5.2e-05,MBA,2006-09-24,2019-11-28
1019622,(2015 YA23),2.596,0.2037,15.13,116.04,340.45,2.067,3.13,4.18,4374.0,...,1.09,424.0,7.5e-07,1.8e-07,2.1e-06,9e-06,0.00016,MBA,2011-12-26,2023-12-17


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340607 entries, 0 to 1340606
Data columns (total 43 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   full_name       1340607 non-null  object 
 1   a               1340607 non-null  float64
 2   e               1340607 non-null  float64
 3   i               1340607 non-null  float64
 4   om              1340607 non-null  float64
 5   w               1340607 non-null  float64
 6   q               1340607 non-null  float64
 7   ad              1340602 non-null  float64
 8   per_y           1340602 non-null  float64
 9   data_arc        1340063 non-null  float64
 10  condition_code  1340584 non-null  object 
 11  n_obs_used      1340607 non-null  int64  
 12  n_del_obs_used  1034 non-null     float64
 13  n_dop_obs_used  1034 non-null     float64
 14  H               1339457 non-null  float64
 15  epoch_mjd       1340607 non-null  int64  
 16  ma              1340606 non-null  fl

## Identify Missing Columns

In this section, I'll identify which columns have missing values. What percentage of the values are missing. I visualize the missing statistics in a bar plot. I, then, chart a course on how to handle the different levels of missing values.

In [6]:
missing = pd.DataFrame(
    df.apply(lambda x: x.isna(), axis=1).sum().sort_values(ascending=True)
).reset_index()

missing.rename(columns={0: "Missing", "index": "Column"}, inplace=True)
missing["Percent"] = missing["Missing"] / df.shape[0] * 100

In [7]:
fig = px.bar(missing[missing.Missing > 0], x="Column", y="Percent", text="Missing")
fig.update_layout(
    height=600,
    width=800,
    title_x=0.5,
    title_text=f"Bar Chart<br><sup>Missing Values of each column</sup>",
)
fig.show()

### Observation 1

Nearly all values in `rot_per` to `IR` are missing. Predicting them from the existing ones will be hard as there isn't enough data. 

    The best way to deal with these columns is to drop them. If I learn of a better way to handle these missing values, 
    I'll come and deal with them later on.

In [8]:
missing[missing.Percent > 90]

Unnamed: 0,Column,Missing,Percent
32,rot_per,1306504,97.456152
33,spec_B,1338941,99.875728
34,n_del_obs_used,1339573,99.922871
35,n_dop_obs_used,1339573,99.922871
36,BV,1339586,99.92384
37,spec_T,1339627,99.926899
38,UB,1339628,99.926973
39,G,1340488,99.991123
40,extent,1340587,99.998508
41,GM,1340592,99.998881


### Observation 2

A big chunk of `diameter` and `albedo` values are missing. 

    Predicting them with a Machine Learning model should be possible from the 20\% data that is available. 
    I'll use a deep learning model to do this.

In [9]:
missing[missing.Column.isin(["diameter", "albedo"])]

Unnamed: 0,Column,Missing,Percent
29,diameter,1200983,89.585016
31,albedo,1202111,89.669157


### Observation 3

Some columns have absolutely no missing values. 

    Nothing needs to be done for these columns. I'll use these to help me in imputing other missing values.

In [10]:
missing[missing.Missing == 0]

Unnamed: 0,Column,Missing,Percent
0,full_name,0,0.0
1,class,0,0.0
2,n,0,0.0
3,first_obs,0,0.0
4,epoch_mjd,0,0.0
5,n_obs_used,0,0.0
6,last_obs,0,0.0
7,a,0,0.0
8,e,0,0.0
9,w,0,0.0


### Observation 4

Most columns have $<5\%$ data is missing. 

    These can be filled in using imputation techniques. 
    
* For numerical columns, I'll use imputation by group median. 

* For categorical, I'll impute by group mode.

In [11]:
missing[(missing.Percent < 5) & (missing.Missing > 0)]

Unnamed: 0,Column,Missing,Percent
13,ma,1,7.5e-05
14,per,5,0.000373
15,neo,5,0.000373
16,ad,5,0.000373
17,per_y,5,0.000373
18,condition_code,23,0.001716
19,data_arc,544,0.040579
20,H,1150,0.085782
21,moid_ld,2112,0.157541
22,moid,2112,0.157541


## Dropping Columns

In this subsection, I drop the columns that have more than 90\% of their values missing. Additionally, I remove columns that aren't necessary for my objective. These include `full_name`, `sigma_i`, `sigma_q`, `sigma_a`, `sigma_e`, `sigma_per`, and `diameter_sigma`.

In [12]:
df.drop(
    columns=[
        "full_name",
        "rot_per",
        "spec_B",
        "spec_T",
        "G",
        "BV",
        "UB",
        "IR",
        "GM",
        "extent",
        "n_del_obs_used",
        "n_dop_obs_used",
        "sigma_i",
        "sigma_q",
        "sigma_a",
        "sigma_e",
        "sigma_per",
        "diameter_sigma",
    ],
    inplace=True,
)

print(f"After dropping, dataframe shape = {df.shape}")

After dropping, dataframe shape = (1340607, 25)


## Imputation

This section fills in missing values based on existing values. In particular, I employ two strategies:

1. **Impute by Group**: Except for `albedo` and `diameter`, most columns have a very small percentage of missing values. I fill in these values using groups central tendency which are

    * _Median_ for numerical columns. Median because the values have a lot of outliers.
    
    * _Mode_ for categorical columns. This is basically majority category of a group.

2. **Impute using MLP**: I design and train a Multi Layer Perceptron (MLP) on existing data for `albedo` and `diameter` columns separately. Then use the model to predict the missing values and impute using predictions.

### Imputation by Group

In this subsection, I'll impute missing values for columns with less than 5\% of their data missing.

#### Categorical Columns

These columns are imputed using group mode.

##### `neo` column

In [13]:
df[df.neo.isna()]

Unnamed: 0,a,e,i,om,w,q,ad,per_y,data_arc,condition_code,...,albedo,neo,pha,n,per,moid,moid_ld,class,first_obs,last_obs
1084193,-57.17,1.0219,145.42,166.04,77.96,1.25,,,17.0,,...,,,N,0.00228,,0.601,234.0,HYA,2016-12-11,2016-12-28
1125900,-1.272,1.2011,122.74,24.6,241.81,0.256,,,80.0,,...,,,N,0.6867,,0.0958,37.3,HYA,2017-10-14,2018-01-02
1220367,-50670.0,1.0011,72.83,287.13,29.58,53.433,,,56.0,,...,,,N,8.641e-08,,52.5,20400.0,HYA,2020-06-20,2020-08-15
1285026,-2290.0,1.0013,137.14,228.8,193.28,2.995,,,389.0,,...,,,N,8.991e-06,,2.01,784.0,HYA,2021-10-30,2022-11-23
1328329,-15820.0,1.0001,12.11,347.39,86.85,1.358,,,25.0,,...,,,N,4.955e-07,,0.445,173.0,HYA,2023-09-15,2023-10-10


I can use `pha` column to group and impute. Other categorical columns, 

* `condition_code` is null for all these rows.
* `class` can't be used. Because all instances of the groups have null `neo` values.

In [13]:
df.groupby("pha").neo.apply(lambda x: x.mode().iloc[0])

pha
N    N
Y    Y
Name: neo, dtype: object

Rows with `pha` of **N** has a `neo` mode of **N**. So, I'll impute the missing `neo` values with this.

In [14]:
df.neo.fillna("N", inplace=True)

To confirm, I should have zero missing values now.

In [15]:
df.neo.isna().sum()

0

##### `condition_code` column

Impute using `class` and `neo` group mode.

In [16]:
df["condition_code"] = (
    df.groupby(["class", "neo"])
    .transform(lambda x: x.fillna(x.mode().iloc[0]))
    .condition_code
)

Let's check if there are null values after imputation.

In [18]:
df[df.condition_code.isna()]

Unnamed: 0,a,e,i,om,w,q,ad,per_y,data_arc,condition_code,...,albedo,neo,pha,n,per,moid,moid_ld,class,first_obs,last_obs
1084193,-57.17,1.0219,145.42,166.04,77.96,1.25,,,17.0,,...,,N,N,0.00228,,0.601,234.0,HYA,2016-12-11,2016-12-28
1125900,-1.272,1.2011,122.74,24.6,241.81,0.256,,,80.0,,...,,N,N,0.6867,,0.0958,37.3,HYA,2017-10-14,2018-01-02
1220367,-50670.0,1.0011,72.83,287.13,29.58,53.433,,,56.0,,...,,N,N,8.641e-08,,52.5,20400.0,HYA,2020-06-20,2020-08-15
1285026,-2290.0,1.0013,137.14,228.8,193.28,2.995,,,389.0,,...,,N,N,8.991e-06,,2.01,784.0,HYA,2021-10-30,2022-11-23
1328329,-15820.0,1.0001,12.11,347.39,86.85,1.358,,,25.0,,...,,N,N,4.955e-07,,0.445,173.0,HYA,2023-09-15,2023-10-10


All `condition_code` values for **HYA** `class` type is null. Which is why it wasn't imputed. So, I'll impute by `neo` group.

In [17]:
df.groupby("neo").apply(lambda x: x.mode().iloc[0]).condition_code

neo
N    0
Y    7
Name: condition_code, dtype: object

Rows with `neo` value **N** has mostly have a `condition_code` value of 0.

In [18]:
df.condition_code.fillna("0", inplace=True)

Check if there are any missing values.

In [19]:
df.condition_code.isna().sum()

0

##### `pha` column

In [20]:
df["pha"] = (
    df.groupby(["class", "neo"]).transform(lambda x: x.fillna(x.mode().iloc[0])).pha
)

df["pha"].isnull().sum()

0

#### Numerical Columns

As before, I first impute by group median. And check if there are missing values after imputation.

In [21]:
impute_columns = ["ma", "per", "ad", "per_y", "data_arc", "H", "moid", "moid_ld"]

df[impute_columns] = (
    df.groupby(by=["neo", "condition_code"])[impute_columns]
    .apply(lambda x: x.fillna(x.median()))
    .reset_index()[impute_columns]
)

df[impute_columns].isnull().sum()

ma          0
per         0
ad          0
per_y       0
data_arc    0
H           1
moid        1
moid_ld     1
dtype: int64

Then I look at what are the missing values. In this case, the same row has all 3 values missing.

In [22]:
df[(df["H"].isnull())|(df["moid"].isnull())|(df["moid_ld"].isnull())].dropna(how="all")

Unnamed: 0,a,e,i,om,w,q,ad,per_y,data_arc,condition_code,...,albedo,neo,pha,n,per,moid,moid_ld,class,first_obs,last_obs
1306656,3.171,0.1008,8.52,342.48,8.14,2.851,3.53,4.62,2.0,7,...,,N,N,0.1745,1690.0,,,MBA,2022-09-16,2022-09-28


Next, I see what the median is for `condition_code` of 7.

In [23]:
df.groupby("condition_code")[impute_columns].apply(lambda x: x.median()).loc[
    "7", ["H", "moid", "moid_ld"]
]

H           18.62
moid         1.10
moid_ld    428.00
Name: 7, dtype: float64

Then, I impute using these values.

In [24]:
df.H.fillna(18.62, inplace=True)
df.moid.fillna(1.10, inplace=True)
df.moid_ld.fillna(428.00, inplace=True)

Finally, one last check to see if there are any more missing values.

In [25]:
df[impute_columns].isnull().sum()

ma          0
per         0
ad          0
per_y       0
data_arc    0
H           0
moid        0
moid_ld     0
dtype: int64

### Impute using MLP

In this subsection, I use my custom `mlp` architecture to fill in missing values for `albedo` and `diameter` columns.

##### `albedo` column

In [26]:
import torch

torch.manual_seed(29)

from src.deep_learning import mlp, train_script, create_dataloader

I first specify which columns will be included and what their types are.

In [28]:
categorical_columns = ["pha", "neo", "condition_code", "class"]
numerical_columns = df.columns.drop(categorical_columns).drop(
    ["first_obs", "last_obs", "diameter"]
)
target_column = "albedo"
exclude_columns = ["first_obs", "last_obs", "diameter"]

Then I create data loaders to feed into the model for training, validation, and inferencing.

In [34]:
train_loader, valid_loader, inf_loader = create_dataloader.create_dataloader(
    df,
    numerical_columns,
    categorical_columns,
    target_column,
    exclude_columns,
    2048,
    False,
)

Number of examples for training purposes: 138496
Number of examples for inference purposes: 1202111
Training X shape: torch.Size([128256, 45])
Training y shape: torch.Size([128256])
Validation X shape: torch.Size([10240, 45])
Validation y shape: torch.Size([10240])
Inference X shape: torch.Size([1202111, 45])
Inference y shape: torch.Size([1202111])


Next, I create and train the model.

In [36]:
model = mlp.MLP_Albedo(
    n=3,
    num_output_list=[256, 128, 64],
    dropout_list=[0.2, 0.15, 0.1],
    device=mlp.device,
)

model = train_script.train_epoch(
    model,
    device=mlp.device,
    num_epochs=10000,
    learning_rate=1e-2,
    gamma=0.999,
    patience=50,
    root_save_dir="model_dir/model_albedo",
    model_name="resnet",
    train_loader=train_loader,
    valid_loader=valid_loader,
)


Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.



The training process has done!


I did this step a couple of times, checking the logs for loss values and overall performance. I also check the model's performance on the inference set. I check to see if the model is predicting values within the acceptable range of `albedo` values, which is between $[0.0, 1.0]$.

In [37]:
model.load_state_dict(torch.load("model_dir/model_albedo/resnet"))

<All keys matched successfully>

In [39]:
predictions = []

for X, _ in inf_loader:
    X = X.to(mlp.device)
    predictions += model(X).cpu().tolist()

len(predictions)

1202111

In [38]:
df[target_column].describe()

count    138496.000000
mean          0.130099
std           0.110358
min           0.001000
25%           0.053000
50%           0.078000
75%           0.189000
max           1.000000
Name: albedo, dtype: float64

In [40]:
df.loc[(df["albedo"].isnull()), "albedo"] = predictions
df["albedo"].describe()

count    1.340607e+06
mean     3.300324e-02
std      7.646102e-02
min     -4.001474e-01
25%     -8.701958e-03
50%      2.037045e-02
75%      5.300000e-02
max      1.000000e+00
Name: albedo, dtype: float64

I check how many of the values are out of range.

In [44]:
df[(df["albedo"] < 0) | (df["albedo"] > 1)].dropna().albedo.count()

528

It's easier for now to drop these values and move on. Ideally, the model shouldn't be outputting unacceptable values at all. However, the model has seen a very small fraction of values compared to the amount it's inferencing. Thus, it is struggling to generalize well.

In [45]:
df.drop(index=df[df.albedo < 0].index, inplace=True)

##### `diameter` column

The same process is carried out for this column as well.

In [124]:
categorical_columns = ["pha", "neo", "condition_code", "class"]
numerical_columns = df.columns.drop(categorical_columns).drop(["first_obs", "last_obs"])
target_column = ["diameter"]
exclude_columns = ["first_obs", "last_obs"]

In [125]:
train_loader, valid_loader, inf_loader = create_dataloader.create_dataloader(
    df,
    numerical_columns,
    categorical_columns,
    target_column,
    exclude_columns,
    2048,
    False,
)

Number of examples for training purposes: 139624
Number of examples for inference purposes: 1200975
Training X shape: torch.Size([129384, 46])
Training y shape: torch.Size([129384, 1])
Validation X shape: torch.Size([10240, 46])
Validation y shape: torch.Size([10240, 1])
Inference X shape: torch.Size([1200975, 46])
Inference y shape: torch.Size([1200975, 1])


In [151]:
model = mlp.MLP_Diameter(
    n=3,
    num_output_list=[256, 128, 64],
    dropout_list=[0.2, 0.15, 0.1],
    device=mlp.device,
)

model = train_script.train_epoch(
    model,
    device=mlp.device,
    num_epochs=1000,
    learning_rate=1e-4,
    gamma=0.99,
    patience=50,
    root_save_dir="model_dir/model_diameter",
    model_name="resnet",
    train_loader=train_loader,
    valid_loader=valid_loader,
)


Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.



The training process has done!


In [152]:
model.load_state_dict(torch.load("model_dir/model_diameter/resnet"))

<All keys matched successfully>

In [153]:
predictions = []

for X, _ in inf_loader:
    X = X.to(mlp.device)
    predictions += model(X).cpu().tolist()

len(predictions)

1200975

In [154]:
df[target_column].describe()

Unnamed: 0,diameter
count,139624.0
mean,5.458924
std,9.308008
min,0.0025
25%,2.763
50%,3.949
75%,5.731
max,939.4


In [157]:
df.loc[(df["diameter"].isnull()), "diameter"] = predictions
df["diameter"].describe()

count    1.340599e+06
mean     2.968320e+00
std      3.139675e+00
min      2.500000e-03
25%      2.616776e+00
50%      2.807216e+00
75%      2.922971e+00
max      9.394000e+02
Name: diameter, dtype: float64

In [158]:
df.isna().sum()

a                 0
e                 0
i                 0
om                0
w                 0
q                 0
ad                0
per_y             0
data_arc          0
condition_code    0
n_obs_used        0
H                 0
epoch_mjd         0
ma                0
diameter          0
albedo            0
neo               0
pha               0
n                 0
per               0
moid              0
moid_ld           0
class             0
first_obs         0
last_obs          0
dtype: int64

Finally, I save the data to load for other data cleaning procedures.

In [159]:
df.to_csv("data/Asteroid_Imputed.csv")