In [1]:
import os
os.chdir("../")

# Cleaning Data

This notebook deals with **noise**, **outliers**, and **inconsistancies** present in the data.

In [2]:
import pandas as pd
import plotly.express as px

In [3]:
df = pd.read_csv("data/Asteroid_Imputed.csv", low_memory=False, index_col=0)
df.shape

(1340599, 25)

In [4]:
df.columns.values

array(['a', 'e', 'i', 'om', 'w', 'q', 'ad', 'per_y', 'data_arc',
       'condition_code', 'n_obs_used', 'H', 'epoch_mjd', 'ma', 'diameter',
       'albedo', 'neo', 'pha', 'n', 'per', 'moid', 'moid_ld', 'class',
       'first_obs', 'last_obs'], dtype=object)

I know that `pha`, `neo`, `class` and `condition_code` are numerical columns.

## Categorical Columns

### `pha` and `neo`

These are binary attributes. I'll replace the **N** and **Y** with 0 and 1 respectively. And set the data type to be **int**.

In [5]:
df.pha.value_counts()

pha
N    1338200
Y       2399
Name: count, dtype: int64

In [6]:
df.neo.value_counts()

neo
N    1306649
Y      33950
Name: count, dtype: int64

In [7]:
df.loc[:, ["neo", "pha"]] = df.loc[:, ["neo", "pha"]].replace({"N": 0, "Y": 1}).astype(int)

### `condition_code`

This takes on values from **0** to **9**. An orbital condition code of **0** stands for the most certainty we have about the orbital path of a celestial space object. The more the condition code trends towards higher values, the less confident we are. However, there isn't an exact difference between each condition code in terms of certainty. Which is why it's an ordinal attribute.

Condition Code | Orbit Longitude runoff
------- | -------
0      | < 1.0 arc seconds
1      | < 4.4 arc seconds
2      | < 19.6 arc seconds
3      | < 1.4 arc minutes
4      | < 6.4 arc minutes
5      | < 28.2 arc minutes
6      | < 2.1 degrees
7      | < 9.2 degrees
8      | < 40.7 degrees
9      | > 40.7 degrees

In [8]:
df.condition_code.value_counts()

condition_code
0    1026990
1      90234
2      59380
5      30619
4      27015
6      24559
3      22834
7      20780
9      19845
8      18342
E          1
Name: count, dtype: int64

Here, there is a mis-labeled asteroid having condition code of **E**.

In [9]:
df[df.condition_code == 'E']

Unnamed: 0,a,e,i,om,w,q,ad,per_y,data_arc,condition_code,...,albedo,neo,pha,n,per,moid,moid_ld,class,first_obs,last_obs
764883,2.775,0.2737,13.78,270.1,11.37,2.015,2.51,3.65,6532.0,E,...,0.200889,0,0,0.2132,1330.0,1.24,484.0,MBA,2010-04-29,2010-05-01


I'll replace this value with an appropriate one. I'll group the rows by `neo`, `pha` and `class`. Then replace the value by group mode.

In [10]:
df.groupby(["neo", "pha", "class"]).condition_code.apply(lambda x: x.mode().iloc[0])[0, 0, "MBA"]

'0'

So this row should have a condition code of **0**.

In [11]:
df.loc[df.condition_code == 'E', "condition_code"] = "0"
df.condition_code.value_counts()

condition_code
0    1026991
1      90234
2      59380
5      30619
4      27015
6      24559
3      22834
7      20780
9      19845
8      18342
Name: count, dtype: int64

Now, I'll convert the dtype of the column to be **int**.

In [12]:
df.condition_code = df.condition_code.astype(int)
df.condition_code.dtype

dtype('int64')

### `class`

Asteroids are classified based on various criteria, but the most common system uses their orbital characteristics and spectral properties.

| Class | Orbit                          | Spectral Type | Examples                |
| ----- | ------------------------------ | ------------- | ----------------------- |
| MBA   | Main belt (Mars-Jupiter)       | Various       | Vesta, Ceres, Juno      |
| OMB   | Outer main belt                | Various       | Psyche, Cybele          |
| MCA   | Mars-crossing                  | Various       | Eros, Phobos            |
| AMO   | Earth-crossing (outside)       | Various       | 433 Eros, Toutatis      |
| IMB   | Inner main belt                | Various       | Vesta, Hidalgo          |
| TJN   | Jupiter Trojan                 | Various       | Patroclus, Menelaus     |
| CEN   | Centaur (Jupiter-Neptune)      | Icy           | Chiron, Pholus          |
| APO   | Earth-crossing (inside 1 year) | Various       | 1036 Ganymede, Apophis  |
| ATE   | Aten (Earth-crossing, < 1 AU)  | Various       | 1950 DA, Bennu          |
| TNO   | Trans-Neptunian                | Icy           | Pluto, Eris, Makemake   |
| IEO   | Inner Earth Object             | Various       | Aten, Apollo types      |
| HYA   | Hungaria (Jupiter resonance)   | Various       | 434 Hungaria, 16 Psyche |

[Source: Response from Google's Bard](https://g.co/bard/share/0951ce8978a0)

In [13]:
df["class"].value_counts()

class
MBA    1192472
OMB      41141
IMB      28803
MCA      25654
APO      19127
TJN      13114
AMO      12114
TNO       4594
ATE       2677
CEN        742
AST        126
IEO         32
HYA          3
Name: count, dtype: int64

In this study, I consider mainly 3 types of asteroid classes which are **The Main Belt**, **Outer Main Belt** and everything else as one classes others. So, I'll lump everything else into one group.

In [14]:
df["class"] = df["class"].map(lambda x: x if x == "MBA" else x if x == "OMB" else "Others")
df["class"].value_counts()

class
MBA       1192472
Others     106986
OMB         41141
Name: count, dtype: int64

## Numerical Columns

### `a`, `q`, and `ad`

These are respectively the **semi-major axis**, **periphellon distance**, and **aphellon distance**. These are measured in astronomical units (AU). One astronomical unit is the distance between the earth and the sun.

The **semi-major axis** of an asteroid is the average orbital distance from the Sun. It is half of the major axis, which is the total distance between the closest and farthest points of the asteroid’s orbit, also known as the perihelion (q) and aphelion (ad). The semi-major axis is often measured in astronomical units (AU), with 1 AU defined as the mean Earth-Sun distance.

In [15]:
df[["a", "q", "ad"]].describe()

Unnamed: 0,a,q,ad
count,1340599.0,1340599.0,1340599.0
mean,2.892441,2.395216,3.391732
std,21.16951,2.090726,19.93217
min,-15820.0,0.07,0.65
25%,2.397,1.972,2.8
50%,2.662,2.238,3.08
75%,3.021,2.588,3.39
max,14510.0,80.538,20162.05


Near Earth Asteroids can be classified into **four** distinct classes based on their **semi-major axis** values.

| Class  | Description                                   | Semimajor axis (AU) |
| ------ | --------------------------------------------- | ------------------- |
| Atira  | Entire orbit inside Earth's                   | < 0.983             |
| Aten   | Crosses Earth's orbit, smaller semimajor axis | < 1.0               |
| Apollo | Crosses Earth's orbit, larger semimajor axis  | \> 1.0              |
| Amor   | Approaches Earth, doesn't cross               | \> 1.017            |

In [16]:
def classify_neo(a):
    if a < 0.983:
        return "Atira"
    elif a < 1.0:
        return "Aten"
    elif a > 1.017:
        return "Apollo"
    return "Amor"

df["neo_type"] = df.a.map(classify_neo)
df["neo_type"].value_counts()

neo_type
Apollo    1337662
Atira        2508
Amor          225
Aten          204
Name: count, dtype: int64

### `albedo`

In astronomy, the geometric albedo of a celestial body is the ratio of its actual brightness as seen from the light source (i.e., at zero phase angle) to that of an idealized flat, fully reflecting, diffusively scattering (Lambertian) disk with the same cross-section. Albedo is measured on a scale of zero to one, zero representing a surface that reflects no light, and one representing an object that reflects all incoming light.

In [17]:
df["albedo"].describe()

count    1.340599e+06
mean     1.409916e-01
std      5.176200e-02
min      5.831197e-04
25%      1.144613e-01
50%      1.370182e-01
75%      1.599091e-01
max      1.000000e+00
Name: albedo, dtype: float64

Based on geometric albedo, we can approximately guess what type an asteroid is.

| Asteroid Type | Albedo Range | Major Components                                |
| ------------- | ------------ | ----------------------------------------------- |
| C-type        | 0.03 - 0.10  | Carbon, silicates, water ice, organic compounds |
| M-type        | 0.10 - 0.30  | Nickel-iron, iron sulfide                       |
| S-type        | 0.10 - 0.25  | Silicates, iron-nickel                          |

I'll create a categorical variable to indicate the probable composition of an asteroid based on its albedo value.

In [18]:
df["composition"] = df.albedo.map(lambda x: "carbonaceous" if x <= 0.1 else "metallic")
df["composition"].value_counts()

composition
metallic        1143291
carbonaceous     197308
Name: count, dtype: int64

### `e` 

The **eccentricity** of an asteroid is a measure of how much its orbit deviates from a perfect circle. 

In [19]:
df.e.describe()

count    1.340599e+06
mean     1.582108e-01
std      9.366209e-02
min      0.000000e+00
25%      9.310000e-02
50%      1.472000e-01
75%      2.035000e-01
max      1.201100e+00
Name: e, dtype: float64

Orbital shapes can be defined based on eccentricity value.

1. Circular orbit: $e=0$

2. Elliptic orbit: $0 < e < 1$

3. Parabolic trajectory: $e = 1$

4. Hyperbolic trajectory: $e > 1$

I'll create a categorical column to establish what the orbital shape of asteroids are based on eccentricity values.

In [20]:
def classify_orbital_shapes(e, eps = 1e-2):
    if e < eps:
        return "Circular"
    elif eps < e < 1 - eps:
        return "Elliptic"
    elif e < 1 + eps:
        return "Parabolic"
    return "Hyperbolic"

df["orbital_shape"] = df.e.map(classify_orbital_shapes)
df["orbital_shape"].value_counts()

orbital_shape
Elliptic      1336493
Circular         4004
Parabolic         100
Hyperbolic          2
Name: count, dtype: int64

In [21]:
df.to_csv("data/Asteroid_Cleaned.csv")