In [1]:
import os
os.chdir("../")

# Cleaning Data

This notebook deals with **noise**, **outliers**, and **inconsistancies** present in the data.

In [2]:
import pandas as pd
import plotly.express as px

In [3]:
df = pd.read_csv("data/Asteroid_Imputed.csv", low_memory=False, index_col=0)
df.shape

(1340599, 25)

In [4]:
df.columns.values

array(['a', 'e', 'i', 'om', 'w', 'q', 'ad', 'per_y', 'data_arc',
       'condition_code', 'n_obs_used', 'H', 'epoch_mjd', 'ma', 'diameter',
       'albedo', 'neo', 'pha', 'n', 'per', 'moid', 'moid_ld', 'class',
       'first_obs', 'last_obs'], dtype=object)

I know that `pha`, `neo`, `class` and `condition_code` are numerical columns.

## Categorical Columns

### `pha` and `neo`

These are binary attributes. I'll replace the **N** and **Y** with 0 and 1 respectively. And set the data type to be **int**.

In [5]:
df.pha.value_counts()

pha
N    1338200
Y       2399
Name: count, dtype: int64

In [6]:
df.neo.value_counts()

neo
N    1306649
Y      33950
Name: count, dtype: int64

In [7]:
df.loc[:, ["neo", "pha"]] = df.loc[:, ["neo", "pha"]].replace({"N": 0, "Y": 1}).astype(int)

### `condition_code`

This takes on values from **0** to **9**. An orbital condition code of **0** stands for the most certainty we have about the orbital path of a celestial space object. The more the condition code trends towards higher values, the less confident we are. However, there isn't an exact difference between each condition code in terms of certainty. Which is why it's an ordinal attribute.

Condition Code | Orbit Longitude runoff
------- | -------
0      | < 1.0 arc seconds
1      | < 4.4 arc seconds
2      | < 19.6 arc seconds
3      | < 1.4 arc minutes
4      | < 6.4 arc minutes
5      | < 28.2 arc minutes
6      | < 2.1 degrees
7      | < 9.2 degrees
8      | < 40.7 degrees
9      | > 40.7 degrees

In [8]:
df.condition_code.value_counts()

condition_code
0    1026990
1      90234
2      59380
5      30619
4      27015
6      24559
3      22834
7      20780
9      19845
8      18342
E          1
Name: count, dtype: int64

Here, there is a mis-labeled asteroid having condition code of **E**.

In [9]:
df[df.condition_code == 'E']

Unnamed: 0,a,e,i,om,w,q,ad,per_y,data_arc,condition_code,...,albedo,neo,pha,n,per,moid,moid_ld,class,first_obs,last_obs
764883,2.775,0.2737,13.78,270.1,11.37,2.015,2.51,3.65,6532.0,E,...,0.200889,0,0,0.2132,1330.0,1.24,484.0,MBA,2010-04-29,2010-05-01


I'll replace this value with an appropriate one. I'll group the rows by `neo`, `pha` and `class`. Then replace the value by group mode.

In [10]:
df.groupby(["neo", "pha", "class"]).condition_code.apply(lambda x: x.mode().iloc[0])[0, 0, "MBA"]

'0'

So this row should have a condition code of **0**.

In [11]:
df.loc[df.condition_code == 'E', "condition_code"] = "0"
df.condition_code.value_counts()

condition_code
0    1026991
1      90234
2      59380
5      30619
4      27015
6      24559
3      22834
7      20780
9      19845
8      18342
Name: count, dtype: int64

Now, I'll convert the dtype of the column to be **int**.

In [12]:
df.condition_code = df.condition_code.astype(int)
df.condition_code.dtype

dtype('int64')

### `class`

Asteroids are classified based on various criteria, but the most common system uses their orbital characteristics and spectral properties.

| Class | Orbit                          | Spectral Type | Examples                |
| ----- | ------------------------------ | ------------- | ----------------------- |
| MBA   | Main belt (Mars-Jupiter)       | Various       | Vesta, Ceres, Juno      |
| OMB   | Outer main belt                | Various       | Psyche, Cybele          |
| MCA   | Mars-crossing                  | Various       | Eros, Phobos            |
| AMO   | Earth-crossing (outside)       | Various       | 433 Eros, Toutatis      |
| IMB   | Inner main belt                | Various       | Vesta, Hidalgo          |
| TJN   | Jupiter Trojan                 | Various       | Patroclus, Menelaus     |
| CEN   | Centaur (Jupiter-Neptune)      | Icy           | Chiron, Pholus          |
| APO   | Earth-crossing (inside 1 year) | Various       | 1036 Ganymede, Apophis  |
| ATE   | Aten (Earth-crossing, < 1 AU)  | Various       | 1950 DA, Bennu          |
| TNO   | Trans-Neptunian                | Icy           | Pluto, Eris, Makemake   |
| IEO   | Inner Earth Object             | Various       | Aten, Apollo types      |
| HYA   | Hungaria (Jupiter resonance)   | Various       | 434 Hungaria, 16 Psyche |

[Source: Response from Google's Bard](https://g.co/bard/share/0951ce8978a0)

In [15]:
df["class"].value_counts()

class
MBA    1192472
OMB      41141
IMB      28803
MCA      25654
APO      19127
TJN      13114
AMO      12114
TNO       4594
ATE       2677
CEN        742
AST        126
IEO         32
HYA          3
Name: count, dtype: int64

I don't have a lot of data points for **HYA**. So, I'll remove them from the dataset and not consider them in my analysis.

In [18]:
df.drop(index=df[df["class"] == "HYA"].index, inplace=True)