In [1]:
import os
os.chdir("../")

# Data Cleaning

This notebook deals with everything associated with data cleaning. Which includes filling in missing values, handling noise, tackling inconsistancies, etc.

## Loading Dataset

In [2]:
import pandas as pd
import plotly.express as px

In [3]:
df = pd.read_csv("data/Asteroid_Updated.csv", low_memory=False)
print(f"Number of (rows, columns) = {df.shape}")

Number of (rows, columns) = (839714, 31)


In [4]:
df.sample(3)

Unnamed: 0,name,a,e,i,om,w,q,ad,per_y,data_arc,...,UB,IR,spec_B,spec_T,G,moid,class,n,per,ma
419273,,2.4366,0.146305,3.527073,104.025651,350.643126,2.080114,2.793086,3.80351,4436.0,...,,,,,,1.09705,MBA,0.259136,1389.232151,142.535564
748950,,3.159899,0.095921,12.304661,41.62114,66.149572,2.856799,3.462999,5.617175,2190.0,...,,,,,,1.90082,MBA,0.175467,2051.673219,288.300996
781001,,1.894457,0.092908,23.51799,148.660291,247.662815,1.718446,2.070468,2.607566,3621.0,...,,,,,,0.7813,IMB,0.377987,952.413517,185.758794


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 839714 entries, 0 to 839713
Data columns (total 31 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   name            21967 non-null   object 
 1   a               839712 non-null  float64
 2   e               839714 non-null  float64
 3   i               839714 non-null  float64
 4   om              839714 non-null  float64
 5   w               839714 non-null  float64
 6   q               839714 non-null  float64
 7   ad              839708 non-null  float64
 8   per_y           839713 non-null  float64
 9   data_arc        824240 non-null  float64
 10  condition_code  838847 non-null  object 
 11  n_obs_used      839714 non-null  int64  
 12  H               837025 non-null  float64
 13  neo             839708 non-null  object 
 14  pha             823272 non-null  object 
 15  diameter        137636 non-null  object 
 16  extent          18 non-null      object 
 17  albedo    

## Missing Values

This section deals with handling missing values.

### Identify Missing Columns

In this subsection, I'll identify which columns have missing values. What percentage of the values are missing. I visualize the missing statistics in a bar plot. I, then, chart a course on how to handle the different levels of missing values.

In [6]:
missing = pd.DataFrame(
    df.apply(lambda x: x.isna(), axis=1).sum().sort_values(ascending=True)
).reset_index()

missing.rename(columns={0: "Missing", "index": "Column"}, inplace=True)
missing["Percent"] = missing["Missing"] / df.shape[0] * 100

In [7]:
fig = px.bar(missing[missing.Missing > 0], x="Column", y="Percent", text="Missing")
fig.update_layout(
    height=600,
    width=800,
    title_x=0.5,
    title_text=f"Bar Chart<br><sup>Missing Values of each column</sup>"
)
fig.show()

**Observation 1**

Nearly all values in `rot_per` to `IR` are missing. Predicting them from the existing ones will be hard as there isn't enough data. 

    The best way to deal with these columns is to drop them. If I learn of a better way to handle these missing values, I'll come and deal with them later on.

In [8]:
missing[missing.Percent > 90]

Unnamed: 0,Column,Missing,Percent
21,name,817747,97.38399
22,rot_per,820918,97.761619
23,spec_B,838048,99.801599
24,BV,838693,99.878411
25,spec_T,838734,99.883294
26,UB,838735,99.883413
27,G,839595,99.985829
28,extent,839696,99.997856
29,GM,839700,99.998333
30,IR,839713,99.999881


**Observation 2**

A big chunk of `diameter` and `albedo` values are missing. 

    Predicting them with a Machine Learning model should be possible from the 20\% data that is available. I'll use a simple deep learning model to do this.

In [9]:
missing[missing.Column.isin(["diameter", "albedo"])]

Unnamed: 0,Column,Missing,Percent
19,diameter,702078,83.609181
20,albedo,703305,83.755302


**Observation 3**

Some columns have absolutely no missing values. 

    Nothing needs to be done for these columns. I'll use these to help me in imputing other missing values.

In [10]:
missing[missing.Missing == 0]

Unnamed: 0,Column,Missing,Percent
0,e,0,0.0
1,i,0,0.0
2,om,0,0.0
3,w,0,0.0
4,q,0,0.0
5,class,0,0.0
6,n_obs_used,0,0.0


**Observation 4**

Most columns have $<5\%$ data is missing. 

    These can be filled in using imputation techniques. For numerical columns, I'll use imputation by group median. For categorical, I'll impute by group mode.

In [11]:
missing[(missing.Percent < 5) & (missing.Missing > 0)]

Unnamed: 0,Column,Missing,Percent
7,per_y,1,0.000119
8,a,2,0.000238
9,n,2,0.000238
10,ad,6,0.000715
11,neo,6,0.000715
12,per,6,0.000715
13,ma,8,0.000953
14,condition_code,867,0.103249
15,H,2689,0.320228
16,data_arc,15474,1.84277


### Dropping Columns

In this subsection, I drop the columns that have more than 90\% of their values missing.

In [13]:
df.drop(
    columns=[
        "name",
        "rot_per",
        "spec_B",
        "spec_T",
        "G",
        "BV",
        "UB",
        "IR",
        "GM",
        "extent",
    ],
    inplace=True,
)

print(f"After dropping, dataframe shape = {df.shape}")

After dropping, dataframe shape = (839714, 21)
