# Step 4.2: Statistics

How does the classification distribution look like? How are the Bus Classes and corresponding Main Groups distributed? In total we have like 1,400 spectra that shall be used for training, verification and testing. It is likely that we cannot use the Bus Classes, but the Main ones. Otherwise the algorithm might get too biased ...

In [1]:
# Import standard libraries
import os

# Import installed libraries
import ipywidgets
from matplotlib import pyplot as plt
import pandas as pd

In [2]:
# Let's mount the Google Drive, where we store files and models (if applicable, otherwise work
# locally)
try:
    from google.colab import drive
    drive.mount('/gdrive')
    core_path = "/gdrive/MyDrive/Colab/asteroid_taxonomy/"
except ModuleNotFoundError:
    core_path = ""

In [3]:
# Read the data
asteroids_df = pd.read_pickle(os.path.join(core_path, "data/lvl2/", "asteroids.pkl"))

## Taking a simple look...

In [4]:
asteroids_df

Unnamed: 0,Name,Bus_Class,SpectrumDF,Main_Group
0,1 Ceres,C,Wavelength_in_microm Reflectance_norm550n...,C
1,2 Pallas,B,Wavelength_in_microm Reflectance_norm550n...,C
2,3 Juno,Sk,Wavelength_in_microm Reflectance_norm550n...,S
3,4 Vesta,V,Wavelength_in_microm Reflectance_norm550n...,Other
4,5 Astraea,S,Wavelength_in_microm Reflectance_norm550n...,S
...,...,...,...,...
1334,1996 UK,Sq,Wavelength_in_microm Reflectance_norm550n...,S
1335,1996 VC,S,Wavelength_in_microm Reflectance_norm550n...,S
1336,1997 CZ5,S,Wavelength_in_microm Reflectance_norm550n...,S
1337,1997 RD1,Sq,Wavelength_in_microm Reflectance_norm550n...,S


## Print some descriptive statistics

In [5]:
# Simple description
asteroids_df[["Bus_Class", "Main_Group"]].describe()

Unnamed: 0,Bus_Class,Main_Group
count,1339,1339
unique,25,4
top,S,S
freq,383,549


In [6]:
# Statistics for the Bus Class
asteroids_df.groupby(["Main_Group", "Bus_Class"])["Bus_Class"].agg(["count"])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Main_Group,Bus_Class,Unnamed: 2_level_1
C,B,60
C,C,141
C,Cb,33
C,Cg,9
C,Cgh,15
C,Ch,138
Other,A,16
Other,D,9
Other,K,31
Other,L,34


In [7]:
# Statistics for the Main Group
asteroids_df.groupby(['Main_Group'])["Main_Group"].agg(['count'])

Unnamed: 0_level_0,count
Main_Group,Unnamed: 1_level_1
C,396
Other,157
S,549
X,237


## Summary

By applying very simple descriptive statistics we found that the Bus-Class has highly under-represented datasets like e.g., O- or R-Class with 1 and 4 spectra, respectively! Since we want to perform a classification a severe bias in the training, validation and test data should be avoided!

For future considerations we use only the Main Group Classification. However, we keep the Bus Class data, if we want to take a deeper look later, after the classification part.