# **ETL**

## Objectives

* Extraction, Transformation of raw dataset and loading of cleaned dataset

## Inputs

* Initial input will be the asteroid dataset downloaded for kaggle. As this is a very large dataset I will be working with a much smaller subset, though still substantial at around 50000 lines. 

## Outputs

* The output will be a cleaned and encoded file - this will involve dropping a number of fields, for instance id fields, cleaning missing values where necessary and encoding variables. 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\CapstonePreparation\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\CapstonePreparation'

# Extraction

### Import Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

As tne dataset is very large I will first subset a random selection and save this to a csv file. The original dataset will be included in gitignore as it will be too large to push to github.


In [9]:
#save a random subset of Asteroid_Dataset to a new csv file
# df = pd.read_csv('data/Raw/Asteroid_Dataset.csv')
# df_sample = df.sample(frac=0.05, random_state=101)
# df_sample.to_csv('data/Raw/Asteroid_Sample_Dataset.csv', index=False)
#Commented out after first run as it will be unnecessary to run again

Next create a dataframe with the raw sample

In [8]:
#load the sample dataset
df = pd.read_csv('data/Raw/Asteroid_Sample_Dataset.csv')
df.head(), df.shape

(         id    spkid            full_name        pdes name prefix neo pha  \
 0  a0346924  2346924   346924 (2009 YR22)      346924  NaN    NaN   N   N   
 1  bK04VD6H  3995840         (2004 VH136)  2004 VH136  NaN    NaN   N   N   
 2  a0140610  2140610    140610 (2001 UG5)      140610  NaN    NaN   N   N   
 3  bK14D43M  3666578          (2014 DM43)   2014 DM43  NaN    NaN   N   N   
 4  bK15H45T  3853059          (2015 HT45)   2015 HT45  NaN    NaN   N   N   
 
         H  diameter  ...   sigma_i  sigma_om   sigma_w  sigma_ma  \
 0  18.100       NaN  ...  0.000008  0.000129  0.000133  0.000025   
 1  16.982       NaN  ...  0.000011  0.000032  0.000049  0.000037   
 2  15.200     2.892  ...  0.000006  0.000008  0.000013  0.000009   
 3  18.492       NaN  ...  0.005861  0.001193  0.344360  0.670270   
 4  18.500       NaN  ...  0.000011  0.000095  0.000114  0.000072   
 
        sigma_ad       sigma_n  sigma_tp  sigma_per  class      rms  
 0  2.108900e-08  3.548600e-09  0.000088   0

In [10]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47926 entries, 0 to 47925
Data columns (total 45 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              47926 non-null  object 
 1   spkid           47926 non-null  int64  
 2   full_name       47926 non-null  object 
 3   pdes            47926 non-null  object 
 4   name            1042 non-null   object 
 5   prefix          4 non-null      object 
 6   neo             47926 non-null  object 
 7   pha             46941 non-null  object 
 8   H               47636 non-null  float64
 9   diameter        6742 non-null   float64
 10  albedo          6685 non-null   float64
 11  diameter_sigma  6737 non-null   float64
 12  orbit_id        47926 non-null  object 
 13  epoch           47926 non-null  float64
 14  epoch_mjd       47926 non-null  int64  
 15  epoch_cal       47926 non-null  float64
 16  equinox         47926 non-null  object 
 17  e               47926 non-null 

Unnamed: 0,spkid,H,diameter,albedo,diameter_sigma,epoch,epoch_mjd,epoch_cal,e,a,...,sigma_q,sigma_i,sigma_om,sigma_w,sigma_ma,sigma_ad,sigma_n,sigma_tp,sigma_per,rms
count,47926.0,47636.0,6742.0,6685.0,6737.0,47926.0,47926.0,47926.0,47926.0,47926.0,...,46941.0,46941.0,46941.0,46941.0,46941.0,46941.0,46941.0,46941.0,46941.0,47926.0
mean,3818407.0,16.902491,5.53924,0.131158,0.473669,2458873.0,58872.379856,20197020.0,0.156213,2.930959,...,29.20377,0.787479,2.028642,924222.4,924044.2,25.63852,0.0284379,263240900.0,70120.78,0.558041
std,6864509.0,1.77013,10.230165,0.110567,0.478214,688.4447,688.444678,18933.36,0.093042,9.84616,...,4307.123,68.32666,68.35221,150121500.0,150095300.0,4280.933,0.8811618,42841200000.0,11052390.0,0.103558
min,2000004.0,3.0,0.008,0.001,0.001,2448566.0,48565.0,19911100.0,1.7e-05,0.555418,...,1.9286e-09,2.1706e-07,3.8808e-07,1.7893e-07,9.5226e-07,5.4038e-10,8.2311e-11,1.6837e-06,2.6384e-07,0.00351
25%,2239225.0,16.0,2.783,0.053,0.18,2459000.0,59000.0,20200530.0,0.091853,2.390434,...,1.4658e-07,6.0877e-06,3.6357e-05,5.7534e-05,2.5779e-05,2.3528e-08,2.7738e-09,0.00011132,1.8083e-05,0.51785
50%,2479192.0,16.9,4.0175,0.079,0.331,2459000.0,59000.0,20200530.0,0.145279,2.647835,...,2.2598e-07,8.6809e-06,6.6467e-05,0.00010508,4.8974e-05,4.3279e-08,4.6367e-09,0.00022181,3.4721e-05,0.56629
75%,3750910.0,17.71125,5.81275,0.191,0.625,2459000.0,59000.0,20200530.0,0.200672,2.997342,...,6.3993e-07,1.5815e-05,0.00016127,0.00030362,0.00016523,1.1661e-07,1.0953e-08,0.00078019,9.5611e-05,0.613357
max,54017230.0,30.9,525.4,1.0,9.9,2459000.0,59000.0,20200530.0,0.997396,1711.799446,...,865700.0,13917.0,7240.9,30846000000.0,30840000000.0,919270.0,101.72,8829900000000.0,2357800000.0,2.653


In [11]:
df.isna().sum()

id                    0
spkid                 0
full_name             0
pdes                  0
name              46884
prefix            47922
neo                   0
pha                 985
H                   290
diameter          41184
albedo            41241
diameter_sigma    41189
orbit_id              0
epoch                 0
epoch_mjd             0
epoch_cal             0
equinox               0
e                     0
a                     0
q                     0
i                     0
om                    0
w                     0
ma                    0
ad                    0
n                     0
tp                    0
tp_cal                0
per                   0
per_y                 0
moid                985
moid_ld               3
sigma_e             985
sigma_a             985
sigma_q             985
sigma_i             985
sigma_om            985
sigma_w             985
sigma_ma            985
sigma_ad            985
sigma_n             985
sigma_tp        

In [12]:
df["pha"].value_counts()

pha
N    46828
Y      113
Name: count, dtype: int64

In [13]:
df["neo"].value_counts()

neo
N    46794
Y     1132
Name: count, dtype: int64

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [7]:
# import os
# try:
#   # create your folder here
#   # os.makedirs(name='')
# except Exception as e:
#   print(e)
