# **ETL2**

## Objectives

* Perform ETL on near earth objects dataset.

## Inputs

* Kaggle Link 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\CapstonePreparation\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\CapstonePreparation'

# Extraction

Import Libraries

In [4]:
import pandas as pd
import numpy as np      
import matplotlib.pyplot as plt
import seaborn as sns   
sns.set_style("whitegrid")


Load Raw data file

In [5]:
df = pd.read_csv('data/Raw/neo.csv')
df.shape, df.head()

((90836, 10),
         id                 name  est_diameter_min  est_diameter_max  \
 0  2162635  162635 (2000 SS164)          1.198271          2.679415   
 1  2277475    277475 (2005 WK4)          0.265800          0.594347   
 2  2512244   512244 (2015 YE18)          0.722030          1.614507   
 3  3596030          (2012 BV13)          0.096506          0.215794   
 4  3667127          (2014 GE35)          0.255009          0.570217   
 
    relative_velocity  miss_distance orbiting_body  sentry_object  \
 0       13569.249224   5.483974e+07         Earth          False   
 1       73588.726663   6.143813e+07         Earth          False   
 2      114258.692129   4.979872e+07         Earth          False   
 3       24764.303138   2.543497e+07         Earth          False   
 4       42737.733765   4.627557e+07         Earth          False   
 
    absolute_magnitude  hazardous  
 0               16.73      False  
 1               20.00       True  
 2               17.83      

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90836 entries, 0 to 90835
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  90836 non-null  int64  
 1   name                90836 non-null  object 
 2   est_diameter_min    90836 non-null  float64
 3   est_diameter_max    90836 non-null  float64
 4   relative_velocity   90836 non-null  float64
 5   miss_distance       90836 non-null  float64
 6   orbiting_body       90836 non-null  object 
 7   sentry_object       90836 non-null  bool   
 8   absolute_magnitude  90836 non-null  float64
 9   hazardous           90836 non-null  bool   
dtypes: bool(2), float64(5), int64(1), object(2)
memory usage: 5.7+ MB


In [7]:
df.isna().sum()

id                    0
name                  0
est_diameter_min      0
est_diameter_max      0
relative_velocity     0
miss_distance         0
orbiting_body         0
sentry_object         0
absolute_magnitude    0
hazardous             0
dtype: int64

Examine columns

In [32]:
#percentage of hazardous vs non hazardous
df["hazardous"].value_counts(normalize=True) * 100

hazardous
False    90.268176
True      9.731824
Name: proportion, dtype: float64

Our principle hypotheses will examine correlations between hazardous status and other features

In [9]:
df["sentry_object"].value_counts()

sentry_object
False    90836
Name: count, dtype: int64

In [10]:
df["orbiting_body"].value_counts()

orbiting_body
Earth    90836
Name: count, dtype: int64

Both sentry object and orbiting body columns have one value and can be safely dropped.

In [11]:
df.duplicated().sum()

0

In [12]:
df["name"].nunique(), df.shape[0]

(27423, 90836)

In [13]:
# create a new column for the number of observations per name and add it to the dataframe
# df["name"].value_counts()
df["observations"] = df["name"].map(df["name"].value_counts())
df.shape, df.head()

((90836, 11),
         id                 name  est_diameter_min  est_diameter_max  \
 0  2162635  162635 (2000 SS164)          1.198271          2.679415   
 1  2277475    277475 (2005 WK4)          0.265800          0.594347   
 2  2512244   512244 (2015 YE18)          0.722030          1.614507   
 3  3596030          (2012 BV13)          0.096506          0.215794   
 4  3667127          (2014 GE35)          0.255009          0.570217   
 
    relative_velocity  miss_distance orbiting_body  sentry_object  \
 0       13569.249224   5.483974e+07         Earth          False   
 1       73588.726663   6.143813e+07         Earth          False   
 2      114258.692129   4.979872e+07         Earth          False   
 3       24764.303138   2.543497e+07         Earth          False   
 4       42737.733765   4.627557e+07         Earth          False   
 
    absolute_magnitude  hazardous  observations  
 0               16.73      False             1  
 1               20.00       True   

In [17]:
#copy dataframe to a new csv file
df.to_csv('data/Processed/neo_updated.csv', index=False)


In [25]:
neo_agg = (
    df.groupby(["id", "name"], as_index=False)
    .agg({
        "est_diameter_min": "mean",
        "est_diameter_max": "mean",
        "relative_velocity": "mean",
        "miss_distance": ["mean", "min"],  # 👈 both average and min
        "absolute_magnitude": "mean",
        "hazardous": "max",
        "observations": "first"
    })
)

# Flatten the MultiIndex column names
neo_agg.columns = [
    "_".join(col).strip("_") for col in neo_agg.columns.to_flat_index()
]
neo_agg.shape, neo_agg.head()

((27423, 10),
         id                    name  est_diameter_min_mean  \
 0  2000433      433 Eros (A898 PA)              23.043847   
 1  2000719    719 Albert (A911 TB)               2.044349   
 2  2001036  1036 Ganymed (A924 UB)              37.892650   
 3  2001566   1566 Icarus (1949 MA)               1.427431   
 4  2001580  1580 Betulia (1950 KA)               3.065879   
 
    est_diameter_max_mean  relative_velocity_mean  miss_distance_mean  \
 0              51.527608            19682.887099        3.754117e+07   
 1               4.571303            27551.597194        4.258288e+07   
 2              84.730541            51496.923293        5.372124e+07   
 3               3.191832           104242.329527        4.609560e+07   
 4               6.855513           107171.338891        4.413007e+07   
 
    miss_distance_min  absolute_magnitude_mean  hazardous_max  \
 0       2.672952e+07                    10.31          False   
 1       4.258288e+07                    1

In [43]:
# I want to show the sum of observations for hazardous and non hazardous as a percantage
obs_summary = (
    neo_agg["observations_first"].groupby(neo_agg["hazardous_max"]).sum() /
    neo_agg["observations_first"].sum() * 100
)
obs_summary

hazardous_max
False    90.268176
True      9.731824
Name: observations_first, dtype: float64

In [15]:
#create a ydata profile report
#required installiing setuptools, ipywidgets, and upgrade to ydata-profiling
from ydata_profiling import ProfileReport

#df = pd.read_csv("Data/Raw/neo.csv")
profile = ProfileReport(df, title="neo Data Profile Report", explorative=True)
profile.to_file(output_file="Data/Processed/neo_profile_report.html")
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 11/11 [00:00<00:00, 37.84it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [16]:
# import os
# try:
#   # create your folder here
#   # os.makedirs(name='')
# except Exception as e:
#   print(e)
