# **ETL**
### (Extraction, Transformation & Loading)

## Objectives

* Extract and validate the raw NEO dataset  `Data/Raw/neo.csv`
* Clean and standardise fields (dates, numeric types), handle missing values and duplicates, and normalise units where necessary (e.g., distances in km).
* Engineer modelling features such as: average estimated diameter, observation_count, miss_distance_km, relative_velocity_km_s, absolute_magnitude, and a binary `is_hazardous` label.
* Produce reproducible, versioned processed datasets for modelling and visualisation and record a short data-validation report.

## Inputs

* `Data/Raw/neo.csv` — primary raw dataset (source: Kaggle / JPL CNEOS).
* `requirements.txt` — to confirm the runtime environment and required packages.
*  y-data profile report for preliminary analysis of data

## Outputs

* `Data/Processed/neo_clean.csv` — cleaned and typed dataset ready for downstream analysis.
* `Data/Processed/neo_features.csv` — dataset with engineered features used by `Modelling.ipynb` and `Visualisation.ipynb`.
* Summary of validation checks, missingness, and transformation notes.
* Summary statistics/ basic visualisations for cleaned datasets
* Clearly annotated code with either markdown or python comments

## Additional Comments

* Do not overwrite `Data/Raw/neo.csv`; write all outputs to `Data/Processed/` 
* Document all transformations inline (why an imputation or filter was applied) or with markdown so results are reproducible and auditable.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\Near-Earth-Asteroid-Analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\Near-Earth-Asteroid-Analysis'

# Extraction

Import Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

Load Raw DataFile

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Conclusions