# Level 1
## Task 2: Data Cleaning and Preprocessing

In this task, we will:
- Load a raw dataset from `data/raw/`
- Automatically detect dataset type (tabular, time series, or text)
- Clean and preprocess the data:
  - Handle missing values
  - Remove outliers
  - Normalize numerical features
- Save the cleaned dataset to `data/cleaned/`

---

## Setup and Imports

Here we configure the project root path and import the necessary utility functions for data cleaning.

In [None]:
import os
import sys

# Detect paths
notebook_dir = os.getcwd()  # e.g., root/notebooks/
root_dir = os.path.abspath(os.path.join(notebook_dir, ".."))  # e.g., root/

# Show path info
print("Notebook dir:", notebook_dir)
print("Root dir:", root_dir)

# Ensure root_dir is in sys.path for module imports
if root_dir not in sys.path:
    sys.path.append(root_dir)
    print(f" Added {root_dir} to sys.path")
else:
    print(f" {root_dir} already in sys.path")

# Show current sys.path
print("üìÇ Current sys.path:")
for p in sys.path:
    print("  ", p)


from Level1_Basic.Task2_DataCleaning.cleaner import (
    load_raw_data,
    clean_and_preprocess,
    save_cleaned_data,
    generate_cleaning_summary
)


Notebook dir: e:\CODveda\codveda-internship\notebooks
Root dir: e:\CODveda\codveda-internship
‚úÖ e:\CODveda\codveda-internship already in sys.path
üìÇ Current sys.path:
   e:\CODveda
   C:\Users\Kabelo Matlakala\AppData\Local\Programs\Python\Python311\python311.zip
   C:\Users\Kabelo Matlakala\AppData\Local\Programs\Python\Python311\DLLs
   C:\Users\Kabelo Matlakala\AppData\Local\Programs\Python\Python311\Lib
   C:\Users\Kabelo Matlakala\AppData\Local\Programs\Python\Python311
   e:\CODveda\codveda-internship\codveda-env
   
   e:\CODveda\codveda-internship\codveda-env\Lib\site-packages
   e:\CODveda\codveda-internship\codveda-env\Lib\site-packages\win32
   e:\CODveda\codveda-internship\codveda-env\Lib\site-packages\win32\lib
   e:\CODveda\codveda-internship\codveda-env\Lib\site-packages\Pythonwin
   e:\CODveda\codveda-internship


## Load Raw Dataset

We will now load a dataset from the `data/raw/` directory.

If the file is space-delimited or has no headers, the loader function will handle it automatically.


## Raw Data Preview

Let's take a quick look at the raw, unprocessed data.

In [32]:
# --- Load Raw Data ---
filename = "house_prediction.csv"  # Replace with your actual file in data/raw
df_raw = load_raw_data(filename)

print(f"Raw data loaded: {filename} (shape={df_raw.shape})")
display(df_raw.head())

‚ö†Ô∏è Auto-split space-separated values in house_prediction.csv ‚Üí shape=(505, 14)
‚úÖ Applied column names for house_prediction.csv
Loaded raw data from e:\CODveda\codveda-internship\data\raw\house_prediction.csv (shape=(505, 14))
Raw data loaded: house_prediction.csv (shape=(505, 14))


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
1,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
2,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
3,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2
4,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222.0,18.7,394.12,5.21,28.7


## Clean & Preprocess

We'll now run the cleaning pipeline on the raw dataset. The process includes:
- Imputing missing values
- Removing statistical outliers (Z-score)
- Normalizing numerical values

In [33]:
# --- Clean & Preprocess ---
df_clean = clean_and_preprocess(df_raw)

print(f"Cleaned data (shape={df_clean.shape})")
display(df_clean.head())

Cleaned data (shape=(414, 14))


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,-0.496564,-0.487068,-0.557095,0.0,-0.70807,0.252183,0.410033,0.567526,-0.818617,-0.943926,-0.316899,0.440548,-0.487823,-0.092857
1,-0.496568,-0.487068,-0.557095,0.0,-0.70807,1.501743,-0.224054,0.567526,-0.818617,-0.943926,-0.316899,0.361627,-1.278277,1.529041
2,-0.495531,-0.487068,-1.277731,0.0,-0.807684,1.195895,-0.769084,1.124906,-0.696488,-1.068054,0.105255,0.396531,-1.446886,1.368089
3,-0.488038,-0.487068,-1.277731,0.0,-0.807684,1.439592,-0.469852,1.124906,-0.696488,-1.068054,0.105255,0.440548,-1.077183,1.714754
4,-0.496045,-0.487068,-1.277731,0.0,-0.807684,0.266903,-0.309549,1.124906,-0.696488,-1.068054,0.105255,0.386641,-1.095745,0.786187


## Cleaning Summary

Here's a quick summary of what changed during cleaning:
- Rows removed due to outliers
- Any changes in shape
- Final dataset type

In [None]:
# --- Generate Summary ---
summary = generate_cleaning_summary(df_raw, df_clean)
print("Cleaning Summary:")
for k, v in summary.items():
    print(f"- {k}: {v}")

Cleaning Summary:
- Dataset type: tabular
- Rows before cleaning: 505
- Rows after cleaning: 414
- Columns before cleaning: 14
- Columns after cleaning: 14
- Outlier rows removed: 91


### üíæ Save Cleaned Data

We now save the cleaned dataset to the `data/cleaned/` folder for use in further analysis or modeling.

### üìä Cleaned Data Preview

Here's a sample of the final cleaned dataset:

In [None]:
# ---  Save Cleaned Data ---
cleaned_filename = filename.replace(".csv", "_cleaned.csv")
save_cleaned_data(df_clean, filename=cleaned_filename)
print(f"Cleaned data saved to: {os.path.join(output_dir_clean, cleaned_filename)}")

Cleaned data saved at e:\CODveda\codveda-internship\data\cleaned\house_prediction_cleaned.csv
Cleaned data saved to: data/cleaned\house_prediction_cleaned.csv
