# **02. Feature Engineering**
*This notebook will focus on adding extra features like saleYear, saleMonth, etc., derived from the saledate column.*


## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Execution Timestamp

Purpose: This code block adds a timestamp to track notebook execution
- Helps monitor when analysis was last performed
- Ensures reproducibility of results
- Useful for debugging and version control

In [64]:
# Timestamp
import datetime

import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2025-02-15 11:24:36.324399


# Project Directory Structure and Working Directory

**Purpose: This code block establishes and explains the project organization**
- Creates a standardized project structure for data science workflows
- Documents the purpose of each directory for team collaboration
- Gets current working directory for file path management

## Key Components:
1. `data/ directory` stores all datasets (raw, processed, interim)
2. `src/` contains all source code (data preparation, models, utilities)
3. `notebooks/` holds Jupyter notebooks for experimentation
4. `results/` stores output files and visualizations

## Setting Up Working Directory
This code block sets up the working environment by:
- Changing to the project directory where our code and data files are located
- Verifying the current working directory to ensure we're in the right place

In [73]:
import os

# Move to the desired directory
os.chdir('c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2')

# Get the current directory to verify the change
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2'

## Set Working Directory to Project Root
**Purpose: Changes the current working directory to the parent directory**
- Gets the folder one level above the current one
- Makes sure all file locations work correctly throughout the project
- Keeps files and folders organized in a clean way

In [66]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


## Get Current Working Directory
**Purpose: Retrieves and stores the current working directory path**
- Gets the folder location where we're currently working
- Saves this location in a variable called current_dir so we can use it later
- Helps us find and work with files in the right place

In [74]:
import os

# Change the current working directory
os.chdir('c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository')

# Get the current working directory
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository'

# **Import Essential Data Science Libraries and Check Versions**

**Purpose: This code block imports fundamental Python libraries for data analysis and visualization**
- `pandas:` For data manipulation and analysis
- `numpy:` For numerical computations
- `matplotlib:` For creating visualizations and plots

**The version checks help ensure:**
- *Code compatibility across different environments*
- *Reproducibility of analysis*
- *Easy debugging of version-specific issues*


In [68]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.2.3
NumPy version: 2.2.2
matplotlib version: 3.10.0


# **1.3 Adding extra features to our DataFrame**
### What is Feature Engineering?

Feature engineering is a powerful technique that allows us to enhance our dataset by deriving new meaningful information from existing data. In this section, we'll explore how to extract valuable temporal features from our sale dates to improve our analysis.

### Time-Based Components

Our approach will focus on breaking down sale dates into multiple time-based components:

- What year it was sold
- What month it was sold
- What day it was sold
- What day of the week it was sold (like Monday = 1, Tuesday = 2)
- What day of the year it was sold (like January 1st = 1, January 2nd = 2)

### Data Safety

To ensure data integrity throughout this process, we'll first create a backup of our original dataset. This precautionary step will allow us to revert any changes if needed.


## **Creating a Safe Working Copy**
Before we start modifying our dataset, it's crucial to create a backup copy. This ensures we can always return to our original data if needed.

In [75]:
import os

# Check the current directory
current_directory = os.getcwd()
print(f"Current Directory: {current_directory}")

# List all directories in the project
project_directory = current_directory  # Change this if your project directory is different
directories = [d for d in os.listdir(project_directory) if os.path.isdir(os.path.join(project_directory, d))]

print("Directories in the project:")
for directory in directories:
    print(directory)

Current Directory: c:\Users\blign\Dropbox\1 PROJECT\VS Code Project Respository
Directories in the project:
About-BulldozerPriceGenius-_BPG-_v2
BulldozerPriceGenuis-BPG-
churnometer
CI-Malaria-Detection
Culture-Project
data
housing-price-data-ml
inputs
job_board_django_ztm
NederLearn
NederLearn_V2
Nederlearn_V3
Nederlearn_V4
NederLearn_V5
PriceBulldozerAI
Recipe-App
Recipe-App-Tutorial
Recipe-Tutorial-Dee-MC
Scartch-Pad
ZTM-Django-bitly-forms
ztm_bd
ztm_django_bitly_clone_project
ztm_django_jobs_board
ztm_django_movie_app


## File Path Verification Code

This code block serves two essential purposes:

- Verifies the existence of our training dataset (`TrainAndValid.csv`) before attempting to use it
- Provides immediate feedback about file accessibility, helping prevent data loading errors

In [None]:
import os

file_path = "C:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2\\data\\raw\\bluebook-for-bulldozers\\TrainAndValid.csv"

# Check if file exists in specified path
if os.path.exists(file_path):
    print(f"The file {file_path} exists.")
else:
    print(f"The file {file_path} does not exist.")

The file C:\Users\blign\Dropbox\1 PROJECT\VS Code Project Respository\About-BulldozerPriceGenius-_BPG-_v2\data\raw\bluebook-for-bulldozers\TrainAndValid.csv exists.


## Folder Path Verification

This code block is designed to verify the existence of our processed data folder, which is crucial for data management and integrity. It performs two key functions:

- Checks if the specified folder path exists in our project structure
- Provides immediate feedback about folder accessibility to prevent data storage/retrieval errors

In [None]:
import os

folder_path = "C:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2\\data\\processed"

# Check if folder exists in specified path
if os.path.exists(folder_path):
    print(f"The folder {folder_path} exists.")
else:
    print(f"The folder {folder_path} does not exist.")

The folder C:\Users\blign\Dropbox\1 PROJECT\VS Code Project Respository\About-BulldozerPriceGenius-_BPG-_v2\data\processed exists.


## Loading and Preprocessing the Bulldozer Dataset

This code block performs essential data loading and preprocessing steps:

- Reads the raw bulldozer price dataset using pandas, with specific configurations:
    - Disables low memory mode to handle mixed data types
    - Automatically parses the 'saledate' column as datetime
- Creates a safe working copy of the data to prevent modifications to the original dataset
- Saves the preprocessed dataset to our processed data folder for further analysis

In [96]:
import os

# Ensure the file path is correct and the file exists at the specified location
df = pd.read_csv(filepath_or_buffer="../data/raw/bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False, # set low_memory=False to prevent mixed data types warning 
                 parse_dates=["saledate"]) # can use the parse_dates parameter and specify which column to treat as a date column

# With parse_dates... check dtype of "saledate"
df.info()
# Make a copy of the original DataFrame to perform edits on
df_tmp = df.copy()

# Save the copy to the processed folder
processed_file_path = "C:\\Users\\blign\\Dropbox\\1 PROJECT\\VS Code Project Respository\\About-BulldozerPriceGenius-_BPG-_v2\\data\\processed\\TrainAndValid_processed.csv"
df_tmp.to_csv(processed_file_path, index=False)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalesID                   412698 non-null  int64         
 1   SalePrice                 412698 non-null  float64       
 2   MachineID                 412698 non-null  int64         
 3   ModelID                   412698 non-null  int64         
 4   datasource                412698 non-null  int64         
 5   auctioneerID              392562 non-null  float64       
 6   YearMade                  412698 non-null  int64         
 7   MachineHoursCurrentMeter  147504 non-null  float64       
 8   UsageBand                 73670 non-null   object        
 9   saledate                  412698 non-null  datetime64[ns]
 10  fiModelDesc               412698 non-null  object        
 11  fiBaseModel               412698 non-null  object        
 12  fi

Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (2852421808.py, line 5)