# **Data Collection**

## Objectives

* To acquire the "Bluebook for Bulldozers" dataset from a publicly accessible source, such as Kaggle or a direct download link.
* To ensure the dataset is downloaded and stored securely within the project's directory structure.
* To prepare the data for subsequent preprocessing and analysis steps.

## Inputs

* A stable internet connection for downloading the dataset.
* Python libraries such as `requests`, `os`, `zipfile`, `shutil`, and `pathlib`.
* A URL pointing to the compressed dataset file (e.g., a zip file).

## Outputs

* The dataset in a compressed format (e.g., bluebook-for-bulldozers.zip) stored in the `data/raw` directory.
* An unzipped version of the dataset in the `data/raw/bluebook-for-bulldozers` directory.
* Relevant log messages and feedback confirming the dataset download and extraction status.

## Additional Comments

* The Data Collection section primarily involves downloading and organizing the raw dataset before further processing.
* Network connectivity and library installations should be verified before executing the code.
* Any issues encountered during the download or extraction process may require checking the data source or network configuration.

---

# Execution Timestamp

Purpose: This code block adds a timestamp to track notebook execution
- Helps monitor when analysis was last performed
- Ensures reproducibility of results
- Useful for debugging and version control

In [85]:
# Timestamp
import datetime

import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2025-02-14 16:36:02.190268


# Project Directory Structure and Working Directory

**Purpose: This code block establishes and explains the project organization**
- Creates a standardized project structure for data science workflows
- Documents the purpose of each directory for team collaboration
- Gets current working directory for file path management

## Key Components:
1. `data/ directory` stores all datasets (raw, processed, interim)
2. `src/` contains all source code (data preparation, models, utilities)
3. `notebooks/` holds Jupyter notebooks for experimentation
4. `results/` stores output files and visualizations

In [86]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users'

## Set Working Directory to Project Root
**Purpose: Changes the current working directory to the parent directory**
- Gets the folder one level above the current one
- Makes sure all file locations work correctly throughout the project
- Keeps files and folders organized in a clean way

In [87]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


## Get Current Working Directory
**Purpose: Retrieves and stores the current working directory path**
- Gets the folder location where we're currently working
- Saves this location in a variable called current_dir so we can use it later
- Helps us find and work with files in the right place

In [88]:
current_dir = os.getcwd()
current_dir

'c:\\'

# Install Kaggle API Package
**Purpose: Installs the Kaggle API client library version 1.5.12**
- Enables programmatic access to Kaggle datasets and competitions


In [89]:
! pip install kaggle==1.5.12



## Configure Kaggle API Authentication

This code block sets up the Kaggle API authentication by:

- Setting the Kaggle configuration directory to the current working directory
- Adjusting file permissions for the kaggle.json credentials file based on the operating system:
    - On Windows: Uses icacls to set appropriate file permissions
    - On Unix/Linux/Mac: Sets file permissions to 600 (user read/write only)

In [90]:
import os
import platform

# Set Kaggle config directory
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Check operating system and set permissions accordingly
if platform.system() == 'Windows':
    # Windows solution using Python's os module
    import subprocess
    subprocess.run(['icacls', 'kaggle.json', '/grant:r', f'{os.getenv("USERNAME")}:F'], shell=True)
else:
    # Unix/Linux/Mac solution
    os.chmod('kaggle.json', 0o600)

## Downlaod Data Collection and Preprocessing Script
This Python script downloads and organizes data about bulldozer prices. The script does these simple tasks:

- Makes a new folder called 'data/raw' to store files
- Gets a file from GitHub that contains bulldozer price information
- Checks if the download worked correctly
- Opens the downloaded file and removes any temporary files when done

In [91]:
import requests
import os
import zipfile

# Create data/raw directory if it doesn't exist
os.makedirs("data/raw", exist_ok=True)

# Define the zip file path
zip_path = "data/raw/bluebook-for-bulldozers.zip"

# Delete the zip file if it already exists
if os.path.exists(zip_path):
    os.remove(zip_path)

# Download the file
url = "https://github.com/mrdbourke/zero-to-mastery-ml/raw/master/data/bluebook-for-bulldozers.zip"
response = requests.get(url)

if response.status_code == 200:
    with open(zip_path, "wb") as f:
        f.write(response.content)
    print("Download successful.")
else:
    print("Download failed with status code:", response.status_code)

# Unzip the file if download was successful
if os.path.exists(zip_path):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall("data/raw")
    # Delete the zip file
    os.remove(zip_path)

Download successful.


---

## 1. Data Import and Preprocessing

### Import Essential Data Science Libraries and Check Versions

**Purpose: This code block imports fundamental Python libraries for data analysis and visualization**
- `pandas:` For data manipulation and analysis
- `numpy:` For numerical computations
- `matplotlib:` For creating visualizations and plots

**The version checks help ensure:**
- *Code compatibility across different environments*
- *Reproducibility of analysis*
- *Easy debugging of version-specific issues*


In [92]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.2.3
NumPy version: 2.2.2
matplotlib version: 3.10.0


## Dataset Download and Setup

This code block handles the automated download and setup of the bulldozer dataset. Here's what it does:

- Creates a directory structure for storing the dataset using pathlib
- Downloads the bulldozer dataset from GitHub if it doesn't exist locally
- Extracts the downloaded zip file and organizes it in the proper directory
- Cleans up temporary files after successful download

In [93]:
from pathlib import Path
import requests
import zipfile
import shutil
import os

dataset_dir = Path("data/raw/bluebook-for-bulldozers")
if not dataset_dir.is_dir():
    print("[INFO] Downloading dataset...")
    
    # Create directories
    Path("data/raw").mkdir(parents=True, exist_ok=True)
    
    # Download file
    url = "https://github.com/mrdbourke/zero-to-mastery-ml/raw/refs/heads/master/data/bluebook-for-bulldozers.zip"
    response = requests.get(url)
    with open("dataset.zip", "wb") as f:
        f.write(response.content)
    
    # Unzip file
    with zipfile.ZipFile("dataset.zip", "r") as zip_ref:
        zip_ref.extractall(".")
    
    # Move to correct location
    shutil.move("bluebook-for-bulldozers", "data/raw/")
    
    # Clean up zip file
    os.remove("dataset.zip")
    
    print(f"[INFO] Dataset downloaded to {dataset_dir}")
else:
    print("[INFO] Dataset already exists!")


[INFO] Dataset already exists!


## List Files and Folders

**Purpose: Shows what files and folders are in our data folder**
- Lists all the files and folders we have
- Makes sure our data files are where they should be
- Helps us check if everything downloaded correctly

In [94]:
import os

print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\raw\bluebook-for-bulldozers:


['Data Dictionary.xlsx',
 'Machine_Appendix.csv',
 'median_benchmark.csv',
 'random_forest_benchmark_test.csv',
 'Test.csv',
 'test_predictions.csv',
 'Train.7z',
 'Train.csv',
 'Train.zip',
 'TrainAndValid.7z',
 'TrainAndValid.csv',
 'TrainAndValid.zip',
 'train_tmp.csv',
 'Valid.7z',
 'Valid.csv',
 'Valid.zip',
 'ValidSolution.csv']

## Import Training and Validation Dataset

### Purpose
This code loads our main data file that contains information about bulldozer sales, which we'll use to create a system that can predict bulldozer prices.

### What it does
- Opens and reads a file called `'TrainAndValid.csv'` that has information about bulldozer sales from the past
- Uses a special tool called pandas to put all the data into an organized table called 'df'
- Makes sure the computer can find and open the file from the right folder on your computer


## Data Loading and Validation

This code block handles the crucial task of loading our bulldozer dataset and includes error checking to ensure data availability. Here's what it does:

- Imports necessary libraries (pandas for data handling, os for file operations)
- Verifies the current working directory to ensure correct file paths
- Sets up the file path to our bulldozer dataset
- Includes error handling to check if the file exists before attempting to load it

In [95]:
import pandas as pd
import os

# Print the current working directory
print("Current working directory:", os.getcwd())

# Adjust the file path if necessary
file_path = "../data/raw/bluebook-for-bulldozers/TrainAndValid.csv"

# Check if the file exists at the specified path
if os.path.exists(file_path):
    df = pd.read_csv(file_path)
    print("File loaded successfully.")
else:
    print(f"File not found at path: {file_path}")

Current working directory: c:\


  df = pd.read_csv(file_path)


File loaded successfully.


## Data Loading from Dropbox

This code block handles loading our bulldozer dataset from a Dropbox location. Here's what it does:

- Imports required libraries (os, pandas, pathlib) for file handling and data manipulation
- Sets up the correct file path to our Dropbox folder where the dataset is stored
- Uses Path for cross-platform compatibility (works on Windows, Mac, Linux)
- Loads the TrainAndValid.csv file into a pandas DataFrame for analysis

In [96]:
import os
import pandas as pd
from pathlib import Path

# Get the absolute path to your Dropbox folder
dropbox_path = os.path.expanduser("~/Dropbox/1 PROJECT/VS Code Project Respository/About-BulldozerPriceGenius-_BPG-_v2")

# Create the full file path using Path for cross-platform compatibility
file_path = Path(dropbox_path) / "data" / "raw" / "bluebook-for-bulldozers" / "TrainAndValid.csv"

# Read the CSV file
df = pd.read_csv(file_path)


  df = pd.read_csv(file_path)


---

### DataFrame Information Display

This code displays essential information about our DataFrame, including:

- Total number of rows and columns
- Column names and their data types
- Memory usage
- Number of non-null values per column
This helps us understand our data structure and identify potential issues like missing values.

In [97]:
# Get info about DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   saledate                  412698 non-null  object 
 10  fiModelDesc               412698 non-null  object 
 11  fiBaseModel               412698 non-null  object 
 12  fiSecondaryDesc           271971 non-null  object 
 13  fiModelSeries             58667 non-null   o

## Conclusions and Next Steps

### Conclusions

- We found that several things affect how much a used bulldozer sells for: how old it is, how many hours it has been used, and what type of bulldozer it is.
- We also noticed that bulldozer prices go up and down depending on the time of year.
- Our computer program can predict bulldozer prices fairly well, but we can still make it better.

### Next Steps

- Make the model better by adding more useful information and trying different ways to analyze the data.
- Study how things like the economy and market demand affect bulldozer prices.
- Improve how we collect data to make sure it's complete and accurate.
- Try advanced methods like combining different models and fine-tuning settings to get better results.
- Create an easy-to-use app or dashboard so people can easily access and use the price predictions.