<a href="https://colab.research.google.com/github/Blignaut24/BulldozerSalePredictor/blob/main/01_data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection Notebook

## Objectives
* Import bulldozers data from Kaggle and save as raw data
* Inspect and save in folder: outputs/datasets/collect/

## Inputs
* JSON file containing Kaggle authentication token

## Outputs
* Generate Dataset: output/dataset/collection/

## Additional Comments

The data Kaggle provides for this project is split into three parts:
* **Train.csv** is the training set, which contains data through the end of 2011.
* **Valid.csv** is the validation set, which contains data from January 1, 2012 - April 30, 2012.
* **Test.csv** is the test set. It contains data from May 1, 2012 - November 2012.

The key fields are in train.csv are:
* **SaleID**: the unique identifier of the sale.
* **MachineID**: the unique identifier of a machine. A machine can be sold multiple times.
* **saleprice**: what the machine sold for at auction (only provided in train.csv)
* **saledate**: the data of the sale

Every machine is unique and can be set up in different ways, which we call **"configurations"**. Sometimes, certain types of machines don't have all the options, so there might be missing info. Also, the details about what options a machine has and how much it's been used might not always be reliable. There's a file called **"machine_appendix.csv"** that gives more details about each machine, like when it was made, what model it is, and what it's used for. Each machine has a special ID number that we can use to find it in all the different data lists for the project.


---

# Install Python packages in notebooks

In [6]:
# %pip install -r /workspace/BulldozerSalePredictor/requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: '/workspace/churnometer/requirements.txt'[0m[31m
[0m

# Fetch data from Kaggle

In [2]:
!wget https://github.com/mrdbourke/zero-to-mastery-ml/raw/master/data/bluebook-for-bulldozers.zip # download files from GitHub as zip

import os
import zipfile

local_zip = 'bluebook-for-bulldozers.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')

zip_ref.extractall('.') # extract all data into current working directory
zip_ref.close()

--2024-05-08 07:26:20--  https://github.com/mrdbourke/zero-to-mastery-ml/raw/master/data/bluebook-for-bulldozers.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/bluebook-for-bulldozers.zip [following]
--2024-05-08 07:26:20--  https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/bluebook-for-bulldozers.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 72077822 (69M) [application/zip]
Saving to: ‘bluebook-for-bulldozers.zip’


2024-05-08 07:26:21 (187 MB/s) - ‘bluebook-for-bulldozers.zip’ saved [72077822/72077822]



# Import Packages

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

In [4]:
# Import training and validation sets
df= pd.read_csv("bluebook-for-bulldozers/TrainAndValid.csv",
               low_memory=False)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   saledate                  412698 non-null  object 
 10  fiModelDesc               412698 non-null  object 
 11  fiBaseModel               412698 non-null  object 
 12  fiSecondaryDesc           271971 non-null  object 
 13  fiModelSeries             58667 non-null   o