# **Data Collection**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token. 

## Outputs



## Additional Comments



---

# Change working directory

In [38]:
# Timestamp
import datetime

import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2025-02-10 01:08:10.874730


We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [39]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox\\1 PROJECT'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [40]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [41]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\blign\\Dropbox'

# Fetch data from Kaggle
Install Kaggle package to fetch data


In [42]:
! pip install kaggle==1.5.12



# Kaggle Authentication and Permissions Setup
The configures Kaggle authentication by setting up the environment directory and adjusting file permissions based on the operating system (Windows/Mac/Linux) to ensure secure access to Kaggle credentials.

In [43]:
import os
import platform

# Set Kaggle config directory
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Check operating system and set permissions accordingly
if platform.system() == 'Windows':
    # Windows solution using Python's os module
    import subprocess
    subprocess.run(['icacls', 'kaggle.json', '/grant:r', f'{os.getenv("USERNAME")}:F'], shell=True)
else:
    # Unix/Linux/Mac solution
    os.chmod('kaggle.json', 0o600)

Get the dataset path from the Kaggle url
* When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ .

Define the Kaggle dataset, and destination folder and download it.

# Download and Setup Dataset

In [44]:
import requests
import os
import zipfile

# Create data/raw directory if it doesn't exist
os.makedirs("data/raw", exist_ok=True)

# Download the file
url = "https://github.com/mrdbourke/zero-to-mastery-ml/raw/master/data/bluebook-for-bulldozers.zip"
response = requests.get(url)
with open("data/raw/bluebook-for-bulldozers.zip", "wb") as f:
    f.write(response.content)

---

# Load and Inspect Kaggle data

Section 2 content

In [45]:
# Timestamp
import datetime

import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2025-02-10 01:08:22.292442


## 1. Data Import and Preprocessing

This section covers the initial setup of the environment and required libraries.

The following libraries will be imported: pandas (data manipulation), NumPy (numerical operations), and matplotlib (visualization).

In [46]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")

pandas version: 2.2.3
NumPy version: 2.2.2
matplotlib version: 3.10.0


## Download and Check Dataset
This code helps us get the bulldozer dataset for our project. It first checks if we already have the data on our computer. If we don't have it, it downloads it from the internet and puts it in the right folder. If the dataset already exists, it simply tells us it's ready to use.

In [47]:
from pathlib import Path
import requests
import zipfile
import shutil
import os

dataset_dir = Path("data/raw/bluebook-for-bulldozers")
if not dataset_dir.is_dir():
    print("[INFO] Downloading dataset...")
    
    # Create directories
    Path("data/raw").mkdir(parents=True, exist_ok=True)
    
    # Download file
    url = "https://github.com/mrdbourke/zero-to-mastery-ml/raw/refs/heads/master/data/bluebook-for-bulldozers.zip"
    response = requests.get(url)
    with open("dataset.zip", "wb") as f:
        f.write(response.content)
    
    # Unzip file
    with zipfile.ZipFile("dataset.zip", "r") as zip_ref:
        zip_ref.extractall(".")
    
    # Move to correct location
    shutil.move("bluebook-for-bulldozers", "data/raw/")
    
    # Clean up zip file
    os.remove("dataset.zip")
    
    print(f"[INFO] Dataset downloaded to {dataset_dir}")
else:
    print("[INFO] Dataset already exists!")


[INFO] Downloading dataset...
[INFO] Dataset downloaded to data\raw\bluebook-for-bulldozers


### Directory Contents Checker

This simple code snippet helps you see what files and folders are inside a specific directory (folder). It prints out a list of everything in that directory, making it easy to check what's available.

In [48]:
import os

print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\raw\bluebook-for-bulldozers:


['Data Dictionary.xlsx',
 'Machine_Appendix.csv',
 'median_benchmark.csv',
 'random_forest_benchmark_test.csv',
 'Test.csv',
 'test_predictions.csv',
 'Train.7z',
 'Train.csv',
 'Train.zip',
 'TrainAndValid.7z',
 'TrainAndValid.csv',
 'TrainAndValid.zip',
 'train_tmp.csv',
 'Valid.7z',
 'Valid.csv',
 'Valid.zip',
 'ValidSolution.csv']

---

### Data Directory and File Creation

This code does two simple things:

- Creates a new folder structure to store our data files
- Saves our data as a CSV file in that folder

In [49]:
#import os
#try:
  # creating here the folder to accomodate the data in a csv file for further analysis.
  #os.makedirs(name='outputs/datasets/collection')
#except Exception as e:
  #print(e)

#df.to_csv(f"outputs/datasets/collection/house_prices_ames_iowa.csv",index=False)