# 00 - Kaggle Data Download Walkthrough
This notebook shows how to fetch the Book-Crossing dataset directly from Kaggle, populate `data/raw`, and run the preprocessing pipeline end-to-end.

## 1. Install Kaggle CLI
Run once inside your virtual environment.

In [42]:
!pip install kaggle --quiet



## 2. Provide `kaggle.json`
1. Visit [Kaggle Account](https://www.kaggle.com/account/token) and create a new API token.
2. Move the downloaded `kaggle.json` to `~/.kaggle/` (Linux/macOS) or `%USERPROFILE%/.kaggle/` (Windows).
3. Ensure the file is only readable by you (chmod 600 on Unix).
4. Alternatively set env vars `KAGGLE_USERNAME` and `KAGGLE_KEY`.

## 3. Configure dataset slug
Change the slug if you forked the dataset.

In [43]:
from pathlib import Path

DATASET_SLUG = "arashnic/book-recommendation-dataset"
PROJECT_ROOT = Path.cwd().parent
RAW_DIR = PROJECT_ROOT / "data" / "raw"
RAW_DIR.mkdir(parents=True, exist_ok=True)
RAW_DIR.resolve()

WindowsPath('E:/Book recommend/data/raw')

## 4. Download & unzip
Executes Kaggle CLI and unpacks files into `data/raw`.

In [44]:
import subprocess

cmd = ['kaggle', 'datasets', 'download', '-d', DATASET_SLUG, '-p', str(RAW_DIR), '--unzip']
print('Running:', ' '.join(cmd))
completed = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
print(completed.stdout)
if completed.returncode != 0:
    raise SystemExit('Kaggle download failed. Check API token and dataset slug.')


Running: kaggle datasets download -d arashnic/book-recommendation-dataset -p e:\Book recommend\data\raw --unzip
Dataset URL: https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset
License(s): CC0-1.0
Downloading book-recommendation-dataset.zip to e:\Book recommend\data\raw

  0%|          | 0.00/24.3M [00:00<?, ?B/s]

100%|██████████| 24.3M/24.3M [00:00<00:00, 833MB/s]

Dataset URL: https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset
License(s): CC0-1.0
Downloading book-recommendation-dataset.zip to e:\Book recommend\data\raw

  0%|          | 0.00/24.3M [00:00<?, ?B/s]

100%|██████████| 24.3M/24.3M [00:00<00:00, 833MB/s]



## 5. Verify files
Expect `Books.csv`, `Users.csv`, `Ratings.csv`.

In [45]:
list(RAW_DIR.glob('*.csv'))

[WindowsPath('e:/Book recommend/data/raw/Books.csv'),
 WindowsPath('e:/Book recommend/data/raw/Ratings.csv'),
 WindowsPath('e:/Book recommend/data/raw/Users.csv')]

## 6. Run preprocessing pipeline
Leverages `src/data_preprocessing.py` to clean and persist processed files.

In [46]:
from pathlib import Path
import pandas as pd
from src.data_preprocessing import normalize_column_names
from config import data_config

ratings_raw = pd.read_csv(data_config.ratings_file, nrows=0)
raw_cols = list(ratings_raw.columns)
norm_cols = list(normalize_column_names(ratings_raw).columns)
raw_cols, norm_cols

(['User-ID', 'ISBN', 'Book-Rating'], ['user_id', 'isbn', 'book_rating'])

In [47]:
# Ensure CSVs are in project-level data/raw
from pathlib import Path
import shutil

PROJECT_ROOT = Path.cwd().parent
NOTEBOOKS_RAW = Path.cwd() / "data" / "raw"
ROOT_RAW = PROJECT_ROOT / "data" / "raw"
ROOT_RAW.mkdir(parents=True, exist_ok=True)

for fname in ["Books.csv", "Users.csv", "Ratings.csv"]:
    src = NOTEBOOKS_RAW / fname
    dst = ROOT_RAW / fname
    if src.exists() and not dst.exists():
        shutil.move(str(src), str(dst))

sorted(ROOT_RAW.glob("*.csv"))

[WindowsPath('e:/Book recommend/data/raw/Books.csv'),
 WindowsPath('e:/Book recommend/data/raw/Ratings.csv'),
 WindowsPath('e:/Book recommend/data/raw/Users.csv')]

In [48]:

import sys
from pathlib import Path
PROJECT_ROOT = Path().resolve().parents[0]  # notebooks/ -> project root is parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src import data_preprocessing
import importlib
importlib.reload(data_preprocessing)

books, users, ratings = data_preprocessing.preprocess_pipeline()
ratings.head()


Unnamed: 0,user_id,isbn,rating
16,276747,60517794,9
19,276747,671537458,9
20,276747,679776818,8
120,276813,8426449476,8
133,276822,60096195,10


## 7. Quick sanity-check recommendation
Fit a small user-based CF model to confirm everything works.

In [49]:
from config import web_config
from src.collaborative_filtering import UserBasedCF

cf_model = UserBasedCF().fit(ratings)

user_id = web_config.default_user_id
if user_id not in ratings["user_id"].values:
    user_id = int(ratings["user_id"].value_counts().idxmax())

cf_model.recommend(user_id=user_id, books=books, top_n=5)


[Recommendation(item_id='0399148760', score=10.000000000000002, metadata={'title': 'Seizure', 'author': 'Robin Cook'}),
 Recommendation(item_id='0425129586', score=10.000000000000002, metadata={'title': 'And Then There Were None', 'author': 'Agatha Christie'}),
 Recommendation(item_id='0006485200', score=10.0, metadata={'title': "The Piano Man's Daughter", 'author': 'Timothy Findley'}),
 Recommendation(item_id='002026478X', score=10.0, metadata={'title': 'AGE OF INNOCENCE (MOVIE TIE-IN)', 'author': 'Edith Wharton'}),
 Recommendation(item_id='0020427859', score=10.0, metadata={'title': 'Over Sea, Under Stone', 'author': 'Susan Cooper'})]