# Step 4 Data Acquision

In the first three steps we have laid out the foundation of the project. The next step is to gather data.

In real-life projects, data can come from multiple sources such as data lakes, databases, and file sotrage. Each of these sources present unique challenges in terms of data acquision. The typical steps needed to be followed include:

**Data Lakes:**
- Establish secure connections to cloud storage (AWS S3, Azure Data Lake, GCP)
- Handle various file formats (Parquet, Delta, ORC) and compression types
- Manage access permissions and data governance policies

**Databases:**
- Set up database connections with proper authentication
- Write optimized queries to extract relevant data subsets or use the existing stored procedures.
- In case of large datasets, use pagination or streaming.

**File Storage Systems:**
- Configure secure file transfer protocols for API endpoints.
- Handle different file formats and possible encoding incompatibility.

In our project, we'll download the [`Kaggle Credit Card Fraud Detection Dataset`](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) and save it in our ../data folder.

You can follow the code below, but that requires you to set your API key in Kaggle and save it in a kaggle.json file located at ../kaggle.json. An example of how it should look is below. Otherwise, you can just log in to Kaggle and download the data manually, then place it in the data folder.

In [3]:
# Sample kaggle.json file content
# This should be saved as ../kaggle.json (one level up from the src folder)

{
    "username": "your-kaggle-username",
    "key": "your-kaggle-api-key"
}

{'username': 'your-kaggle-username', 'key': 'your-kaggle-api-key'}

In [8]:
import os
import kaggle
from pathlib import Path
import zipfile

# Set up path
current_dir = Path.cwd()
project_root = current_dir.parent
data_dir = project_root / "data"
kaggle_config = project_root / "kaggle.json"

data_dir.mkdir(exist_ok=True)

if kaggle_config.exists():
    os.environ['KAGGLE_CONFIG_DIR'] = str(project_root)
    print(f"Kaggle config exists at: {kaggle_config}")
else:
    print(f"Kaggle config not found at: {kaggle_config}")
    print("Please create kaggle.json with your API credentials or download manually from Kaggle")

dataset_name = "mlg-ulb/creditcardfraud"
dataset_file = "creditcard.csv"
dataset_path = data_dir / dataset_file

if dataset_path.exists():
    print(f"Dataset already exists at: {dataset_path}")
else:
    try:
        print(f"Downloading dataset: {dataset_name}")
        print(f"Saving to: {data_dir}")
        
        # Download dataset to data directory
        kaggle.api.dataset_download_files(
            dataset_name, 
            path=str(data_dir), 
            unzip=True
        )
        
        if dataset_path.exists():
            print(f"Dataset downloaded successfully!")
        else:
            print(f"Download completed but {dataset_file} not found")
            
    except Exception as e:
        print(f"Error downloading dataset: {str(e)}")
        print("Please download the dataset manually from Kaggle and place it in the data folder")

Kaggle config exists at: C:\Users\ArashNozarinejad\Documents\Personal\GitHub\credit-card-fraud-detection\kaggle.json
Downloading dataset: mlg-ulb/creditcardfraud
Saving to: C:\Users\ArashNozarinejad\Documents\Personal\GitHub\credit-card-fraud-detection\data
Dataset URL: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
Dataset downloaded successfully!
