# **Data Collection Notebook**

**Objective:**
Collect and preprocess raw data related to insider transactions and stock prices.
Fetch data from Kaggle and save it as raw data, inspect and save it.

**Inputs:**
- Raw TSV files: `NONDERIV_TRANS.tsv`, `SUBMISSION.tsv`, `REPORTING_OWNER.tsv`
- Stock price data files in `../data/raw/stock_prices/` directory
- Kaggle JSON file - the authentication token.

**Outputs:**
- Interim CSV files:
  - `interim_insider_transactions.csv`
  - `interim_stock_prices.csv`
  - `interim_merged_insider_transactions_stock_prices.csv`

---


# Step 1: Imports & Kaggle Endpoint 

In [None]:
import sys
import json
from pathlib import Path
import os
import pandas as pd
import numpy as np
import re
from dotenv import load_dotenv
import zipfile
import shutil
import subprocess

# Change working directory

* Access current directory and change to parent directory

In [None]:
current_dir = os.getcwd()
current_dir

os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm Current Directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Fetch data from Kaggle

### Setup Credintials

In [None]:
# Define the path to the .env file in the current directory (MarketPulseAnalytics)
env_path = os.path.join(os.getcwd(), '.env')

# Load environment variables from the .env file if it exists
if os.path.exists(env_path):
    load_dotenv(env_path)
    print(".env file loaded from:", env_path)
else:
    print("No .env file found in the current directory. Ensure environment variables are set in the hosting environment.")

# Access the environment variables
kaggle_username = os.getenv('KAGGLE_USERNAME')
kaggle_key = os.getenv('KAGGLE_KEY')

# Verify environment variables are set
if not kaggle_username or not kaggle_key:
    print("Warning: KAGGLE_USERNAME and/or KAGGLE_KEY environment variables are not set. Make sure they are configured in the production environment.")
else:
    print("Environment variables loaded successfully.")


Set the download paths

In [None]:
# Define paths
insider_transactions_download_path = 'data/downloaded/zip_insider_transactions/'
insider_transactions_filename = 'zip_insider_transactions.zip'

insider_transactions_unzip_path = 'data/raw/insider_transactions'

Define the Kaggle datasets

In [None]:
# specify the dataset name
stock_prices_dataset = "borismarjanovic/price-volume-data-for-all-us-stocks-etfs"
insider_transactions_dataset = "osawani/sec-insider-transactions"

Download the dataset

In [None]:
# Function to download the dataset using Kaggle CLI
def download_dataset(dataset_name, download_path):
    # Create the Kaggle CLI command as a string
    command = f"kaggle datasets download -d {dataset_name} -p {download_path}"
    
    # Print the command for debugging purposes
    print(f"Running command: {command}")
    
    # Use os.system to run the command (works in a regular Python script)
    os.system(command)
    
    # Notify the user the download is complete
    print(f"Dataset {dataset_name} downloaded successfully to {download_path}")

In [None]:
# 1. Check and download the Stock Prices dataset if the file doesn't exist
download_dataset(insider_transactions_dataset, insider_transactions_download_path)