# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file and authentication token. 

## Outputs

* Generate Dataset: outputs/datasets/collection/hand_history.txt

## Additional Comments

* This data is coming from an open, public source and poses no ethical or privacy concerns.

---

# Install python packages

In [None]:
%pip install -r ../requirements.txt

---

# Change working directory

Since jupyter notebooks are in a subfolder we need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Data acquisition from Kaggle

Install Kaggle package to fetch data.

In [None]:
%pip install kaggle==1.5.12

* In order to download the data a personal authentication token (JSON file) is needed to authenticate Kaggle. You can aquire one by signin up to Kaggle.

* Once you have your token, drag and drop the file into the directory and then run the following:

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

In [None]:
KaggleDatasetPath = "simasjakubenas/poker-hand-history"
DestinationFolder = "inputs/datasets"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip the downloaded file
* Delete the zip file and delete the kaggle.json file

In [None]:
import glob
zip_files = glob.glob(f"{DestinationFolder}/*.zip") # Get all zip files in the folder

for zip_file in zip_files: # Extract each ZIP file individually
    !tar -xf "{zip_file}" -C "{DestinationFolder}"

for zip_file in zip_files: # # Remove ZIP files
    !del "{zip_file}"

!del /Q kaggle.json # Remove Kaggle API key file

---

## Load and Inspect Kaggle Data

In [None]:
import pandas as pd
txt_files = glob.glob(f"{DestinationFolder}/*.txt")
data = pd.read_csv(f"{txt_files[0]}", header=None, delimiter="\t")
print(f"Hand history from {len(txt_files)} sessions")
data.head(35)

* Join all sessions hand history

In [None]:
combined_hand_history = pd.concat([pd.read_csv(file, header=None,delimiter="\t", on_bad_lines='warn') for file in txt_files], ignore_index=True)
combined_hand_history.head(5)

---

# Push files to Repo

* Create output file name dinamicaly

In [None]:
filename = txt_files[0]

# Split on " - " (with spaces around the dash)
parts = filename.split(" - ", 2)  # Split only twice to keep everything after the second '-'

# Get everything after the second '-'
output_filename = parts[2] if len(parts) > 2 else ""
output_filename

* We will save joined hand history

In [None]:
import os

output_folder = "outputs/datasets/collection"
try:
  os.makedirs(name=output_folder) # create outputs/datasets/collection folder
except Exception as e:
  print(e)

combined_hand_history.to_csv(f"{output_folder}/{output_filename}", index=False, header=False)
