# **Data Colletion Notebook**

## Objectives

* Fetch Data from Kaggle and save as raw file and unzip.
* Inspect the data and save it under inputs/datasets/raw
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/insurance.csv



---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch Data from Kaggle

Install Kaggle Packages

In [None]:
%pip install kaggle==1.5.12

Add kaggle.json token

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "mirichoi0218/insurance"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/insurance.csv")
print(df.head())

In [None]:
df.info()

Check for null values in the dataframe

In [None]:
# Check for null values in the dataframe
print("Number of null values in each column:")
print(df.isnull().sum())

print("\nTotal number of null values in the dataframe:", df.isnull().sum().sum())

In [None]:
df.shape

In [None]:
df.dtypes

We have 3 categorical features:
- sex
- smoker
- region
We will convert these to numerical values later.

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

# Save the dataframe to a CSV file in the outputs folder
df.to_csv('outputs/datasets/collection/insurance.csv', index=False)

# Conclusions and Next Steps

### Conclusion
We have successfully downloaded the dataset from Kaggle, inspected it, and saved it in the appropriate directories. The next steps would be to preprocess the data, convert categorical features to numerical values, and prepare it for analysis or modeling.

### Key Points
* Fetched data from Kaggle and saved it as a raw file.
* Inspected the data for null values and identified categorical features.
* Saved the processed data under outputs/datasets/collection.

### Next Steps
* Exploratory Data Analysis (EDA) on the dataset.
* Correlation study between features.
* Visualization of data distributions.
* Feature engineering and transformation.